General Documentation
Documentation is a critical part of research data management. Describing your dataset well and documenting what you've done is not only for your own benefit (and the benefit of your research group or lab), but it also benefits anyone else who wants to try and reuse your data in the future. In order for someone to interpret and reuse research data, they must first understand the research -- when, why, and how the data was collected or generated, what the variables mean, how it was processed or transformed, and how the final dataset was created.
Clear and detailed documentation allows for others to reconstruct the context of the dataset, without which it cannot be effectively used in further analysis.
One important principle for documentation is to start at the beginning of the research project and update as you go along. This will facilitate accuracy and thoroughness, since people tend to forget details over time. It's best not to wait until the middle or end of the project to write documentation.
There are many aspects of your research project and associated dataset(s) that warrant documentation. Some broad categories to consider documenting:
README Files
A common type of documentation included with datasets deposited into repositories is a README file. Creating a README file is especially important if your research project or dataset isn't well served by an existing metadata standard, as in this case it will be the primary vehicle for providing information about the project and data to others. We recommend a README file include the following descriptive information (if not covered by other types of documentation):
Both the Cornell guide to README-style documentation and the DMPTool guidance on data description below contain excellent guidance for writing READMEs, and we recommend you review these resources.
In addition to README files and other types of unstructured descriptive material, documentation of data and datasets frequently also comes in more structured forms, which we cover in the next two sections. Also worth noting is that if your project involves use of code and if code is going to be shared along with your dataset, it too should be well-documented.
Codebooks and data dictionaries are two forms of structured documentation primarily focused on defining variables. They are related in function but differ somewhat in form, focus, and approach.
Codebooks
A codebook is a document commonly included with datasets in the social and behavioral sciences intended to assist with understanding the contents and structure of those datasets. Codebooks include front matter, including the study title, names of the principal investigators, and an introduction to the data. They may include methodological information too, if that is not documented elsewhere. However, the main content of a codebook is detailed definitions and descriptions of variables in the dataset.
Codebooks are commonly included with studies where lengthy questionnaires, surveys, or similar instruments are used and result in large numbers of variables, often named with opaque alphanumeric codes. For each coded variable, a codebook offers the question text, what the data values mean (e.g. 1 = good, 2 = fair, etc., also called value labels), and sometimes additional information such as summary statistics or notes and comments about that variable.
Data Dictionaries
Data dictionaries are, in contrast, typically in tabular/spreadsheet form. A typical data dictionary might contain columns for variable name (exactly as it appears in the dataset), a more descriptive human-readable variable name, unit of measurement, allowed values, a definition of the variable, and additional explanation, comments, or notes for each variable. Data dictionaries are not exclusively intended for quantitative empirical data, but they are more suited for that purpose than codebooks, since they foreground the units and allowed/expected values of variables.
If either of these forms of documentation are suitable for your study and dataset(s), it is good practice to create and maintain them and to later include them with your data when sharing it. They are crucial documentation when a research project has variables that are difficult to understand or need explanation.
Introduction to Metadata
Depending on the resource or the person, sometimes the terms documentation and metadata are used interchangeably. For the purposes of this guide, we use the term metadata to refer to structured description that makes use of a formally defined standard or schema. What we mean by structured and formally defined is that a list of elements or fields are delineated, some of which may have suggestions or restrictions on the content or form of the associated values. Many metadata standards exist, and they range from broadly applicable and general to highly domain-specific.
When the term schema is used, this usually means that a standard also has a technical specification for how the metadata should be encoded (and refers to this specification), such as in a markup language like XML.
Structured metadata fundamentally serves multiple purposes:
Metadata is structured and encoded the way it is to allow it to be indexed and used in searches. In other words, it is machine-readable. Machine-readable metadata is increasingly important as our reliance on computers to help us find and retrieve information and data grows. Without metadata that can be parsed by computers, we would not have search features like filters and field searching (keywords, title, author, date, abstract text, etc.)
Here is a table of summary metadata for a hypothetical dataset to give you an idea of what you might see when you search for datasets in a repository (An example containing more complete metadata can be viewed by visiting the this dataset landing page and clicking on the metadata tab):
This metadata is both indexed and displayed for human readability so you can quickly discern what the dataset is, how it may be used, and whether or not it would be useful to you.
Example Standard: Dublin Core Element Set
As an example of a metadata standard and to demonstrate how these standards differ from less formalized documentation such as the READMEs discussed previously, we introduce the 15 basic elements of the popular cross-disciplinary Dublin Core standard and how they might be used for datasets (see link for their official definitions, which are broader and apply to more than just datasets):
Note the recommended use of various controlled vocabularies -- we will discuss this in the next section.
Domain-specific Metadata Standards
While Dublin Core is a discipline-agnostic standard, many disciplines have more specialized metadata standards that have elements tailored to and fitting for studies (and data) in those fields. Use of appropriate standards designed with certain types of studies or data in mind can help make your documentation more effective, and it can also enhance your dataset's interoperability with other data that share use of the same standard.
Here are a few examples of specialized metadata standards:
Selecting a Metadata Standard
When you decide on a metadata standard, whether it is because it is the one your target data repository requires, one that is endorsed in your field of study, or simply because it is the one you find most suitable, take care to accurately fill out as many elements/attributes as you reasonably can. All of metadata's core functions (description, enabling discovery, enhancing interoperability) depend on metadata that is as complete and accurate as possible.
It is important to note that if you decide to deposit your dataset into a data repository, it is likely that you will generate and/or enter this metadata through use of a form upon deposit.
Here are links to pages introducing several popular standards and schema in more detail. If you plan on using LibraData (see the section on choosing a data repository and the LibraData subpage), it uses the DDI standard for dataset metadata, so that may be of particular interest to you.
Resources for finding (meta)data standards are linked below. If you are unsure of what standard is best for your dataset, consider the following:
An important step researchers can take to make their data (and associated metadata) more shareable, comprehensible, and interoperable is to standardize it when possible. One approach to data standardization is to use specific sets of values that are shared among studies in the same field or on the same topic. To that end, many disciplines make use of controlled vocabularies (term lists, thesauri, taxonomies, or similar sets of standard terms and names) to ensure (meta)data consistency – both within and between research groups and studies – and to improve data quality.
One common example of such an approach is use of a thesauri that employs a preferred term for indexing, and synonyms or near-synonyms of that term redirect to the preferred term. This aids searching consistency, as searching for any of a number of synonyms for a term or subject yields the same results.
In the previous section, we noted that popular metadata standard Dublin Core recommends several controlled vocabularies. Sometimes, as with ISO 8601 (for date formatting) and the Getty Thesaurus of Geographic Names, these controlled vocabularies help ensure broad data consistency across fields. This is one important aspect of standardization, but another important aspect is for researchers in the same discipline to share use of the same language: a standard set of terms and names to refer to the same concepts and to represent the same values. This helps make research in those fields more mutually intelligible and encourages meta-analysis or other syntheses.
Some examples of controlled vocabularies that target particular disciplines or research areas include:
Included below are several more useful general resources on controlled vocabularies: