Skip to Main Content

Research Data Management

This guide offers guidance and resources for managing research data in any discipline.

Writing Good Documentation

Documentation is a critical part of research data management. Describing your dataset well and documenting what you've done is not only for your own benefit (and the benefit of your research group or lab), but it also benefits anyone else who wants to try and reuse your data in the future. In order for someone to interpret and reuse research data, they must first understand the research -- when, why, and how the data was collected or generated, what the variables mean, how it was processed or transformed, and how the final dataset was created.

Clear and detailed documentation allows for others to reconstruct the context of the dataset, without which it cannot be effectively used in further analysis.

One important principle for documentation is to start at the beginning of the research project and update as you go along. This will facilitate accuracy and thoroughness, since people tend to forget details over time. It's best not to wait until the middle or end of the project to write documentation.

There are many aspects of your research project and associated dataset(s) that warrant documentation. Some broad categories to consider documenting:

  • Context of data collection
  • Data collection methods
  • Information about variables used
  • File organization and naming schemes
  • How data has been transformed or processed for analysis
  • Software used for data processing and analysis
  • Outside data sources used
  • The roles and responsibilities of project personnel

A common type of documentation included with datasets deposited into repositories is a README file. Creating a README file is especially important if your research project or dataset isn't well served by an existing metadata standard, as in this case it will be the primary vehicle for providing information about the project and data to others. We recommend a README file include the following descriptive information (if not covered by other types of documentation):

  • General Information: Dataset title, contact information for principal investigators and other key personnel, time frame and location for data collection, language information, persistent identifier (if available), and identification of funding source
  • Content overview: Explanation of directory structure or file organization, a list of files and short descriptions of each, important relationships between files, other information necessary to understand files
  • Methodological information: Description of method for data collection or generation, documentation of data processing/transforming/cleaning steps, details on software or instrumentation necessary to correctly interpret the data, information on quality assurance procedures (if applicable)
  • Access and reuse information: Licenses or reuse restrictions that apply to the data, links to the associated publication and other related resources or publications (including code repositories), recommended citation for the dataset
  • Other data-specific details: A list of variables with descriptions of each, units of measurement, definitions of codes/abbreviations/symbols used in the dataset, how missing data values have been recorded/denoted

Both the Cornell guide to README-style documentation and the DMPTool guidance on data description below contain excellent guidance for writing READMEs, and we recommend you review these resources.

In addition to README files and other types of unstructured descriptive material, documentation of data and datasets frequently also comes in more structured forms, which we cover in the next two sections. Also worth noting is that if your project involves use of code and if code is going to be shared along with your dataset, it too should be well-documented.

Codebooks and Data Dictionaries

Codebooks and data dictionaries are two forms of structured documentation primarily focused on defining variables. They are related in function but differ somewhat in form, focus, and approach.

A codebook is a document commonly included with datasets in the social and behavioral sciences intended to assist with understanding the contents and structure of those datasets. Codebooks include front matter, including the study title, names of the principal investigators, and an introduction to the data. They may include methodological information too, if that is not documented elsewhere. However, the main content of a codebook is detailed definitions and descriptions of variables in the dataset.

Codebooks are commonly included with studies where lengthy questionnaires, surveys, or similar instruments are used and result in large numbers of variables, often named with opaque alphanumeric codes. For each coded variable, a codebook offers the question text, what the data values mean (e.g. 1 = good, 2 = fair, etc., also called value labels), and sometimes additional information such as summary statistics or notes and comments about that variable.

Data dictionaries are, in contrast, typically in tabular/spreadsheet form. A typical data dictionary might contain columns for variable name (exactly as it appears in the dataset), a more descriptive human-readable variable name, unit of measurement, allowed values, a definition of the variable, and additional explanation, comments, or notes for each variable. Data dictionaries are not exclusively intended for quantitative empirical data, but they are more suited for that purpose than codebooks, since they foreground the units and allowed/expected values of variables.

If either of these forms of documentation are suitable for your study and dataset(s), it is good practice to create and maintain them and to later include them with your data when sharing it. They are crucial documentation when a research project has variables that are difficult to understand or need explanation.

Choosing and Using Metadata Standards

Depending on the resource or the person, sometimes the terms documentation and metadata are used interchangeably. For the purposes of this guide, we use the term metadata to refer to structured description that makes use of a formally defined standard or schema. What we mean by structured and formally defined is that a list of elements or fields are delineated, some of which may have suggestions or restrictions on the content or form of the associated values. Many metadata standards exist, and they range from broadly applicable and general to highly domain-specific.

When the term schema is used, this usually means that a standard also has a technical specification for how the metadata should be encoded (and refers to this specification), such as in a markup language like XML.

Structured metadata fundamentally serves multiple purposes:

  1. To act as documentation
  2. To facilitate discovery, and
  3. To allow interoperability

Metadata is structured and encoded the way it is to allow it to be indexed and used in searches. In other words, it is machine-readable. Machine-readable metadata is increasingly important as our reliance on computers to help us find and retrieve information and data grows. Without metadata that can be parsed by computers, we would not have search features like filters and field searching (keywords, title, author, date, abstract text, etc.)

Here is a table of summary metadata for a hypothetical dataset to give you an idea of what you might see when you search for datasets in a repository (An example containing more complete metadata can be viewed by visiting the this dataset landing page and clicking on the metadata tab):

This table contains the metadata fields description, subjects, keywords, related publications, and license/data use agreement with associated values for each for a made-up example dataset.

This metadata is both indexed and displayed for human readability so you can quickly discern what the dataset is, how it may be used, and whether or not it would be useful to you.

As an example of a metadata standard and to demonstrate how these standards differ from less formalized documentation such as the READMEs discussed previously, we introduce the 15 basic elements of the popular cross-disciplinary Dublin Core standard and how they might be used for datasets (see link for their official definitions, which are broader and apply to more than just datasets):

  • Contributor: Name(s) of who has contributed to the creation of the dataset. This could be co-PIs, research staff, and/or research scientists. Include ORCiDs if available. 
  • Coverage: The spatial or temporal extent of the resource/dataset. Can include geospatial information as well as a date range. Best practice for locations is the Getty Thesaurus of Geographic Names (TGN).
  • Creator: Name(s) of researcher, group, or organization that created the dataset. Best practice for personal names is surname first. Include ORCiDs if available.
  • Date: Point or period of time associated with an event in the lifecycle of the data. Usually indicates when the dataset was completed. Best practice is to use a standard like ISO 8601 (YYYY-MM-DD).
  • Description: An account of the resource. May include but is not limited to: an abstract, a table of contents, or a free-text account of the resource.
  • Format: The file format, physical medium, or dimensions of the resource. Best practice is to use a controlled vocabulary such as the IANA media types.
  • Identifier: A unique reference to the resource (dataset). Best practice is to use a DOI or similar persistent ID. Often supplied by the repository you are submitting the dataset to. 
  • Language: The language of the resource. Best practice is to use a controlled code list such as ISO 639-2/639-1.
  • Publisher: The person, organization, or service that made the resource available.
  • Relation: A related resource, such as the associated article or another dataset. Best practice is to use a URI/DOI. If a persistent identifier is not available, use a formal identification system (like an ISBN for a book or volume/issue for a journal).
  • Rights: A statement of known rights information, including copyright, licenses, and data (re)use restrictions.
  • Source: A related resource from which the dataset is derived. If you reused another dataset, you would list it here.
  • Subject: Keywords or key phrases that describe the resource/dataset. Best practice is to use a controlled vocabulary.
  • Title: The name of the dataset or research project.
  • Type: The "nature or genre" of the resource. Best practice is to use the DCMI Type vocabulary, and here 'dataset' is an option.

Note the recommended use of various controlled vocabularies -- we will discuss this in the next section.

While Dublin Core is a discipline-agnostic standard, many disciplines have more specialized metadata standards that have elements tailored to and fitting for studies (and data) in those fields. Use of appropriate standards designed with certain types of studies or data in mind can help make your documentation more effective, and it can also enhance your dataset's interoperability with other data that share use of the same standard.

Here are a few examples of specialized metadata standards:

When you decide on a metadata standard, whether it is because it is the one your target data repository requires, one that is endorsed in your field of study, or simply because it is the one you find most suitable, take care to accurately fill out as many elements/attributes as you reasonably can. All of metadata's core functions (description, enabling discovery, enhancing interoperability) depend on metadata that is as complete and accurate as possible.

It is important to note that if you decide to deposit your dataset into a data repository, it is likely that you will generate and/or enter this metadata through use of a form upon deposit.

Here are links to pages introducing several popular standards and schema in more detail. If you plan on using LibraData (see the section on choosing a data repository and the LibraData subpage), it uses the DDI standard for dataset metadata, so that may be of particular interest to you.

Resources for finding (meta)data standards are linked below. If you are unsure of what standard is best for your dataset, consider the following:

  1. Search by discipline to familiarize yourself with what standards exist for your research area.
  2. Search for and examine other shared data related to your project or research area to see what standards (if any) your peers are using.
  3. If you plan on depositing into a discipline-specific repository, see if they have standards recommendations or preferences.
  4. Ask us at dmconsult@virginia.edu! If you aren't sure if you should use a standard or which standard to use, we are happy to discuss this with you, and we will consult our subject specialist colleagues if more knowledge of the field is required.

Using Controlled Vocabularies and Thesauri

An important step researchers can take to make their data (and associated metadata) more shareable, comprehensible, and interoperable is to standardize it when possible. One approach to data standardization is to use specific sets of values that are shared among studies in the same field or on the same topic. To that end, many disciplines make use of controlled vocabularies (term lists, thesauri, taxonomies, or similar sets of standard terms and names) to ensure (meta)data consistency – both within and between research groups and studies – and to improve data quality.

One common example of such an approach is use of a thesauri that employs a preferred term for indexing, and synonyms or near-synonyms of that term redirect to the preferred term. This aids searching consistency, as searching for any of a number of synonyms for a term or subject yields the same results.

In the previous section, we noted that popular metadata standard Dublin Core recommends several controlled vocabularies. Sometimes, as with ISO 8601 (for date formatting) and the Getty Thesaurus of Geographic Names, these controlled vocabularies help ensure broad data consistency across fields. This is one important aspect of standardization, but another important aspect is for researchers in the same discipline to share use of the same language: a standard set of terms and names to refer to the same concepts and to represent the same values. This helps make research in those fields more mutually intelligible and encourages meta-analysis or other syntheses.

Some examples of controlled vocabularies that target particular disciplines or research areas include:

Included below are several more useful general resources on controlled vocabularies: