Research Data Management

This guide offers guidance and resources for managing research data in any discipline.

Data Sharing

(Adapted from Rourk, Will, 2020, "150- Architectural Detail - Pavilion V parlor", https://doi.org/10.18130/V3/CVKEMV, University of Virginia Dataverse, V1; pav_v_parlor_bucrania_STL.stl [fileName] with a CC0 1.0 Public Domain License.)

Data sharing isn't new. Researchers have been sending each other data files for years. What is new is that many funders require data sharing as a key component of their research funding strategy. Publishers may also require you to share the data that supports the articles that they publish.

Research data is a valuable resource, usually requiring much time and money to be produced. Many data collections have a significant value beyond usage for the original research. The ease with which digital data can be stored, disseminated and made easily accessible online to users means that many institutions are keen to share research data to increase the impact and visibility of their research.

Why should you share your research data?

Enable others to replicate and verify results as part of the scientific process
Allow researchers to ask new questions and conduct new analyses, and improve research methods
Link to research products like publications & presentations
Create a more complete understanding of a research study
Meet the expectations of sponsors, funders, publishers and institutions
Receive credit for data creation for career advancement
Reduce the costs of duplicating data collection

How do I share my data?

Deposit it in a repository
Submit as supplemental material to a journal in support of an article
Link it to a data paper in a journal

Data repositories are not just place holders - many of them also preserve and curate the data. Funders may specify repositories for the research data produced by projects they fund. Publishers may require that the data supporting research they publish be deposited in a specific location.

Advantages to using a repository:

Persistent Identifiers -- unique and citable
Terms of use and licenses
Access controls
Repository guidelines for deposit
Long-term data preservation in standard file formats
Regular data backup
Quality standards

Managing and Sharing Research Data: A Guide to Good Practice (2nd edition.)

The Data Repository Landscape

About Research Data Repositories

Research data repositories are storage locations and services designed and intended for long-term data archiving and preservation. Data repositories generally offer support for data sharing, enable access controls for datasets, provide persistent and citable identifiers, and create landing pages for deposited datasets that display descriptive metadata. While most data repositories share some core features, they vary considerably in the level of support offered beyond the basics.

Research data repositories are frequently separated into three broad categories:

Generalist/Discipline-agnostic repositories: These repositories are characterized by the broad range of datasets they will host. Repositories in this category typically don't place domain- or data type-based restrictions on deposits, though some may not accept data that requires access limitations (e.g. human subjects or medical research). These repositories can range from completely general (like Zenodo) to having a skew toward a certain field or topic (for instance, Dryad is generalist but hosts many life sciences datasets).
Discipline- and data type-specific repositories: These repositories cater to a specific domain or type of data. These repositories tend to have staff with specialized expertise and more targeted options and robust features. One example is the Inter-university Consortium for Political and Social Research (ICPSR), a respected repository for the social sciences. One data type-specific repository is GenBank, the NIH database for genetic data.
Institutional repositories: These repositories are local to research institutions. In the case of universities, they are often housed in the university's library. Historically, institutional repositories have focused on hosting publications (especially theses and dissertations), but now increasingly include data repositories. The University of Virginia's institutional data repository is LibraData.

Beyond these categories, one other major factor in choosing a repository is whether or not it offers curation services. The NLM defines data curation as "the ongoing processing and maintenance of data throughout its lifecycle to ensure long term accessibility, sharing, and preservation." Some repositories will provide curation services for hosted datasets, or expect deposited datasets meet some minimum standard level of curation. Generalist repositories do not curate deposited data. Some guidance that aligns with data curation is given in a later section.

Data Repository Characteristics

What makes a data repository effective and trustworthy? Federal agencies have devised a consensus set of "desirable characteristics" of data repositories for federally funded data, including but not limited to:

Free and open access
Clear (re)use guidance
Stated retention policies
Demonstrated capabilities for long-term planning and risk management
Provides curation and quality assurance
Ensures that every dataset has a persistent identifier and metadata
Data provenance tracking
Established data security and authentication practices

We have linked to this document below.

We have the following recommendations for choosing a repository for your dataset:

Choice of repository may also depend on details such as data formats and size, a repository's submission requirements and preservation policies, and whether or not there are associated fees. If you have questions or concerns about choice of repository please reach out to us at dmconsult@virginia.edu. One of the topics we cover in consultations with faculty is where and when to deposit and share your data.

Registry of Research Data Repositories
Re3data is the most well-known and referenced catalog for data repositories, containing listings for over 3000 repositories across disciplines and offering various search and filter options.
Desirable Characteristics of Data Repositories for Federally Funded Research
Guidance from the National Science and Technology Council identifying characteristics of data repositories that support FAIR data.

Preparing Data for Sharing

Many recommendations and practices covered in this guide apply when it comes time to prepare a dataset for deposit. Below is a checklist that should help you think over what steps you have taken and what is left to be done before your dataset is deposit-ready:

Review your data management plan (if applicable). Note which repository you planned on using to share your dataset.
Consult that repository's website, documentation, and policies to see if they have any specifications for deposit, such as recommended file formats or metadata standards.
If your data is sensitive, ensure that an appropriate option for mediated or controlled access is available for your hosted dataset. Make sure to remove or properly anonymize any personally identifying information if necessary.
Determine which parts of your project and, specifically, what data will be deposited. Repositories aren't meant for hosting your entire project, but rather are for making available complete & final data and any code, documentation, and other supplemental files necessary to understand and make use of that data.
Ensure the data and other files you are planning to deposit are organized in a logical, coherent manner. Try to avoid unnecessary depth of hierarchy in the directory structure, especially if you plan on depositing as a zipped or compressed archive file.
Review your dataset to assure yourself it is in a good state to be deposited. Datasets often require some "cleaning" to be in the best shape for sharing and reuse - reorganizing or reformatting for clarity and fixing erroneous values to improve accuracy. If you have tabular data, follow spreadsheet best practices: Avoid special characters in column headers, don't use special formatting like colors to encode information, save charts or diagrams as separate files, have only one table per tab, and export each tab of an excel spreadsheet as a separate .csv file.
Make sure files are named descriptively and are in formats that support long-term preservation or that comport with repository recommendations (to the greatest extent possible/feasible). If files need to be converted, keep the original files.
Choose a suitable license for your dataset. Choose public domain or a more permissive license unless there's reason to be restrictive. If your dataset includes code, this may require a separate license.
Include documentation of reasonable thoroughness to help others understand the dataset. If your dataset has a data dictionary or a codebook, this should be part of the deposit. At minimum, include a README that covers basic information about the dataset, gives an overview of the data and file list, states a license for the data, defines variables, and covers any methodological information key to understanding the research.
Include accurate and complete metadata for your dataset. Dataset metadata should include (among other things) a title, authors, study abstract, links to related publications and codebases, funding/sponsorship information, and subject keywords.

LibraDara Deposit Checklist
A checklist provided by LibraData to guide researchers in how to deposit data.
eCommons Submission Checklist
Cornell's data repository deposit checklist - this covers a lot of ground and is a useful supplement to this section.
CURATE(D) Training
A set of training modules developed by the Data Curation Network, based on their curation process. May be of interest to researchers who want a view into how data curators see datasets.
Data Organization in Spreadsheets for Social Scientists
An instructional module from Data Carpentry that covers (among other things) best practices for formatting data in spreadsheets.

Making Data FAIR

(This FAIR graphic is licensed under a CC BY-SA 4.0 International license.)

Defining FAIR

FAIR stands for Findable, Accessible, Interoperable, and Reusable, and has become recognized as an ideal for archived and shared research data since its introduction as a framework. While these terms may seem clear, it is helpful to explain their specific meaning in the context of the FAIR guiding principles:

Findable: A dataset is described with rich metadata, is assigned a unique and persistent identifier, and is indexed and searchable online. The metadata should include the identifier of the dataset it describes.
Accessible: The dataset and associated metadata are retrievable via their identifier using a standardized, open communications protocol that allows for authentication/authorization. Metadata should be accessible even if the data itself is no longer available.
Interoperable: Data and metadata use formally specified, standardized, and broadly applicable data/metadata schemas and vocabularies.
Reusable: Data and metadata are sufficiently described with fitting and accurate attributes, have a clear associated license covering reuse permissions, and meet the standards of the relevant domain or community.

Please consult the links below on the FAIR principles for a more thorough exposition as well as examples.

Machine Actionablility

The main feature that distinguishes the FAIR guiding principles from other principles or recommendations regarding data is their strong emphasis on machine actionability. For (meta)data to be machine actionable, it must be able to be discovered, accessed, parsed, and in general usefully acted upon by a computational system or agent with little or no human intervention.

This capability is enabled by use of standards and protocols that are defined and structured in ways that programs and algorithms can "understand" and follow. Machine actionability is a desirable quality for hosted datasets due to our increasing reliance on computer systems to automate, expedite, or otherwise facilitate handling large volumes of data.

FAIR Principles
An outline of the FAIR principles from the GO FAIR initiative.
The FAIR Guiding Principles for scientific data management and stewardship
The original 2016 Nature paper on the FAIR principles by Wilkinson et al.
Preparing FAIR data for reuse and reproducibility
An informative page by Cornell Data Services on FAIR data.
ARDC FAIR Self-assessment tool
A tool to self-assess your data for FAIRness created by the Australian Research Data Commons.
The CARE Principles for Indigenous Data Governance
The CARE principles are intended as a complement for the FAIR principles, with a focus on Indigenous data and who controls it and benefits from it.

Research Data Management

Your Research Data Management Team

Senior Research Data Management Librarian

Data Sharing

The Data Repository Landscape

Preparing Data for Sharing

Making Data FAIR