Skip to Main Content

Research Data Management

This guide offers guidance and resources for managing research data in any discipline.

Data Sharing

Five copies of a bovine face

(Adapted from Rourk, Will, 2020, "150- Architectural Detail - Pavilion V parlor", https://doi.org/10.18130/V3/CVKEMV, University of Virginia Dataverse, V1; pav_v_parlor_bucrania_STL.stl [fileName] with a CC0 1.0 Public Domain License.)

Data sharing isn't new. Researchers have been sending each other data files for years.  What is new is that many funders require data sharing as a key component of their research funding strategy. Publishers may also require you to share the data that supports the articles that they publish.

Research data is a valuable resource, usually requiring much time and money to be produced. Many data collections have a significant value beyond usage for the original research. The ease with which digital data can be stored, disseminated and made easily accessible online to users means that many institutions are keen to share research data to increase the impact and visibility of their research.

Why should you share your research data?

  • Enable others to replicate and verify results as part of the scientific process
  • Allow researchers to ask new questions and conduct new analyses, and improve research methods
  • Link to research products like publications & presentations
  • Create a more complete understanding of a research study
  • Meet the expectations of sponsors, funders, publishers and institutions
  • Receive credit for data creation for career advancement
  • Reduce the costs of duplicating data collection

How do I share my data?

  • Deposit it in a repository
  • Submit as supplemental material to a journal in support of an article
  • Link it to a data paper in a journal

Data repositories are not just place holders - many of them also preserve and curate the data. Funders may specify repositories for the research data produced by projects they fund. Publishers may require that the data supporting research they publish be deposited in a specific location.

Advantages to using a repository:

  • Persistent Identifiers -- unique and citable
  • Terms of use and licenses
  • Access controls
  • Repository guidelines for deposit
  • Long-term data preservation in standard file formats
  • Regular data backup
  • Quality standards

The Data Repository Landscape

About Research Data Repositories

Research data repositories are storage locations and services designed and intended for long-term data archiving and preservation. Data repositories generally offer support for data sharing, enable access controls for datasets, provide persistent and citable identifiers, and create landing pages for deposited datasets that display descriptive metadata. While most data repositories share some core features, they vary considerably in the level of support offered beyond the basics.

Research data repositories are frequently separated into three broad categories:

  1. Generalist/Discipline-agnostic repositories: These repositories are characterized by the broad range of datasets they will host. Repositories in this category typically don't place domain- or data type-based restrictions on deposits, though some may not accept data that requires access limitations (e.g. human subjects or medical research). These repositories can range from completely general (like Zenodo) to having a skew toward a certain field or topic (for instance, Dryad is generalist but hosts many life sciences datasets).
  2. Discipline- and data type-specific repositories: These repositories cater to a specific domain or type of data. These repositories tend to have staff with specialized expertise and more targeted options and robust features. One example is the Inter-university Consortium for Political and Social Research (ICPSR), a respected repository for the social sciences. One data type-specific repository is GenBank, the NIH database for genetic data.
  3. Institutional repositories: These repositories are local to research institutions. In the case of universities, they are often housed in the university's library. Historically, institutional repositories have focused on hosting publications (especially theses and dissertations), but now increasingly include data repositories. The University of Virginia's institutional data repository is LibraData.

Beyond these categories, one other major factor in choosing a repository is whether or not it offers curation services. The NLM defines data curation as "the ongoing processing and maintenance of data throughout its lifecycle to ensure long term accessibility, sharing, and preservation." Some repositories will provide curation services for hosted datasets, or expect deposited datasets meet some minimum standard level of curation. Generalist repositories do not curate deposited data. Some guidance that aligns with data curation is given in a later section.

Data Repository Characteristics

What makes a data repository effective and trustworthy? Federal agencies have devised a consensus set of "desirable characteristics" of data repositories for federally funded data, including but not limited to:

  • Free and open access
  • Clear (re)use guidance
  • Stated retention policies
  • Demonstrated capabilities for long-term planning and risk management
  • Provides curation and quality assurance
  • Ensures that every dataset has a persistent identifier and metadata
  • Data provenance tracking
  • Established data security and authentication practices

We have linked to this document below.

We have the following recommendations for choosing a repository for your dataset:

Choosing a data repository: If the funder recommends or requires a specific repository, use that one. They usually have good reasons for their pick. If there is no funder-mandated repository, prioritize data type and discipline-specific repositories. These tend to be better known and more frequently searched, meaning your dataset will be more likely to be found and reused. If neither of the previous points apply, LibraData is a fine choice. See the LibraData subpage or other LibraData resources for more information.

Choice of repository may also depend on details such as data formats and size, a repository's submission requirements and preservation policies, and whether or not there are associated fees. If you have questions or concerns about choice of repository please reach out to us at dmconsult@virginia.edu. One of the topics we cover in consultations with faculty is where and when to deposit and share your data.

Preparing Data for Sharing

Many recommendations and practices covered in this guide apply when it comes time to prepare a dataset for deposit. Below is a checklist that should help you think over what steps you have taken and what is left to be done before your dataset is deposit-ready:

  1. Review your data management plan (if applicable). Note which repository you planned on using to share your dataset.
  2. Consult that repository's website, documentation, and policies to see if they have any specifications for deposit, such as recommended file formats or metadata standards.
  3. If your data is sensitive, ensure that an appropriate option for mediated or controlled access is available for your hosted dataset. Make sure to remove or properly anonymize any personally identifying information if necessary.
  4. Determine which parts of your project and, specifically, what data will be deposited. Repositories aren't meant for hosting your entire project, but rather are for making available complete & final data and any code, documentation, and other supplemental files necessary to understand and make use of that data.
  5. Ensure the data and other files you are planning to deposit are organized in a logical, coherent manner. Try to avoid unnecessary depth of hierarchy in the directory structure, especially if you plan on depositing as a zipped or compressed archive file.
  6. Review your dataset to assure yourself it is in a good state to be deposited. Datasets often require some "cleaning" to be in the best shape for sharing and reuse - reorganizing or reformatting for clarity and fixing erroneous values to improve accuracy. If you have tabular data, follow spreadsheet best practices: Avoid special characters in column headers, don't use special formatting like colors to encode information, save charts or diagrams as separate files, have only one table per tab, and export each tab of an excel spreadsheet as a separate .csv file.
  7. Make sure files are named descriptively and are in formats that support long-term preservation or that comport with repository recommendations (to the greatest extent possible/feasible). If files need to be converted, keep the original files.
  8. Choose a suitable license for your dataset. Choose public domain or a more permissive license unless there's reason to be restrictive. If your dataset includes code, this may require a separate license.
  9. Include documentation of reasonable thoroughness to help others understand the dataset. If your dataset has a data dictionary or a codebook, this should be part of the deposit. At minimum, include a README that covers basic information about the dataset, gives an overview of the data and file list, states a license for the data, defines variables, and covers any methodological information key to understanding the research.
  10. Include accurate and complete metadata for your dataset. Dataset metadata should include (among other things) a title, authors, study abstract, links to related publications and codebases, funding/sponsorship information, and subject keywords.

Making Data FAIR

(This FAIR graphic is licensed under a CC BY-SA 4.0 International license.)

Defining FAIR

FAIR stands for Findable, Accessible, Interoperable, and Reusable, and has become recognized as an ideal for archived and shared research data since its introduction as a framework. While these terms may seem clear, it is helpful to explain their specific meaning in the context of the FAIR guiding principles:

  • Findable: A dataset is described with rich metadata, is assigned a unique and persistent identifier, and is indexed and searchable online. The metadata should include the identifier of the dataset it describes.
  • Accessible: The dataset and associated metadata are retrievable via their identifier using a standardized, open communications protocol that allows for authentication/authorization. Metadata should be accessible even if the data itself is no longer available.
  • Interoperable: Data and metadata use formally specified, standardized, and broadly applicable data/metadata schemas and vocabularies.
  • Reusable: Data and metadata are sufficiently described with fitting and accurate attributes, have a clear associated license covering reuse permissions, and meet the standards of the relevant domain or community.

Please consult the links below on the FAIR principles for a more thorough exposition as well as examples.

Machine Actionablility

The main feature that distinguishes the FAIR guiding principles from other principles or recommendations regarding data is their strong emphasis on machine actionability. For (meta)data to be machine actionable, it must be able to be discovered, accessed, parsed, and in general usefully acted upon by a computational system or agent with little or no human intervention.

This capability is enabled by use of standards and protocols that are defined and structured in ways that programs and algorithms can "understand" and follow. Machine actionability is a desirable quality for hosted datasets due to our increasing reliance on computer systems to automate, expedite, or otherwise facilitate handling large volumes of data.