Skip to Main Content

Research Data Management

This guide offers guidance and resources for managing research data in any discipline.

Project and File Organization

The way project directories are structured and folders are organized is an underappreciated but important aspect of research data management. Of course, sensible file organization schemes are highly context-dependent and will vary from project to project and across domains. That said, there are some general principles and guidelines that apply broadly:

  • Plan it early: Don't wait until partway through a project to devise an organization plan. It will be easier to follow a plan if you do it from the beginning, before it involves moving or renaming anything.
  • Write it down: You will want to document your file organization scheme for several reasons, including making it easier for students or collaborators to follow, ensuring you understand the organization if you come back to a project after a long pause, and for use in your final documentation that you submit to a repository at the end of the project.
  • Make it logical: There won't ever be only one logical organization for any project, but some choices are better than others. Beyond organizing your files and folders by project, you might choose to organize projects by
    • Time period
    • Researcher/project staff
    • File type or function
    • Location
    • Research activity or experiment
    • Specimen
    • Or some other way your data and files naturally group together.
  • Follow it consistently: However you choose to organize your project folders and files, it is crucial to follow your scheme consistently. It is too easy to forget to go back and fix instances where the scheme wasn't followed, likely leading to confusion. If the plan needs to be changed, change the written plan and reorganize the project files systematically.
  • Avoid encoding information in the directory structure: You should be able to understand the contents of a file independently of the directory structure. In other words, they should be described sufficiently by their file name alone. See the next section on file naming conventions.

Below is a diagram of an example directory structure. Again, there is no right answer, and you may choose to organize your project folders differently (for instance, you might keep code and results together, or you may choose to keep documentation specific to each experiment or sub-study).

National-Parks-NSF-Study-20023 directory with folders for Project Documentation, Project Communication, Yellowstone substudy, and Yosemite substudy .  The Yellowstone and Yosemite substudies include raw data, code, and outputs.

(Example file directory image created by Michael and licensed CC-BY)

File Naming Conventions

Best practice for file naming is that the names are descriptive of the contents of the file. The goal is to be able to understand and recall at a glance what is in any given file. Some potential attributes and information to include are:

  • Date and time
  • Researcher name or laboratory
  • Data type
  • Version number and/or status
  • Data collection location
  • Type of experiment or observation
  • Type of instrumentation or equipment used

Some technical formatting guidelines for file naming:

  • Be consistent: Just as with your organizing scheme, a naming scheme is most effective when followed consistently. This applies especially to the order you use the above-mentioned elements in names. Also be consistent with case (lower case, UPPER CASE, or CamelCase).
  • Avoid spaces and special characters: Use only letters, numbers, and underscores or dashes.
  • Use a date format: YYYY-MM-DD or YYYYMMDD is common, and will sort chronologically.
  • Pad with zeroes: If you expect at least e.g. 100 numbered files, start with 001, not 1. This again allows for proper sorting.
  • Limit to 32 characters: Longer filenames can be unwieldy, but more importantly can cause issues with some computer file systems which may have length limits on filepaths.
  • Avoid vague terms: Both overly common terms ("data", "sample") and vague versioning terms ("revision", "final") should be avoided when possible. Use version numbers, dates, or other forms of version control for the latter case.

File Types and Formats

File formats should be chosen to enable sharing, long-term access, and preservation of your data. Ideally, this means standard and open (non-proprietary) formats, but this may not always be possible depending on the file type and project needs. Researchers of course must consider which formats are best suited to data creation/collection and analysis vs. which are most easily preserved and shared.

When open formats are an option, however, openly available documentation and continued community support for these formats increase the likelihood that such files will be successfully preserved and able to be (re)used down the road by a wider audience.

If you use a program with a proprietary file format as a part of your research, we recommend exporting a copy of that data/file in an open format if possible (e.g. exporting tabular data from Excel as a comma-separated value file), especially when it comes time to deposit and share your data. Please note and be aware that such format conversions may result in the loss of data, metadata, formatting, or other information in some cases. For this reason, we also recommend you keep the original data files, as they may be the files with the most complete version of your dataset.

For certain file types such as images, audio, and video, you will have a choice between lossy and lossless formats. Lossy formats employ (irreversible) compression to reduce filesize, at the cost of fidelity. Lossless formats are generally preferred unless storage space is at a premium.

Recommended Digital Formats Overview (for preservation):

  • Text: Plain text (.txt), Rich text (.rtf), Markdown, XML, HTML, PDF/A
  • Raster image: TIFF (uncompressed), JPEG2000 (lossless), PNG
  • Vector image: SVG
  • Tabular data: CSV, TSV, JSON
  • Database: XML, CSV, .db/.db3, SQLite
  • Statistical: .R/.Rmd/.Rdata, .por (SPSS portable), .do/.dta (STATA), SAS formats
  • Geospatial: ESRI shapefiles, GML, NetCDF, GeoTIFF, GeoJSON
  • Audio: FLAC, WAV, AIFF
  • Video/Moving image: MPEG-4, MOV, AVI, MXF
  • Code: Uncompiled source code (.py, .c, .cpp, .java, .js, .php, etc.)

The above are just common suggestions. If you have a repository in mind, check their site to see what they recommend. If you do convert your files or export copies of your files in standard/open formats, consider adjusting filenames to make this clear, as the differences between these files will not necessarily be obvious from the file extensions/types alone.

If you have questions about file types/formats and are wondering which are best for your project, reach out to us.

Version Control

It is often important when working with research files to be able to track changes and revert to an earlier version of a file. Version control refers to both the process of tracking multiple versions of a file and also to software that implements such a functionality. Many folks might be familiar with version control as something that programmers use and need, but it has benefits for various types of research and data files besides code. These benefits may be particularly useful for datasets that require complex processing and for collaborative research projects where many people edit the same files.

If you want to employ version control in your project's file management, there are a few ways to do it:

  1. Manually: As noted previously, you can manually version your files. Several approaches include:
    • Keeping numbered or dated versions of them as you make edits, revisions, additions, or transformations
    • Including file history or version control information notes in the files themselves
    • Using a dedicated table or file to record version information as you change and update files
  2. File sharing platforms: Some common file sharing platforms have versioning functionality. This includes UVA Box, Open Science Framework, Google Docs, and Google Drive.
  3. Repository hosting platforms: GitHub, GitLab and Bitbucket are examples of file/repository hosting platforms that are built on a version control backbone (in this case, Git - see #4). These are a flexible and powerful option for research projects that involve code.
  4. Version control software: There is dedicated version control software that allows for the most control over the version control process, at the cost of a steeper technical learning curve. Git is the most popular and well-known of these, though some other examples include Mercurial, Apache Subversion, and SmartSVN.
  5. Electronic lab notebooks: Some electronic lab notebooks (ELNs) come with built-in version control.

Some best practices for manual version control:

  • Decide on a system for identifying versions and stick with it
  • Record all changes made to a file, whether in the file itself, in a changelog file, or in the documentation
  • Identify "milestone" versions to keep (e.g. if you update to version 3.0, you might keep 2.0 but not 2.1 and 2.2)
  • Decide how many versions of files to keep, for how long, and how you will organize them