Data Organization

Data organization begins in the planning phase but becomes especially important when data collection starts. Well-organized file names and folder structures make it easier to find and keep track of data files, and to share and reuse research data.
Documentation of the filenaming convention and folder structure is also important metadata to provide when sharing or publishing research data.

To organize data properly, make sure to:

Name your files in a structured and consistent manner.
Choose appropriate file formats.
Organize files in a workable folder structure.
When available, adopt data organization standards from your discipline.

File Naming Conventions

Information to Include in a File Name

Good file names can provide useful cues to the content and status of a file, uniquely identify a file and help in classifying files. You might consider including some of the following information in your file names, but you can include any information that will allow you to distinguish your files from one another.

Project or experiment name or acronym
Location/spatial coordinates
Researcher name/initials
Date or date range of experiment
Type of data
Conditions
Version number of file

Keep in Mind

When naming files, keep in mind:

Choose short but meaningful file names
Avoid special characters such as ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " and | in file names
Avoid spaces, use other options (e.g. underscore, dash, no separation, Camel case) instead
Filenames with year-month-day start in the form YYYYMMDD or YYMMDD so that the files are listed chronologically
When possible, follow naming conventions that are standard or broadly used in your field.

Version Control

Version control is a system that keeps track of changes made to a file (usually source code) over time so that earlier versions can be recalled later. This can be either done automatically (e.g. Git) or manually by adding version numbering to a file. Each new version indicates file revisions or edits, which is especially useful in collaborations.

For example:

File name	Changes to file
LC_Interviewschedule_1.0	Original document
LC_Interviewschedule_1.1	Minor revisions made
LC_Interviewschedule_1.2	Further minor revisions
LC_Interviewschedule_2.0	Substantive changes

Some types of text documentation (e.g. a "terms and conditions" document) may include a Changes Log section at the end of a document, describing the main changes in each version.

Examples of Good File Names

20100212_FG1_CONS is the file that contains the transcript of the first focus group with a study of consumers, that took place on 12 February 2010.
20080605_Int024_R01 is an interview with participant 024, interviewed by Researcher 01 on 5 June 2008. (It can be useful to mask the researchers instead of using their names or initials if you want to conduct blind analyses.)

Top

File Formats

A file format is a standard way to encode data for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open. The file formats you use have a direct impact on your ability to open those files at a later date and on the ability of other people to access those data.

When selecting file formats for archiving, the formats should ideally be:

Non-proprietary
Unencrypted
Uncompressed
In common usage by the research community

File Formats for Reusability and Preservation

Type of data	Appropriate	Acceptable	not suitable
Tabular data with extensive metadata	.csv / .tsv / .hdf5	.txt / .html / .tex / .por
Tabular data with minimal metadata	.csv / .tsv /.tab / .ods / SQL	.xml if appropriate DTD / .xlsx	.xls / .xlsb
Textual data	.pdf / .txt / .odt / .odm / .tex / .md / .htm / .xml	.pptx / .pdf with embedded forms / .rtf	.doc / .ppt
Code	.m / .R / .py / .iypnb / .rstudio / .rmd / NetCDF	.sdd	.mat / .rdata
Digital image data	.tif (uncompressed) / .png / .svg / .jpeg	.jpg / .jp2 / .tif (compressed) / .tiff / .pdf / .gif / .bmp	.indd / .ait / .psd
Digital audio data	.flac / .wav / .ogg	.mp3 / .mp4 / .aif
Digital video data	.mp4 / .mj2 / .avi / .mkv	.ogm / .webm	.wmv / .mov
Geospatial data	NetCDF, tabular GIS attribute data, .shp / .shx / .dbf / .prj / .sbx / .sbn / PostGIS / .tif / .tfw / GeoJSON	.mdb / .mif
CAD / vector and raster data	.x3d / .x3dv / .x3db / PDF3D .pdf	.dwg / .dxf
Generic data	.xml / .json / .rdf

Source: EPFL Library. 2018. Recommended data formats. Available at: https://www.epfl.ch/campus/library/wp-content/uploads/2018/05/Recommended_DataFormats_-2018_03_05_Final.pdf

A note on textual data: The encoding format UNICODE UTF-8 is recommended for interoperability. If formatting is not important, convert the file to plain text. If formatting is important, the recommended format for archiving is .pdf (following the PDF/A standardized version).

Compressing Files

If you need to use compressed files, favor lossless compression as opposed to lossy compression formats that involves permanently discarding original data. Depending on the data features that are scientifically relevant, compressed formats of any type may not be acceptable.

Additional References

van de Wiel, H., Fraga Gonzalez, G., Furrer, E., & Held, L. (2024). Primer: File Naming Conventions. Zenodo. https://doi.org/10.5281/zenodo.13322390

ETHZ: File formats for archiving
CLARIN: Format recommendations for language data

Top

Folder Structure

When organizing the folder structure, it helps if the data is divided into categories, e.g. by project/subproject, time, date, file type, location, etc. When possible, follow folder and file organization standards in your field and organize folders in a way that an external collaborator will be able to find and reuse the files with minimal instructions.

Keep in Mind:

The folder structure should not be more than 3-4 levels deep.
Keep source data (original data as collected) folders clearly separated from the preprocessed data (derivative data, where transformations are performed on the original data).
Clearly label any folder that contains identifying or particularly sensitive information. Such folders may have additional encryption and they may be excluded from archiving, hence they should be easy to distinguish from others.
Balance informative names and concision. If some information is encoded in the filename, avoid repetition in the folder name and vice versa.
The use of numbering in folder names (e.g., 01_data, 02_documents, etc.) is generally discouraged as it can lead to ambiguities, frequent need to rename folders or to a too inflexible folder organization. However, numbering script folders or filenames can be helpful to indicate the intended execution order.

Some examples:

Additional References

Fraga González, G., Clark, A., Furrer, E., & Held, L. (2024). Primer: Long-term Archiving of Experimental Data (Version v3). Zenodo. https://doi.org/10.5281/zenodo.13880988

Top

This tutorial was jointly created by the Open Science Services and Gorka Fraga González, Data Steward at LiRI.

Quicklinks

Main navigation

Data Organization

To organize data properly, make sure to:

File Naming Conventions

Information to Include in a File Name

Keep in Mind

Version Control

Examples of Good File Names

File Formats

File Formats for Reusability and Preservation

Compressing Files

Additional References

Folder Structure

Keep in Mind:

Additional References

Make sure that your data organization system is practical and used consistently.

Additional Information

Questions on Data Management?

Training

More information