Header

Search

Data Organization

Data organization begins in the planning phase but becomes especially important when data collection starts. Well-organized file names and folder structures make it easier to find and keep track of data files, and to share and reuse research data.
Documentation of the filenaming convention and folder structure is also important metadata to provide when sharing or publishing research data.

To organize data properly, make sure to:

  1. Name your files in a structured and consistent manner.
  2. Choose appropriate file formats.
  3. Organize files in a workable folder structure.
  4. When available, adopt data organization standards from your discipline.

File Naming Conventions

Information to Include in a File Name

Good file names can provide useful cues to the content and status of a file, uniquely identify a file and help in classifying files. You might consider including some of the following information in your file names, but you can include any information that will allow you to distinguish your files from one another.

  • Project or experiment name or acronym
  • Location/spatial coordinates
  • Researcher name/initials
  • Date or date range of experiment
  • Type of data
  • Conditions
  • Version number of file

Keep in Mind

When naming files, keep in mind:

  • Choose short but meaningful file names
  • Avoid special characters such as  ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " and | in file names
  • Avoid spaces, use other options (e.g. underscore, dash, no separation, Camel case) instead
  • Filenames with year-month-day start in the form YYYYMMDD or YYMMDD so that the files are listed chronologically
  • When possible, follow naming conventions that are standard or broadly used in your field.

Version Control

Version control is a system that keeps track of changes made to a file (usually source code) over time so that earlier versions can be recalled later. This can be either done automatically (e.g. Git) or manually by adding version numbering to a file. Each new version indicates file revisions or edits, which is especially useful in collaborations.

For example:

File name

Changes to file

LC_Interviewschedule_1.0

Original document

LC_Interviewschedule_1.1

Minor revisions made

LC_Interviewschedule_1.2

Further minor revisions

LC_Interviewschedule_2.0

Substantive changes

Some types of text documentation (e.g. a "terms and conditions" document) may include a Changes Log section at the end of a document, describing the main changes in each version.

Examples of Good File Names

  • 20100212_FG1_CONS is the file that contains the transcript of the first focus group with a study of consumers, that took place on 12 February 2010.
  • 20080605_Int024_R01 is an interview with participant 024, interviewed by Researcher 01 on 5 June 2008. (It can be useful to mask the researchers instead of using their names or initials if you want to conduct blind analyses.)

File Formats

A file format is a standard way to encode data for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open. The file formats you use have a direct impact on your ability to open those files at a later date and on the ability of other people to access those data.

When selecting file formats for archiving, the formats should ideally be:

  • Non-proprietary
  • Unencrypted
  • Uncompressed
  • In common usage by the research community

File Formats for Reusability and Preservation

Type of data

Appropriate

Acceptable

not suitable

Tabular data with extensive metadata

.csv / .tsv / .hdf5

.txt / .html / .tex / .por

 

Tabular data with minimal metadata

.csv / .tsv /.tab / .ods / SQL

.xml if appropriate DTD / .xlsx

.xls / .xlsb

Textual data

.pdf / .txt / .odt / .odm / .tex / .md / .htm / .xml

.pptx / .pdf with embedded forms / .rtf

.doc / .ppt

Code

.m / .R / .py / .iypnb / .rstudio / .rmd / NetCDF

.sdd

.mat / .rdata

Digital image data

.tif (uncompressed) / .png / .svg / .jpeg

.jpg / .jp2 / .tif (compressed) / .tiff / .pdf / .gif / .bmp

.indd / .ait / .psd

Digital audio data

.flac / .wav / .ogg

.mp3 / .mp4 / .aif

 

Digital video data

.mp4 / .mj2 / .avi / .mkv

.ogm / .webm

.wmv / .mov

Geospatial data

NetCDF, tabular GIS attribute data, .shp / .shx / .dbf / .prj / .sbx / .sbn / PostGIS / .tif / .tfw / GeoJSON

.mdb / .mif

 

CAD / vector and raster data

.x3d / .x3dv / .x3db / PDF3D .pdf

.dwg / .dxf

 

Generic data

.xml / .json / .rdf

 

 

Source: EPFL Library. 2018. Recommended data formats. Available at: https://www.epfl.ch/campus/library/wp-content/uploads/2018/05/Recommended_DataFormats_-2018_03_05_Final.pdf

A note on textual data: The encoding format UNICODE UTF-8 is recommended for interoperability. If formatting is not important, convert the file to plain text. If formatting is important, the recommended format for archiving is .pdf (following the PDF/A standardized version).

Compressing Files

If you need to use compressed files, favor lossless compression as opposed to lossy compression formats that involves permanently discarding original data. Depending on the data features that are scientifically relevant, compressed formats of any type may not be acceptable.

Additional References

van de Wiel, H., Fraga Gonzalez, G., Furrer, E., & Held, L. (2024). Primer: File Naming Conventions. Zenodo. https://doi.org/10.5281/zenodo.13322390

ETHZ: File formats for archiving
CLARIN: Format recommendations for language data

Folder Structure

When organizing the folder structure, it helps if the data is divided into categories, e.g. by project/subproject, time, date, file type, location, etc. When possible, follow folder and file organization standards in your field and organize folders in a way that an external collaborator will be able to find and reuse the files with minimal instructions.

Keep in Mind:

  • The folder structure should not be more than 3-4 levels deep.
  • Keep source data (original data as collected) folders clearly separated from the preprocessed data (derivative data, where transformations are performed on the original data).
  • Clearly label any folder that contains identifying or particularly sensitive information. Such folders may have additional encryption and they may be excluded from archiving, hence they should be easy to distinguish from others.
  • Balance informative names and concision. If some information is encoded in the filename, avoid repetition in the folder name and vice versa.
  • The use of numbering in folder names (e.g., 01_data, 02_documents, etc.) is generally discouraged as it can lead to ambiguities, frequent need to rename folders or to a too inflexible folder organization. However, numbering script folders or filenames can be helpful to indicate the intended execution order.

Some examples:

Image of an exemplary folder structure where all subfolders are visible.

 

Additional References

Fraga González, G., Clark, A., Furrer, E., & Held, L. (2024). Primer: Long-term Archiving of Experimental Data (Version v3). Zenodo. https://doi.org/10.5281/zenodo.13880988

Make sure that your data organization system is practical and used consistently.


This tutorial was jointly created by the Open Science Services and Gorka Fraga González, Data Steward at LiRI.

Additional Information

Questions on Data Management?

data@ub.uzh.ch

+41 44 635 47 44

Training