File formats in which data are created depend on:
- Software in which research data are created and digitized
- How researchers plan to analyze data
- Hardware used
- Availability of software
When formatting your data for storage, store data in non-proprietary software formats (e.g., comma delimited text file, .csv).
These formats are considered relatively stable and better for long-term preservation:
- open documentation
- support across a range of software platforms
- wide adoption; no compression (or lossless compression)
- no embedded files or embedded programs/scripts; and non-proprietary format
|Text||Acrobat PDF/A, Comma-Separated Values, MS Word, Plain Text (US-ASCII, UTF-8), RTF, XML||.pdf, .csv, .html, .txt, .xml|
|Multimedia||AVI, AIFF, Bitmap, Dicom, MPEG, JPEG, JPEG2000, PNG, Quicktime, SAVE, TIFF||.aif, .aiff, .avi, .jpg, .jp2, .png, .tif, .tiff, .wav|
|Models||3D, Statistical, Similitude, Macroeconomic, Causal||.aif, .aiff,|
|Software||Java, C, Perl, Python, Ruby, PHP||.java, .c, .pyc|
|Discipline or Instrument Specific||Flexible Image Transport System (astronomy)|
Carl Zeiss Digital Microscopic Image Format
See a list and descriptions of file formats maintained by the Library of Congress.
Organizing files using best practices is essential for accessibility and finding and keeping track of data files.
- Develop a system that works for your project
- Use file names to classify broad types of files
- Create meaningful but brief names (“Corvallis_VegBiodiv_2007” is more clear than “Year01” or “Fall03”)
- Capitalize each word to differentiate it.
- Avoid using special characters in a file name. (\ / : * ? “ < > | [ ] & $)
- Underscore spaces or use hyphens (“_” or “-“) instead of periods or spaces
- Capture place, time, and theme – extremely useful, even if done in a highly abbreviated manner
- Reverse dates so they sort usefully YYYYMMDD e.g. filenaming_20080507
- Capture document version control (v01, v02, v03 instead of filenaming_latest)
- Be consistent. It is only effective if everyone in the group follows the rules consistently
See more Electronic Research Notebook (ERN) and File Naming from the University of Michigan.
Types of data
The type of data generated impacts the management and preservation of the data. There are four main types of research data, including: observational (captured in real time and not exactly reproducible); experimental (from labs and equipment; can often be reproduced but may be expensive); simulation (from models; can be reproduced if input data is known); and derived/compiled (after data mining or analysis; can be reproduced if documented).
Documenting your data is important. You need to document your project (who, what, and how of the project). You should also document the file and database information (include a readme.txt file) and document the item level (what the variables mean). Documentation should include:
- following best practices in Good Record Keeping (from UF Office of Technology Licensing)
- lab notebooks; see examples
- codebooks, questionnaires, etc.
- software and output files
- information about equipment, instruments, and measures
- database schema
- methodology reports
- information about sources (provenance) of data
- how files relate, including formats
- key to understand the results (include variable name and a full meaning)
Metadata is descriptive information about your data and can help increase the machine readability of your data. Metadata includes descriptive (title, author, abstract, keywords, etc.), administrative (preservation, rights management, technical metadata) and structural (how data correlates to other data; relationship between tables). See Metadata Standards from the UK Digital Curation Centre.
Ontologies are a descriptive framework for describing an area of study. Ontologies may be general, such as the Australian National Data Service (ANDS) Ontology that describes data, or may be specific to a discipline, such as the Plant Structure Ontology that describes the morphology of a flowering plant. Ontologies are built by specialists in the field and can help assign meaning to your data, with the overall purpose of encouraging data sharing and classification, as well as machine readability.
- REDCap is a web-based clinical data capture tool available for free for any research project at UF. REDCap allows you to create a database and related forms quickly and output data into Excel and common statistical packages (SPSS, SAS, Stata, R).
- LabArchives is an Electronic Lab Notebook (also known as an ELN) to store, organize, share, and publish laboratory data.
- Visual Understanding Environment (VUE) provides a flexible visual environment for structuring, presenting, and sharing digital information. Managing digital information, based on concept maps (visual representation of complex ideas) becoming content maps. It allows to create semantic mapping, analysis and comparison between maps, and presentations.
- SPSS Statistics allows you to explore your data, formulate hypotheses, make statistical analysis, to clarify relationships between variables, create clusters, identify trends and make predictions. SPSS free training workshops are provided by e-Learning. Those interested should contact Dr. Silva-Lugo. SPSS is available in the computer labs and also for personal or home use.
- SAS Statistics software provides a wide range of analytical capabilities, from traditional analysis of variance to exact methods and statistical visualization techniques. SAS also offers an interactive matrix programming language and exploratory data analysis with integration to R. Access to this software is available at some UF HSC Library computers. Also available for personal or home use.
Training & Support Tools
- MANTRA Research Data Management Training is a free online course from the University of Edinburgh designed “for PhD students and others who are planning a research project using digital data.”
- Lynda.com provides training in basic tools such as Excel and Data Analysis. Log in with your Gatorlink.
- Data Curation Profiles includes information on specific data generated by a single scientist or lab, based on their reported needs and preferences for the data.