There are many different ways to set up and organise your documentation.
Project level documentation gives contextual information about the study/project: it explains the aims of the study, the research questions, the methodologies, etc.
Project level documentation also seeks answers to questions such as:
For what purpose was the data created? Describe the project history, its aims, objectives, concepts and hypotheses, including:
The title of the project
Authors, creators, co workers of the dataset
The institution of the author(s)/creator(s)
References to related projects
Publications from the data.
What does the dataset contain?
Kind of data (interviews, images, questionnaires, instrumental, etc.)
Organization & structure
Relationships between files
Description of data file(s): version and edition, structure of the database, associations, links between files, external links, formats, compatibility
How was the data collected?
The methodology and technique used in collecting and creating the data
Description of all the sources the data originate from
The methods/modes of data collection (for example):
The instruments, hardware and software used to collect the data
Digitisation or transcription methods
Data collection protocols
Sampling design and procedure
Target population, units of observation
What possible manipulations were done to the data? How was the data processed?
Modifications made to data over time since their original creation and identification of different versions of datasets
Describe workflow and specific tools, instruments, procedures, hardware/software or protocols you might have used to process the data
Anonymisation /pseudonymization strategy
What where the quality assurance procedures?
Checking for equipment and transcription errors
Quality control of materials
Data integrity checks
Data capture resolution and repetitions
Other procedures related to data quality such as weighting, calibration, reasons for missing values, checks and corrections of transcripts, transformations.
How can the data be accessed? Describe the use and access conditions of the data:
Where the data can be found
Access conditions such as embargo
Parts of the data that are restricted, protected or confidential
Copyright and ownership issues
A complete academic thesis normally contains this information in details, but a published article may not. If a dataset is shared, a detailed technical report needs to be included for the user to understand how the data were collected and processed. You should also provide a sample bibliographic citation to indicate how you would like secondary users of your data to cite it in any publication.
File or Database Level
File or database level documentation documents how all the files (or tables in a database) that make up the dataset relate to each other, what format they are in, whether they supersede or are superseded by previous files, etc.
For this purpose, a codebook is advised. These codebooks can be used as a separate file or they can be embedded within the datafile. The first allows for much flexibility, but is yet another document to maintain, the latter sits close to data, is easy to use, but is hardly flexible and may get lost in conversion
Data level documentation should also seek to document the processing steps, answering questions such as:
What happens between data files and why?
What is the chronology like? What happens when, and why?
use annotated scripts or cookbooks that describe all steps, decisions and study protocol
Variable or Item Level
Variable or item level documentation documents how an object of analysis came about. For example, it does not just document a variable name at the top of a spreadsheet file, but also the full label explaining the meaning of that variable in terms of how it was operationalised.
Best practices regarding variable names:
Use valid variable names
Meaningful abbreviations, e.g. use bmi, not var1
Refer to numbering system in instrument, e.g. q1a, q1b, q2, q3a
Avoid simplistic numerical order system like v1, v2, v3
Short, no spaces, no special characters and lower case. (Gender vs gender)
Best practices regarding variable descriptions: Variables in tabular data should have descriptive labels.
Be brief, max. 80 characters
Spaces or special characters are ok
Include unit of measurement where applicable
Refer to number used in instrument. e.g. variable q11bhexw with label q11b: hours spent taking physical exercise in a typical week the description gives the unit of measurement and a reference to the question number (q11b)