Skip to content

Commit

Permalink
DOC: datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
jotelha committed Nov 9, 2023
1 parent 51bdb69 commit ea785c9
Show file tree
Hide file tree
Showing 2 changed files with 44 additions and 0 deletions.
43 changes: 43 additions & 0 deletions docs/source/dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
## dtool datasets

A _dataset_ is a collection of files that belong together logically. For example, this could be...

* ...raw data recorded in a measurement.
* ...input files for a simulation and the output of that simulation.
* ...the LaTeX manuscript of a paper, the raw data underlying that paper and the scripts required to plot that data.

The idea behind a dataset is that it becomes immutable at some point. The process of making a dataset immutable is called 'freezing'. Once a dataset is frozen, it can be moved between storage systems but no longer be edited. Changing a dataset would then require the creation of a new (derived) dataset. While this may seem cumbersome, it has a few advantages: the storage backend can be kept simple, the inadvertent modification of primary data is not possible, and the data provenance is traceable exhaustively as long as derived datasets refer to their source datasets properly.

### Granularity

What goes into a _dataset_ is the decision of the dataset's creator. In general, a finer granularity makes it easier to move, copy and backup datasets. We distinguish between...

* ..._primary data_, i.e. the raw results from a measurement or the output of a simulation...
* ...and _secondary or derived data_, which is obtained by processing primary or other derived data. Derived data could for example be some elemental analysis based on some spectrometry measurement. The recorded spectrum itself would be primary data.

This distinction allows to freeze a dataset immediately after completion of an experiment, while there may still be many postprocessing steps that follow. Those would go into separate datasets. The relationship between datasets can be specified using the `derived_from` property or other means discussed below.

### Owners

Each dataset has at least one owner. This is typically the person who created the dataset. __Note that the owner has scientific responsibility for the contents of the dataset.__ Part of this scientific responsibility is for example ensuring that the data has not been fabricated or falsified. Attaching an owner to a dataset allows traceability of these scientific responsibilities.

## Metadata

A dataset _always_ has metadata attached to it. Common formats for specifying metadata of arbitrary complexity are [YAML](https://yaml.org/) and [JSON](https://www.json.org). *dtool* strongly ecourages the use of YAML. YAML is suitable for specifying metadata in simple key-value pairs, but can do much more. The [YAML cheatsheet](https://quickref.me/yaml) offers a compact overview on both formats. Often, metadata are categorized into administrative, bibliographic and descriptive metadata. The example below shows a minimum example of administrative metadata in YAML format.

```yaml
project: 2023-11-09-dtool-example
description: Examples on how document a dtool dataset with YAML-formatted metadata
owners:
- name: Johannes Laurin Hoermann
email: [email protected]
orcid: 0000-0001-5867-695X
funders:
- organization: DFG
program: Clusters of Excellence
code: EXC 2193
creation_date: '2023-11-09'
```
We recommend to always use your clear name, your email and your [ORCID](https://orcid.org/) for tracing ownership.
The more (descriptive) metadata are available, the easier it will be to search for a specific dataset.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Contents

readme
story
datasets
contributing
changelog
api

0 comments on commit ea785c9

Please sign in to comment.