-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
112 lines (92 loc) · 5.33 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty
The command "datalad save [-m] PATH" saves the file (modifications) to
history.
Note to self: Always use informative, concise commit messages.
The command 'datalad clone URL/PATH [PATH]' installs a dataset from
e.g., a URL or a path. If you install a dataset into an existing
dataset (as a subdataset), remember to specify the root of the
superdataset with the '-d' option.
There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.
The datalad run command can record the impact a script or command has
on a Dataset. In its simplest form, datalad run only takes a commit
message and the command that should be executed.
Any datalad run command can be re-executed by using its commit shasum
as an argument in datalad rerun CHECKSUM. DataLad will take
information from the run record of the original commit, and re-execute
it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!
You should specify all files that a command takes as input with an
-i/--input flag. These files will be retrieved prior to the command
execution. Any content that is modified or produced by the command
should be specified with an -o/--output flag. Upon a run or rerun of
the command, the contents of these files will get unlocked so that
they can be modified.
Important! If the dataset is not "clean" (a datalad status output is
empty), datalad run will not work - you will have to save
modifications present in your dataset.
A suboptimal alternative is the --explicit flag, used to record only
those changes done to the files listed with --output flags.
A source to install a dataset from can also be a path, for example as
in "datalad clone ../DataLad-101".
Just as in creating datasets, you can add a description on the
location of the new dataset clone with the -D/--description option.
Note that subdatasets will not be installed by default, but are only
registered in the superdataset -- you will have to do a
"datalad get -n PATH/TO/SUBDATASET" to install the subdataset for file
availability meta data. The -n/--no-data options prevents that file
contents are also downloaded.
Note that a recursive "datalad get" would install all further
registered subdatasets underneath a subdataset, so a safer way to
proceed is to set a decent --recursion-limit:
"datalad get -n -r --recursion-limit 2 <subds>"
The command "git annex whereis PATH" lists the repositories that have
the file content of an annexed file. When using "datalad get" to
retrieve file content, those repositories will be queried.
To update a shared dataset, run the command "datalad update --merge".
This command will query its origin for changes, and integrate the
changes into the dataset.
To update from a dataset with a shared history, you need to add this
dataset as a sibling to your dataset. "Adding a sibling" means
providing DataLad with info about the location of a dataset, and a
name for it.
Afterwards, a "datalad update --merge -s name" will integrate the
changes made to the sibling into the dataset. A safe step in between
is to do a "datalad update -s name" and checkout the changes with
"git/datalad diff" to remotes/origin/master
Configurations for datasets exist on different levels (systemwide,
global, and local), and in different types of files (not version
controlled (git)config files, or version controlled .datalad/config,
.gitattributes, or gitmodules files), or environment variables.
With the exception of .gitattributes, all configuration files share a
common structure, and can be modified with the git config command, but
also with an editor by hand.
Depending on whether a configuration file is version controlled or
not, the configurations will be shared together with the dataset.
More specific configurations and not-shared configurations will always
take precedence over more global or hared configurations, and
environment variables take precedence over configurations in files.
The git config --list --show-origin command is a useful tool to give
an overview over existing configurations. Particularly important may
be the .gitattributes file, in which one can set rules for git-annex
about which files should be version-controlled with Git instead of
being annexed.
It can be useful to use pre-configured procedures that can apply
configurations, create files or file hierarchies, or perform arbitrary
tasks in datasets. They can be shipped with DataLad, its extensions,
or datasets, and you can even write your own procedures and distribute
them.
The "datalad run-procedure" command is used to apply such a procedure
to a dataset. Procedures shipped with DataLad or its extensions
starting with a "cfg" prefix can also be applied at the creation of a
dataset with "datalad create -c <PROC-NAME> <PATH>" (omitting the
"cfg" prefix).
Git has many handy tools to go back in forth in time and work with the
history of datasets. Among many other things you can rewrite commit
messages, undo changes, or look at previous versions of datasets.
A superb resource to find out more about this and practice such Git
operations is this chapter in the Pro-git book:
https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History