Skip to content

Commit

Permalink
Merge branch 'book-intro' of github.com:datalad-handbook/book into bo…
Browse files Browse the repository at this point in the history
…ok-intro
  • Loading branch information
adswa committed Nov 14, 2023
2 parents b4fac7b + 7b71502 commit 924db85
Show file tree
Hide file tree
Showing 7 changed files with 111 additions and 61 deletions.
10 changes: 3 additions & 7 deletions docs/basics/101-114-txt2git.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,6 @@ the first student in your lecturer's office hours.
"Oh, you are really attentive. This is a great question!" our lecturer starts
to explain.

.. figure:: ../artwork/src/teacher.svg
:width: 50%

.. index:: ! dataset procedure; text2git

Do you remember that we created the ``DataLad-101`` dataset with a
Expand Down Expand Up @@ -51,17 +48,16 @@ But what does it mean if files are in Git instead of git-annex?
Well, procedurally it means that everything that is stored in git-annex is
content-locked, and everything that is stored in Git is not. You can modify
content stored in Git straight away, without unlocking it first.
This is easy enough, and illustrated in :numref:`fig-gitvsannex`.

.. _fig-gitvsannex:

.. figure:: ../artwork/src/git_vs_gitannex.svg
:alt: A simplified illustration of content lock in files managed by git-annex.
:width: 50%
:width: 70%

A simplified overview of the tools that manage data in your dataset.

That's easy enough, and illustrated in :numref:`fig-gitvsannex`.

"So, first of all: If we hadn't provided the ``-c text2git`` argument, text files
would get content-locked, too?". "Yes, indeed. However, there are also ways to
later change how file content is handled based on its type or size. It can be specified
Expand Down Expand Up @@ -93,7 +89,7 @@ modifications are performed outside of a :dlcmd:`run`.
But there comes the second, tricky part: There are ways to get rid of locking and
unlocking within git-annex, using so-called :term:`adjusted branch`\es.
This functionality is dependent on the git-annex version one has installed, the git-annex version of the repository, and a use-case dependent comparison of the pros and cons.
On Windows systems, this *adjusted mode* is even the *only* mode of operation.
On Windows systems, this :term:`adjusted mode` is even the *only* mode of operation.
In later sections we will see how to use this feature.
The next lecture, in any way, will guide us deeper into git-annex, and improve our understanding a slight bit further.

Expand Down
75 changes: 46 additions & 29 deletions docs/basics/101-115-symlinks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,15 @@ It is a crucial component to understanding certain aspects of a dataset, but it

You might have noticed already that an ``ls -l`` or ``tree`` command in your dataset shows small arrows and quite cryptic paths following each non-text file.
Maybe your shell also displays these files in a different color than text files when listing them.
We'll take a look together, using the ``books/`` directory as an example:
We'll take a look together, using the ``books/`` directory as an example.
Also check the :windows-wit:`on directory appearance <ww-directories>` for comparison:

.. index::
pair: no symlinks; on Windows
pair: tree; terminal command
.. windows-wit:: Dataset directories look different on Windows
:name: ww-directories
:float: tb

.. include:: topic/tree-symlinks.rst

Expand All @@ -42,10 +45,10 @@ here to understand it.

The small ``->`` symbol connecting one path (the book's name) to another path (the weird
sequence of characters ending in ``.pdf``) is what is called a
*symbolic link* (short: :term:`symlink`) or *softlink*.
*symbolic link*, :term:`symlink` or *softlink* for short.
It is a term for any file that contains a reference to another file or directory as
a :term:`relative path` or :term:`absolute path`.
If you use Windows, you are familiar with a related, although more basic concept: a shortcut.
If you use Windows, you are familiar with a related, although more basic concept: a shortcut. But see the :windows-wit:`on how the actual behavior is there <ww-adjusted-mode>`.

This means that the files that are in the locations in which you saved content
and are named as you named your files (e.g., ``TLCL.pdf``),
Expand Down Expand Up @@ -87,9 +90,10 @@ tree is also known as the *annex* of a dataset.
.. index::
pair: elevated storage demand; in adjusted mode
pair: no symlinks; on Windows
.. windows-wit:: File content management on Windows
:name: woa_objecttree
:float:
pair: adjusted mode; on Windows
.. windows-wit:: File content management on Windows (adjusted mode)
:name: ww-adjusted-mode
:float: tbp

.. include:: topic/adjustedmode-nosymlinks.rst

Expand Down Expand Up @@ -140,7 +144,7 @@ This comes with two very important advantages:
One, should you have copies of the
same data in different places of your dataset, the symlinks of these files
point to the same place - in order to understand why this is the case, you
will need to read the :find-out-more:`about the object tree <fom-objecttree>`.
will need to read the :find-out-more:`on how git-annex manages file content <fom-objecttree>`.
Therefore, any amount of copies of a piece of data
is only one single piece of data in your object tree. This, depending on
how much identical file content lies in different parts of your dataset,
Expand All @@ -151,14 +155,14 @@ Compared to copying and deleting huge data files, small symlinks can be written

.. gitusernote:: Speedy branch switches

Switching branches fast, even when they track vasts amounts of data, lets you work with data with the same routines as in software development.
Switching branches fast, even when they track vasts amounts of data, lets you work with data using the same routines as in software development workflows.

This leads to a few conclusions:

The first is that you should not be worried
to see cryptic looking symlinks in your repository -- this is how it should look.
You can read the :ref:`find-out-more on why these paths look so weird <fom-objecttree>` and what all of this has to do with data integrity, if you want to.
It's additional information that can help to establish trust in that your data are safely stored and tracked, and understanding more about the object tree and knowing bits of the git-annex basics can make you more confident in working with your datasets.
Again, you can read the :find-out-more:`on why these paths look so weird <fom-objecttree>` and what all of this has to do with data integrity, if you want to.
It has additional information that can help to establish trust in that your data are safely stored and tracked, and understanding more about the object tree and knowing bits of the git-annex basics can make you more confident in working with your datasets.

The second is that it should now be clear to you why the ``.git`` directory
should not be deleted or in any way modified by hand. This place is where
Expand All @@ -174,30 +178,31 @@ will take a closer look at that.
.. _objecttree:
.. index::
pair: key; git-annex concept
.. find-out-more:: Paths, checksums, object trees, and data integrity
.. find-out-more:: Data integrity and annex keys
:name: fom-objecttree
:float: tbp

So how do these cryptic paths and names in the object tree come into existence?
It's not malicious intent that leads to these paths and file names - its checksums.

When a file is annexed, git-annex generates a *key* (or :term:`checksum`) from the **file content**.
When a file is annexed, git-annex typically generates a *key* (or :term:`annex key`) from the **file content**.
It uses this key (in part) as a name for the file and as the path
in the object tree.
Thus, the key is associated with the content of the file (the *value*),
and therefore, using this key, file content can be identified --
or rather: Based on the keys, it can be identified whether file content changed,
and whether two files have identical contents.
and therefore, using this key, file content can be identified.

The key is generated using *hashes*. A hash is a function that turns an
input (e.g., a PDF file) into a string of characters with a fixed length based on its contents.
Most key types contain a :term:`checksum`. This is a string of a fixed number of characters
computed from some input, for example the content of a PDF file,
by a *hash* function.

Importantly, a hash function will generate the same character sequence for the same file content, and once file content changes, the generated hash changes, too.
This checksum *uniquely* identifies a file's content.
A hash function will generate the same character sequence for the same file content, and once file content changes, the generated checksum changes, too.
Basing the file name on its contents thus becomes a way of ensuring data integrity:
File content cannot be changed without git-annex noticing, because file's hash, and thus its key in its symlink, will change.
Furthermore, if two files have identical hashes, the content in these files is identical.
File content cannot be changed without git-annex noticing, because the file's checksum, and thus its key in its symlink, will change.
Furthermore, if two files have identical checksums, the content in these files is identical.
Consequently, if two files have the same symlink, and thus link the same file in the object-tree, they are identical in content.
This can save disk space if a dataset contains many identical files: Copies of the same data only need one instance of that content in the object tree, and all copies will symlink to it.
If you want to read more about the computer science basics about hashes check out the `Wikipedia page <https://en.wikipedia.org/wiki/Hash_function>`_.
If you want to read more about the computer science basics about hash functions check out the `Wikipedia page <https://en.wikipedia.org/wiki/Hash_function>`_.

.. runrecord:: _examples/DL-101-115-104
:language: console
Expand Down Expand Up @@ -230,30 +235,40 @@ will take a closer look at that.
The next subdirectory in the symlink helps to prevent accidental deletions and changes, as it does not have write :term:`permissions`, so that users cannot modify any of its underlying contents.
This is the reason that annexed files need to be unlocked prior to modifications, and this information will be helpful to understand some file system management operations such as removing files or datasets. Section :ref:`file system` takes a look at that.

The next part of the symlink contains the actual hash.
There are different hash functions available.
The next part of the symlink contains the actual checksum.
There are different :term:`annex key` backends that use different checksums.
Depending on which is used, the resulting :term:`checksum` has a certain length and structure, and the first part of the symlink actually states which hash function is used.
By default, DataLad uses the ``MD5E`` git-annex backend (the ``E`` adds file extensions to annex keys), but should you want to, you can change this default to `one of many other types <https://git-annex.branchable.com/backends>`_.
The reason why MD5E is used is the relatively short length of the underlying MD5 checksums -- thus it is possible to ensure cross-platform compatibility and share datasets also with users on operating systems that have restrictions on total path lengths, such as Windows.
The reason why MD5E is used is the relatively short length of the underlying MD5 checksums -- which facilitates cross-platform compatibility for sharing datasets also with users on operating systems that have restrictions on total path length, such as Windows.

The one remaining unidentified bit in the file name is the one after the checksum identifier.
This part is the size of the content in bytes.
An annexed file in the object tree thus has a file name following this structure:
An annexed file in the object tree thus has a file name following this structure
(but see `the git-annex documentation on keys <https://git-annex.branchable.com/internals/key_format>`_ for the complete details):

``checksum-identifier - size -- checksum . extension``
``<backend type>-s<size>--<checksum>.<extension>``

You now know a great deal more about git-annex and the object tree.
Maybe you are as amazed as we are about some of the ingenuity used behind the scenes.
Even more mesmerizing things about git-annex can be found in its `documentation <https://git-annex.branchable.com/git-annex>`_.


.. raw:: latex

\vspace{1cm}

.. image:: ../artwork/src/teacher.svg
:width: 50%
:align: center

.. index:: ! broken symlink, ! symlink; broken
.. _wslfiles:

Broken symlinks
^^^^^^^^^^^^^^^

Whenever a symlink points to a non-existent target, this symlink is called
*broken*, and opening the symlink would not work as it does not resolve. The
*broken* or *dangling*, and opening the symlink would not work as it does not resolve. The
section :ref:`file system` will give a thorough demonstration of how symlinks can
break, and how one can fix them again. Even though *broken* sounds
troublesome, most types of broken symlinks you will encounter can be fixed,
Expand All @@ -276,14 +291,16 @@ Alternatively, use the :shcmd:`ls` command in a terminal instead of a file manag
Other tools may be more more specialized, smaller, or domain-specific, and may fail to correctly work with broken symlinks, or display unhelpful error messages when handling them, or require additional flags to modify their behavior.
When encountering unexpected behavior or failures, try to keep in mind that a dataset without retrieved content appears to be a pile of broken symlinks to a range of tools, consult a tools documentation with regard to symlinks, and check whether data retrieval fixes persisting problems.

A last special case on symlinks exists if you are using DataLad on the Windows Subsystem for Linux.
If so, please take a look into the Windows Wit below.
A last special case on symlinks exists if you are using DataLad on the Windows Subsystem for Linux. Take a look at the :windows-wit:`on WSL2 symlink access <ww-wsl2-symlinks>`
for that.

.. index::
pair: access WSL2 symlinked files; on Windows
single: WSL2; symlink access
pair: log; Git command
.. windows-wit:: Accessing symlinked files from your Windows system
:name: ww-wsl2-symlinks
:float: tbp

.. include:: topic/wsl2-symlinkaccess.rst

Expand Down
3 changes: 1 addition & 2 deletions docs/basics/101-122-config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -256,8 +256,7 @@ remaining sections in that file, and the :ref:`that dissects this config file fu
Let's walk through the Git config file of ``DataLad-101``:
As mentioned above, git-annex will use the
:term:`Git config file` for some of its configurations, such as the second section.
It lists the repository version and git-annex
UUID [#f4]_ (:gitannexcmd:`whereis` displays information about where the
It lists the repository version and :term:`annex UUID` [#f4]_ (:gitannexcmd:`whereis` displays information about where the
annexed content is with these UUIDs).

You may recognize the fourth part of the configuration, the subsection
Expand Down
22 changes: 9 additions & 13 deletions docs/basics/101-123-config2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,7 @@ There even is one key word that you recognize: MD5E.
If you have read the :ref:`Find-out-more on object trees <objecttree>`
you will recognize it as a reference to the type of
key used by git-annex to identify and store file content in the object-tree.
The first row, ``* annex.backend=MD5E``, therefore translates to "Everything in this
directory should be checksummed with the MD5 hash function".
The first row, ``* annex.backend=MD5E``, therefore translates to "The ``MD5E`` git-annex backend should be used for any file".
But what is the rest? We'll start with the last row:

.. code-block:: bash
Expand Down Expand Up @@ -66,17 +65,14 @@ configured git-annex to regard all files of type "binary" as a large file.
Thanks to this little line, your text files are not annexed, but stored
directly in Git.

The patterns ``*`` and ``**`` are so-called "wildcards" used in :term:`globbing`.
``*`` matches any file or directory in the current directory, and ``**`` matches
all files and directories in the current directory *and subdirectories*. In technical
terms, ``**`` matches *recursively*. The third row therefore
translates to "Do not annex anything that is a text file in this directory" for git-annex.

However, rules can be even simpler. The second row simply takes a complete directory
(``.git``) and instructs git-annex to regard nothing in it as a "large file".
The second row, ``**/.git* annex.largefiles=nothing`` means that no
``.git`` repository in this directory or a subdirectory should be considered
a "large file". This way, the ``.git`` repositories are protected from being annexed.
The patterns ``*`` and ``**`` are so-called "wildcards" you might recognize from used in :term:`globbing`.
In Git configuration files, an asterisk "*" matches anything except a slash.
The third row therefore
translates to "Do not annex anything that is a text file" for git-annex.
Two leading "``**``" followed by a slash matches
*recursively* in all directories.
Therefore, the second row instructs git-annex to regard nothing starting with ``.git`` as a "large file", including contents inside of ``.git`` directories.
This way, the ``.git`` repositories are protected from being annexed.
If you had a single file (``myfile.pdf``) you would not want annexed, specifying a rule such as:

.. code-block:: bash
Expand Down
4 changes: 2 additions & 2 deletions docs/basics/topic/adjustedmode-nosymlinks.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Windows has insufficient support for :term:`symlink`\s and revoking write :term:`permissions` on files.
Therefore, :term:`git-annex` classifies it as a :term:`crippled file system` and has to stray from its default behavior.
While git-annex on Unix-based file operating systems stores data in the annex and creates a symlink in the data's original place, on Windows it moves data into the :term:`annex` and creates a *copy* of the data in its original place.
Therefore, :term:`git-annex` classifies it as a :term:`crippled file system` and has to stray from its default behavior: it enters :term:`adjusted mode`.
While git-annex on Unix-based file operating systems stores data in the annex and creates a symlink in the data's original place, on Windows it moves data into the :term:`annex` and creates a *copy* of the data in its original place. This behavior is not specific to Windows, but is done for any impaired file system, such as a dataset on a USB-stick plugged into a Mac.

**Why is that?**
Data *needs* to be in the annex for version control and transport logistics -- the annex is able to store all previous versions of the data, and manage the transport to other storage locations if you want to publish your dataset.
Expand Down
Loading

0 comments on commit 924db85

Please sign in to comment.