diff --git a/docs/basics/101-114-txt2git.rst b/docs/basics/101-114-txt2git.rst index 3105749e9..948baaf24 100644 --- a/docs/basics/101-114-txt2git.rst +++ b/docs/basics/101-114-txt2git.rst @@ -21,9 +21,6 @@ the first student in your lecturer's office hours. "Oh, you are really attentive. This is a great question!" our lecturer starts to explain. -.. figure:: ../artwork/src/teacher.svg - :width: 50% - .. index:: ! dataset procedure; text2git Do you remember that we created the ``DataLad-101`` dataset with a @@ -51,17 +48,16 @@ But what does it mean if files are in Git instead of git-annex? Well, procedurally it means that everything that is stored in git-annex is content-locked, and everything that is stored in Git is not. You can modify content stored in Git straight away, without unlocking it first. +This is easy enough, and illustrated in :numref:`fig-gitvsannex`. .. _fig-gitvsannex: .. figure:: ../artwork/src/git_vs_gitannex.svg :alt: A simplified illustration of content lock in files managed by git-annex. - :width: 50% + :width: 70% A simplified overview of the tools that manage data in your dataset. -That's easy enough, and illustrated in :numref:`fig-gitvsannex`. - "So, first of all: If we hadn't provided the ``-c text2git`` argument, text files would get content-locked, too?". "Yes, indeed. However, there are also ways to later change how file content is handled based on its type or size. It can be specified @@ -93,7 +89,7 @@ modifications are performed outside of a :dlcmd:`run`. But there comes the second, tricky part: There are ways to get rid of locking and unlocking within git-annex, using so-called :term:`adjusted branch`\es. This functionality is dependent on the git-annex version one has installed, the git-annex version of the repository, and a use-case dependent comparison of the pros and cons. -On Windows systems, this *adjusted mode* is even the *only* mode of operation. +On Windows systems, this :term:`adjusted mode` is even the *only* mode of operation. In later sections we will see how to use this feature. The next lecture, in any way, will guide us deeper into git-annex, and improve our understanding a slight bit further. diff --git a/docs/basics/101-115-symlinks.rst b/docs/basics/101-115-symlinks.rst index d8643b323..070520019 100644 --- a/docs/basics/101-115-symlinks.rst +++ b/docs/basics/101-115-symlinks.rst @@ -16,12 +16,15 @@ It is a crucial component to understanding certain aspects of a dataset, but it You might have noticed already that an ``ls -l`` or ``tree`` command in your dataset shows small arrows and quite cryptic paths following each non-text file. Maybe your shell also displays these files in a different color than text files when listing them. -We'll take a look together, using the ``books/`` directory as an example: +We'll take a look together, using the ``books/`` directory as an example. +Also check the :windows-wit:`on directory appearance ` for comparison: .. index:: pair: no symlinks; on Windows pair: tree; terminal command .. windows-wit:: Dataset directories look different on Windows + :name: ww-directories + :float: tb .. include:: topic/tree-symlinks.rst @@ -42,10 +45,10 @@ here to understand it. The small ``->`` symbol connecting one path (the book's name) to another path (the weird sequence of characters ending in ``.pdf``) is what is called a -*symbolic link* (short: :term:`symlink`) or *softlink*. +*symbolic link*, :term:`symlink` or *softlink* for short. It is a term for any file that contains a reference to another file or directory as a :term:`relative path` or :term:`absolute path`. -If you use Windows, you are familiar with a related, although more basic concept: a shortcut. +If you use Windows, you are familiar with a related, although more basic concept: a shortcut. But see the :windows-wit:`on how the actual behavior is there `. This means that the files that are in the locations in which you saved content and are named as you named your files (e.g., ``TLCL.pdf``), @@ -87,9 +90,10 @@ tree is also known as the *annex* of a dataset. .. index:: pair: elevated storage demand; in adjusted mode pair: no symlinks; on Windows -.. windows-wit:: File content management on Windows - :name: woa_objecttree - :float: + pair: adjusted mode; on Windows +.. windows-wit:: File content management on Windows (adjusted mode) + :name: ww-adjusted-mode + :float: tbp .. include:: topic/adjustedmode-nosymlinks.rst @@ -140,7 +144,7 @@ This comes with two very important advantages: One, should you have copies of the same data in different places of your dataset, the symlinks of these files point to the same place - in order to understand why this is the case, you -will need to read the :find-out-more:`about the object tree `. +will need to read the :find-out-more:`on how git-annex manages file content `. Therefore, any amount of copies of a piece of data is only one single piece of data in your object tree. This, depending on how much identical file content lies in different parts of your dataset, @@ -151,14 +155,14 @@ Compared to copying and deleting huge data files, small symlinks can be written .. gitusernote:: Speedy branch switches - Switching branches fast, even when they track vasts amounts of data, lets you work with data with the same routines as in software development. + Switching branches fast, even when they track vasts amounts of data, lets you work with data using the same routines as in software development workflows. This leads to a few conclusions: The first is that you should not be worried to see cryptic looking symlinks in your repository -- this is how it should look. -You can read the :ref:`find-out-more on why these paths look so weird ` and what all of this has to do with data integrity, if you want to. -It's additional information that can help to establish trust in that your data are safely stored and tracked, and understanding more about the object tree and knowing bits of the git-annex basics can make you more confident in working with your datasets. +Again, you can read the :find-out-more:`on why these paths look so weird ` and what all of this has to do with data integrity, if you want to. +It has additional information that can help to establish trust in that your data are safely stored and tracked, and understanding more about the object tree and knowing bits of the git-annex basics can make you more confident in working with your datasets. The second is that it should now be clear to you why the ``.git`` directory should not be deleted or in any way modified by hand. This place is where @@ -174,30 +178,31 @@ will take a closer look at that. .. _objecttree: .. index:: pair: key; git-annex concept -.. find-out-more:: Paths, checksums, object trees, and data integrity +.. find-out-more:: Data integrity and annex keys :name: fom-objecttree + :float: tbp So how do these cryptic paths and names in the object tree come into existence? It's not malicious intent that leads to these paths and file names - its checksums. - When a file is annexed, git-annex generates a *key* (or :term:`checksum`) from the **file content**. + When a file is annexed, git-annex typically generates a *key* (or :term:`annex key`) from the **file content**. It uses this key (in part) as a name for the file and as the path in the object tree. Thus, the key is associated with the content of the file (the *value*), - and therefore, using this key, file content can be identified -- - or rather: Based on the keys, it can be identified whether file content changed, - and whether two files have identical contents. + and therefore, using this key, file content can be identified. - The key is generated using *hashes*. A hash is a function that turns an - input (e.g., a PDF file) into a string of characters with a fixed length based on its contents. + Most key types contain a :term:`checksum`. This is a string of a fixed number of characters + computed from some input, for example the content of a PDF file, + by a *hash* function. - Importantly, a hash function will generate the same character sequence for the same file content, and once file content changes, the generated hash changes, too. + This checksum *uniquely* identifies a file's content. + A hash function will generate the same character sequence for the same file content, and once file content changes, the generated checksum changes, too. Basing the file name on its contents thus becomes a way of ensuring data integrity: - File content cannot be changed without git-annex noticing, because file's hash, and thus its key in its symlink, will change. - Furthermore, if two files have identical hashes, the content in these files is identical. + File content cannot be changed without git-annex noticing, because the file's checksum, and thus its key in its symlink, will change. + Furthermore, if two files have identical checksums, the content in these files is identical. Consequently, if two files have the same symlink, and thus link the same file in the object-tree, they are identical in content. This can save disk space if a dataset contains many identical files: Copies of the same data only need one instance of that content in the object tree, and all copies will symlink to it. - If you want to read more about the computer science basics about hashes check out the `Wikipedia page `_. + If you want to read more about the computer science basics about hash functions check out the `Wikipedia page `_. .. runrecord:: _examples/DL-101-115-104 :language: console @@ -230,22 +235,32 @@ will take a closer look at that. The next subdirectory in the symlink helps to prevent accidental deletions and changes, as it does not have write :term:`permissions`, so that users cannot modify any of its underlying contents. This is the reason that annexed files need to be unlocked prior to modifications, and this information will be helpful to understand some file system management operations such as removing files or datasets. Section :ref:`file system` takes a look at that. - The next part of the symlink contains the actual hash. - There are different hash functions available. + The next part of the symlink contains the actual checksum. + There are different :term:`annex key` backends that use different checksums. Depending on which is used, the resulting :term:`checksum` has a certain length and structure, and the first part of the symlink actually states which hash function is used. By default, DataLad uses the ``MD5E`` git-annex backend (the ``E`` adds file extensions to annex keys), but should you want to, you can change this default to `one of many other types `_. - The reason why MD5E is used is the relatively short length of the underlying MD5 checksums -- thus it is possible to ensure cross-platform compatibility and share datasets also with users on operating systems that have restrictions on total path lengths, such as Windows. + The reason why MD5E is used is the relatively short length of the underlying MD5 checksums -- which facilitates cross-platform compatibility for sharing datasets also with users on operating systems that have restrictions on total path length, such as Windows. The one remaining unidentified bit in the file name is the one after the checksum identifier. This part is the size of the content in bytes. - An annexed file in the object tree thus has a file name following this structure: + An annexed file in the object tree thus has a file name following this structure + (but see `the git-annex documentation on keys `_ for the complete details): - ``checksum-identifier - size -- checksum . extension`` + ``-s--.`` You now know a great deal more about git-annex and the object tree. Maybe you are as amazed as we are about some of the ingenuity used behind the scenes. Even more mesmerizing things about git-annex can be found in its `documentation `_. + +.. raw:: latex + + \vspace{1cm} + +.. image:: ../artwork/src/teacher.svg + :width: 50% + :align: center + .. index:: ! broken symlink, ! symlink; broken .. _wslfiles: @@ -253,7 +268,7 @@ Broken symlinks ^^^^^^^^^^^^^^^ Whenever a symlink points to a non-existent target, this symlink is called -*broken*, and opening the symlink would not work as it does not resolve. The +*broken* or *dangling*, and opening the symlink would not work as it does not resolve. The section :ref:`file system` will give a thorough demonstration of how symlinks can break, and how one can fix them again. Even though *broken* sounds troublesome, most types of broken symlinks you will encounter can be fixed, @@ -276,14 +291,16 @@ Alternatively, use the :shcmd:`ls` command in a terminal instead of a file manag Other tools may be more more specialized, smaller, or domain-specific, and may fail to correctly work with broken symlinks, or display unhelpful error messages when handling them, or require additional flags to modify their behavior. When encountering unexpected behavior or failures, try to keep in mind that a dataset without retrieved content appears to be a pile of broken symlinks to a range of tools, consult a tools documentation with regard to symlinks, and check whether data retrieval fixes persisting problems. -A last special case on symlinks exists if you are using DataLad on the Windows Subsystem for Linux. -If so, please take a look into the Windows Wit below. +A last special case on symlinks exists if you are using DataLad on the Windows Subsystem for Linux. Take a look at the :windows-wit:`on WSL2 symlink access ` +for that. .. index:: pair: access WSL2 symlinked files; on Windows single: WSL2; symlink access pair: log; Git command .. windows-wit:: Accessing symlinked files from your Windows system + :name: ww-wsl2-symlinks + :float: tbp .. include:: topic/wsl2-symlinkaccess.rst diff --git a/docs/basics/101-122-config.rst b/docs/basics/101-122-config.rst index 5d37e95f4..b7e04d079 100644 --- a/docs/basics/101-122-config.rst +++ b/docs/basics/101-122-config.rst @@ -256,8 +256,7 @@ remaining sections in that file, and the :ref:`that dissects this config file fu Let's walk through the Git config file of ``DataLad-101``: As mentioned above, git-annex will use the :term:`Git config file` for some of its configurations, such as the second section. - It lists the repository version and git-annex - UUID [#f4]_ (:gitannexcmd:`whereis` displays information about where the + It lists the repository version and :term:`annex UUID` [#f4]_ (:gitannexcmd:`whereis` displays information about where the annexed content is with these UUIDs). You may recognize the fourth part of the configuration, the subsection diff --git a/docs/basics/101-123-config2.rst b/docs/basics/101-123-config2.rst index 01384ff13..3f07ba268 100644 --- a/docs/basics/101-123-config2.rst +++ b/docs/basics/101-123-config2.rst @@ -37,8 +37,7 @@ There even is one key word that you recognize: MD5E. If you have read the :ref:`Find-out-more on object trees ` you will recognize it as a reference to the type of key used by git-annex to identify and store file content in the object-tree. -The first row, ``* annex.backend=MD5E``, therefore translates to "Everything in this -directory should be checksummed with the MD5 hash function". +The first row, ``* annex.backend=MD5E``, therefore translates to "The ``MD5E`` git-annex backend should be used for any file". But what is the rest? We'll start with the last row: .. code-block:: bash @@ -66,17 +65,14 @@ configured git-annex to regard all files of type "binary" as a large file. Thanks to this little line, your text files are not annexed, but stored directly in Git. -The patterns ``*`` and ``**`` are so-called "wildcards" used in :term:`globbing`. -``*`` matches any file or directory in the current directory, and ``**`` matches -all files and directories in the current directory *and subdirectories*. In technical -terms, ``**`` matches *recursively*. The third row therefore -translates to "Do not annex anything that is a text file in this directory" for git-annex. - -However, rules can be even simpler. The second row simply takes a complete directory -(``.git``) and instructs git-annex to regard nothing in it as a "large file". -The second row, ``**/.git* annex.largefiles=nothing`` means that no -``.git`` repository in this directory or a subdirectory should be considered -a "large file". This way, the ``.git`` repositories are protected from being annexed. +The patterns ``*`` and ``**`` are so-called "wildcards" you might recognize from used in :term:`globbing`. +In Git configuration files, an asterisk "*" matches anything except a slash. +The third row therefore +translates to "Do not annex anything that is a text file" for git-annex. +Two leading "``**``" followed by a slash matches +*recursively* in all directories. +Therefore, the second row instructs git-annex to regard nothing starting with ``.git`` as a "large file", including contents inside of ``.git`` directories. +This way, the ``.git`` repositories are protected from being annexed. If you had a single file (``myfile.pdf``) you would not want annexed, specifying a rule such as: .. code-block:: bash diff --git a/docs/basics/topic/adjustedmode-nosymlinks.rst b/docs/basics/topic/adjustedmode-nosymlinks.rst index c95a2460a..662704ae2 100644 --- a/docs/basics/topic/adjustedmode-nosymlinks.rst +++ b/docs/basics/topic/adjustedmode-nosymlinks.rst @@ -1,6 +1,6 @@ Windows has insufficient support for :term:`symlink`\s and revoking write :term:`permissions` on files. -Therefore, :term:`git-annex` classifies it as a :term:`crippled file system` and has to stray from its default behavior. -While git-annex on Unix-based file operating systems stores data in the annex and creates a symlink in the data's original place, on Windows it moves data into the :term:`annex` and creates a *copy* of the data in its original place. +Therefore, :term:`git-annex` classifies it as a :term:`crippled file system` and has to stray from its default behavior: it enters :term:`adjusted mode`. +While git-annex on Unix-based file operating systems stores data in the annex and creates a symlink in the data's original place, on Windows it moves data into the :term:`annex` and creates a *copy* of the data in its original place. This behavior is not specific to Windows, but is done for any impaired file system, such as a dataset on a USB-stick plugged into a Mac. **Why is that?** Data *needs* to be in the annex for version control and transport logistics -- the annex is able to store all previous versions of the data, and manage the transport to other storage locations if you want to publish your dataset. diff --git a/docs/glossary.rst b/docs/glossary.rst index 678021e50..20d4daa5c 100644 --- a/docs/glossary.rst +++ b/docs/glossary.rst @@ -16,13 +16,32 @@ Glossary adjusted branch .. index:: pair: adjusted branch; in adjusted mode - - git-annex concept: a special :term:`branch` in a dataset. - Adjusted branches refer to a different, existing branch that is not adjusted. - The adjusted branch is called "adjusted/(unlocked)", and on an adjusted branch, all files handled by :term:`git-annex` are not locked -- - They will stay "unlocked" and thus modifiable. - Instead of referencing data in the :term:`annex` with a :term:`symlink`, unlocked files need to be copies of the data in the annex. - Adjusted branches primarily exist as the default branch on so-called :term:`crippled file system`\s such as Windows. + pair: adjusted branch; git-annex concept + + A specially managed :term:`branch` in a dataset. + An adjusted branch presents a modified (adjusted) view on its + :term:`corresponding branch`. The most common use of an adjusted branch + is a work tree where all files are "unlocked". + Such a branch is named ``adjusted/(unlocked)``, and + all files handled by :term:`git-annex` are immediately modifiable. + Instead of referencing data in the :term:`annex` with a :term:`symlink`, + unlocked files need to be copies of the data in the annex. + Files where no content is available locally are also files, but only + contain placeholder content. Some adjusted modes hide files without + available content entirely. + Adjusted branches are locally managed, and it is not meaningful to push + them to other dataset clones. + Adjusted branches primarily exist as the default branch on so-called + :term:`crippled file system`\s such as Windows. + + adjusted mode + .. index:: + pair: adjusted mode; git-annex concept + + A repository mode that used an :term:`adjusted branch` for the work tree. + This mode can be entered manually (see ``git annex adjust``), or automatically + when git-annex detects a file system with insufficient capabilities + (see :term:`crippled file system`). annex .. index:: @@ -30,9 +49,23 @@ Glossary git-annex concept: a different word for :term:`object-tree`. + annex key + .. index:: + pair: file content identifier; git-annex concept + pair: annex key; git-annex concept + + Git-annex file content identifier. It is used for naming objects + in a dataset :term:`annex`. These identifiers follow a + `strict naming scheme `_. + However, various types of identifiers, so called + `backends `_ can be used. Most + backends are based on a :term:`checksum`, thereby enabling content verification + and data integrity checks for files in an annex. + annex UUID .. index:: pair: location identifier; git-annex concept + pair: annex uuid; git-annex concept A :term:`UUID` assigned to an annex of each individual :term:`clone` of a dataset repository. :term:`git-annex` uses this UUID to track file content availability information. @@ -127,6 +160,15 @@ Glossary A text file that lists all required components of the computational environment that a :term:`software container` should contain. It is made by a human user. + corresponding branch + .. index:: + pair: corresponding branch; in adjusted mode + + A :term:`branch` underlying a particular :term:`adjusted branch`. + Changes committed to an adjusted branch are propagated to its corresponding + branch. Only the corresponding branch is suitable for sharing with other + repository clones. + crippled file system .. index:: pair: crippled file system; git-annex concept diff --git a/docs/latex/preamble_end.sty b/docs/latex/preamble_end.sty index 678ca80b6..73ee56885 100644 --- a/docs/latex/preamble_end.sty +++ b/docs/latex/preamble_end.sty @@ -1,7 +1,7 @@ \renewcommand{\sphinxstyletheadfamily}{\bfseries} % make :term: references visually distinct in a print % by adding a small "dictionary" symbol -\renewcommand{\sphinxtermref}[1]{\mbox{\textit{#1}\hspace{0.1em}\raisebox{.1em}{\scriptsize{\color{dataladlightgray}\faBook}}}} +\renewcommand{\sphinxtermref}[1]{\mbox{\textit{#1}\raisebox{.3em}{\tiny{\color{dataladlightgray}\faBook}}}} % better control over the spacing of list items % MIH: we cannot use enumitem, it messes with the (description) list