diff --git a/modules/unix-linux-introduction/_posts/2024-02-02-task-two.md b/modules/unix-linux-introduction/_posts/2024-02-02-task-two.md index e39d7b2c2..4d6d60471 100644 --- a/modules/unix-linux-introduction/_posts/2024-02-02-task-two.md +++ b/modules/unix-linux-introduction/_posts/2024-02-02-task-two.md @@ -38,6 +38,10 @@ Shell variables can be specified using braces e.g. `echo ${MYVAR}`. These braces variable name starts and ends so that you can do things like `echo ${MYVAR}_word`. Without the braces `echo $MYVAR_word` would be interpreted as a reference to a variable called `MYVAR_word`. +The variables that you set are only available in the shell session where you set them unless you use the `export` command. If +you say `export MYVAR` then `MYVAR` is available to any commands that you run in your shell. For more on `export`, +read [this guide](https://www.digitalocean.com/community/tutorials/export-command-linux) from DigitalOcean. + ##### Advanced shell variable expansion The [GNU Bash manual](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html) has a section on shell @@ -56,11 +60,13 @@ echo ${MYVAR#*/} # this will print patrice echo ${MYVAR##*/} -# print the part of the variable's value that comes after the last pattern match +# NOTE: the pattern now changes from /* to */ for the next two examples + +# print the part of the variable's value that comes before the last pattern match # this will print /home/user echo ${MYVAR%/*} -# print the part of the variable's value that comes after the first pattern match +# print the part of the variable's value that comes before the first pattern match # this will print a blank line echo ${MYVAR%%/*} ``` @@ -165,15 +171,162 @@ three ``` The Bash shell `$()` syntax provides an effective way to create lists to loop over. E.g. `$(cat myfile.txt)` creates a list of all the -lines in the file and runs the loop once for each line (and for each word in a line if the lines have spaces in them). +lines in the file and runs the loop once for each line (and for each word in a line if the lines have spaces in them). This is +called _command substitution_ and what it does is to replace the `$()` with the output (`stdout`) of the command inside the brackets. + +For example, using an example from Task 1, `$(tail -n+2 cases.csv |cut -d, -f3)` would give a list of all the dates in the `cases.csv` file. When +this is used with a `for` loop, all whitespace (spaces, tabs and newlines) acts to separate _items_ and then loop runs once for each item in the +list. For a more formal description of what command substitution is doing, look in the [GNU Bash manual](https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html). + + +#### Shell conditionals (If statements) + +One final component of Bash shell scripts is `if statements`. The basic structure of a conditional (i.e. `if statement`) is: + +```bash +if SOMETEST +then + # do something +fi +``` + +For an introductory tutorial see [here](https://ryanstutorials.net/bash-scripting-tutorial/bash-if-statements.php) and for a more complete +reference consult the [GNU Bash manual](https://www.gnu.org/software/bash/manual/html_node/Conditional-Constructs.html). + +An important point about the `SOMETEST` part listed above is that it is actually run as a command. Most commonly one uses the test command, +which is expressed using square brackets (`[[ ]]`) but any command can be used. Any command that returns an exit status of 0 counts as True, +and any other exit status counts as False. + +The square bracket tests (`[[ ]]`) allow [conditional expressions](https://www.gnu.org/software/bash/manual/html_node/Bash-Conditional-Expressions.html) +and combinations of these conditional expressions. Some of the most common test are `-f` to test if a file exists and `-d` to test if a path is a +directory, `=` to test if two strings are equal and `-eq` to test if two numbers are equal. There are many others, please refer to the linked tutorials. + +Expressions can be combined with logical operators: `&&` for AND, `||` for OR, `!` for NOT (changing True to False or False to True) and brackets `( )` +for grouping. + +#### A practical shell script + +Here is an example shell script that fetches files using a file like the `file_list1.csv` mentioned previously: + +{% highlight bash linenos %} +#!/bin/bash + +set -u +set -e + +LISTFILE=$1 + +if [[ ! -f $LISTFILE ]] ; then + echo "Incorrect filename, $LISTFILE not found" >&2 + exit 1 +fi + +if [[ ! -d tracking ]] ; then + mkdir tracking +fi + +for URL in $(cut -d, -f2 $LISTFILE); do + # links look like + # https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/SRR8364252_1.fastq.gz + OUTPUT=${URL##*/} + if [[ ! -f tracking/$OUTPUT ]] ; then + wget -c -O $OUTPUT $URL + # curl -L -O $OUTPUT $URL + touch tracking/$OUTPUT + fi +done +{% endhighlight %} + +1. Line 1 has the `#!/bin/bash`, the so-called `shebang line`, that tells the system how to run this script + +2. Line 3 (`set -u`) tells the Bash shell that referring to uninitialised values should trigger an error. This will force the user to provide +a command line parameter (i.e. the $1 variable) + +3. Line 4 (`set -e`) tells the Bash shell to exit if there is any error. This means that if e.g. a download fails and `wget` exits with an error, +the script won't carry on running. + +4. Line 6 sets the `LISTFILE` variable to the value of `$1` and lines 8 through 11 check if the file exists. If the file does not exist +(the `! -f` means "file does not exist") then a message is printed to `stderr` (that's what the `>&2` redirect does) and the program exits +with an error (`exit 1`). + +5. Lines 13 through check for the existing of a directory (a subdirectory of the directory in which this script is run) called `tracking` and +create it if it doesn't exist. This directory will be used to keep track fo which file has been downloaded. + +6. Line 17 sets up the loop. `cut -d, -f2` is used to extract the URL part from the LISTFILE and the list of URLs (links) is returned using a `$()` +operator. + +7. Line 20 uses a pattern match to remove everything before the last match of the shell (i.e. glob) pattern `*/`. This leaves only the filename part of +the URL. The OUTPUT variable is set to the destination filename. + +8. Line 21 tests to see if there is a file in the `tracking` directory with the `OUTPUT` filename. If no such file exists, the download will proceed + +9. Line 22 downloads the file, storing it in the `OUTPUT` filename. The `-c` flag of `wget` will continue a download if it finds a partially downloaded +file. + +10. Line 23 is commented out but shows how to do a download using `curl`. The `-L` flag to `curl` ensures that if the server tells `curl` that the file has +move, `curl` will download from the new link. Note that this `curl` command line will not resume partial downloads. The `-C -` option of `curl` does +support resuming downloads but it has some technical differences from the way `wget` does resume of partial downloads that make it a bit trickier to use. + +11. Line 24 uses `touch` to make an empty file in the `tracking` directory with the same name as the `OUTPUT` filename. + +12. Lines 25 and 26 end the `if` statement and the `for` loop. + +In summary, this script downloads files as specified in `file_list1.csv` and uses a tracking system to ensure that already-downloaded files are not downloaded +a second time. If it turns out to be necessary to re-download a file, the corresponding tracking file can be deleted from the `tracking` directory. + + +#### Examining the downloaded files + +The files are gzip-compressed FASTQ data files from a DNA sequencer. The presumption is that each file is the output of a sequencing run. Let us examine +the downloaded files. One simple test is to check the sizes of the files. Are there any files with unusual sizes? + +
+Show answer +Sort the listing of file sizes with `ls -l *.fastq.gz |sort -k5nr`. The `-k5` specificies that `sort` must use the 5th column for sorting and the `n` specifies that +this must be a numeric sort and finally the `r` specifies that the sort order must be reversed, with the results in descending, not ascending, order. + +Looking at the files sorted this way shows that `SRR8364257_1.fastq.gz` and `SRR8364257_2.fastq.gz` are dramatically smaller than the other files. This could be +a signal that there is something wrong. Perhaps there were problems with the sequencing run or the sample being sequenced? +
+ +##### Checksums and using them + +[Checksums](https://en.wikipedia.org/wiki/Checksum) are calculations used to verify that a file has not been altered or has been downloaded correctly. The +checksums for the downloaded files are available at [this link](https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/checksums.txt) and can +be downloaded using e.g. `wget https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/checksums.txt`. The checksums in this file are +[SHA256](https://en.wikipedia.org/wiki/SHA-2) checksums generated using the `sha256sum` tool. Another common checksum type is [MD5](https://en.wikipedia.org/wiki/MD5) +which can be generated using the `md5sum` tool. + +The `checksums.txt` file contains lines like: + +``` +d4d7dac720ae285317f8d8adba2ba9077a9e5e4f6099bc9aee3f9900c77a518a SRR8364252_1.fastq.gz +38827e72486e2b1025270c619ea110ead5d80901a1f7f7a7ecc5a5f3f604b69b SRR8364252_2.fastq.gz +``` + +where the first column is the checksum and the second is the filename that the checksum applies to. With this file in hand and the files already downloaded, +we can check if the files match their checksums: + +``` +sha256sum -c checksums.txt +``` + +What does this show? What might be going on? + +
+Show answer +For most of the files we get the response that the file is OK and matches it's checksum. For sample `SRR8364261_2.fastq.gz` we get the message `FAILED`. + +Our download script did not show any errors, and we can verify this by manually re-downloading the file using `wget` or `curl` and checking the checksums +again. This shows that either there is a problem with the checksum provided or the with the file on the remote server. This is a problem that you would raise +with the people who are providing the downloadable files and checksums. +
+ +##### Resources +[Introduction to Bash Shell Scripting](https://www.linode.com/docs/guides/intro-bash-shell-scripting/) +[Advanced Bash Scripting Guide](https://tldp.org/LDP/abs/html/) +[Introduction to Shell scripting](https://sambitsinha.hashnode.dev/linux-shell-scripting-basics-an-introduction-to-shell-scripting-in-linux) and [Intermediate Shell Scripting](https://sambitsinha.hashnode.dev/intermediate-shell-scripting-in-linux) from [Sambit Sinha](https://sambitsinha.hashnode.dev/?source=top_nav_blog_home) -1. Look at file_list1 -2. Write a loop to fetch data - 2.1. dealing with incomplete fetch -3. Checksums - check if the data is right -4. Which is the smallest sample? -5. zcat / zmore - looking inside a file -6. count the number of reads in the files \ No newline at end of file