Skip to content

Commit

Permalink
Merge pull request #8 from pvanheus/make_data
Browse files Browse the repository at this point in the history
Completed Task 2 on writing shell scripts
  • Loading branch information
pvanheus authored Mar 5, 2024
2 parents ca6d276 + 086b8d4 commit e942669
Showing 1 changed file with 163 additions and 10 deletions.
173 changes: 163 additions & 10 deletions modules/unix-linux-introduction/_posts/2024-02-02-task-two.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ Shell variables can be specified using braces e.g. `echo ${MYVAR}`. These braces
variable name starts and ends so that you can do things like `echo ${MYVAR}_word`. Without the braces `echo $MYVAR_word` would
be interpreted as a reference to a variable called `MYVAR_word`.

The variables that you set are only available in the shell session where you set them unless you use the `export` command. If
you say `export MYVAR` then `MYVAR` is available to any commands that you run in your shell. For more on `export`,
read [this guide](https://www.digitalocean.com/community/tutorials/export-command-linux) from DigitalOcean.

##### Advanced shell variable expansion

The [GNU Bash manual](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html) has a section on shell
Expand All @@ -56,11 +60,13 @@ echo ${MYVAR#*/}
# this will print patrice
echo ${MYVAR##*/}

# print the part of the variable's value that comes after the last pattern match
# NOTE: the pattern now changes from /* to */ for the next two examples

# print the part of the variable's value that comes before the last pattern match
# this will print /home/user
echo ${MYVAR%/*}

# print the part of the variable's value that comes after the first pattern match
# print the part of the variable's value that comes before the first pattern match
# this will print a blank line
echo ${MYVAR%%/*}
```
Expand Down Expand Up @@ -165,15 +171,162 @@ three
```

The Bash shell `$()` syntax provides an effective way to create lists to loop over. E.g. `$(cat myfile.txt)` creates a list of all the
lines in the file and runs the loop once for each line (and for each word in a line if the lines have spaces in them).
lines in the file and runs the loop once for each line (and for each word in a line if the lines have spaces in them). This is
called _command substitution_ and what it does is to replace the `$()` with the output (`stdout`) of the command inside the brackets.

For example, using an example from Task 1, `$(tail -n+2 cases.csv |cut -d, -f3)` would give a list of all the dates in the `cases.csv` file. When
this is used with a `for` loop, all whitespace (spaces, tabs and newlines) acts to separate _items_ and then loop runs once for each item in the
list. For a more formal description of what command substitution is doing, look in the [GNU Bash manual](https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html).


#### Shell conditionals (If statements)

One final component of Bash shell scripts is `if statements`. The basic structure of a conditional (i.e. `if statement`) is:

```bash
if SOMETEST
then
# do something
fi
```

For an introductory tutorial see [here](https://ryanstutorials.net/bash-scripting-tutorial/bash-if-statements.php) and for a more complete
reference consult the [GNU Bash manual](https://www.gnu.org/software/bash/manual/html_node/Conditional-Constructs.html).

An important point about the `SOMETEST` part listed above is that it is actually run as a command. Most commonly one uses the test command,
which is expressed using square brackets (`[[ ]]`) but any command can be used. Any command that returns an exit status of 0 counts as True,
and any other exit status counts as False.

The square bracket tests (`[[ ]]`) allow [conditional expressions](https://www.gnu.org/software/bash/manual/html_node/Bash-Conditional-Expressions.html)
and combinations of these conditional expressions. Some of the most common test are `-f` to test if a file exists and `-d` to test if a path is a
directory, `=` to test if two strings are equal and `-eq` to test if two numbers are equal. There are many others, please refer to the linked tutorials.

Expressions can be combined with logical operators: `&&` for AND, `||` for OR, `!` for NOT (changing True to False or False to True) and brackets `( )`
for grouping.

#### A practical shell script

Here is an example shell script that fetches files using a file like the `file_list1.csv` mentioned previously:

{% highlight bash linenos %}
#!/bin/bash

set -u
set -e

LISTFILE=$1

if [[ ! -f $LISTFILE ]] ; then
echo "Incorrect filename, $LISTFILE not found" >&2
exit 1
fi

if [[ ! -d tracking ]] ; then
mkdir tracking
fi

for URL in $(cut -d, -f2 $LISTFILE); do
# links look like
# https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/SRR8364252_1.fastq.gz
OUTPUT=${URL##*/}
if [[ ! -f tracking/$OUTPUT ]] ; then
wget -c -O $OUTPUT $URL
# curl -L -O $OUTPUT $URL
touch tracking/$OUTPUT
fi
done
{% endhighlight %}

1. Line 1 has the `#!/bin/bash`, the so-called `shebang line`, that tells the system how to run this script

2. Line 3 (`set -u`) tells the Bash shell that referring to uninitialised values should trigger an error. This will force the user to provide
a command line parameter (i.e. the $1 variable)

3. Line 4 (`set -e`) tells the Bash shell to exit if there is any error. This means that if e.g. a download fails and `wget` exits with an error,
the script won't carry on running.

4. Line 6 sets the `LISTFILE` variable to the value of `$1` and lines 8 through 11 check if the file exists. If the file does not exist
(the `! -f` means "file does not exist") then a message is printed to `stderr` (that's what the `>&2` redirect does) and the program exits
with an error (`exit 1`).

5. Lines 13 through check for the existing of a directory (a subdirectory of the directory in which this script is run) called `tracking` and
create it if it doesn't exist. This directory will be used to keep track fo which file has been downloaded.

6. Line 17 sets up the loop. `cut -d, -f2` is used to extract the URL part from the LISTFILE and the list of URLs (links) is returned using a `$()`
operator.

7. Line 20 uses a pattern match to remove everything before the last match of the shell (i.e. glob) pattern `*/`. This leaves only the filename part of
the URL. The OUTPUT variable is set to the destination filename.

8. Line 21 tests to see if there is a file in the `tracking` directory with the `OUTPUT` filename. If no such file exists, the download will proceed

9. Line 22 downloads the file, storing it in the `OUTPUT` filename. The `-c` flag of `wget` will continue a download if it finds a partially downloaded
file.

10. Line 23 is commented out but shows how to do a download using `curl`. The `-L` flag to `curl` ensures that if the server tells `curl` that the file has
move, `curl` will download from the new link. Note that this `curl` command line will not resume partial downloads. The `-C -` option of `curl` does
support resuming downloads but it has some technical differences from the way `wget` does resume of partial downloads that make it a bit trickier to use.

11. Line 24 uses `touch` to make an empty file in the `tracking` directory with the same name as the `OUTPUT` filename.

12. Lines 25 and 26 end the `if` statement and the `for` loop.

In summary, this script downloads files as specified in `file_list1.csv` and uses a tracking system to ensure that already-downloaded files are not downloaded
a second time. If it turns out to be necessary to re-download a file, the corresponding tracking file can be deleted from the `tracking` directory.


#### Examining the downloaded files

The files are gzip-compressed FASTQ data files from a DNA sequencer. The presumption is that each file is the output of a sequencing run. Let us examine
the downloaded files. One simple test is to check the sizes of the files. Are there any files with unusual sizes?

<details markdown="1">
<summary>Show answer</summary>
Sort the listing of file sizes with `ls -l *.fastq.gz |sort -k5nr`. The `-k5` specificies that `sort` must use the 5th column for sorting and the `n` specifies that
this must be a numeric sort and finally the `r` specifies that the sort order must be reversed, with the results in descending, not ascending, order.

Looking at the files sorted this way shows that `SRR8364257_1.fastq.gz` and `SRR8364257_2.fastq.gz` are dramatically smaller than the other files. This could be
a signal that there is something wrong. Perhaps there were problems with the sequencing run or the sample being sequenced?
</details>

##### Checksums and using them

[Checksums](https://en.wikipedia.org/wiki/Checksum) are calculations used to verify that a file has not been altered or has been downloaded correctly. The
checksums for the downloaded files are available at [this link](https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/checksums.txt) and can
be downloaded using e.g. `wget https://pathogen-genomics-march-2024.sanbi.ac.za/data/shell/reads/checksums.txt`. The checksums in this file are
[SHA256](https://en.wikipedia.org/wiki/SHA-2) checksums generated using the `sha256sum` tool. Another common checksum type is [MD5](https://en.wikipedia.org/wiki/MD5)
which can be generated using the `md5sum` tool.

The `checksums.txt` file contains lines like:

```
d4d7dac720ae285317f8d8adba2ba9077a9e5e4f6099bc9aee3f9900c77a518a SRR8364252_1.fastq.gz
38827e72486e2b1025270c619ea110ead5d80901a1f7f7a7ecc5a5f3f604b69b SRR8364252_2.fastq.gz
```

where the first column is the checksum and the second is the filename that the checksum applies to. With this file in hand and the files already downloaded,
we can check if the files match their checksums:

```
sha256sum -c checksums.txt
```

What does this show? What might be going on?

<details markdown="1">
<summary>Show answer</summary>
For most of the files we get the response that the file is OK and matches it's checksum. For sample `SRR8364261_2.fastq.gz` we get the message `FAILED`.

Our download script did not show any errors, and we can verify this by manually re-downloading the file using `wget` or `curl` and checking the checksums
again. This shows that either there is a problem with the checksum provided or the with the file on the remote server. This is a problem that you would raise
with the people who are providing the downloadable files and checksums.
</details>

##### Resources

[Introduction to Bash Shell Scripting](https://www.linode.com/docs/guides/intro-bash-shell-scripting/)

[Advanced Bash Scripting Guide](https://tldp.org/LDP/abs/html/)

[Introduction to Shell scripting](https://sambitsinha.hashnode.dev/linux-shell-scripting-basics-an-introduction-to-shell-scripting-in-linux) and [Intermediate Shell Scripting](https://sambitsinha.hashnode.dev/intermediate-shell-scripting-in-linux) from [Sambit Sinha](https://sambitsinha.hashnode.dev/?source=top_nav_blog_home)

1. Look at file_list1
2. Write a loop to fetch data
2.1. dealing with incomplete fetch
3. Checksums - check if the data is right
4. Which is the smallest sample?
5. zcat / zmore - looking inside a file
6. count the number of reads in the files

0 comments on commit e942669

Please sign in to comment.