Merge pull request #29 from hbctraining/heather_edits

Heather edits
hbctraining · Mar 15, 2024 · 50b1501 · 50b1501
2 parents 12c13af + 2efbc87
commit 50b1501
Show file tree

Hide file tree

Showing 8 changed files with 426 additions and 348 deletions.
diff --git a/Accelerate_with_automation/README.md b/Accelerate_with_automation/README.md
@@ -7,29 +7,28 @@
 
 ### Description
 
-**Add more here**
-
-This repository has teaching materials for a **3 hour**, hands-on **Intermediate bash** workshop led at a relaxed pace. Many tools for the analysis of big data require knowledge of the command line, and this workshop will build on the basic skills taught in the **Introduction to the command-line interface** workshop to allow for greater automation using scripts. 
+This repository has teaching materials for a **3 hour**, hands-on **Intermediate bash** workshop led at a relaxed pace. Many tools for the analysis of big data require knowledge of the command line, and this workshop will build on the basic skills taught in the **The Foundation - Basic Shell** workshop to allow for greater automation using scripts. 
 
 ### Learning Objectives
 
 * Define what a variable is and store information using variables
 * Distinguish between variables and positional parameters
 * Create a script to run multiple commands as a single command
 * Implement loops, positional parameters and variables in a bash script
-* Run existing R and python scripts via the command-line
+* Learn about slurm arrays for automation on a high performance computing cluster
 
-> These materials are developed for a trainer-led workshop, but also amenable to self-guided learning.
+> These materials are developed for a trainer-led workshop, but are also amenable to self-guided learning.
 
 
 ### Contents
 
 | Lessons            | Estimated Duration |
 |:------------------------|:----------|
 |[Setting up](lessons/setting_up.md) | 15 min |
-|[Shell Scripts and `for` loops](lessons/loops_and_scripts.md) | 75 min |
-|[Positional Parameters](lessons/positional_params.md) | 45 min |
-|[Running R scripts and Python scripts undrafted]() |30 min |
+|[Shell scripts and `for` loops](lessons/loops_and_scripts.md) | 75 min |
+|[Positional parameters](lessons/positional_params.md) | 45 min |
+|[Slurm arrays](lessons/arrays_in_slurm.md)| 15 min |
+
 
 ### Dataset
 
@@ -38,9 +37,21 @@ This repository has teaching materials for a **3 hour**, hands-on **Intermediate
 ### Installation Requirements
 
 ***Mac users:***
-[R](https://cran.r-project.org/)
-
+No installation requirements.
 
 ***Windows users:***
 [GitBash](https://git-scm.com/download/win)
-[R](https://cran.r-project.org/)
+
+### Resources
+
+* Shell cheatsheets:
+  * [http://fosswire.com/post/2007/08/unixlinux-command-cheat-sheet/](http://fosswire.com/post/2007/08/unixlinux-command-cheat-sheet/)
+  * [https://github.com/swcarpentry/boot-camps/blob/master/shell/shell_cheatsheet.md](https://github.com/swcarpentry/boot-camps/blob/master/shell/shell_cheatsheet.md)
+* [Explain shell](http://explainshell.com) - a web site where you can see what the different components of a shell command are doing.  
+* Software Carpentry tutorial: [The Unix shell](https://swcarpentry.github.io/shell-novice/)
+* [Introduction HOW-TO Bash](https://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html)
+
+---
+
+*These materials have been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
+
diff --git a/Accelerate_with_automation/img/Enjoy_Slurm.png b/Accelerate_with_automation/img/Enjoy_Slurm.png
diff --git a/Accelerate_with_automation/img/Slurm-1-.webp b/Accelerate_with_automation/img/Slurm-1-.webp
diff --git a/Accelerate_with_automation/lessons/advanced_pos_param_loop.md b/Accelerate_with_automation/lessons/advanced_pos_param_loop.md
@@ -0,0 +1,55 @@
+## Positional Parameters, continued
+
+This lesson is a continuation of **Tying everything together: Using positional parameters in a loop** from the [Positional Parameters Lesson](https://github.com/hbctraining/Training-modules/blob/heather_edits/Accelerate_with_automation/lessons/positional_params.md)
+
+### Advanced use: using positional parameters with and without loops in the same script
+
+What if we still wanted to run this on multiple files AND also provide a different sequence besides "NNNNNNNNNN?" We can do that, but would need to modify the above script a little bit. Because we are using a for loop which will iterate over all positional parameters, we can't simply add our sequence to our command like this: `sh generate_bad_reads_summary_param_loop.sh *fq CTGCTAGA`, because bash will treat our new sequence, "CTGCTAGA" exactly as if it were another file in the list. However, using the long form of the `for` loop mentioned above, we can specify where in the list we would like to start iterating through the loop, while capturing the other positional parameters in their own variables.
+
+Below is the `generate_bad_reads_summary_param_loop.sh` script, but we have modified it so that we can capture the first positional parameter outside of the `for` loop to use it as a user-provided sequence. Like the `generate_bad_reads_summary_param.sh` we modified in the positional parameters exercise, it will search for whatever sequence we provide and incorporate the sequence string into the output filenames. Here are the modification we have made, enumerated:
+1. Capture the first positional parameter in a variable named `$sequence` and echo it for the user to see
+3. Change the `for` loop to go through the positional parameters starting with the second positional parameter by using the longform `for filename in "${@:2}"`. Specifically, the `:2` in this statement indicates to start the loop with the second positional parameter instead of the first. Any number could be used here, depending on the needs of your specific script 
+4. Replace `param` with `${sequence}` to add sequence to the output filenames and differentiate it from output files from the previous scripts
+5. Update the USAGE and EXAMPLE to reflect the changes we have made
+
+```bash
+#!/bin/bash 
+
+## USAGE: User provides sequence to be searched for in user-provided list of files
+##  Script will output files in the same directory
+## EXAMPLE: generate_bad_reads_summary_param_loop2.sh sequence *.fq
+
+sequence=$1
+
+# tell us what sequence we're looking for
+echo $sequence
+
+# count bad reads for each FASTQ file in the provided list of files
+for filename in "${@:2}"
+do 
+  # create a prefix for all output files
+  base=$(basename $filename .subset.fq)
+
+  # tell us what file we're working on	
+  echo $filename
+
+  # grab all the bad read records
+  grep -B1 -A2 $sequence $filename > ${base}.${sequence}.loop.fastq
+
+  # grab the number of bad reads and write it to a summary file
+  grep -cH $sequence $filename > ${base}.${sequence}.loop.count.summary
+done
+```
+Open nano and copy/paste the above code and save it as a new script, `generate_bad_reads_summary_param_loop2.sh`
+
+Try running the script with the following command:
+
+```bash
+sh generate_bad_reads_summary_param_loop2.sh GATTACA *fq
+```
+
+Check your outupt:
+```bash
+ls -lt
+```
+If it worked, you should now have yet another set of output files with `GATTACA.loop` in the file names.
diff --git a/Accelerate_with_automation/lessons/arrays_in_slurm.md b/Accelerate_with_automation/lessons/arrays_in_slurm.md
@@ -1,7 +1,7 @@
 
 # Arrays in Slurm
 
-When I am working on large data sets my mind often drifts back to an old Simpsons episode. Bart is in France and being taught to pick grapes. They show him a detailed technique and he does it successfully. Then they say:
+When we are working on large data sets our minds often drift back to an old Simpsons episode. Bart is in France and being taught to pick grapes. They show him a detailed technique and he does it successfully. Then they say:
 
 
 <p align = "center">
@@ -20,6 +20,17 @@ One easy way to scale up is to use the array feature in slurm.
 
 Atlassian says this about job arrays on O2: "Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks." [link](https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Job-Arrays).
 
+>Sbatch vs. sh
+
+>So far we have run all of our scripts as sh script.sh which runs the script while we wait on the command line. However, for jobs that are going to take a very long time this is less than ideal because:
+
+   >* You have to wait for the script to finish to get the command line back and run other tasks
+   >* If you get disconnected from the cluster the job will automatically quit.  
+
+>Running jobs with sbatch will immediately give you the command line back and are not dependant on you being connected to the cluster. We will not cover the basics of sbatch here but to learn how to write these scripts come to our module **Shell tips and tricks on O2**!
+
+## Running an array
+
 Array jobs run simultaneously rather than one at a time which means they are very fast! Additionally, running a job array is very simple!  
 
 ```bash
@@ -28,11 +39,6 @@ sbatch --array=1-10 my_script.sh
 
 This will run my_script.sh 10 times with the job IDs 1,2,3,4,5,6,7,8,9,10
 
-We can also put this directly into the bash script itself (although we will continue with the command line version here).
-```bash
-$SBATCH --array=1-10
-```
-
 We can specify any job IDs we want.
 
 ```bash
@@ -95,26 +101,23 @@ sbatch --array=1-16 my_script.sh
 
 So what is this script doing? `file=$(awk -v  awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)` pulls the line of `samples.txt` that matched the job ID. Then we assign that to a variable called `${file}` and use that to run our command.
 
-Job IDs can also be helpful for output files or folders. We saw above how we used the job ID to help name our output bam file. But creating and naming folders is helpful in some instances as well. 
+**We will come back to this awk one liner in our Needle in a Haystack module!**
 
-```bash
-
-file=$(awk -v  awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)
-
-PREFIX="Folder_${SLURM_ARRAY_TASK_ID}"
-     mkdir $PREFIX
-        cd $PREFIX
-
-samtools view -S -b ../${file}.sam > ${file}.bam
-
-```    
-
-This script differs from our previous one in that it makes a folder with the job ID (Folder_1 for job ID 1) then moves inside of it to execute the command. Instead of getting all 16 of our bam files output in a single folder each of them will be in its own folder labled Folder_1 to Folder_16. 
-
-**NOTE** That we define `${file}` BEFORE we move into our new folder as samples.txt is only present in the main directory. 
+<p align = "center">
+<img src="../img/Enjoy_Slurm.png">
+</p>
+     
+<p align = "center">
+Enjoy Slurm!
+</p>
 
+***
 
+*This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
 
+* *The materials used in this lesson were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). 
+All Data Carpentry instructional material is made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0).*
+* *Adapted from the lesson by Tracy Teal. Original contributors: Paul Wilson, Milad Fatenejad, Sasha Wood and Radhika Khetani for Software Carpentry (http://software-carpentry.org/)*