16S Intermediate Bioinformatics Training - Software setup and testing

Test sshing into the cluster
Running RStudio on a cluster
Running the DADA2 Nextflow pipeline on test data

1. Test sshing into the cluster

Use your USERNAME (if you have not send your public key ssh would not work)

$ ssh USERNAME@cbio-training.ilifu.ac.za
USERNAME@slurm-login:~$ hostname
slurm-login

Submit an interactive job

$ srun --nodes=1 --ntasks 1 --mem=8g --pty bash
gerrit@slurm-worker-0001:~$ hostname
slurm-worker-0001

2. Running RStudio on the Ilifu Slurm cluster

The steps works for both Linux and Windows users. Windows users please use git-bash.

Firstly one should configure ssh in such a way that it is simple to connect to a worker node once a job is running. The easiest way it to add the following to your local ~/.ssh/config file on your local machine. To open your config file: nano ~/.ssh/config

Need to do this on your local machine

Host cbio-training.ilifu.ac.za
    User USERNAME
    ForwardAgent yes

Host slurm-worker-*
    Hostname %h
    User USERNAME
    StrictHostKeyChecking no
    ProxyCommand ssh cbio-training.ilifu.ac.za nc %h 22

One should substitute your login ID (name) in USERNAME settings in the above script.

Open up a terminal. And start a new session.

$ ssh USERNAME@cbio-training.ilifu.ac.za

Next is the process of starting an interactive job and launching RStudio. To begin start an interactive job – below is an example of launching a single node / 1 core job with 8Gb of ram:

USERNAME@slurm-login:~$ srun --nodes=1 --ntasks 1 --mem=8g --pty bash
USERNAME@slurm-worker-0001:~$

Once the interactive session has begun on a specific node (in this case slurm-worker-0001), RStudio can be launched as follows:

USERNAME@slurm-worker-0001:~$ RSTUDIO_PASSWORD='Make your own secure password here' /cbio/images/bionic-R3.6.1-RStudio1.2.1335-bio.simg

Running rserver on port 45299

This will launch an RStudio server listening on a random free port (in this case 45299). Now one needs to do port-forward from your local machine to the host machine.

Open up a new terminal. One connects to the appropriate node by running:

ssh slurm-worker-0001 -L8082:localhost:45299

On your local machine. Specifically what this does is forward traffic on your local machine's port 8082 to the worker node's port 45299 (and it knows how to connect to slurm-worker-0001 by using the .ssh/config settings above). One may use any free local port – ssh will complain if you choose something that is not free with an error message approximating:

bind [127.0.0.1]:8000: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: 8000

Finally in your browser you can connect to http://localhost:8082 and you can login with your USERNAME and the RSTUDIO_PASSWORD which you set.

DADA2 should be installed already

> library("dada2")
Loading required package: Rcpp
Registered S3 methods overwritten by 'ggplot2':
method from
[.quosures rlang
c.quosures rlang
print.quosures rlang
> packageVersion("dada2")
[1] ‘1.12.1’

Looks OK.

3. Running the DADA2 Nextflow pipeline on test data

Login to master node

$ ssh USERNAME@cbio-training.ilifu.ac.za

Do the setup

$ cd $HOME
$ git clone https://github.com/h3abionet/16S-rDNA-dada2-pipeline
$ cd $HOME/16S-rDNA-dada2-pipeline

Get onto a node and select 2 cores

$  srun --nodes=1 --cpus-per-task 2 --mem=8g --pty bash

Run nextflow

$ nextflow run main.nf -profile standard --reads="/cbio/data/test-data/*_R{1,2}.fastq.gz" --trimFor 24 --trimRev 25 --reference="/cbio/data/ref-data/silva_nr_v132_train_set.fa.gz" --species="/cbio/data/ref-data/silva_species_assignment_v132.fa.gz" --runTree="fasttree" --outdir="$HOME/out"
N E X T F L O W  ~  version 19.07.0
Launching `main.nf` [exotic_heisenberg] - revision: 1696132777
===================================
uct-cbio/16S-rDNA-dada2-pipeline  ~  version 0.4
===================================
Run Name       : exotic_heisenberg
Reads          : /cbio/data/test-data/*_R{1,2}.fastq.gz
trimFor        : 24
trimRev        : 25
truncFor       : 248
truncRev       : 212
truncQ         : 2
maxEEFor       : 2
maxEERev       : 2
maxN           : 0
maxLen         : Inf
minLen         : 50
rmPhiX         : T
minOverlap     : 20
maxMismatch    : 0
trimOverhang   : F
species        : /cbio/data/ref-data/silva_species_assignment_v132.fa.gz
pool           : pseudo
Reference      : /cbio/data/ref-data/silva_nr_v132_train_set.fa.gz
Max Memory     : 384 GB
Max CPUs       : 40
Max Time       : 3d
Output dir     : /cbio/home/gerrit/out
Working dir    : /cbio/home/gerrit/16S-rDNA-dada2-pipeline/work
Container      : /cbio/containers/16s-rdna-dada2-pipeline.simg
Current home   : /cbio/home/gerrit
Current user   : gerrit
Current path   : /cbio/home/gerrit/16S-rDNA-dada2-pipeline
Script dir     : /cbio/home/gerrit/16S-rDNA-dada2-pipeline
Config Profile : training
=========================================
executor >  local (15)
[fd/c5dfb3] process > runFastQC (rFQC.Dog24)                          [100%] 4 of 4 ✔
[9e/6a2141] process > runMultiQC (rMQC)                               [100%] 1 of 1 ✔
[95/9e5acf] process > filterAndTrim (filterAndTrim)                   [100%] 4 of 4 ✔
[37/15cf0a] process > runFastQC_postfilterandtrim (rFQC_post_FT.Dog2) [100%] 4 of 4 ✔
[-        ] process > runMultiQC_postfilterandtrim                    
[-        ] process > mergeTrimmedTable                               
[dc/19e08d] process > LearnErrorsFor (LearnErrorsFor)                 [100%] 1 of 1 ✔
[7e/62c5a5] process > LearnErrorsRev (LearnErrorsRev)                 [  0%] 0 of 1
[-        ] process > SampleInferDerepAndMerge                        
[-        ] process > mergeDadaRDS                                    
[-        ] process > SequenceTable                                   
[-        ] process > ChimeraTaxonomySpecies                          
[-        ] process > AlignAndGenerateTree                            
[-        ] process > BiomFile                                        
[-        ] process > ReadTracking

Looks OK.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

software_testing_cbio.md

software_testing_cbio.md

16S Intermediate Bioinformatics Training - Software setup and testing

1. Test sshing into the cluster

2. Running RStudio on the Ilifu Slurm cluster

3. Running the DADA2 Nextflow pipeline on test data

Files

software_testing_cbio.md

Latest commit

History

software_testing_cbio.md

File metadata and controls

16S Intermediate Bioinformatics Training - Software setup and testing

1. Test sshing into the cluster

2. Running RStudio on the Ilifu Slurm cluster

3. Running the DADA2 Nextflow pipeline on test data