Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amplicon 454 ion torrent Nextflow workflow #103

Open
wants to merge 21 commits into
base: DEV_Amplicon_454-IonTorrent_NF_conversion
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Workflow change log

## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_Amp454IonTor_1.0.0/Amplicon/454-and-IonTorrent/Workflow_Documentation/NF_Amp454IonTor)
- workflow version that conxverted snakemake to nextflow
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Workflow Information and Usage Instructions

## General Workflow Info

### Implementation Tools

The current GeneLab 454 and IonTorrent amplicon sequencing data processing pipeline (Amp454IonTor), [GL-DPPD-7106.md](../../Pipeline_GL-DPPD-7106_Versions/GL-DPPD-7106.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) containers or [conda](https://docs.conda.io/en/latest/) environments to install/run all tools. This workflow is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.

## Utilizing the Workflow

1. [Install Nextflow and Singularity](#1-install-nextflow-and-singularity)
1a. [Install Nextflow](#1a-install-nextflow)
1b. [Install Singularity](#1b-install-singularity)

2. [Download the workflow files](#2-download-the-workflow-files)

3. [Fetch Singularity Images](#3-fetch-singularity-images)

4. [Run the workflow](#4-run-the-workflow)
4a. [Approach 1: Run slurm jobs in singularity containers with OSD accession as input](#4a-approach-1-run-slurm-jobs-in-singularity-containers-with-osd-accession-as-input)
4b. [Approach 2: Run slurm jobs in singularity containers with a csv file as input](#4b-approach-2-run-slurm-jobs-in-singularity-containers-with-a-csv-file-as-input)
4c. [Approach 3: Run jobs locally in conda environments and specify the path to one or more existing conda environments](#4c-approach-run-jobs-locally-in-conda-environments-and-specify-the-path-to-one-or-more-existing-conda-environments)
4d. [Modify parameters and cpu resources in the nextflow config file](#4d-modify-parameters-and-cpu-resources-in-the-nextflow-config-file)

5. [Workflow outputs](#5-workflow-outputs)
5a. [Main outputs](#5a-main-outputs)
5b. [Resource logs](#5b-resource-logs)

<br>

---

### 1. Install Nextflow and Singularity

#### 1a. Install Nextflow

Nextflow can be installed either through [Anaconda](https://anaconda.org/bioconda/nextflow) or as documented on the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html).

> Note: If you want to install Anaconda, we recommend installing a Miniconda, Python3 version appropriate for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
>
> Once conda is installed on your system, you can install the latest version of Nextflow by running the following commands:
>
> ```bash
> conda install -c bioconda nextflow
> nextflow self-update
> ```

<br>

#### 1b. Install Singularity

Singularity is a container platform that allows usage of containerized software. This enables the GeneLab workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system.

We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).

> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity).

<br>

---

### 2. Download the workflow files

All files required for utilizing the NF_XXX GeneLab workflow for processing 454 ion torrent data are in the [workflow_code](workflow_code) directory. To get a copy of latest *NF_XXX* version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:

```bash
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_Amp454IonTor/NF_Amp454IonTor.zip
unzip NF_Amp454IonTor.zip && cd NF_XXX-X_X.X.X
```

OR by using the genelab-utils conda package

```bash
GL-get-workflow Amplicon-454-IonTorrent
```

<br>

---

### 3. Fetch Singularity Images

Although Nextflow can fetch Singularity images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210).

To avoid this issue, run the following command to fetch the Singularity images prior to running the NF_Amp454IonTor workflow:

> Note: This command should be run in the location containing the `NF_Amp454IonTor` directory that was downloaded in [step 2](#2-download-the-workflow-files) above.

```bash
bash ./bin/prepull_singularity.sh nextflow.config
```

Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as a Nextflow configuration environment variable to ensure Nextflow can locate the fetched images:

```bash
export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity
```

<br>

---


### 4. Run the Workflow

For options and detailed help on how to run the workflow, run the following command:

```bash
nextflow run main.nf --help
```

> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --csv_file) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.

<br>

#### 4a. Approach 1: Run slurm jobs in singularity containers with OSD accession as input

```bash
nextflow run main.nf -resume -profile slurm,singularity --GLDS_accession OSD-72 --target_region 16S --min_bbduk_len 50 --min_bbduk_avg_quality 15
```

<br>

#### 4b. Approach 2: Run slurm jobs in singularity containers with a csv file as input

```bash
nextflow run main.nf -resume -profile slurm,singularity --csv_file file.csv --target_region 16S --F_primer AGAGTTTGATCCTGGCTCAG --R_primer CTGCCTCCCGTAGGAGT --min_bbduk_len 50 --min_bbduk_avg_quality 15
```

<br>

#### 4c. Approach 3: Run jobs locally in conda environments and specify the path to one or more existing conda environment(s)

```bash
nextflow run main.nf -resume -profile conda --csv_file file.csv --target_region 16S --F_primer AGAGTTTGATCCTGGCTCAG --R_primer CTGCCTCCCGTAGGAGT --min_bbduk_len 50 --min_bbduk_avg_quality 15 --conda.qc <path/to/existing/conda/environment>
```

<br>

**Required Parameters For All Approaches:**

* `-run main.nf` - Instructs nextflow to run the NF_XXX workflow
* `-resume` - Resumes workflow execution using previously cached results
* `-profile` – Specifies the configuration profile(s) to load, `singularity` instructs nextflow to setup and use singularity for all software called in the workflow
* `--target_region` – Specifies the amplicon target region to be analyzed, 16S or ITS.
* `--min_bbduk_len` – Specifies the minimum read length to retain after filtering with bbduk.
* `--min_bbduk_avg_quality` – Specifies the minimum average read quality for bbduk read filtering.



*Required only if you would like to pull and process data directly from OSDR*

* `--GLDS_accession` – A Genelab / OSD accession number e.g. OSD-72.

*Required only if --GLDS_accession is not passed as an argument*

* `--csv_file` – A 2-column input file with these headers [sample_id, read]. Please see the sample [file.csv](workflow_code/file.csv) in this repository for an example on how to format this file.

* `--F_primer` – Forward primer sequence.

* `--R_primer` – Reverse primer sequence.

> See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details on how to run nextflow.

<br>

#### 4d. Modify parameters and cpu resources in the nextflow config file

Additionally, the parameters and workflow resources can be directly specified in the nextflow.config file. For detailed instructions on how to modify and set parameters in the nextflow.config file, please see the [documentation here](https://www.nextflow.io/docs/latest/config.html).

Once you've downloaded the workflow template, you can modify the parameters in the `params` scope and cpus/memory requirements in the `process` scope in your downloaded version of the [nextflow.config](workflow_code/nextflow.config) file as needed in order to match your dataset and system setup. For example, you can directly set the the full paths to available conda environments in the `conda` scope within the `params` scope. Additionally, if necessary, you'll need to modify each variable in the [nextflow.config](workflow_code/nextflow.config) file to be consistent with the study you want to process and the machine you're using.

<br>

---

### 5. Workflow outputs

#### 5a. Main outputs

The outputs from this pipeline are documented in the [GL-DPPD-7106](../../Pipeline_GL-DPPD-7106_Versions/GL-DPPD-7106.md) processing protocol.

#### 5b. Resource logs

Standard nextflow resource usage logs are also produced as follows:

- Output:
- Resource_Usage/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands)
- Resource_Usage/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow)
- Resource_Usage/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output)

> Further details about these logs can also found within [this Nextflow documentation page](https://www.nextflow.io/docs/latest/tracing.html#execution-report).


Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#!/usr/bin/env Rscript
##################################################################################
## R processing script for 454/Ion Torrent amplicon data ##
## Developed by Michael D. Lee ([email protected]) ##
##################################################################################

# as called from the associated Snakefile, this expects to be run as: Rscript full-R-processing.R <otus_fasta_path> <trimmed_dir> <filtered_dir> <final_outputs_directory> <output_prefix> <target_region>

# setting variables used within R:
args <- commandArgs(trailingOnly = TRUE)

suppressWarnings(otus_fasta_path <- args[1])
suppressWarnings(trimmed_dir <- args[2])
suppressWarnings(filtered_dir <- args[3])
suppressWarnings(final_outputs_dir <- args[4])
suppressWarnings(output_prefix <- args[5])
suppressWarnings(target_region <- args[6])
suppressWarnings(assay_suffix <- args[7])

# loading libraries
library(DECIPHER)
library(biomformat)

# Set default internet timeout to 1 hour
options(timeout=3600)


### assigning taxonomy ###
# reading OTUs into a DNAStringSet object
dna <- readDNAStringSet(paste0(final_outputs_dir, output_prefix, "OTUs", assay_suffix, ".fasta"))


# downloading reference R taxonomy object
cat("\n\n Downloading reference database...\n\n")

if ( target_region == "16S" ) {

download.file("https://figshare.com/ndownloader/files/46245217", "SILVA_SSU_r138_2019.RData")
load("SILVA_SSU_r138_2019.RData")
#file.remove("SILVA_SSU_r138_2019.RData")

} else if ( target_region == "ITS" ) {
download.file("https://figshare.com/ndownloader/files/46245586", "UNITE_v2020_February2020.RData")
load("UNITE_v2020_February2020.RData")
#file.remove("UNITE_v2020_February2020.RData")
}


# assigning taxonomy
cat("\n\n Assigning taxonomy...\n\n")

tax_info <- IdTaxa(dna, trainingSet, strand="both", processors=NULL)

cat("\n\n Making and writing out tables...\n\n")

# making and writing out a taxonomy table:
# creating vector of desired ranks
ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")

# creating table of taxonomy and setting any that are unclassified as "NA"
tax_tab <- t(sapply(tax_info, function(x) {
m <- match(ranks, x$rank)
taxa <- x$taxon[m]
taxa[startsWith(taxa, "unclassified_")] <- NA
taxa
}))

colnames(tax_tab) <- ranks
row.names(tax_tab) <- NULL
otu_ids <- names(tax_info)
tax_tab <- data.frame("OTU_ID"=otu_ids, tax_tab, check.names=FALSE)

write.table(tax_tab, paste0(final_outputs_dir, output_prefix,"taxonomy", assay_suffix, ".tsv"), sep = "\t", quote=F, row.names=FALSE)

# reading in counts table to generate other outputs
otu_tab <- read.table(paste0(final_outputs_dir, output_prefix, "counts", assay_suffix, ".tsv"), sep="\t", header=TRUE, check.names=FALSE)

# generating and writing out biom file format
biom_object <- make_biom(data=otu_tab, observation_metadata=tax_tab)
write_biom(biom_object, paste0(final_outputs_dir, output_prefix, "taxonomy-and-counts", assay_suffix, ".biom"))

# making a tsv of combined tax and counts
tax_and_count_tab <- merge(tax_tab, otu_tab)
write.table(tax_and_count_tab, paste0(final_outputs_dir, output_prefix, "taxonomy-and-counts", assay_suffix, ".tsv"), sep="\t", quote=FALSE, row.names=FALSE)

# making final count summary table
cutadapt_tab <- read.table(paste0(trimmed_dir, output_prefix, "trimmed-read-counts", assay_suffix, ".tsv"), sep="\t", header=TRUE)
bbduk_tab <- read.table(paste0(filtered_dir, output_prefix, "filtered-read-counts", assay_suffix, ".tsv"), sep="\t", header=TRUE)[,c(1,3)]
# re-reading in counts table to this time set first col as rownames (rather than doing it another way)
otu_tab <- read.table(paste0(final_outputs_dir, output_prefix, "counts", assay_suffix, ".tsv"), sep="\t", header=TRUE, check.names=FALSE, row.names = 1)
mapped_sums <- colSums(otu_tab)
mapped_tab <- data.frame(sample=names(mapped_sums), mapped_to_OTUs=mapped_sums, row.names=NULL)

t1 <- merge(cutadapt_tab, bbduk_tab)
count_summary_tab <- merge(t1, mapped_tab)
count_summary_tab$final_perc_reads_retained <- round(count_summary_tab$mapped_to_OTUs / count_summary_tab$raw_reads * 100, 2)

write.table(count_summary_tab, paste0(final_outputs_dir, output_prefix, "read-count-tracking", assay_suffix, ".tsv"), sep="\t", quote=FALSE, row.names=FALSE)

cat("\n\n Session info:\n\n")
sessionInfo()
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/usr/bin/env bash
set -e

# only built for use on N288 cluster

# example usage: bash clean-paths.sh <input-file> <workflow-dir>

# making sure by chance we are not overwriting a wanted file called 't'

if [ -s t ]; then
printf "\n This simple program temporarily writes to a file called 't'\n"
printf " Since that exists already here, we are not going to continue.\n\n"
exit
fi


ROOT_DIR=$(echo $2 | awk '{N=split($0,a,"/"); for(i=0; i < N-1; i++) printf "%s/", a[i]}' | sed 's|//|/|')


sed -E 's|.*/GLDS_Datasets/(.+)|\1|g' ${1} \
| sed -E 's|.+/miniconda.+/envs/[^/]*/||g' \
| sed -E 's|/[^ ]*/GLDS-|GLDS-|g' \
| sed -E 's|/[a-z]{6}/[^ ]*|<path-removed-for-security-purposes>|g' \
| sed -E "s|${ROOT_DIR}||g" > t && mv t ${1}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/env bash

# A script to the the input csv file rtequired by this pipeline when a GLDS accession is provided
# Rather than the required input csv file

ASSAY_TABLE=$1 # target_region_assay_table.txt

cat ${ASSAY_TABLE} | \
sed 's/"//g' | \
awk -v PWD=$PWD 'BEGIN{FS="\t"; OFS=","; print "sample_id,read,f_primer,r_primer,target_region"} \
NR==1{ for(i=1; i<=NF; i++) header[$i]=i} \
NR>1{ split ($header["Parameter Value[Primer Info]"], primers, ","); \
printf "%s,%s/Raw_Sequence_Data/%s,%s,%s,%s\n", $header["Sample Name"], PWD, $header["Raw Data File"], primers[2], primers[4],$header["Parameter Value[Target Molecule]"] }' | \
sed -E "s/5'-([A-Z]+)-3'/\1/g" | \
sed -E 's/\s+//g'
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/usr/bin/env Rscript

# Get versions
VERSIONS <- sprintf("DECIPHER %s\nbiomformat %s\n",
packageVersion("DECIPHER"),
packageVersion("biomformat"))

# Write versions to file

write(x= VERSIONS, file="versions.txt", append=TRUE)
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/usr/bin/env bash

# Addresses issue: https://github.com/nextflow-io/nextflow/issues/1210

CONFILE=${1:-nextflow.config}
OUTDIR=${2:-./singularity}

if [ ! -e $CONFILE ]; then
echo "$CONFILE does not exist"
exit
fi

TMPFILE=`mktemp`

CURDIR=$(pwd)

mkdir -p $OUTDIR

cat ${CONFILE}|grep 'container'|perl -lane 'if ( $_=~/container\s*\=\s*\"(\S+)\"/ ) { $_=~/container\s*\=\s*\"(\S+)\"/; print $1 unless ( $1=~/^\s*$/ or $1=~/\.sif/ or $1=~/\.img/ ) ; }' > $TMPFILE

cd ${OUTDIR}

while IFS= read -r line; do
name=$line
name=${name/:/-}
name=${name//\//-}
echo $name
singularity pull ${name}.img docker://$line
done < $TMPFILE

cd $CURDIR
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- r-base=4.1.1
- bioconductor-decipher=2.20.0
- bioconductor-biomformat=1.20.0
Loading