Skip to content

Latest commit

 

History

History
490 lines (313 loc) · 18.7 KB

GL-DPPD-7104-A.md

File metadata and controls

490 lines (313 loc) · 18.7 KB

Bioinformatics pipeline for amplicon Illumina sequencing data

This page holds an overview and instructions for how GeneLab processes Illumina amplicon datasets. Exact processing commands for specific datasets that have been released is available in the GLDS_Processing_Scripts sub-directory and is also provided with their processed data in the GeneLab Data Systems (GLDS) repository.


Date: May 13, 2020
Revision: A
Document Number: GL-DPPD-7104-A

Submitted by:
Michael D. Lee (GeneLab Data Processing Team)

Approved by:
Sylvain Costes (GeneLab Project Manager)
Samrawit Gebre (GeneLab Deputy Project Manager)
Homer Fogle (GeneLab Data Processing Representative)
Jonathan Galazka (GeneLab Project Scientist)
Anushree Sonic (Genelab Configuration Manager)


Table of contents


Software used

Program Version* Relevant Links
FastQC fastqc -v https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
MultiQC multiqc -v https://multiqc.info/
Cutadapt cutadapt --version https://cutadapt.readthedocs.io/en/stable/
DADA2 packageVersion("dada2") https://www.bioconductor.org/packages/release/bioc/html/dada2.html
DECIPHER packageVersion("DECIPHER") https://bioconductor.org/packages/release/bioc/html/DECIPHER.html
biomformat packageVersion("biomformat") https://github.com/joey711/biomformat

* Exact versions are available along with the processing commands for each specific dataset.


General processing overview with example commands

Exact processing commands for specific datasets are available in the GLDS_Processing_Scripts sub-directory of this repository, as well as being provided with their processed data in the GeneLab Data Systems (GLDS) repository.


1. Raw Data QC

fastqc -o raw_fastqc_output *.fastq.gz

Parameter Definitions:

  • -o – the output directory to store results
  • *.fastq.gz – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them

Input Data:

  • fastq, compressed or uncompressed

Output Data:

  • fastqc.html (FastQC output html summary)
  • fastqc.zip (FastQC output data)

Compile Raw Data QC

multiqc -o raw_multiqc_output raw_fastqc_output

Parameter Definitions:

  • -o – the output directory to store results
  • raw_fastqc_output/ – the directory holding the output data from the fastqc run, provided as a positional argument

Input Data:

  • fastqc.zip (FastQC output data)

Output Data:

  • multiqc_report.html (multiqc output html summary)
  • multiqc_data (directory containing multiqc output data)


2. Trim Primers

The location and orientation of primers in the data is important to understand in deciding how to do this step. cutadapt has many options for primer identification and removal. They are described in detail on their documentation page here: https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types

The following example commands show how it was done for some samples of GLDS-200, which was 2x250 sequencing of the 16S gene using these primers:

  • forward: 5’-GTGCCAGCMGCCGCGGTAA-3’ ## Is there supposed to be a M in the sequence? If so, please define what M represents ##
  • reverse: 5’- GGACTACVSGGGTATCTAAT-3’ ## Is there supposed to be a V and S in the sequence? If so, please define what V and S represent ##

Due to the size of the target amplicon and the type of sequencing done here, both forward and reverse primers are expected to be on each of the forward and reverse reads. It therefore takes “linked” primers as input for forward and reverse reads, specified above by the ... between them. It also expects that the primers start at the first position of the reads (“anchored”), specified with the leading ^ characters.

The following website is useful for reverse complementing primers and dealing with degenerate bases appropriately: http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html

cutadapt -a ^GTGCCAGCMGCCGCGGTAA...ATTAGATACCCSBGTAGTCC -A ^GGACTACVSGGGTATCTAAT...TTACCGCGGCKGCTGGCAC \
         ## Define what B represents; and define what K represents ##
         -o Primer-trimmed-R1.fq.gz -p Primer-trimmed-R2.fq.gz Input_R1_raw.fastq.gz Input_R2_raw.fastq.gz \
         --discard-untrimmed

Parameter Definitions:

  • -a – specifies the primers and orientations expected on the forward reads (when primers are linked as noted above)

  • -A – specifies the primers and orientations expected on the reverse reads (when primers are linked as noted above)

  • -o – specifies output of forward, primer-trimmed reads

  • -p – specifies output of reverse, primer-trimmed reads

  • Input_R1_raw.fastq.gz – this and following “R2” version are positional arguments specifying the forward and reverse reads, respectively, for input

  • --discard-untrimmed – this filters out those reads? where the primers were not found as expected

Input Data:

  • fastq, compressed or uncompressed (original reads)

Output Data:

  • fastq, compressed or uncompressed (trimmed reads)
  • tsv (per sample read counts before and after trimming)
  • log (log file of standard output and error from cutadapt)


3. Quality filtering

The following is run in an R environment.

Specific settings required will depend on the dataset being processing. These include parameters such as truncLen, which might depend on the target amplicon and its size, and maxEE which might depend on the quality of the sequencing run. For instance, when working with ITS data, it may be preferable to omit using the truncLen parameter if the target amplified region is expected to vary to lengths greater than the read size. More information on these parameters can be found at these sites:

The following is an example from a GLDS-200 sample that used paired-end 2x250 sequencing with the following 16S primers:

  • forward: 5’-GTGCCAGCMGCCGCGGTAA-3’
  • reverse: 5’- GGACTACVSGGGTATCTAAT-3’
filtered_out <- filterAndTrim(fwd=“Primer-trimmed-R1.fq.gz”, filt=“Filtered-R1.fq.gz”,
                              rev=“Primer-trimmed-R2.fq.gz”, filt.rev=“Filtered-R2.fq.gz”,
                              truncLen=c(220, 160), maxN=0, maxEE=c(2,2),
                              truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

Parameter Definitions:

  • filtered_out <- – specifies the variable that will store the summary results within in our R environment

  • filterAndTrim() – the DADA2 function we are calling, with the following parameters set within it

  • fwd= – specifying the path to the forward reads, here “Primer-trimmed-R1.fq.gz”

  • filt= – specifying the path to where the output forward reads will be written

  • rev= – specifying the path to the reverse reads, here “Primer-trimmed-R2.fq.gz”; only applicable if paired-end

  • filt.rev= – specifying the path to where the output reverse reads will be written; only applicable if paired-end

  • truncLen=c(220, 160) – specifying the forward reads to be truncated at 220 bp, and the reverse to be truncated at 160 bps (note that this parameter also functions as a minimum-length filter); would only have 1 value if not paired-end

  • maxN=0 – setting the maximum allowed Ns to 0, any reads with an N will be filtered out

  • maxEE=c(2,2) – setting maximum expected error allowed to 2 for each forward and reverse read; would only have 1 value if not paired-end

  • truncQ=2 – looking from the lower-quality end of each read, truncate at the first base with a quality score lower than 2

  • rm.phix=TRUE – filter out reads with exact kmers matching the PhiX genome

  • compress=TRUE – gzip-compress the output filtered reads

  • multithread=TRUE – determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)

Input Data:

  • fastq, compressed or uncompressed (primer-trimmed reads)

Output Data:

  • fastq, compressed or uncompressed (filtered reads)
  • tsv (per sample read counts before and after filtering)


4. Filtered Data QC

fastqc -o filtered_fastqc_output/ filtered*.fastq.gz

Parameter Definitions:

  • -o – the output directory to store results
  • filtered*.fastq.gz – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them

Input Data:

  • fastq, compressed or uncompressed (filtered reads)

Output Data:

  • fastqc.html (FastQC output html summary)
  • fastqc.zip (FastQC output data)

Compile Filtered Data QC

multiqc -o filtered_multiqc_output  filtered_fastqc_output

Parameter Definitions:

  • -o – the output directory to store results
  • filtered_fastqc_output – the directory holding the output data from the fastqc run, provided as a positional argument

Input Data:

  • fastqc.zip (FastQC output data)

Output Data:

  • multiqc_report.html (multiqc output html summary)
  • multiqc_data (directory containing multiqc output data)


5. Calculate error model, apply DADA2 algorithm, assign taxonomy, and create output tables

The following is run in an R environment.

These example commands as written assumes paired-end data, with notes included on what would be different if working with single-end data. The taxonomy reference database used below is as an example only, suitable for the example 16S dataset (GLDS-200) used here. But others designed for DECIPHER can be found here: http://www2.decipher.codes/Downloads.html


Learning the error rates

forward_errors <- learnErrors(fls=Filtered-R1.fq.gz”, multithread=TRUE)

Parameter Definitions:

  • forward_errors <- – specifies the variable that will store the results within in our R environment

  • learnErrors() – the DADA2 function we are calling, with the following parameters set within it

  • fls= – the path to the forward filtered reads

  • multithread=TRUE – determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)

reverse_errors <- learnErrors(fls=Filtered-R2.fq.gz”, multithread=TRUE)

Parameter Definitions:

  • same as above, but providing reverse filtered reads; not needed if data are single-end

Inferring sequences

forward_seqs <- dada(derep=Filtered-R1.fq.gz”, err=forward_errors, pool=pseudo”, multithread=TRUE)

Parameter Definitions:

  • forward_seqs <- – specifies the variable that will store the results within in our R environment

  • dada() – the DADA2 function we are calling, with the following parameters set within it

  • derep= – the path to the forward filtered reads

  • err= – the object holding the error profile for the forward reads created in above step, if not paired-end data, this would be the error-profile object created and the following “reverse_seqs” object would not be created

  • pool=“pseudo” – setting the method of incorporating information from multiple samples

  • multithread=TRUE – determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)


reverse_seqs <- dada(derep=Filtered-R2.fq.gz”, err=reverse_errors, pool=pseudo”, multithread=TRUE)

Parameter Definitions:

  • same as above, but providing reverse filtered reads and reverse error-profile object; not needed if data are single-end

Merging forward and reverse reads; not needed if data are single-end

merged_contigs <- mergePairs(dadaF=forward_seqs, derepF=Filtered-R1.fq.gz”, dadaR=reverse_seqs, derepR=Filtered-R2.fq.gz”)

Parameter Definitions:

  • merged_contigs <- – specifies the variable that will store the results within in our R environment

  • mergePairs() – the DADA2 function we are calling, with the following parameters set within it

  • dadaF= – specifying the object holding the forward-read inferred sequences

  • derepF= – specifying the path to the filtered forward reads

  • dadaR= – specifying the object holding the reverse-read inferred sequences

  • derepR= – specifying the path to the filtered reverse reads


Generating sequence table with counts per sample

seqtab <- makeSequenceTable(merged_contigs)

Parameter Definitions:

  • If single-end data, instead of “merged_contigs”, the forward_seqs object would be provided to the makeSequenceTable() function here

Removing putative chimeras

seqtab.nochim <- removeBimeraDenovo(unqs=seqtab, method=consensus”, multithread=TRUE)

Parameter Definitions:

  • seqtab.nochim <- – specifies the variable that will store the results within in our R environment

  • removeBimeraDenovo() – the DADA2 function we are calling, with the following parameters set within it

  • unqs= – specifying the “seqtab” object created above

  • method= – specifying the method for putative-chimera identification and removal

  • multithread=TRUE – determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)


Assigning taxonomy

Creating a DNAStringSet object from the ASVs:

dna <- DNAStringSet(getSequences(seqtab.nochim))

Downloading the reference R taxonomy object:

download.file( url=http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData”, destfile=SILVA_SSU_r138_2019.RData”)

Parameter Definitions:

  • download.file() – the function we are calling, with the following parameters set within it

  • url= – specifying the url address of the file to download

  • destfile= – specifying the path/name of the file after downloading


Loading taxonomy object:

load(“SILVA_SSU_r138_2019.RData”)

Classifying sequences:

tax_info <- IdTaxa(test=dna, trainingSet=trainingSet, strand=both”, processors=NULL)

Parameter Definitions:

  • tax_info <- – specifies the variable that will store the results within in our R environment

  • IdTaxa() – the DECIPHER function we are calling, with the following parameters set within it

  • test= – specifying the “dna” object created above holding the sequences we want to classify

  • trainingSet= – specifying the reference database we downloaded and loaded above

  • strand=“both” – specifying to check taxonomy assignment in both orientations

  • processors=NULL – determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)


Generating and writing standard outputs

Giving sequences more manageable names (e.g. ASV_1, ASV_2, …,):

asv_seqs <- colnames(seqtab.nochim)
asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")

for (i in 1:dim(seqtab.nochim)[2]) {
  asv_headers[i] <- paste(">ASV", i, sep="_")
}

Making then writing a fasta of final ASV seqs:

asv_fasta <- c(rbind(asv_headers, asv_seqs))
write(asv_fasta, "ASVs.fa")

Making then writing a count table:

asv_tab <- t(seqtab.nochim)
row.names(asv_tab) <- sub(">", "", asv_headers)

write.table(asv_tab, "ASVs_counts.tsv", sep="\t", quote=F, col.names=NA)

Creating table of taxonomy and setting any that are unclassified as "NA":

ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")
tax_tab <- t(sapply(tax_info, function(x) {
  m <- match(ranks, x$rank)
  taxa <- x$taxon[m]
  taxa[startsWith(taxa, "unclassified_")] <- NA
  taxa
}))
colnames(tax_tab) <- ranks
rownames(tax_tab) <- gsub(pattern=">", replacement="", x=asv_headers)

write.table(tax_tab, "Taxonomy.tsv", sep = "\t", quote=F, col.names=NA)

Generating then writing biom file format:

biom_object <- make_biom(data=asv_tab, observation_metadata=tax_tab)
write_biom(biom_object, "Taxonomy_and_counts.biom")

Making a combined taxonomy and count table

tax_and_count_tab <- merge(tax_tab, asv_tab)
write.table(tax_and_count_tab, "Taxonomy_and_counts.tsv", sep="\t", quote=FALSE, row.names=FALSE)

Input Data:

  • fastq, compressed or uncompressed (filtered reads)

Output Data:

  • fasta (inferred sequences)
  • count_table.tsv (count table)
  • taxonomy_table.tsv (taxonomy table)
  • taxonomy_and_counts.tsv (combined taxonomy and count table)
  • biom (count and taxonomy table in biom format)
  • read_count_tracking.tsv (read counts at each processing step)