This page holds an overview and instructions for how GeneLab processes Illumina amplicon datasets. Exact processing commands for specific datasets that have been released is available in the GLDS_Processing_Scripts sub-directory and is also provided with their processed data in the GeneLab Data Systems (GLDS) repository.
Date: May 13, 2020
Revision: A
Document Number: GL-DPPD-7104-A
Submitted by:
Michael D. Lee (GeneLab Data Processing Team)
Approved by:
Sylvain Costes (GeneLab Project Manager)
Samrawit Gebre (GeneLab Deputy Project Manager)
Homer Fogle (GeneLab Data Processing Representative)
Jonathan Galazka (GeneLab Project Scientist)
Anushree Sonic (Genelab Configuration Manager)
- Software used
- General processing overview with example commands
Program | Version* | Relevant Links |
---|---|---|
FastQC | fastqc -v |
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ |
MultiQC | multiqc -v |
https://multiqc.info/ |
Cutadapt | cutadapt --version |
https://cutadapt.readthedocs.io/en/stable/ |
DADA2 | packageVersion("dada2") |
https://www.bioconductor.org/packages/release/bioc/html/dada2.html |
DECIPHER | packageVersion("DECIPHER") |
https://bioconductor.org/packages/release/bioc/html/DECIPHER.html |
biomformat | packageVersion("biomformat") |
https://github.com/joey711/biomformat |
* Exact versions are available along with the processing commands for each specific dataset.
Exact processing commands for specific datasets are available in the GLDS_Processing_Scripts sub-directory of this repository, as well as being provided with their processed data in the GeneLab Data Systems (GLDS) repository.
fastqc -o raw_fastqc_output *.fastq.gz
Parameter Definitions:
-o
– the output directory to store results*.fastq.gz
– the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
Input Data:
- fastq, compressed or uncompressed
Output Data:
- fastqc.html (FastQC output html summary)
- fastqc.zip (FastQC output data)
multiqc -o raw_multiqc_output raw_fastqc_output
Parameter Definitions:
-o
– the output directory to store resultsraw_fastqc_output/
– the directory holding the output data from the fastqc run, provided as a positional argument
Input Data:
- fastqc.zip (FastQC output data)
Output Data:
- multiqc_report.html (multiqc output html summary)
- multiqc_data (directory containing multiqc output data)
The location and orientation of primers in the data is important to understand in deciding how to do this step. cutadapt
has many options for primer identification and removal. They are described in detail on their documentation page here: https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
The following example commands show how it was done for some samples of GLDS-200, which was 2x250 sequencing of the 16S gene using these primers:
- forward: 5’-GTGCCAGCMGCCGCGGTAA-3’ ## Is there supposed to be a M in the sequence? If so, please define what M represents ##
- reverse: 5’- GGACTACVSGGGTATCTAAT-3’ ## Is there supposed to be a V and S in the sequence? If so, please define what V and S represent ##
Due to the size of the target amplicon and the type of sequencing done here, both forward and reverse primers are expected to be on each of the forward and reverse reads. It therefore takes “linked” primers as input for forward and reverse reads, specified above by the ...
between them. It also expects that the primers start at the first position of the reads (“anchored”), specified with the leading ^
characters.
The following website is useful for reverse complementing primers and dealing with degenerate bases appropriately: http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html
cutadapt -a ^GTGCCAGCMGCCGCGGTAA...ATTAGATACCCSBGTAGTCC -A ^GGACTACVSGGGTATCTAAT...TTACCGCGGCKGCTGGCAC \
## Define what B represents; and define what K represents ##
-o Primer-trimmed-R1.fq.gz -p Primer-trimmed-R2.fq.gz Input_R1_raw.fastq.gz Input_R2_raw.fastq.gz \
--discard-untrimmed
Parameter Definitions:
-
-a
– specifies the primers and orientations expected on the forward reads (when primers are linked as noted above) -
-A
– specifies the primers and orientations expected on the reverse reads (when primers are linked as noted above) -
-o
– specifies output of forward, primer-trimmed reads -
-p
– specifies output of reverse, primer-trimmed reads
-
Input_R1_raw.fastq.gz
– this and following “R2” version are positional arguments specifying the forward and reverse reads, respectively, for input -
--discard-untrimmed
– this filters out those reads? where the primers were not found as expected
Input Data:
- fastq, compressed or uncompressed (original reads)
Output Data:
- fastq, compressed or uncompressed (trimmed reads)
- tsv (per sample read counts before and after trimming)
- log (log file of standard output and error from cutadapt)
The following is run in an R environment.
Specific settings required will depend on the dataset being processing. These include parameters such as truncLen
, which might depend on the target amplicon and its size, and maxEE
which might depend on the quality of the sequencing run. For instance, when working with ITS data, it may be preferable to omit using the truncLen
parameter if the target amplified region is expected to vary to lengths greater than the read size. More information on these parameters can be found at these sites:
- https://benjjneb.github.io/dada2/tutorial.html
- https://astrobiomike.github.io/amplicon/dada2_workflow_ex
The following is an example from a GLDS-200 sample that used paired-end 2x250 sequencing with the following 16S primers:
- forward: 5’-GTGCCAGCMGCCGCGGTAA-3’
- reverse: 5’- GGACTACVSGGGTATCTAAT-3’
filtered_out <- filterAndTrim(fwd=“Primer-trimmed-R1.fq.gz”, filt=“Filtered-R1.fq.gz”,
rev=“Primer-trimmed-R2.fq.gz”, filt.rev=“Filtered-R2.fq.gz”,
truncLen=c(220, 160), maxN=0, maxEE=c(2,2),
truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)
Parameter Definitions:
-
filtered_out <-
– specifies the variable that will store the summary results within in our R environment -
filterAndTrim()
– the DADA2 function we are calling, with the following parameters set within it -
fwd=
– specifying the path to the forward reads, here “Primer-trimmed-R1.fq.gz” -
filt=
– specifying the path to where the output forward reads will be written -
rev=
– specifying the path to the reverse reads, here “Primer-trimmed-R2.fq.gz”; only applicable if paired-end -
filt.rev=
– specifying the path to where the output reverse reads will be written; only applicable if paired-end -
truncLen=c(220, 160)
– specifying the forward reads to be truncated at 220 bp, and the reverse to be truncated at 160 bps (note that this parameter also functions as a minimum-length filter); would only have 1 value if not paired-end -
maxN=0
– setting the maximum allowed Ns to 0, any reads with an N will be filtered out -
maxEE=c(2,2)
– setting maximum expected error allowed to 2 for each forward and reverse read; would only have 1 value if not paired-end -
truncQ=2
– looking from the lower-quality end of each read, truncate at the first base with a quality score lower than 2 -
rm.phix=TRUE
– filter out reads with exact kmers matching the PhiX genome -
compress=TRUE
– gzip-compress the output filtered reads -
multithread=TRUE
– determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)
Input Data:
- fastq, compressed or uncompressed (primer-trimmed reads)
Output Data:
- fastq, compressed or uncompressed (filtered reads)
- tsv (per sample read counts before and after filtering)
fastqc -o filtered_fastqc_output/ filtered*.fastq.gz
Parameter Definitions:
-o
– the output directory to store resultsfiltered*.fastq.gz
– the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
Input Data:
- fastq, compressed or uncompressed (filtered reads)
Output Data:
- fastqc.html (FastQC output html summary)
- fastqc.zip (FastQC output data)
multiqc -o filtered_multiqc_output filtered_fastqc_output
Parameter Definitions:
-o
– the output directory to store resultsfiltered_fastqc_output
– the directory holding the output data from the fastqc run, provided as a positional argument
Input Data:
- fastqc.zip (FastQC output data)
Output Data:
- multiqc_report.html (multiqc output html summary)
- multiqc_data (directory containing multiqc output data)
The following is run in an R environment.
These example commands as written assumes paired-end data, with notes included on what would be different if working with single-end data. The taxonomy reference database used below is as an example only, suitable for the example 16S dataset (GLDS-200) used here. But others designed for DECIPHER can be found here: http://www2.decipher.codes/Downloads.html
forward_errors <- learnErrors(fls=“Filtered-R1.fq.gz”, multithread=TRUE)
Parameter Definitions:
-
forward_errors <-
– specifies the variable that will store the results within in our R environment -
learnErrors()
– the DADA2 function we are calling, with the following parameters set within it -
fls=
– the path to the forward filtered reads -
multithread=TRUE
– determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)
reverse_errors <- learnErrors(fls=“Filtered-R2.fq.gz”, multithread=TRUE)
Parameter Definitions:
- same as above, but providing reverse filtered reads; not needed if data are single-end
forward_seqs <- dada(derep=“Filtered-R1.fq.gz”, err=forward_errors, pool=“pseudo”, multithread=TRUE)
Parameter Definitions:
-
forward_seqs <-
– specifies the variable that will store the results within in our R environment -
dada()
– the DADA2 function we are calling, with the following parameters set within it -
derep=
– the path to the forward filtered reads -
err=
– the object holding the error profile for the forward reads created in above step, if not paired-end data, this would be the error-profile object created and the following “reverse_seqs” object would not be created -
pool=“pseudo”
– setting the method of incorporating information from multiple samples -
multithread=TRUE
– determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)
reverse_seqs <- dada(derep=“Filtered-R2.fq.gz”, err=reverse_errors, pool=“pseudo”, multithread=TRUE)
Parameter Definitions:
- same as above, but providing reverse filtered reads and reverse error-profile object; not needed if data are single-end
merged_contigs <- mergePairs(dadaF=forward_seqs, derepF=“Filtered-R1.fq.gz”, dadaR=reverse_seqs, derepR=“Filtered-R2.fq.gz”)
Parameter Definitions:
-
merged_contigs <-
– specifies the variable that will store the results within in our R environment -
mergePairs()
– the DADA2 function we are calling, with the following parameters set within it -
dadaF=
– specifying the object holding the forward-read inferred sequences -
derepF=
– specifying the path to the filtered forward reads -
dadaR=
– specifying the object holding the reverse-read inferred sequences -
derepR=
– specifying the path to the filtered reverse reads
seqtab <- makeSequenceTable(merged_contigs)
Parameter Definitions:
- If single-end data, instead of “merged_contigs”, the forward_seqs object would be provided to the
makeSequenceTable()
function here
seqtab.nochim <- removeBimeraDenovo(unqs=seqtab, method=“consensus”, multithread=TRUE)
Parameter Definitions:
-
seqtab.nochim <-
– specifies the variable that will store the results within in our R environment -
removeBimeraDenovo()
– the DADA2 function we are calling, with the following parameters set within it -
unqs=
– specifying the “seqtab” object created above -
method=
– specifying the method for putative-chimera identification and removal -
multithread=TRUE
– determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)
Creating a DNAStringSet object from the ASVs:
dna <- DNAStringSet(getSequences(seqtab.nochim))
Downloading the reference R taxonomy object:
download.file( url=“http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData”, destfile=“SILVA_SSU_r138_2019.RData”)
Parameter Definitions:
-
download.file()
– the function we are calling, with the following parameters set within it -
url=
– specifying the url address of the file to download -
destfile=
– specifying the path/name of the file after downloading
Loading taxonomy object:
load(“SILVA_SSU_r138_2019.RData”)
Classifying sequences:
tax_info <- IdTaxa(test=dna, trainingSet=trainingSet, strand=“both”, processors=NULL)
Parameter Definitions:
-
tax_info <-
– specifies the variable that will store the results within in our R environment -
IdTaxa()
– the DECIPHER function we are calling, with the following parameters set within it -
test=
– specifying the “dna” object created above holding the sequences we want to classify -
trainingSet=
– specifying the reference database we downloaded and loaded above -
strand=“both”
– specifying to check taxonomy assignment in both orientations -
processors=NULL
– determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)
Giving sequences more manageable names (e.g. ASV_1, ASV_2, …,):
asv_seqs <- colnames(seqtab.nochim)
asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")
for (i in 1:dim(seqtab.nochim)[2]) {
asv_headers[i] <- paste(">ASV", i, sep="_")
}
Making then writing a fasta of final ASV seqs:
asv_fasta <- c(rbind(asv_headers, asv_seqs))
write(asv_fasta, "ASVs.fa")
Making then writing a count table:
asv_tab <- t(seqtab.nochim)
row.names(asv_tab) <- sub(">", "", asv_headers)
write.table(asv_tab, "ASVs_counts.tsv", sep="\t", quote=F, col.names=NA)
Creating table of taxonomy and setting any that are unclassified as "NA":
ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")
tax_tab <- t(sapply(tax_info, function(x) {
m <- match(ranks, x$rank)
taxa <- x$taxon[m]
taxa[startsWith(taxa, "unclassified_")] <- NA
taxa
}))
colnames(tax_tab) <- ranks
rownames(tax_tab) <- gsub(pattern=">", replacement="", x=asv_headers)
write.table(tax_tab, "Taxonomy.tsv", sep = "\t", quote=F, col.names=NA)
Generating then writing biom file format:
biom_object <- make_biom(data=asv_tab, observation_metadata=tax_tab)
write_biom(biom_object, "Taxonomy_and_counts.biom")
Making a combined taxonomy and count table
tax_and_count_tab <- merge(tax_tab, asv_tab)
write.table(tax_and_count_tab, "Taxonomy_and_counts.tsv", sep="\t", quote=FALSE, row.names=FALSE)
Input Data:
- fastq, compressed or uncompressed (filtered reads)
Output Data:
- fasta (inferred sequences)
- count_table.tsv (count table)
- taxonomy_table.tsv (taxonomy table)
- taxonomy_and_counts.tsv (combined taxonomy and count table)
- biom (count and taxonomy table in biom format)
- read_count_tracking.tsv (read counts at each processing step)