title | author | date |
---|---|---|
Aggregating QC metrics with MultiQC |
Will Gammerdinger, Meeta Mistry |
December 11, 2023 |
Approximate time: 30 minutes
- Run MultiQC
- Evaluate read alignment
- Intepret read QC metrics within
MultiQC
HTML report
The goal of this lesson is to show you how to combine numerical stats from multiple QC runs into a single HTML report. To do this aggregation, we will introduce you to MultiQC - a general use tool, perfect for summarizing the output from numerous bioinformatics tools. Rather than having to sift through multiple reports from individual samples and different tools, we can use MultiQC to evaluate quality metrics conveniently within a single file!
One nice feature of MultiQC
is that it accepts many different file formats. It figures out which format was submitted and tailors the report to that type of analysis. For this workflow we will combine the following QC stats:
- FastQC
- Alignment QC from Picard
We have already discussed in great detail the FASTQC html report in a previous lesson. But we haven't yet looked at the output from Picard CollectAlignmentSummaryMetrics
.
Once your scripts from the previous lesson have finished running, we can take a look to see what kind of output was generated. In your reports directory you should see files with the extension .CollectAlignmentSummaryMetrics.txt
.
Let's use less
to view it:
less ~/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
This is a tab-delimited file which contains a header component, followed by a table with many columns. Each column lists a different metric and the associated value for this syn3_normal sample. It is difficult to view this in the terminal and so you can use the screenshot below to see the contents:
To find out more about individual metrics we encourage you to peruse the Picard documentation. This file format is not ideal for viewing and assessing manually. We would benefit from visualizing the results rather than reading through this line by line.
Now, we could use MultiQC to just visualize the data from Picard AlignmentSummaryMetrics
but that wouldn't be a very good use of the tool. Ideally, the more data we can aggregate, the easier it is to evaluate the quality of our samples with data collated across multiple results and multiple samples.
MultiQC
would be relatively quick to just run from the command-line, but it's best practice to write our steps to scripts so that we always have a record of what we did and how we created our reports. We will start by making sure we are in our scripts directory and writing a sbatch
script in vim
for submission:
cd ~/variant_calling/scripts/
vim multiqc_alignment_metrics_normal_tumor.sbatch
First, we will add our shebang line, description and sbatch
directives
#!/bin/bash
# This sbatch script is for collating alignment metrics from FastQC and Picard using MultiQC
# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:10:00
#SBATCH -c 1
#SBATCH --mem 1G
#SBATCH -o multiqc_alignment_metrics_%j.out
#SBATCH -e multiqc_alignment_metrics_%j.err
Next, we will load our modules:
# Load modules
module load gcc/9.2.0
module load multiqc/1.21
NOTE:
MultiQC
version 1.12 requiresgcc/9.2.0
on the O2 cluster.
Next, we will assign our variables:
# Assign variables
REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/
NORMAL_SAMPLE_NAME=syn3_normal
TUMOR_SAMPLE_NAME=syn3_tumor
REFERENCE=GRCh38.p7
NORMAL_PICARD_METRICS=${REPORTS_DIRECTORY}picard/${NORMAL_SAMPLE_NAME}/${NORMAL_SAMPLE_NAME}_${REFERENCE}.CollectAlignmentSummaryMetrics.txt
TUMOR_PICARD_METRICS=${REPORTS_DIRECTORY}picard/${TUMOR_SAMPLE_NAME}/${TUMOR_SAMPLE_NAME}_${REFERENCE}.CollectAlignmentSummaryMetrics.txt
NORMAL_FASTQC_1=${REPORTS_DIRECTORY}fastqc/${NORMAL_SAMPLE_NAME}_1_fastqc.zip
NORMAL_FASTQC_2=${REPORTS_DIRECTORY}fastqc/${NORMAL_SAMPLE_NAME}_2_fastqc.zip
TUMOR_FASTQC_1=${REPORTS_DIRECTORY}fastqc/${TUMOR_SAMPLE_NAME}_1_fastqc.zip
TUMOR_FASTQC_2=${REPORTS_DIRECTORY}fastqc/${TUMOR_SAMPLE_NAME}_2_fastqc.zip
OUTPUT_DIRECTORY=${REPORTS_DIRECTORY}/multiqc/
We also need to add the output directory:
# Create directory for output
mkdir -p $OUTPUT_DIRECTORY
Then, we will add the command to run MultiQC
:
# Run MultiQC
multiqc \
$NORMAL_PICARD_METRICS \
$TUMOR_PICARD_METRICS \
$NORMAL_FASTQC_1 \
$NORMAL_FASTQC_2 \
$TUMOR_FASTQC_1 \
$TUMOR_FASTQC_2 \
--outdir $OUTPUT_DIRECTORY
Click here to see what our final sbatch
code script for running multiqc
should look like
#!/bin/bash # This sbatch script is for collating alignment metrics from FastQC and Picard using MultiQC
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:10:00 #SBATCH -c 1 #SBATCH --mem 1G #SBATCH -o multiqc_alignment_metrics_%j.out #SBATCH -e multiqc_alignment_metrics_%j.err
# Load modules module load gcc/9.2.0 module load multiqc/1.21
# Assign variables REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/ NORMAL_SAMPLE_NAME=syn3_normal TUMOR_SAMPLE_NAME=syn3_tumor REFERENCE=GRCh38.p7 NORMAL_PICARD_METRICS=${REPORTS_DIRECTORY}picard/${NORMAL_SAMPLE_NAME}/${NORMAL_SAMPLE_NAME}_${REFERENCE}.CollectAlignmentSummaryMetrics.txt TUMOR_PICARD_METRICS=${REPORTS_DIRECTORY}picard/${TUMOR_SAMPLE_NAME}/${TUMOR_SAMPLE_NAME}_${REFERENCE}.CollectAlignmentSummaryMetrics.txt NORMAL_FASTQC_1=${REPORTS_DIRECTORY}fastqc/${NORMAL_SAMPLE_NAME}_1_fastqc.zip NORMAL_FASTQC_2=${REPORTS_DIRECTORY}fastqc/${NORMAL_SAMPLE_NAME}_2_fastqc.zip TUMOR_FASTQC_1=${REPORTS_DIRECTORY}fastqc/${TUMOR_SAMPLE_NAME}_1_fastqc.zip TUMOR_FASTQC_2=${REPORTS_DIRECTORY}fastqc/${TUMOR_SAMPLE_NAME}_2_fastqc.zip OUTPUT_DIRECTORY=${REPORTS_DIRECTORY}/multiqc/
# Create directory for output mkdir -p $OUTPUT_DIRECTORY
# Run MultiQC multiqc \ $NORMAL_PICARD_METRICS \ $TUMOR_PICARD_METRICS \ $NORMAL_FASTQC_1 \ $NORMAL_FASTQC_2 \ $TUMOR_FASTQC_1 \ $TUMOR_FASTQC_2 \ --outdir $OUTPUT_DIRECTORY
Like the previous step, we will need to check to ensure that the previous Picard
step for collecting metrics for each sample is down before we can submit this script. To do this, we will check out squeue
:
squeue --me
If it is complete, we can go ahead and run our script.
sbatch multiqc_alignment_metrics_normal_tumor.sbatch
This job should finish fairly quickly.
As we have previously discussed with FASTQC, O2 is not designed to render HTML files. And like the FASTQC report, we will need a browser, such as Safari, Chrome, Firefox, etc., on our local computer to view the HTML report. Thus, we will need to download the HTML report from the cluster to our local computers and we are going to use FileZilla
to help us download the report file.
In the right hand panel, navigate to where the HTML files are located on O2 ~/variant_calling/reports/multiqc/
. Then decide where you would like to copy those files to on your computer and move to that directory on the left hand panel.
Once you have found the HTML output (multiqc_report.html
) for MultiQC
copy it over by double clicking it or drag it over to right hand side panel. Once you have the HTML file copied over to your computer, you can leave the FileZilla
interface. You can then locate the HTML file on your computer and open the HTML report up in a browser (Chrome, Firefox, Safari, etc.).
Now we can evalute all of our FastQC
and alignments metrics at once. The first table we are presented with gives us an overview of our sequencing and alignment.
A few quick takeaways from this table is that it gives us an overview of our alignments:
1) We had an alignment rate of 99% for both normal and tumor which is very good.
2) The level of duplicates is low (<10%).
3) The GC-content of our sequencing is 49%. The average GC content of the human genome is ~41%, but GC-content is higher in genic regions than intergenic regions. Given that our sequencing represents whole exome sequencing rather than whole genome sequencing, a moderately elevated GC-content compared to the genome average seems reasonable. If we see GC-content that drasatically differs (more than ~10%) from our expectation then that could be a reason to pause and look for reasons for this divergence.
4) We can also see that we have ~49 million fragments for each sample, which should provide more than adequate depth for variant calling for human WES.
The next figure in the report is a chart of the aligned reads. You can click on the "Percentages" tab if you'd rather see the alignments as a percentage. As we already covered in the General Statistics section, we are seeing an alignment rate of ~99%, which is very good.
As we continue down the report, we can see that the FastQC plots are present and aggregrated. We won't go in these in detail, as we have already explored them in a previous lesson. What we will highlight is that all samples are aggregated in the plots.
Use the sequence quality figure below as an example. Rather than a single line, we now observe lines representing each sample. If you hover over the line with your mouse, you will see a text box appear to with a sample name label and quality score at each base. The advantage to having all samples in a single plot is that you can more easily spot trends across samples within a dataset.
The QC for this dataset looks pretty good. We have a high alignment rate to our reference genome. Our read qualities are good and the GC-content is within the range we would expect for our sample. There doesn't appear to be many duplicated or overpresented sequences. All of these signs point to having a high-quality dataset, which perhaps was expected from a in silico generated dataset.
1) Rarely will one single metric tell you that there is something wrong about the data. Small deviations away from the "ideal" are normal and should mostly only be a concern if there are multiple deviations with moderate impact. Many of these metrics are somewhat redundant, so any problematic deviation should likely show up in multiple diagnostics. For example, if the expected GC content for your sample is 41%, but your data comes back as 43%, that single metric on it's own is likely not too problematic. However, if the GC content comes back as 60%, there's only 40% alignment to the reference genome and there appears to be a multi-modal distribution in the GC content distribution, then you should definitely pause and evaluate sources of variation that could be creating this pattern.
2) A poor QC report DOES NOT mean that you need to discard all of the data immediately. Oftentimes, there are still salvagable data in a dataset that fails QC on some metrics. Perhaps it means you will need to remove adapter contamination or other contaminants. While it is unfortunate to have to discard reads and weaken your depth for finding variants, having clean data will substanially help the analyses more accuarately call variants. Of course, some datasets are beyond salvagable, but these are generally rare.
3) As with any NGS QC analysis, be aware of the biological and technical aspects of your sample. Perhaps your organism of interest or sample has some peculiar biological/technical aspect as this may or likely will come through in the QC analysis. This biological or technical aspect could skew some QC metrics or create patterns that we haven't shown here. For instance, our GC metrics were a bit elevated compared to the human genome, but we recalled that we are working from exome data and the human exome is more GC-rich than the rest of the genome, so our elevate GC percentages were reasonable.
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.