Pipeline to run basic RNA-seq analysis on paired-end data.
This is a package of Python and R scripts that enable reading, processing and analysis of Omics' datasets. This package implements the Snakemake management workflow system and is currently implemented to work with the cluster management and job scheduling system SLURM. This snakemake workflow utilizes conda installations to download and use packages for further analysis, so please ensure that you have installed miniconda prior to use.
Please add an issue to the Omics-QC-pipeline repository. We would appreciate if your issue included sample code/files (as appropriate) so that we can reproduce your bug/issue.
We welcome contributors! For your pull requests, please include the following:
- Sample code/file that reproducibly causes the bug/issue
- Documented code providing fix
- Unit tests evaluating added/modified methods.
Locate raw files:
- After sequencing, your raw fastq files are placed in
/path/to/sequencing/files
.
$ cd /path/to/raw/data
$ ls -alh
Check md5sum.
$ md5sum –c md5sum.txt > md5sum_out.txt
Move your files into the archive to be stored.
$ mv /path/to/raw/data /path/to/archive
Check md5sum again to ensure your sequencing files are not corrupted.
$ md5sum –c md5sum.txt > md5sum_out.txt
Unzip all fastq files.
$ gunzip –d sample.fastq.gz
$ ctrl+z
$ bg
Clone this repository into your working directory.
$ git clone https://github.com/ohsu-cedar-comp-hub/Bulk-RNA-seq-pipeline-PE.git
Create a samples/raw
directory, a logs
directory and a data
directory (if they do not exist) in your wdir()
.
$ mkdir logs
$ mkdir data
$ mkdir samples
$ cd samples
$ mkdir raw
Symbollically link the fastq files of your samples to the wdir/samples/raw
directory using a bash script loop in your terminal.
$ ls -1 /path/to/data/LIB*R1*fastq | while read fastq; do
R1=$( basename $fastq | cut -d _ -f 2 | awk '{print $1"_R1.fq"}' )
R2=$( basename $fastq | cut -d _ -f 2 | awk '{print $1"_R2.fq"}' )
echo $R1 : $R2
ln -s $fastq ./$R1
ln -s ${fastq%R1_001.fastq}R2_001.fastq ./$R2
done
Upload your metadata file to the data
directory, with the correct formatting:
- Columns should read:
StudyID Column2 Column3 ...
- Each row should be a sample, with subsequent desired information provided (RNA extraction date, etc.)
- Edit omic_config.yaml to include only columns included in this metadata file:
- This includes
meta_columns_to_plot
andpca labels
- This includes
- All values in this file should be tab-separated
Edit the omic_config.yaml
in your wdir()
:
- Change the
project_id
to a unique project identifier - Add appropriate contrasts based on your samples under the
[diffexp][contrasts]
section - Add the path to your metadata file for the
omic_meta_data
andsamples
parameters - Change
base_dir
to your current working directory - Ensure you have the correct
assembly
specified- Current options for this are: hg19, hg38.89 (ensembl v89) and hg38.90 (ensembl v90)
Do a dry-run of snakemake to ensure proper execution before submitting it to the cluster (in your wdir).
$ snakemake -np --verbose
Once your files are symbolically linked, you can submit the job to exacloud via your terminal window.
$ sbatch submit_snakemake.sh
To see how the job is running, look at your queue.
$ squeue -u your_username
- Trimming
- Trimming of paired-end reads was performed using the trimming tool
sickle
- The output is located in
samples/trimmed/
- Trimming of paired-end reads was performed using the trimming tool
- Quality Analysis
- Trimmed reads were subject to
fastqc
quality analysis - The output is located in
samples/fastqc/{sample}/{samples}_t_fastqc.zip
- Trimmed reads were subject to
- Alignment
- Trimmed reads were aligned to the hg38 genome assembly using
STAR
- We included a two pass mode flag in order to increase the number of aligned reads
- Output is placed in
samples/star/{sample}_bam/
- Output directory includes:
Aligned.sortedByCoord.out.bam
,ReadsPerGene.out.tab
, andLog.final.out
- Output directory includes:
- We extracted the statistics from the
STAR
run, and placed them in a table, summarizing the results across all samples from theLog.final.out
output of STAR- Output is
results/tables/{project_id}_STAR_mapping_statistics.txt
- Output is
- Trimmed reads were aligned to the hg38 genome assembly using
- Summarizing output
htseq
is used to extract the gene counts from each sample- We summarize these results into 1 table, which includes the gene counts across all samples
- The output is located in
data/{project_id}_counts.txt
- RSEQC Quality check
RSEQC
was used to check the quality of the reads by using a collection of commands from theRSEQC
package:- Insertion Profile
- Inner Distance
- Clipping Profile
- Read distribution
- Read GC
- For more information on these, visit: http://dldcc-web.brc.bcm.edu/lilab/liguow/CGI/rseqc/_build/html/index.html#usage-information
- Output directory:
rseqc/
- QA/QC scripts to analyze the data as a whole
- The purpose of this analysis is to identify potential batch effects and outliers in the data
- The outputs to this are located in the
results
directory, and are distributed amongst 4 subdirectories, numbered1 through 4
1
- A boxplot of the raw log2-transformed gene counts across all samples
- A boxplot of the loess-transformed gene counts across all samples
- A scatter plot comparing raw gene counts to loess-transformed gene counts
- A density plot of raw log2-transformed gene counts across all samples
- A density plot of loess-transformed gene counts across all samples
- A scatter plot of the standard deviation of raw log2-transformed gene counts across all samples
- A scatter plot of the standard deviation of loess-transformed gene counts across all samples
2
- A heatmap of all raw log2-transformed gene counts across samples
- A heatmap of all loess-transformed gene counts across samples
- These are generated to look for any batch effects in the data, due to date of extraction, or other factors
- An MDS Plot for all samples, generated with the raw log2-transformed gene counts
- An MDS Plot for all samples, generated with the loess-transformed gene counts
- These are generated to look for outliers in the data
3
- p-value histograms for each contrast specified in the
omic_config.yaml
- q-value QC plot arrays for each contrast specified in the
omic_config.yaml
- p-value histograms for each contrast specified in the
4
- A Heatmap which looks at genes with a high FC and low q-value (very significant)
- Takes genes with a FC>1.3, and ranks those by q-value. From this, a heatmap is generated for the top 50, 100 and 200 genes in this list
- An MDS Plot which looks at the same subsets of genes as the Heatmap described above
- A Heatmap which looks at genes with a high FC and low q-value (very significant)
- Initializing the DESeq2 object
- Here, we run
DESeq2
on the genecounts table, which generates an RDS object and rlog- This includes the DE analysis across all samples
- Output is located in the
results/diffexp/ directory
- From the dds object generated, we extract the normalized counts and generate a table with the results
- Output is
results/tables/{project_id}_normed_counts.txt
- Output is
- Here, we run
- Generating plots
- From the RDS object, we generate a collection of informative plots. These include:
- PCA Plot
- Standard Deviation from the Mean Plot
- Heatmap
- Variance Heatmap
- Distance Plot
- From the RDS object, we generate a collection of informative plots. These include:
- Differential Expression Analysis
- We perform Differential Expression (DE) analysis for each contrast listed in the
omic_config.yaml
- Our output consists of DE gene count tables and a variety of plots
- A table is generated for genes that are differentially expressed for each contrast
- The output is placed in
results/diffexp/{contrast}.diffexp.tsv
- The output is placed in
- MA Plots are generated for each contrast
- p-histograms are generated for each contrast
- A table is generated for genes that are differentially expressed for each contrast
- We perform Differential Expression (DE) analysis for each contrast listed in the
- Differential Expression Plots
- We use the output from DESeq2 to generate two types of plots:
- Gene Ontology (GO) plots:
- A
tree graph
describing the GO ID relationship for significantly up/downregulated genes in a given comparison- Output is located in
results/diffexp/GOterms
- Output is located in
- A
bar graph
describing the enrichment and significance of GO IDs for up/downregulated genes in a given comparison
- A
- Volcano plots:
- A
volcano plot
describing the distribution of up/downregulated genes in a given comparison- Output is located in
results/diffexp
- Output is located in
- A
- Gene Ontology (GO) plots:
- We use the output from DESeq2 to generate two types of plots: