To standardlize the process, we strongly suggest to start with universal file naming format. In this pipeline, we will use filename convention as:
[PD|AD|HD|HC]_[subjectID]_[cellType]_<b#>_<rep#>.[R1|R2].<fq|fastq>.gz
where,
- Beginning with patient type, PD="Parkinson's disease", AD="Alzhimer's disease", HD="Huntington's disease" and HC="Healthy controls", in 2-letter abbreviation;
- Following with subjectID (NOT lane# or sequencing date), which is unique per subject;
- Following with cell type or tissue short name, e.g. SNDA="Substantial nigra dopamine neuron", MCPY="Motor cortex pyramidal neuron" etc.;
b#
andrep#
are optional, in format of (if any);- batch number:
b1
,b2
etc. - replication number (capitalized one for biological rep), e.g.
rep1
is for technique replicate 1,Rep1
for biological replicate 1 - Use R1 or R2 to tell the two mates of pair-end sequencing data;
- Use zipped fastq;
For each project (e.g. PDMap), it should have a folder named with project short name and followed by the type of data, such as PDMap_RNAseq, HDPredict_smallRNA etc.
- rawfiles
- raw fastq files from sequencing;
- use soft links for external location or unformated file name
- log file (readme.txt or Excel)
- filtered
- filtered files (e.g. adaptor removal/clip)
- run_output
- sub-folder per sample (e.g. sample1, sample2 etc.)
- output of Tophat/Cufflinks/htseq-count runs
- uniq subfolder for the runs for unique reads only
- status indication file (.status*) for tracking the progress
- sub-folder per sample (e.g. sample1, sample2 etc.)
- for_display
- files used for display on UCSC / IGV, such as *.bam, *.bam.bai, *.bw, *.gtf etc.
- use soft links to the output files
- results
- result files for integrative analysis, e.g. differential analysis by combining all samples.
- Install programs:
tophat
,cufflinks
,bowtie
,bedtools
,samtools
,fastq-mcf
,fastqc
, andhtseq-count
; - Install Jim Kent's executable programms: http://hgdownload.cse.ucsc.edu/admin/exe/;
- Install R and bioconductor packages:
DESeq2
,MatrixEQTL
,SPIA
etc. - Add path of the executable programs to the
$PATH
;
Each sub-module has its own status hidden file (e.g. .status.$modulename.adaptorremoval). To check if any sample not completed at any step, use script below:
$grep touch ~/pipeline/modules/_RNAseq.sh | grep -v "#" | sed 's/.*modulename\.//g' | while read i; do echo $i `ll ~/neurogen/rnaseq_PD/run_output/*/.status._RNAseq.sh.$i | wc -l`; done
RNAseq.pipeline.sh
- Usage:
RNAseq.pipeline.sh /data/neurogen/rnaseq_PD/rawfiles
- Function: Main script for submitting RNAseq data analysis jobs to high-properformance computing cluster in batch. For now, it's configured to support only HPC environment with LSF job scheduler.
- Input: absolute path of folder for the raw fastq files
- Output: Tophat/Cufflinks/htseq-count/callSNP etc. runs for each sample using both multiple and unique mappers
- Usage: _RNAseq.sh HC_BN10-39_2.R1.fastq.gz HC_BN10-39_2.R2.fastq.gz
- Function: Routine steps for pair-end RNAseq data
- Input: a pair of fastq files from PE RNAsequencing
- Output: BAM/SAM (from alignment), BED/GTF (from assembly) etc.
List of sub-moduels in the core script _RNAseq.sh:
$ grep touch ~/pipeline/modules/_RNAseq.sh | grep -v "#" | sed 's/.*modulename\.//g'
adaptorremoval
shortReadsExtract
shortReadsExtract_stemloop
fastqc
mapping
circRNA
sam2bam
bam2stat
sam2bw
rpm_vs_coverage
cufflinks.multi.denovo
cufflinks
htseqcount
metaexon
uniq
uniq.sam2bw
uniq.bam2stat
cufflinks.denovo
uniq.bam2annotation
uniq.normalize
uniq.cufflinks.rpkm
uniq.cuffquant.rpkm
uniq.htseqcount
uniq.metaexon
makelinks