Illumina Read Processing and Alignment

Initial QC and Trimming

Initial sequencing results were visualised with FastQCv0.12.1 and MultiQCv1.13. Reads were trimmed using TrimGalore v0.6.10.

ref=/dir/to/reference.fasta
dir=/dir/to/reads/
reports=/dir/to/fastq_out/
trim_out=/dir/to/trimmmed_reads/

for samp in ${dir}*R1.fastq.gz
    do
    base=$(basename ${samp} _R1.fastq.gz)
    echo "Running Trim_galore for ${base}..."
    trim_galore --paired \
        --nextseq 28 \
        --2colour 20 \
        --cores 8 \
        --fastqc_args "--nogroup --outdir ${reports} --threads 32" \
        --length 50 \
        --output_dir ${trim_out} \
        --clip_R1 20 \
        --clip_R2 20 \
        --three_prime_clip_R1 5 \
        --three_prime_clip_R2 5 \
        --retain_unpaired \
        --length_1 55 \
        --length_2 55 \
        ${dir}${base}_R1.fastq.gz \
        ${dir}${base}_R2.fastq.gz
done

Alignment

For population analyses, Illumina short-reads were aligned, PCR-duplicates removed, and alignment statistics estimated in the same manner for both fairy tern populations and kakī. All manipulation of alignment files was performed with SAMtools v1.16.

REF=/dir/to/reference.fasta
INPUT=/dir/to/reads/
OUTPUT=/dir/to/alignments/

for LIB in lib1 lib2 LIC001 LIC002
    do
    for SAMP in ${INPUT}${LIB}/*_R1.fq.gz
        do
        BASE=$(basename ${SAMP} _${LIB}_R1.fq.gz)
        INFOLINE=$(zcat ${SAMP} | head -n 1)
        INSTRUMENT=`echo ${INFOLINE} | cut -d ':' -f1`
        INSTRUMENTRUN=`echo ${INFOLINE} | cut -d ':' -f2`
        FLOWCELL=`echo ${INFOLINE} | cut -d ':' -f3`
        LANE=`echo ${INFOLINE} | cut -d ':' -f4`
        INDEX=`echo ${INFOLINE} | cut -d ':' -f10`

        RGID="ID:${INSTRUMENT}_${INSTRUMENTRUN}_${FLOWCELL}_${LANE}_${INDEX}"
        RGPL="PL:Illumina"
        RGPU="PU:${FLOWCELL}.${LANE}"
        RGLB="LB:${BASE}_${LIB}"
        RGSM="SM:${BASE}"

        printf "BEGAN ALIGNING READS FOR ID:${BASE}, LIB:${LIB} AT "
        date
        bwa mem -M -R @RG'\t'$RGID'\t'$RGPL'\t'$RGPU'\t'$RGLB'\t'$RGSM -t 46 $REF $SAMP ${INPUT}${LIB}/${BASE}_${LIB}_R2.fq.gz | samtools sort -@ 46 -o ${OUTPUT}${BASE}.bam
        printf "FINISHED ALIGNING READS FOR ID:${BASE}, LIB:${LIB} AT "
        date
    done
done

Files were then merged for each of the datasets, the reference sample (SP01) is the only individual sequenced at two different facilites.

samtools merge -@ 16 \
    -o ${DIR}SP01_merged.bam \
    ${DIR}SP01_lib1.bam ${DIR}SP01_lib2.bam \
    ${DIR}SP01_LIC001.bam ${DIR}SP01_LIC002.bam

This for loop below was used to merge all other samples. The sequence batches were run simultaneously by changing the suffix of the file (i.e., LIC002, lib2) in the initiation of the for loop and for the creation of the $BASE variable.

for BAM in ${DIR}*_lib2.bam
    do
    BASE=$(basename $bam _lib2.bam)
    echo "MERGING BAMS FOR $BASE..."
    samtools merge -@64 -o ${DIR}${BASE}_merged.bam ${DIR}${BASE}*.bam
done

All alignments were then sorted and PCR duplicates marked and removed using SAMtools v.1.16.

for BAM in ${DIR}*_merged.bam
    do
    BASE=$(basename $BAM _merged.bam)
    echo "NSORTING AND FIXING MATE PAIRS FOR ${BASE}\n"
    samtools sort -@64 -n ${BAM} | samtools fixmate -@64 -m -c - ${DIR}${BASE}.fixmate.bam
    samtools sort -@64 -o ${DIR}bam/${BASE}.fixmate.sorted.bam ${DIR}${BASE}.fixmate.bam
    printf "MARKING AND REMOVING PCR DUPLICATES FOR ${BASE} AT "
    date
    samtools markdup -@64 --write-index ${DIR}${BASE}.fixmate.sorted.bam ${DIR}${BASE}_markdup.bam
    samtools markdup -r -@64 --write-index ${DIR}${BASE}.fixmate.sorted.bam ${DIR}${BASE}_nodup.bam
    printf "\nFINISHED MARKING AND REMOVING DUPLICATES FOR ${BASE} AT "
    date
done

Once BAMs were processed, autosomal chromosomes were extracted for population analyses. Mean mapping quality and alignment depth was then estimated using QualiMap v2.3.

cut -f1,2 SP01_5kb_ragtag.fa.fai | grep "CM020" | grep -v CM020462.1_RagTag | grep -v CM020463.1_RagTag | awk '{print $1"\t1\t"$2} > SP01_autosomes.bed

for SAMP in ${DIR}*_markdup.bam
    do
    BASE=$(basename $SAMP _markdup.bam)
    printf "EXTRACTING AUTOSOMES FOR ${BASE} AT "
    date
    samtools view -@64 -L SP01_genome/SP01_autosomes.bed \
        --write-index -o ${DIR}markdup/${BASE}_markdup_autosomes.bam ${SAMP}
    samtools view -@64 -L SP01_genome/SP01_autosomes.bed \
        --write-index -o ${DIR}nodup/${BASE}_nodup_autosomes.bam ${DIR}nodup/${BASE}_nodup.bam
    printf "FINISHED CONVERTING AND INDEXING FILES FOR ${BASE}, NOW ESTIMATING STATS AT "
    date
    mosdepth --threads 24 --fast-mode --by 50 ${DIR}mosdepth/${BASE}_markdup_autosomes ${SAMP}
    mosdepth --threads 24 --fast-mode --by 50 ${DIR}mosdepth/${BASE}_nodup_autosomes ${SAMP}
    qualimap bamqc -bam ${DIR}markdup/${BASE}_markdup_autosomes.bam -outdir ${DIR}qualimap/${BASE}_markdup_autosomes -outformat PDF:HTML --java-mem-size=3G
    qualimap bamqc -bam ${DIR}nodup/${BASE}_nodup_autosomes.bam -outdir ${DIR}qualimap/${BASE}_nodup_autosomes -outformat PDF:HTML --java-mem-size=3G
    echo "FINISHED PROCESSING ${BASE}!"
done

For ease of comparisons, mosdepth outputs were also plotted with:

python ~/anaconda3/envs/mosdepth/scripts/plot-dist.py ${data}nodup_bam_stats/*.global.dist.txt

Three additional fairy tern scaffolds CM020459.1, CM020460.1, and CM02061.1 were excluded from downstream analyses as these scaffolds consistently had read depths much higher than expected and likely represents improperly assembled representations of these chromosomes. A new bedfile, cleverly named SP01_autosomes2.bed, was used to remove these scaffolds prior to analyses.

for SAMP in ${DIR}*_autosomes.bam
    do
    BASE=$(basename $SAMP _autosomes.bam)
    printf "EXTRACTING AUTOSOMES FOR ${BASE} AT "
    date
    samtools view -@64 -L SP01_genome/SP01_autosomes2.bed \
        --write-index -o ${DIR}markdup/${BASE}_markdup_autosomes2.bam ${SAMP}
    samtools view -@64 -L SP01_genome/SP01_autosomes2.bed \
        --write-index -o ${DIR}nodup/${BASE}_nodup_autosomes2.bam ${DIR}nodup/${BASE}_nodup.bam
    echo "FINISHED PROCESSING ${BASE}!"
done

The BAM files originally filtered for autosomomal scaffolds were subsequently overwritten with the newly filtered bam files. Thus all files denoted as *_{markdup,nodup}_autosomes.bam contain only 22 scaffolds for fairy terns.

A similar process was performed for kakī alignments, with some minor differences. Because the kakī genome was assembled and scaffolded without the use of a chromosomally assembled genome, we identified scaffolds > 2Mb in length that demonstrated sequence coverage consistent with larger scaffolds. This indicated that Scaffolds 1-28 likely represent relatively well assembled chromosomes. These, exlucding the Z sex chromosome (Scaffold 4) were used in all downstream analyses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02_pop_sequence_QC_and_alignment.md

02_pop_sequence_QC_and_alignment.md

Illumina Read Processing and Alignment

Initial QC and Trimming

Alignment

Files

02_pop_sequence_QC_and_alignment.md

Latest commit

History

02_pop_sequence_QC_and_alignment.md

File metadata and controls

Illumina Read Processing and Alignment

Initial QC and Trimming

Alignment