Skip to content
cfarkas edited this page Dec 27, 2022 · 160 revisions

Welcome to the annotate_my_genomes wiki!

The pipeline will:

  • Use as input a GTF file coming from StringTie using long/hybrid sequencing reads settings (for documentation, please see http://ccb.jhu.edu/software/stringtie/ and the documentation in this repository).
  • Complement annotation using UCSC/NCBI annotation, using genome coordinates from UCSC genomes. A concilliated GTF file (final_annotated.gtf) is generated with annotated gene names and corresponding StringTie assembled transfrags (transcripts) as follows:
gene_id "YF5"; transcript_id "STRG.40.1" ---> 40 = assembled transfrag; 1 = isoform number.

I) Manual Installation of dependences

Mandatory Dependences:

(1) Obtaining and installing latest Anaconda (Anaconda3 2020.02)

Click to expand code
apt-get install libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
chmod 755 Anaconda3-2020.02-Linux-x86_64.sh
./Anaconda3-2020.02-Linux-x86_64.sh

(2) Installing parallel library

Click to expand code
  • via Ubuntu/debian repository:
sudo apt-get install -y parallel
  • via conda (conda-forge channel):
conda install -c conda-forge parallel

(3) Obtaining and installing StringTie (v2.0 release needed)

Click to expand code
git clone https://github.com/gpertea/stringtie
cd stringtie
make release
sudo cp stringtie /usr/local/bin/

(4) Obtaining and installing gffcompare and gffread

Click to expand code
# In the same directory, do
git clone https://github.com/gpertea/gclib
git clone https://github.com/gpertea/gffcompare
git clone https://github.com/gpertea/gffread
cd gffcompare  
make release
sudo cp gffcompare trmap /usr/local/bin/
cd ..
cd gffread
make release
sudo cp gffread /usr/local/bin/
cd ..

(5) Installing up-to-date ncbi-blast+ version (>=v2.7.1)

If you have ncbi-blast+ version > 2.7.1 is OK, your binaries will work, but GAWN pipeline strongly recommends > v2.7.1+

Click to expand code
# To remove older ncbi-blast+ binaries from your system 
sudo apt-get remove ncbi-blast+

# Installing ncbi-blast+ version > 2.7
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-linux.tar.gz      # 2021-26-10 linux version
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-macosx.tar.gz     # 2021-26-10 macOSX version

gunzip ncbi-blast-2.X.X+-x64-linux.tar.gz
tar -xvf ncbi-blast-2.X.X+-x64-linux.tar
cd ncbi-blast-2.X.X+/bin/
sudo cp ./bin/* /usr/local/bin/

(6) Installing GMAP genomic aligner program

(for documentation, please see http://research-pub.gene.com/gmap/)

Click to expand code
sudo apt-get install gmap

(7) Obtaining and Installing BEDTools

Complete instructions can be found in https://bedtools.readthedocs.io/en/latest/content/installation.html. Users with privileges can accomplish with sudo:

Click to expand code
sudo apt-get install bedtools

(8) Installing dependences for FEELnc : FlExible Extraction of LncRNA

(for documentation, please see https://github.com/tderrien/FEELnc)

Click to expand code
## debian requirements
sudo apt-get install libcurl4 libcurl4-openssl-dev -y
sudo apt-get install libxml-dom-xpath-perl

## Perl Requirements: Parallel and BioPerl (with sudo privileges)

sudo perl -MCPAN -e shell
install Parallel::ForkManager
install BioPerl
quit

# Installing KmerInShort 

git clone --recursive https://github.com/rizkg/KmerInShort
cd KmerInShort
mkdir build;  cd build;  cmake ..;  make -j 8
sudo cp KmerInShort /usr/local/bin/

# Installing fasta_ushuffle

git clone https://github.com/agordon/fasta_ushuffle.git
cd fasta_ushuffle
make
sudo cp ushuffle /usr/local/bin/
sudo cp fasta_ushuffle /usr/local/bin/

Clone the FEELnc git and install it

git clone https://github.com/tderrien/FEELnc.git
cd FEELnc
# Export PERL5LIB and FEELNCPATH variables
export FEELNCPATH=${PWD}
export PERL5LIB=$PERL5LIB:${FEELNCPATH}/lib/ #order is important to avoid &Bio::DB::IndexedBase::_strip_crnl error with bioperl >=v1.7
export PATH=$PATH:${FEELNCPATH}/scripts/
export PATH=$PATH:${FEELNCPATH}/utils/

# for LINUX
#----------
export PATH=$PATH:${FEELNCPATH}/bin/LINUX/
# or
cp ${FEELNCPATH}/bin/LINUX/ ~/bin/

(9) Obtaining and installing up-to-date SAMtools with htslib (version >= 1.9)

(Old samtools version can also work). Users needs to install version up to date of these three packages. Users can first install htslib v1.9 and then samtools with bcftools v1.9, respectively. For downloading these packages, see http://www.htslib.org/download/). The latter can be accomplish by downloading the three packages, decompressing it, and doing the following:

Click to expand code
# Debian requirement
sudo apt-get install -y libbz2-dev 
# Installation
wget https://github.com/samtools/htslib/releases/download/1.10.2/htslib-1.10.2.tar.bz2
bzip2 -d htslib-1.10.2.tar.bz2
tar -xvf htslib-1.10.2.tar
rm htslib-1.10.2.tar
cd htslib-1.10.2    # and similarly for samtools
sudo ./configure --prefix=/usr/local/bin
sudo make
sudo make install
# this step is only for samtools
sudo cp samtools /usr/local/bin/

# Similarly as htslib, samtools and bcftools can be downloaded as follows:

wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
wget https://github.com/samtools/bcftools/releases/download/1.10.2/bcftools-1.10.2.tar.bz2

And installed as htslib. Then in a terminal type

samtools

to check 1.10 version (using htslib v1.10)

(10) Obtaining and installing TransDecoder (latest version= 5.5.0)

Complete information about installation can be found here: https://github.com/TransDecoder/TransDecoder

Click to expand code
  • Via conda
conda install -c bioconda transdecoder
  • Debian:
sudo apt install transdecoder

Optional Dependences (to run all the downstream analysis):

(1) Obtaining and installing R (>=3.6.0) and associated packages

See https://cloud.r-project.org/ for R installation in linux/ubuntu/windows/(Mac) OS X. R version 3.2.3 comes from default in Ubuntu 16.04 LTS but users with newer Ubuntu distributions must upgrade R. A way to accomplish this can be through conda:

Click to expand code
conda config --set channel_priority false        # To make conda install the newest version of a package
conda config --prepend channels conda-forge      # adding conda-forge package
conda install r-base=4.0.2                       # for the latest R version, through conda-forge channel (4.0.2)
conda install r-base=3.6.3                       # for an specific R version

or via ubuntu packages, removing old R distributions and adding a newer R repository

# Removing R from system
sudo apt-get remove r-base-core

# Editing sources.list 
sudo su
echo "deb https://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/" >> /etc/apt/sources.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

# Installing R (version 3.6.3)
sudo apt update; sudo apt install r-base
exit
R

will install R version 3.6.3 (2020-02-29) -- "Holding the Windsock"

The following R packages are required for the script usages:

  • car >=3.5.0
  • edgeR
  • EnhancedVolcano

To install these packages, do the following:

-debian requirements (for R library curl)
sudo apt-get install libcurl4 libcurl4-openssl-dev -y

In R:
R
# car
install.packages("car")

# edgeR and EnhancedVolcano packages can be installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("edgeR")
BiocManager::install("EnhancedVolcano")

(2) Installing Salmon

Salmon is a ultra-fast transcript quantification tool of RNA-seq data. More information about install can be found here: https://salmon.readthedocs.io/en/latest/. This tool can be installed through bioconda as follows:

Click to expand code
conda config --add channels conda-forge
conda config --add channels bioconda
conda create -n salmon salmon

This will install the latest salmon in its own conda environment. The environment can then be activated via:

conda activate salmon

(3) Installing prinseq

PRINSEQ. PReprocessing and INformation of SEQuence data can be installed from here: http://prinseq.sourceforge.net/ or through bioconda as follwows:

Click to expand code
conda install -c bioconda prinseq

(4) Installing seqkit

SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation (https://bioinf.shenwei.me/seqkit/) can be installed from repository as follows:

Click to expand code
wget https://github.com/shenwei356/seqkit/releases/download/v0.12.1/seqkit_linux_386.tar.gz
gunzip seqkit_linux_386.tar.gz
tar -xvf seqkit_linux_386.tar
sudo cp seqkit /usr/local/bin/

(5) Installing rnaQUAST

A software designed for quality evaluation and assessment of de novo transcriptome assemblies. For information, please visit http://cab.spbu.ru/software/rnaquast/ . Install can be resolved via bioconda:

Click to expand code
conda install -c bioconda rnaquast
# Testing installation
rnaQUAST.py --test

(6) Obtaining and installing EMBOSS toolkit (Open Source software for molecular biology)

Complete instructions can be found at the webpage: https://ssbio.readthedocs.io/en/latest/instructions/emboss.html. A way to install it can be the following:

Click to expand code
sudo apt-get install emboss

(7) Obtaining and installing Clustal Omega (DNA/Protein alignment program)

Description can be found at the webpage: http://www.clustal.org/omega/. A way to install it can be the following:

Click to expand code
sudo apt-get install clustalo

(8) Obtaining and installing Cufflinks: http://cole-trapnell-lab.github.io/cufflinks/

Click to expand code
sudo apt-get install -y cufflinks

II) Obtaining StringTie GTF file for annotation.

As an example, we will provide all steps to obtain suitable GTF file for annotation from different sequencing technologies.

Example 1: Alignment of long sequencing reads using minimap2 aligner against galGal6 genome from UCSC, using 30 threads.

# Installing minimap2
git clone https://github.com/lh3/minimap2
cd minimap2 && make
sudo cp minimap2 /usr/local/bin/

# Convert PacBio subreads (bam files) to fastq
samtools bam2fq m54027_190807_082031.subreads.bam > reads.fastq

# Download Gallus gallus v6 fasta file (use genome-download program from this repository, located in ./bin folder)
./genome-download galGal6

# Aligning with minimap2, using 15 threads
minimap2 -ax splice galGal6.fa reads.fastq > aln_galGal6.sam -t 15
samtools view -S -b aln_galGal6.sam -@ 15 > aln_galGal6.bam
samtools sort aln_galGal6.bam -@ 15 > aln_galGal6.sorted.bam
samtools index aln_galGal6.sorted.bam -@ 15

# Obtaining GTF (transcripts.gtf) from the above alignment using StringTie (-L: long read settings)
stringtie -p 1 -L -v -a 4 -o transcripts.gtf aln_galGal6.sorted.bam

# Obtaining GTF (transcripts.gtf) from the above alignment using StringTie (-L: long read settings) using Reference GTF as guide
stringtie -p 1 -L -v -a 4 -o transcripts.gtf -G galGal6_ncbiRefSeq.gtf aln_galGal6.sorted.bam

Example 2: Short and long read sequencing. If users also sequenced with Illumina, short reads can be aligned by using hisat2 (https://ccb.jhu.edu/software/hisat2/manual.shtml) and then merge with minimap2 alignments as follows:

# Install hisat2
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.4-Linux_x86_64.zip
unzip hisat2-2.0.4-Linux_x86_64.zip
sudo cp hisat2-2.0.4/hisat2* /usr/local/bin/

# build hisat2 index with 15 threads
hisat2-build galGal6.fa -p 15 ./galGal6

# align short illumina reads agains galGal6 genome from USCS, using 15 threads
hisat2 -x ./galGal6 -p 15 -1 41-A3_S1_R1_001.fastq.gz -2 41-A3_S1_R2_001.fastq.gz | samtools view -bS - > 41.bam

# merge with PacBio alignments (aln_galGal6.bam) with illumina alignments (41.bam), using 15 threads:
samtools merge PacBio_Illumina_merged.bam aln_galGal6.bam 41.bam -@ 15 -f

# Sort and Index, using 15 threads
samtools sort -o PacBio_Illumina_merged.sorted.bam PacBio_Illumina_merged.bam -@ 15
samtools index PacBio_Illumina_merged.sorted.bam

1) Obtaining GTF (transcripts.gtf) from the above alignment using StringTie
stringtie -p 1 -v -a 4 -o transcripts.gtf PacBio_Illumina_merged.sorted.bam

2) If the above fails, users can increase -j and -c parameters (useful for large BAM file processing)
stringtie -p 1 -j 2 -c 2 -v -a 4 -o transcripts.gtf PacBio_Illumina_merged.sorted.bam

3) Transcript assembly using Reference GTF as guide
stringtie -p 1 -j 2 -c 2 -v -a 4 -o transcripts.gtf -G galGal6_ncbiRefSeq.gtf PacBio_Illumina_merged.sorted.bam

Example 3: Using StringTie --mix (Recommended)

  • The new method StringTie --mix method enable high accurate transcriptome assembly: (https://doi.org/10.1371/journal.pcbi.1009730).
  • We encourage to use this new transcriptome assembly method with hybrid RNA sequencing. The latter can be done as follows:
## Obtaining GTF (StringTie-mix.gtf) from PacBio and Illumina alignments with StringTie mix (using Reference GTF as guide)

# Sort and Index Illumina and PacBio bam files, using 15 threads
samtools sort -o 41.sorted.bam 41.bam -@ 15 && samtools index 41.sorted.bam                              # Illumina 
samtools sort -o aln_galGal6.sorted.bam aln_galGal6.bam -@ 15 && samtools index aln_galGal6.sorted.bam   # PacBio 

# Run StringTie mix, using --mix flag
stringtie --mix -G galGal6_ncbiRefSeq.gtf -o StringTie-mix.gtf 41.sorted.bam aln_galGal6.sorted.bam                    

Important:

StringTie can overlap transcripts across multiple genes, giving the same gene_id for multiple genes: https://github.com/gpertea/stringtie/issues/270. We recommend after running StringTie, merge the resulting GTF with the reference GTF and fix the gene_id field.

I.e.: using:

  • StringTie-mix.gtf as resulting GTF from StringTie and
  • galGal6_ncbiRefSeq.gtf as reference GTF:

Users can do:

stringtie --merge -G galGal6_ncbiRefSeq.gtf StringTie-mix.gtf -l STRG > merge.gtf
wget https://gist.githubusercontent.com/moble/e2b3e89fc0220e4363765fadc8c1dd73/raw/80acd1577ccd6a485ebe0c21aae4a6326241617a/unique_gene_id.py
python unique_gene_id.py merge.gtf

The resulting merge.unique_gene_id.gtf GTF file now is ready to be inputted into the pipeline.

III) Usage examples

Simplest Usage mode

To download reference genome sequences (UCSC), use genome-download program from this repository. To check UCSC genome names, please visit: https://genome.ucsc.edu/cgi-bin/hgGateway

(Optional) Edit NCPUS value in gawn_config.sh file in annotate_my_genomes folder. Default is 10

  • For mouse assembly using "target.gtf" in genome_1 folder, using 15 threads for cpu processing:
mkdir output1
./genome-download mm10
./annotate-my-genomes -a target.gtf -r mm10.gtf -g mm10.fa -c /path/to/gawn_config.sh -t 15 -o output1
  • For rabbit assembly using "target.gtf" in genome_1 folder, using 15 threads for cpu processing:
mkdir output2
./genome-download oryCun2
./annotate-my-genomes -a target.gtf -r oryCun2.gtf -g oryCun2.fa -c /path/to/gawn_config.sh -t 15 -o output2

Adding NCBI annotations

Users can add annotations from NCBI by using the three outputs from ./genome-download program by using ./add-ncbi-annotation. As example, the pipeline will work as follows (chicken assembly, inside test folder):

# Downloading galGal6 genome and correspondent UCSC/NCBI GTF annotations
./genome-download galGal6

# Running the pipeline on StringTie.gtf, using NCBI GTF (galGal6_ncbiRefSeq.gtf), UCSC GTF (galGal6.gtf), genome (galGal6.fa) and 15 threads for processing:
mkdir output3
./add-ncbi-annotation -a StringTie.gtf -n galGal6_ncbiRefSeq.gtf -r galGal6.gtf -g galGal6.fa -c /path/to/gawn_config.sh -t 15 -o output3

final_annotated.gtf (located in output_files_NCBI) will contained the merged NCBI-updated annotation (in UCSC coordinates).

IV) Downstream analysis using outputs:

(1) Gene quantification procedure examples using output GTF file (final_annotated.gtf):

sudo apt-get install build-essential python2.7-dev python-numpy python3-matplotlib python-pysam python-htseq
  • Gene-level quantification using "final_annotated.gtf" GTF file
htseq-count --stranded=no --format bam condition1.bam condition2.bam final_annotated.gtf > gene_counts

To obtain a suitable count table for R, do

tac gene_counts | sed "1,5{d}" | tac > count_table

(2) Transcriptome and gene metrics

To obtain basic transcriptome metrics, users will need final_annotated.gtf file (output from this pipeline) and the reference genome in fasta format as inputs of transcriptome_metrics.sh script. The metrics can be obtained as follows:

bash /path/to/annotate_my_genomes/additional_scripts/transcriptome_metrics.sh -f final_annotated.gtf -g /path/to/genome.fasta

A subdirectory called transcriptome_metrics will contain gene_metrics.txt and transcriptome_metrics.txt files, containing the classification of novel and reconciled coding/non-coding genes and transcripts, respectively.

(3) Detailed transcriptome annotation

Please see: https://github.com/cfarkas/annotate_my_genomes/blob/master/README.md#d-post-processing-add-ncbi-annotation-outputs

(4) I need the transcript sequences matching each gene. Also validate conserved regions with qPCR. What can I do?:

A gene list in tabular format can also be used to extract:

  • Transcripts sequences belonging to each gene.
  • Align transcript sequences in order to obtain consensus sequences

As example for the chicken genome, we will use final_annotated.gtf file and an user-provided gene list in tabular format (such as gene_list.tab) into a given subdirectory and will execute the following commands to obtain all conserved transcript sequences from a chicken transcriptome.

mkdir conserved_chicken_transcripts && cd conserved_chicken_transcripts
cp /path/to/annotate_my_genomes/bash_scripts/get-transcripts.sh ./
cp /path/to/final_annotated.gtf ./

# Get gene list in tabular format (gene_id.tab) from final_annotated.gtf. In this example we will obtain all genes in the transcriptome:
perl -lne 'print "@m" if @m=(/((?:gene_id)\s+\S+)/g);' final_annotated.gtf > gene_id.tab
sed -i 's/gene_id//g' gene_id.tab && sed -i 's/"//g' gene_id.tab && sed -i 's/;//g' gene_id.tab
awk '{print $1}' gene_id.tab > gene_id.1.tab
awk 'seen[$0]++ == 1' gene_id.1.tab > gene_id.tab && rm gene_id.1.tab

# Download Gallus gallus v6 genome
genome-download galGal6

# Generate "commands" file using provided list of genes 
awk '{print "./get-transcripts.sh -f final_annotated.gtf -g galGal6.fa -i " $0}' gene_id.tab > commands

# Execute "commands" file
bash commands
  • {gene_id}.fa contains transcripts per gene in fasta format. {gene_id}.gtf contains transcripts per gene in GTF format.

  • {gene_id}.cons files contain conserved regions within transcripts and could suitable for PCR primer picking. Users can go to https://www.ncbi.nlm.nih.gov/tools/primer-blast/ , paste this sequences and pick appropiate primers, specifying the genome to discard off-targets. Aditionally, users can compare a precomputed primer list for each gene here: https://gecftools.epfl.ch/getprime

(5) Annotate and identify homologs in novel proteins from transcriptome

The following steps will annotate novel proteins and identify paralogs within the novel genes identified by the pipeline.

1) Obtaining parsed_results.tab file

Users can download up-to-date proteome of the species of interest employing donwload_proteome_uniprot.pl script, providing the correspondent taxid to the species under study. Then, by using novel-prot.fa (generated in the pipeline), the novel predicted proteins can be blasted against the reference proteome to find paralogs/missed genes. Taxonomy proteome numbers can be found here: https://www.uniprot.org/proteomes/

As example from a chicken transcriptome:

## Donwload gallus gallus proteome (taxid 9031)
perl /path/to/annotate_my_genomes/additional_scripts/download_proteome_uniprot.pl 9031

## making blast database
makeblastdb -in 9031.fasta -dbtype 'prot' -out 9031

## Blast novel protein sequences (using 15 threads, max targets= 1, output format=xml)
sed 's/gene=.*//' novel-prot.fa > novel-prot.formatted.fa                                     # format novel-prot.fa first
blastp -db 9031 -max_hsps 1 -max_target_seqs 1 -out blast_results -query novel-prot.formatted.fa -num_threads 15 -outfmt 5

## Parsing blast results (min_identity 90, min_alignment_length 100). Thanks to: https://github.com/sandyjmacdonald/blast_parser. 
# requires BioPython installed: pip3 install biopython 
python /path/to/annotate_my_genomes/additional_scripts/blast_parser.py -i blast_results -e 1e-20 -p 90 -a 100 > parsed_results.txt

## Removing queries with no hits
awk -F'\t' '$2!=""' parsed_results.txt > parsed_results.tab && rm parsed_results.txt
  • parsed_results.tab will contain matched proteins to the provided reference (if not missed genes in the reference annotation). Proteins with identity closed to 100% and within the same chromosome are good paralog candidates. This file will be used as input in following steps.

2) Obtaining final_annotated.format.bed file

Now, final_annotated.gtf needs to be converted to BED format as follows:

Users will need cufflinks installed sudo apt-get install -y cufflinks

# Formatting final_annotated.gtf (to obtain coordinates in bed format)

wget https://raw.githubusercontent.com/TransDecoder/TransDecoder/master/util/gtf_to_bed.pl && chmod 755 gtf_to_bed.pl

gtf_to_bed.pl final_annotated.gtf > final_annotated.bed                                            # gtf_to_bed.pl from cufflinks
sed 's/;/\t/'g final_annotated.bed > final_annotated.format.bed                                    # convert ";" to TAB
sed -i 's/ID=/\t/'g final_annotated.format.bed                                                     # convert "ID=" to TAB
awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5}' final_annotated.format.bed > final_annotated.format2.bed  # extract first five columns from bed file
mv final_annotated.format2.bed final_annotated.format.bed                                          # rename files
  • final_annotated.format.bed will contains annotations in BED format and will be used as input in following steps.

3) Obtaining Ref_Transcript_Annotation.csv file

To obtain Ref_Transcript_Annotation.csv, check here: https://github.com/cfarkas/annotate_my_genomes/blob/master/README.md#d-post-processing-add-ncbi-annotation-outputs

4) Obtaining eggNOG-mapper annotations

An excellent tool to identify orthologs is eggNOG-mapperv2 (see Cantalapiedra et al, 2021. https://www.biorxiv.org/content/10.1101/2021.06.03.446934v2). This tool perform functional annotation, orthology assignments, and domain prediction among other characterization of novel proteins. Users can submit novel-prot.fa file here: http://eggnog-mapper.embl.de/ and obtain the annotation of these predicted proteins using curated evidence. The output out.emapper.annotations.xlsx will be used as input in following steps

5) Run homolog_parser.py

Finally, a python parser called homolog_parser.py will integrate parsed_results.tab, final_annotated.format.bed, Ref_Transcript_Annotation.csv and out.emapper.annotations.xlsx inputs, and will return:

  • Novel_protein_with_coordinates.bed and Reference_annotation_with_coordinates.bed files, containing Novel and Reference transcripts annotated with genomic coordinates in BED format.
  • eggNOG-mapper-blastp-intersections.csv, a complete annotation table of novel proteins, characterized by eggNOG-mapper and blastp analysis against UniProt.
  • eggNOG-mapper-blastp-with_coordinates.bed the latter annotation table, filtered and formatted as a BED file

homolog_parser.py can be runned as follows:

python /path/to/annotate_my_genomes/additional_scripts/homolog_parser.py --Ref Ref_Transcript_Annotation.csv --blastp parsed_results.tab --bed final_annotated.format.bed --eggnog out.emapper.annotations.xlsx

The following requirements needs to be installed prior to running the script:

conda install -c anaconda pandas
conda install --channel conda-forge --channel bioconda pybedtools
pip install xlrd==1.2.0

Finally, assigned gene IDs to novel proteins can be double-checked in NCBI and/or Ensembl to discover missing genes in current annotations or novel paralogs (if a novel gene share gene symbol with a gene locating within the same loci).

(6) rnaQUAST analysis of my transcriptome

rnaQUAST (https://github.com/ablab/rnaquast) will compare the assembled transcriptome (i.e. NCBI_transcripts.fa coming from final_annotated.gtf) against a reference genome (i.e. galGal6.fa) and a reference GTF (i.e. galGal6_ncbiRefSeq.gtf) and will output usefuul metrics such as sensitivity/specificity of the input assembly. The reference GTF and reference genome can be obtained by using genome-download program as depicted here: https://github.com/cfarkas/annotate_my_genomes#pipeline-outline.

# to execute a full analysis using 16 threads, do the following
rnaQUAST.py --transcripts NCBI_transcripts.fa --reference galGal6.fa --gtf galGal6_ncbiRefSeq.gtf -t 16 -o ./comparisons

"comparisons" folder will contains all the analysis.

(7) I want to plot my differentially expressed genes (DEGs) between two conditions (with replicates)

In this case, we will replicate the above analysis using triplicate sequencing gene count of 12 and 16 days from GSE114129 study (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114129), also provided in this repo as "gene_counts_GSE114129":

Click to expand code
# SHELL
htseq-count -t transcript --stranded=no --format bam E12_1.bam E12_2.bam E12_3.bam E16_1.bam E16_2.bam E16_3.bam final_annotated.gtf > gene_counts

# Open R
R
data <- read.table("gene_counts_GSE114129", header=F)
data1 <- data[,-1]
rownames(data1) <- data[,1]
colnames(data1) <- c("E12_1","E12_2","E12_3","E16_1","E16_2","E16_3")
head(data1)

##### Differential expression #####

library(edgeR)
library(EnhancedVolcano)

d <- DGEList(counts=data1, group=c(1,1,1,2,2,2), lib.size=colSums(data1))
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
et <- exactTest(d)
out <- topTags(et, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05 & abs(out$table$logFC) >= 0.5
out[keep,]
write.table(out, file = "edgeR-12_vs_16.csv", sep = ",", col.names = NA, qmethod = "double")

##### Plotting #####
pdf("DEGs_plotting.pdf", height=11, width=15)
EnhancedVolcano(out$table,
  lab = rownames(out$table),
  x = 'logFC',
  y = 'FDR',
  xlim = c(-8, 8),
  title = 'E12 versus E16',
  pCutoff = 10e-04,
  FCcutoff = 1.5,
  transcriptLabSize = 4.0,
  transcriptPointSize = 2.0)
dev.off()

(8) Differential exon usage analysis with DEXSeq

We will present a custom differential exon usage analysis between two replicates of chicken subcommisural organ transcriptomes at HH23/HH30 stages by employing DEXSeq (https://bioconductor.org/packages/release/bioc/html/DEXSeq.html). For simplicity, we will name HH23 samples as 41-42 and HH30 samples as 71-72 (4 and 7 indicates days of development). Prior to quantify illumina BAM files, we need to convert the GTF file to GFF with a custom python scripts from DEXseq (provided in this repository as dexseq_prepare_annotation_fixed.py) as follows:

Click to expand code
## Obtaining gff file for counting by filtering unstranded transcripts in GTF file
awk 'BEGIN{OFS="\t"}{if($7!=".") print $0}' final_annotated.gtf > filtered.gtf

## Obtaining GFF script for counting
python dexseq_prepare_annotation_fixed.py filtered.gtf filtered.gff

Now we can count genes as follows:

# Counting Reads with DEXSeq counting script (also provided in this repository)

python ./dexseq_count.py filtered.gff 41.sorted.bam DEXSeq_41.txt --format=bam
python ./dexseq_count.py filtered.gff 42.sorted.bam DEXSeq_42.txt --format=bam
python ./dexseq_count.py filtered.gff 71.sorted.bam DEXSeq_71.txt --format=bam
python ./dexseq_count.py filtered.gff 72.sorted.bam DEXSeq_72.txt --format=bam

The analysis is performed in R by loading the DEXSeq library and the quantified exon files (in .txt format) as follows:

R
library(DEXSeq)
sampleTable = data.frame(row.names = c( "DEXSeq_41.txt", "DEXSeq_42.txt", "DEXSeq_71.txt", "DEXSeq_72.txt"),
condition = c("HH23", "HH23", "HH30", "HH30"),
libType = c( "single-end", "single-end", "single-end", "single-end") )
sampleTable
suppressPackageStartupMessages( library( "DEXSeq" ) )
countFiles = list.files(pattern=".txt$", full.names=TRUE)

countFiles
flattenedFile = list.files(pattern="gff$", full.names=TRUE)
flattenedFile

dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile )
colData(dxd)
head( counts(dxd), 5 )
split( seq_len(ncol(dxd)), colData(dxd)$exon )
head( featureCounts(dxd), 5 )
head( rowRanges(dxd), 3 )
sampleAnnotation( dxd )

dxd = estimateSizeFactors( dxd )
dxd = estimateDispersions( dxd )
plotDispEsts( dxd )
dxd = testForDEU( dxd )
dxd = estimateExonFoldChanges( dxd, fitExpToVar="condition")
dxr1 = DEXSeqResults( dxd )
dxr1
elementMetadata(dxr1)$description
table ( dxr1$padj < 0.1 )
table ( tapply( dxr1$padj < 0.1, dxr1$groupID, any ) )
write.csv( dxr1, file="HH23_vs_HH30.csv" )
#to plot, use groupID names#

# For figure 4
pdf("chr21_CDC42-.pdf", height=6, width=15)
plotDEXSeq( dxr1, "chr21_CDC42-", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
dev.off()
pdf("chr2_MAP4+.pdf", height=6, width=15)
plotDEXSeq( dxr1, "chr2_MAP4+", legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )
dev.off()


##### Plotting #####
library(EnhancedVolcano)
pdf("DEXSeq_HH23_vs_HH30_plotting.pdf", height=16, width=15)
EnhancedVolcano(dxr1,
  lab = rownames(dxr1),
  x = 'log2fold_HH23_HH30',
  y = 'padj',
  xlim = c(-8, 8),
  title = 'HH23 vs HH30',
  pCutoff = 0.001,
  FCcutoff = 1,
  transcriptLabSize = 4.0,
  transcriptPointSize = 2.0)
dev.off()

The latter commands gave typical plots of differential exon usage (See chr21_CDC42- and chr2_MAP4+ pdf files) including a plot of the global differentially exon usage within transcriptomes, provided by the EnhancedVolcano package.

(9) Ex90N50 value of my transcriptome:

By definition, Ex90N50 value refers the contig N50 value based on the set of transcripts representing 90% of the expression data. By using an assembled transcriptome (i.e.: NCBI_transcripts.fa) and illumina reads, this can be calculated as follows:

Click to expand code
  1. Use salmon to quantify reads mapped to each transcriptome:
# Activate proper conda enviroment
conda activate salmon
#
salmon index -t NCBI_transcripts.fa -i NCBI_index -p 40
salmon quant -i ./NCBI_index/ -l A -1 forward.illumina.fastq.gz -2 reverse.illumina.fastq.gz -o NCBI_quantification
  1. Then, to find the transcripts that accounts the 90% of the expressed transcripts, do:
cd NCBI_quantification/
sort -k 5 -nr quant.sf > sorted
awk 'NR==FNR{a = a + $5;next} {c = ($5/a)*100;print $1"\t"$5"\t"c }' sorted sorted > percentages
awk '{s+=$3} END {print s}' percentages
awk '\
    BEGIN { \
        s = 0; \
    } \
    { \
        s += $3; \
        if (s >= 90) { \
            print $0; \
            exit; \
        } \
    }' percentages

The latter will output the last transcript that should be account for Ex90N50 calculation. In example:

STRG.10135.3    1675.780        0.00157262

Third column of this line (0.00157262) can be used to filter 90% of the expressed transcriptome as follows:

awk '$3>=0.00157262 {print $1}' percentages > percentages_filtered
  1. Last, intersect transcriptome (fasta) with "percentages_filtered" file and calculate N50 as follows:
seqkit fx2tab NCBI_transcripts.fa > NCBI_transcripts.tab
grep -w -F -f percentages_filtered NCBI_transcripts.tab > transcripts_filtered.tab
seqkit tab2fx transcripts_filtered.tab > Ex90N50.fa
prinseq-lite.pl -fasta Ex90N50.fa -stats_all

N50 value (in this case Ex90N50) will be reported along with other statistics.

Clone this wiki locally