Skip to content

ANNEXA wiki

tderrien edited this page Dec 20, 2024 · 3 revisions

WIKI page for ANNEXA

ANNEXA report description

See an example of report.

ANNEXA reports QC visuals plots at 3 levels of annotation : gene, transcript and exon.

Nomenclature of input files :

  • bamsample: corresponds to the input bam file(s) (e.g. 501Mel_1-3_OSS_CM-R_R1.sorted.bam)
  • refannot: corresponds to the input reference annotation used to launch ANNEXA (e.g. gencode.v46.annotation.gtf.gz)
  • refgenome: corresponds to the input reference genome used to launch ANNEXA (e.g. Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa)

⚠️ Be sure to match genome assembly version and transcriptome file (with or without chr).

GENE characterization

  • page3 : Number of genes

Number of genes per biotypes (lncRNAs/mRNAs) and source (from refannot i.e. known or novel).

  • Page 4: Gene length distribution

Distribution of gene lengths per biotypes (lncRNAs/mRNAs) and source. Gene lengths is computed by summing exon lengths of the longest isoform.

  • Page 5: Proportion of mono versus multi-isoform genes

Proportion of genes with 1 (strides) versus at least 2 (no stride) isoforms, according to biotype and source.

  • Page 6: Distribution of gene counts

Distribution of gene expression per biotypes (lncRNAs/mRNAs) and source accross all bamsample (sum of gene_count).

  • Page 7: Breadth of expression

Number of genes expressed in N samples (gene_count >1)

  • Page 8: Distribution of gene counts wrt Breadth of expression

Number of novel genes (log) based on isoform number and number of samples with gene_count >1

  • Page 9: Number of 5' and/or 3' gene extensions

Number of known genes (wrt biotypes) with extension thanks to novel isoforms in the 3'-end (1st stride), in 5'-end (2nd stride) and in both 5' and 3'-ends (two strides).

  • Page 10: Distribution of gene extensions lengths

Distribution of gene extensions lengths (at the genomic level) of gene from input annotation with novel isoforms.

TRANSCRIPT level

  • Page 12: Number of transcripts

Number of transcripts, according to biotype and source, which are already in the input annotation (darker), novel isoforms of known genes (intermediate) and novel transcripts from novel gene (lighter).

  • Page 13: Transcript length distribution

Same as page 4 on transcript level.

  • Page 14: Proportion of mono versus multi exonic transcripts

Proportion of transcripts with 1 (stide) versus >= 2 exons (normal), according to source and biotype.

EXON level

  • Page 16: Number of exon

Number of exons per biotypes (lncRNAs/mRNAs) and source (from refannot i.e. known or novel).

  • Page 17: Exon length distribution

Same as page 4 on exon level.

Fitlering operations

With ANNEXA --filter option, the workflow provides 2 sets of .gtf annotations:

  • FULL (or exhaustive): includes all novel transcripts without any filtering.
  • FILTER (or stringent): includes novel transcripts respecting filtering options.

For the latter, all novel transcripts are filtered based on their NDR cutoff (Bambu tool) and/or TFK cutoff (Bambu and Stringtie). ANNEXA filtering options can be summarized in the following figure: ANNEXA_filter_operation NB: Note the option --bambu_singleexon, which is TRUE (by default) may include lots of single exon transcripts (SETs) in Bambu and potentially high number of false positive in the FULL set of annotation.