-
Notifications
You must be signed in to change notification settings - Fork 3
ANNEXA wiki
See an example of report.
ANNEXA reports QC visuals plots at 3 levels of annotation : gene, transcript and exon.
Nomenclature of input files :
-
bamsample: corresponds to the input bam file(s) (e.g.
501Mel_1-3_OSS_CM-R_R1.sorted.bam
) -
refannot: corresponds to the input reference annotation used to launch ANNEXA (e.g.
gencode.v46.annotation.gtf.gz
) -
refgenome: corresponds to the input reference genome used to launch ANNEXA (e.g.
Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
)
- page3 : Number of genes
Number of genes per biotypes (lncRNAs/mRNAs) and source (from refannot i.e. known or novel).
- Page 4: Gene length distribution
Distribution of gene lengths per biotypes (lncRNAs/mRNAs) and source. Gene lengths is computed by summing exon lengths of the longest isoform.
- Page 5: Proportion of mono versus multi-isoform genes
Proportion of genes with 1 (strides) versus at least 2 (no stride) isoforms, according to biotype and source.
- Page 6: Distribution of gene counts
Distribution of gene expression per biotypes (lncRNAs/mRNAs) and source accross all bamsample (sum of gene_count
).
- Page 7: Breadth of expression
Number of genes expressed in N samples (gene_count
>1)
- Page 8: Distribution of gene counts wrt Breadth of expression
Number of novel genes (log) based on isoform number and number of samples with gene_count
>1
- Page 9: Number of 5' and/or 3' gene extensions
Number of known genes (wrt biotypes) with extension thanks to novel isoforms in the 3'-end (1st stride), in 5'-end (2nd stride) and in both 5' and 3'-ends (two strides).
- Page 10: Distribution of gene extensions lengths
Distribution of gene extensions lengths (at the genomic level) of gene from input annotation with novel isoforms.
- Page 12: Number of transcripts
Number of transcripts, according to biotype and source, which are already in the input annotation (darker), novel isoforms of known genes (intermediate) and novel transcripts from novel gene (lighter).
- Page 13: Transcript length distribution
Same as page 4 on transcript level.
- Page 14: Proportion of mono versus multi exonic transcripts
Proportion of transcripts with 1 (stide) versus >= 2 exons (normal), according to source and biotype.
- Page 16: Number of exon
Number of exons per biotypes (lncRNAs/mRNAs) and source (from refannot i.e. known or novel).
- Page 17: Exon length distribution
Same as page 4 on exon level.
With ANNEXA --filter
option, the workflow provides 2 sets of .gtf
annotations:
- FULL (or exhaustive): includes all novel transcripts without any filtering.
- FILTER (or stringent): includes novel transcripts respecting filtering options.
For the latter, all novel transcripts are filtered based on their NDR cutoff (Bambu tool) and/or TFK cutoff (Bambu and Stringtie).
ANNEXA filtering options can be summarized in the following figure:
NB: Note the option --bambu_singleexon
, which is TRUE (by default) may include lots of single exon transcripts (SETs) in Bambu and potentially high number of false positive in the FULL set of annotation.