-
Notifications
You must be signed in to change notification settings - Fork 4
15. Understanding the (more complex) Output
The following descriptions use the output files from the O104:H4 tutorial as examples.
(See RedDog Tutorial. Some output lines have been edited to improve readability)
a) Run statistics - summary: <reference>_AllStats.txt
2011C-3493_full_AllStats.txt #(header and first entry)
Isolate Cover%_CP003289 Cover%_CP003290 Cover%_CP003291 Cover%_CP003292 Depth_CP003289 Depth_CP003290 Depth_CP003291 Depth_CP003292 Mapped%_CP003289 Mapped%_CP003290 Mapped%_CP003291 Mapped%_CP003292 Mapped%_Total Total_Reads Insert_Mean Insert_StDev Length_Max Base_Qual_Mean Base_Qual_StDev
A_% T_% G_% C_% N_%
11-4632_C3 99.3629360507 99.7368539935 97.2782516135 98.9670755326 37.0178053113 36.2147637327 35.9647353768 19.8708414873 97.0259499571 1.62915023095 1.31985026242 0.02309954955 99.99805 2000000 499.9758 33.2385 100 40.0000 0.0000 24.6243 24.7370 25.2986 25.3391 0.0011
Reports on all replicons found in the reference file
Cover%_<replicon>: percentage of bases of the reference with at least one read mapped.
Depth_<replicon>: average depth of reads for bases with at least one read.
Mapped%_<replicon>: percentage of the total reads mapped to each replicon.
Mapped%_Total: percentage of the total reads mapped to any replicon.
Total_Reads: total number of reads (mapped and unmapped).
Insert_Mean (and _StDev): estimated size of the gap between paired end reads.
Length_Max: longest read length.
Base_Qual_Mean (and _StDev): average quality scores for the read set.
A_%, T_%, G_%, C_%, and N_%: percentage of each nucleotide in the read set.
b) Run statistics by replicon: <reference>_<replicon>_RepStats.txt
2011C-3493_full_CP003289_RepStats.txt
Isolate Cover%_CP003289 Depth_CP003289 Mapped%_CP003289 Mapped%_Total Total_Reads SNPs Hets_Removed Indels Ingroup/Fail
11-4632_C3 99.3629360507 37.0178053113 97.0259499571 99.99805 2000000 88 16 3 i
Report for each replicon in the reference (phylogeny run), or specified replicon(s) (pangenome)
Cover%_<replicon>: percentage of bases of the reference sequence with at least one read mapped.
Depth_<replicon>: average depth of reads for bases with at least one read.
Mapped%_<replicon>: percentage of the total reads mapped to each replicon.
Mapped%_Total: percentage of the total reads mapped to any replicon.
Total_Reads: total number of reads (mapped and unmapped).
SNPs: total SNP count (does not include conservation filtering).
Hets_Removed: number of heterozygous calls filtered from the SNP set (High count c.f. SNPs could indicate contamination).
Indels: Number of indel variants called.
Ingroup/Fail (pangenome) or Ingroup/Outgroup/Fail (phylogeny): Any isolate that fails one or more filters (percentage cover, depth or percentage mapped) will be set to ‘f’. Otherwise they will be set to ‘I’ (ingroup) unless (for phylogeny runs) the number of SNPs is greater than 2 SD of the mean SNP count for all isolates that did not fail.
c) SNP allele matrix: <reference>_<replicon>_alleles_var[_cons0.95].csv
Pos,Ref,11-02030,11-02033-1, … Ec11-5538,Ec11-5603,Ec11-6006,GOS1,GOS2,H112180280,H112180282,H112180283,H112180540,H112180541,LB226692,ON2011,TY-2482
9878,G,G, … C,G,G,G,G,G,G,G,G,G,G
15186,C,C, … C,C,C,C,C,C,C,G,C,C,C
Pos: position of SNP in reference sequence
Ref: SNP call in reference sequence
d) Percentage cover by gene: <reference>_CoverMatrix.csv
replicon__gene,01-09591,04-8351, … H112180540,H112180541,LB226692,ON2011,TY-2482,ON2010
CP003289__O3K_00005,100.0,100.0, … 100.0,100.0,100.0,100.0,100.0,100.0
CP003289__O3K_00010,100.0,100.0, … 95.0090744102,100.0,100.0,100.0,100.0,100.0
replicon__gene: source of the gene; reference sequence and gene tag separated by ‘__’.
If the genes in the GenBank file do not have locus tag, a tag based on the gene’s position will be used (these are not added to the Genbank file!).
Each cell in the matrix gives the percentage cover of the gene (for bases with at least one read). The DepthMatrix.csv has exactly the same format, except cell values are the average depth of reads for the gene (for bases with at least one read). The PresenceAbsence.csv also has the same format, but cell values are either ‘1‘ for present genes (coverage >= 95% and depth >= 5) or ‘0’ for absent (coverage < 95% or depth < 5).
e) SNP consequences: <reference>_<replicon>_alleles_var[_cons0.95]_consequences.txt
SNP ref alt change gene ancestralCodon derivedCodon ancestralAA derivedAA product ntInGene codonInGene posInCodon
9878 G C ns O3K_25622 CTC GTC L V SogL protein 1516 506 1
15186 C G ns O3K_25647 CGG CCG R P lipopolysaccharide core heptose(II)-phosphate phosphatase 548 183 2
For each SNP called in the SNP allele matrix, there will be an entry in the consequences table; these consequences are also added to a new version of the GenBank file for the reference sequence. A ‘change’ can be synonymous (s), non-synonymous (ns) or intergenic. If the SNP occurs in a gene, further information is provided (as shown).