This repository contains code used in the Master's Thesis titled:
'Functional and computational benchmarking of pangenome graph construction tools and development of a graph-based bacterial gene caller.'
- python3
- biopython
- numpy
- R version 3+
Analyse the proportion of genic read bases aligned by Graphaligner.
python analyse_gaf_genic.py gaffile blastfile reads.fa outfile.txt
Input/Output:
gaffile
: graphical alignment file (GAF) produced by Graphalignerblastfile
: BLAST output file in tabular format generated from exact alignment of gene sequences from Lees et al. (aligned files are '*_dna.fa') to respective simulated genomes (blastn -outfmt 6 -perc_identity 100 -qcov_hsp_perc 100
)reads.fa
: read sequences used in alignment in FASTA format, generated by Nanosim-Houtfile.txt
: output summary file
Analyse the proportion of total read bases aligned by GraphAligner.
python analyse_gaf_total.py gaffile outfile.txt
Input/Output
gaffile
: graphical alignment file produced by Graphaligneroutfile.txt
: output summary file
Analyse node length and degree from a GFA file.
python analyse_nodes.py gfafile outpref
Input/Output
gfafile
: GFA file produced from graph constructionoutpref
: output prefix for distribution and summary files
Analyse unitig frequencies from a Bifrost graph.
python analyse_unitig.py gfa_dist.txt colours.tsv output
Input/Output
gfa_dist.txt
: Distribution file generated from analyse_nodes.pycolours.tsv
: TSV file produced from Bifrost query by querying constituent unitigs against GFA itselfoutput
: output file of results
R script for analysis of output files from analyse_untig.py, generating unitig frequency plot.
To use, specify input directory at #input directory
, with files in .txt format.
This script contains two functions for analysing ORF calls made by a gene caller.
Checks presence of a called ORF in forward and reverse complements of a set of reference source sequences.
(within python REPL)
check_ORF_in_ref(ref_fasta_for, ref_fasta_rev, query_fasta, outfasta)
Input/Output
ref_fasta_for
: Multi-FASTA of reference source sequences (forward strand)ref_fasta_rev
: Multi-FASTA of reference source sequences (reverse strand)query_fasta
: ORF calls in FASTA formatoutfasta
: output FASTA containing ORFs not present in forward or reverse sequences
Checks presence of a known gene in longer called ORFs.
(within python REPL)
check_ref_in_query(ref_fasta, query_fasta, outfile)
Input/Output
ref_fasta
: Multi-FASTA of known genesquery_fasta
: ORF calls in FASTA formatoutfile
: output FASTA containing known genes not found in any called ORFs
Compares known genes against called ORFs by Prodigal or ggCaller in S. pneumoniae capsular biosynthetic loci (CBL). Prints recall and precision, and returns list of unmatched sequences.
python compare_gene_calls.py reference_genes gene_calls caller_type group
Input/Output
reference_genes
: known genes in FASTA formatgene_calls
: ORF calls by Prodigal or ggCaller in FASTA formatcaller_type
: specify which gene caller used (ggCaller = ggc, Prodigal = prod)group
: CBL group used in comparison.
Creates FASTA of unitigs from GFA files
(within python REPL)
gfa_to_fasta(gfafile)
Input/Output
gfafile
: GFA file generated from graph construction- output is FASTA with same file prefix as gfafile
R script for analysis of RTAB file generated by Panaroo.
For usage, specify input directory at #input directory
, with files in .RTAB format.