Skip to content

Outputs

SionBayliss edited this page Feb 27, 2019 · 8 revisions

PIRATE produces number of output files. These have been summarised below:

  • PIRATE.pangenome_summary.txt - short summary of the number and frequency of genes in the pangenome.

  • PIRATE.log - PIRATE log file.

  • PIRATE.gene_families.ordered.tsv - tabular summary of all gene families. One entry per gene family. Families that have been separated at the paralog splitting stage are denoted with and undescore and a number (e.g. g0001_1 and g0001_2). The file with the suffix .ordered.tsv has been ordered on syntenic regions in the pangenome graph.

  • PIRATE.unique_alleles.tsv - tabular summary of all unique alleles of each gene family. Unique alleles are defined as a novel MCL sub-clusters of loci at a higher %identity thresholds.

  • binary_presence_absence.fasta/nwk - a tree generated by fasttree from binary gene_family presence-absence data and the fasta file used to create it.

  • pangenome.gfa - GFA network file representing all unique connections between gene families (extracted from the GFF files). Can be loaded and visualised in Bandage.

  • modified_gffs directory - GFF3 files which have been standardised for PIRATE (see above). Loci in gene_families/unique allele files correspond to the annotation in these files.

  • [optional -r] PIRATE_plots.pdf - summary plots of the PIRATE pangenome.

  • [optional -a] core_alignment/pangenome_alignment.fasta - gene-by-gene nucleotide alignments of the core and full pangenome created using MAFFT. Loci are ordered using the PIRATE.gene_families.ordered.tsv file. If the pangenome was created from translated CDS then the resulting alignments were reverse-translated from the amino acid sequence to retain the codon structure of the genes. Note - If a genome has a gene dosage/copy number of >1 for the gene family then the seqeuence is replaced with ?s in the alignment.

  • [optional -a] core_alignment/pangenome_alignment.gff - Annotation containing the position of the gene family within the corresponding fasta file and associated gene/product annotation.

  • [optional -a] feature_sequences directory - a directory containing all amino acid and nucleotide sequences for each gene family (aligned using MAFFT).

PIRATE.*.tsv file format

PIRATE.gene_families.tsv and PIRATE.unique_alleles.tsv share the same file format and column headers:

1/ allele_name - a unique identifier for the allele (MCL clustering).

2/ gene_family - a unique identifier for the gene family. If the family name is contains a numeric suffix (e.g. g0001_1/g0001_2) then the family contained paralogs and has been split into 1 or more related gene families.

3/ consensus_gene_name - the most frequent gene name from the original GFF3 file annotation within the cluster (NAs omitted).

4/ consensus_product - the most frequent product information from the original GFF3 file annotation within the cluster (NAs omitted).

5/ threshold - the highest threshold at which all loci within the allele/family clustered together. This is a measure of how dissimilar the most divergent loci is from its nearest neighbour measured in percentage identity i.e. a rough proxy for how similar the loci contained within the allele/gene are to one another.

6/ alleles_at_max_threshold - the number of unique alleles at the highest homology threshold used in the analysis. This can be used as a rough proxy for the diversity contained within the gene_family.

7/ number_genomes - the number of genomes in which the gene family/allele is present.

8-10/ average/min/max_dose - summary statistics for copy number (dosage) per genome. This value has been corrected for fission/fusion loic i.e. three loci in a single fusion cluster are considered a single gene.

11-12/ genomes_containing_fission/duplication- total number of genomes containing one or more fusion or multicopy loci.

13-14/ number_of_fission/duplicated_loci - total number of fission/fusion or multicopy loci in all genomes per family/allele.

15/ no_loci - total number of loci in gene family/allele.

16/ products - counts of unique product annotations assigned to loci in the family/allele (ordered: highest -> lowest).

17/ gene_names - counts of unique gene names assigned to loci in the family/allele (ordered: highest -> lowest).

18-20/ min/max/average_length (bp) - summary stats of the length of the gene in base pairs for each loci within the family/allele.

21-22/ synteny_cluster/synteny_cluster_order - The syntenic cluster the gene_family has been assigned to and the corresponding order within the cluster. NOTE: these columns are only present in PIRATE.gene_families.tsv.

23+/ genome_names- one column per genome which contains the gene family. Rows contain the locus tags of each loci per genome. Loci encased in brackets and separated by a colon have been assigned as as fusion cluster by PIRATE (e.g. (example_001:example_002) )

Clone this wiki locally