-
Notifications
You must be signed in to change notification settings - Fork 30
Outputs
PIRATE produces number of output files. These have been summarised below:
-
PIRATE.pangenome_summary.txt - short summary of the number and frequency of genes in the pangenome.
-
PIRATE.log - PIRATE log file.
-
PIRATE.gene_families.ordered.tsv - tabular summary of all gene families. One entry per gene family. Families that have been separated at the paralog splitting stage are denoted with and undescore and a number (e.g. g0001_1 and g0001_2). The file with the suffix .ordered.tsv has been ordered on syntenic regions in the pangenome graph.
-
PIRATE.unique_alleles.tsv - tabular summary of all unique alleles of each gene family. Unique alleles are defined as a novel MCL sub-clusters of loci at a higher %identity thresholds.
-
binary_presence_absence.fasta/nwk - a tree generated by fasttree from binary gene_family presence-absence data and the fasta file used to create it.
-
pangenome.gfa - GFA network file representing all unique connections between gene families (extracted from the GFF files). Can be loaded and visualised in Bandage.
-
modified_gffs directory - GFF3 files which have been standardised for PIRATE (see above). Loci in gene_families/unique allele files correspond to the annotation in these files.
-
[optional -r] PIRATE_plots.pdf - summary plots of the PIRATE pangenome.
-
[optional -a] core_alignment/pangenome_alignment.fasta - gene-by-gene nucleotide alignments of the core and full pangenome created using MAFFT. Loci are ordered using the PIRATE.gene_families.ordered.tsv file. If the pangenome was created from translated CDS then the resulting alignments were reverse-translated from the amino acid sequence to retain the codon structure of the genes. Note - If a genome has a gene dosage/copy number of >1 for the gene family then the seqeuence is replaced with ?s in the alignment.
-
[optional -a] core_alignment/pangenome_alignment.gff - Annotation containing the position of the gene family within the corresponding fasta file and associated gene/product annotation.
-
[optional -a] feature_sequences directory - a directory containing all amino acid and nucleotide sequences for each gene family (aligned using MAFFT).
PIRATE.gene_families.tsv and PIRATE.unique_alleles.tsv share the same file format and column headers:
1/ allele_name - a unique identifier for the allele (MCL clustering).
2/ gene_family - a unique identifier for the gene family. If the family name is contains a numeric suffix (e.g. g0001_1/g0001_2) then the family contained paralogs and has been split into 1 or more related gene families.
3/ consensus_gene_name - the most frequent gene name from the original GFF3 file annotation within the cluster (NAs omitted).
4/ consensus_product - the most frequent product information from the original GFF3 file annotation within the cluster (NAs omitted).
5/ threshold - the highest threshold at which all loci within the allele/family clustered together. This is a measure of how dissimilar the most divergent loci is from its nearest neighbour measured in percentage identity i.e. a rough proxy for how similar the loci contained within the allele/gene are to one another.
6/ alleles_at_max_threshold - the number of unique alleles at the highest homology threshold used in the analysis. This can be used as a rough proxy for the diversity contained within the gene_family.
7/ number_genomes - the number of genomes in which the gene family/allele is present.
8-10/ average/min/max_dose - summary statistics for copy number (dosage) per genome. This value has been corrected for fission/fusion loic i.e. three loci in a single fusion cluster are considered a single gene.
11-12/ genomes_containing_fission/duplication- total number of genomes containing one or more fusion or multicopy loci.
13-14/ number_of_fission/duplicated_loci - total number of fission/fusion or multicopy loci in all genomes per family/allele.
15/ no_loci - total number of loci in gene family/allele.
16/ products - counts of unique product annotations assigned to loci in the family/allele (ordered: highest -> lowest).
17/ gene_names - counts of unique gene names assigned to loci in the family/allele (ordered: highest -> lowest).
18-20/ min/max/average_length (bp) - summary stats of the length of the gene in base pairs for each loci within the family/allele.
21-22/ synteny_cluster/synteny_cluster_order - The syntenic cluster the gene_family has been assigned to and the corresponding order within the cluster. NOTE: these columns are only present in PIRATE.gene_families.tsv.
23+/ genome_names- one column per genome which contains the gene family. Rows contain the locus tags of each loci per genome. Loci encased in brackets and separated by a colon have been assigned as as fusion cluster by PIRATE (e.g. (example_001:example_002) )