-
Notifications
You must be signed in to change notification settings - Fork 30
Usage
PIRATE accepts GFF3 annotation files containing matching nucleotide sequence at the end of the file. This is the format produced by Prokka. PIRATE will verify and discard files that do not follow the accepted GFF3 format and do not have a .gff extension before running. GFF3 files obtained from other sources, such as RAST or the NCBI, may sometime cause problems as they may not adhere to the accepted format. It is recommended that the nucleotide FASTA is downloaded (use ncbi-genome-download) and annotated with Prokka. If this is not possible to do so, for instance you wish to retain the reference genome naming scheme, then it is recommended that you check the fasta header matches the first field in the annotation and that the file contains locus_tag and or ID fields.
PIRATE renames locus_tag and ID to adhere to a standardised format (name of genome[underscore]locus number). The previous nomenclature is retained in the modified GFF3 files present in the "modified_gffs" directory under previous_ID and previous_locustag fields. The old nomenclature can be transferred to the output files using the subsample_outputs.pl script and the field of interest e.g. --prev_locustag.
The core functionality of PIRATE is invoked using the PIRATE
command. A number of additional scripts are provided (in the scripts and tools directories) for converting or analysing the outputs.
PIRATE -i /path/to/directory/containing/gffs/
PIRATE input/output:
-i|--input input directory containing gffs [mandatory]
-o|--output output directory in which to create PIRATE folder
[default: input_dir/PIRATE]
Global:
-s|--steps % identity thresholds to use for pangenome construction
[default: 50,60,70,80,90,95,98]
-f|--features choose features to use for pangenome construction.
Multiple may be entered, seperated by a comma [default: CDS]
-n|--nucl CDS are not translated to AA sequence [default: off]
--pan-opt additional arguments to pass to pangenome_contruction
--pan-off don't run pangenome tool [assumes PIRATE has been previously
run and resulting files are present in output folder]
Paralog classification:
--para-off switch off paralog identification [default: off]
Output:
-a|--align align all genes and produce core/pangenome alignments
[default: off]
-r|--rplots plot summaries using R [requires dependencies]
Usage:
-t|--threads number of threads/cores used by PIRATE [default: 2]
-q|--quiet switch off verbose
-z retain intermediate files [0 = none, 1 = retain pangenome
files (default - re-run using --pan-off), 2 = all]
-c|--check check installation and run on example files
-h|--help usage information
Run PIRATE over a range of amino acid %ID thresholds (50,60,70,80,90,95,98), classify paralogs and produce output tables in the input directory.
PIRATE -i /path/to/gff/files/
PIRATE will run over a predefined range of thresholds (-s 50,70,90,95), classify paralogs and produce an output folder in the specified directory (-o). Align individual gene sequences with MAFFT and produce a core gene alignment (-a). Graphical summaries will be produced if optional R dependencies have been installed (-r).
PIRATE -i /path/to/gff/files/ -s "50,70,90,95" -o /path/to/output_directory/ -a -r
Paralog classification can sometime take some time for a large number of samples with open pangenomes. First run PIRATE with paralog classification off (--para-off).
PIRATE -i /path/to/gff/files/ --para-off
Run PIRATE on a pangenome created by a previous PIRATE run without recreating the pangenome (--pan-off). Note that the thresholds selected (-s) should 'exactly' match the original thresholds.
PIRATE -i /path/to/gff/files/ -o /path/to/previous/output_directory/ --pan-off
Create a pangenome on CDS features using nucleotide identity rather than amino acid identity (-n).
PIRATE -i /path/to/gff/files/ -n
Run PIRATE tRNA and rRNA features in input GFF3 files (-f). By default, this will run on nucleotides rather than amino acids.
PIRATE -i /path/to/gff/files/ -f "rRNA,tRNA"
PIRATE allows for more fine-scale control of the parameters used for pangenome construction by passing commands to pangenome_construction.pl directly using the -pan-opt option. The applicable options are listed below:
Clustering options:
-p|--perc single % identity threshold to use for pangenome
construction [default: 98]
-s|--steps multiple % id thresholds to use for pangenome
construction, comma seperated
[default: 50,60,70,80,90,95,98]
-n|--nucl create pangenome on nucleotide sequence
[default: amino acid]
CDHIT options:
--cd-low cdhit lowest percentage id [default: 98]
--cd-step cdhit step size [default: 0.5]
--cd-core-off don't extract core families during cdhit clustering
[default: on]
BLAST options:
-e|--evalue e-value used for blast hit filtering [default: 1E-6]
-d|--diamond use diamond instead of BLAST - incompatible
with --nucleotide [default: off]
--hsp_prop remove BLAST hsps that are < hsp_prop proportion
of query length/query hsp length [default: off]
MCL options:
-f|--flat mcl inflation value [default: 1.5]
Create a pangenome using diamond (faster) rather than BLAST for homology searching (-k and --diamond).
PIRATE -i /path/to/gff/files/ -k "--diamond"
Create a pangenome by initially clustering the input fasta file with cdhit using a step size of 1% (-cds) until 95% identity (-cdl) over a %identity threshold range of 90,91,92,93,94,95% (-s)
PIRATE -i /path/to/gff/files/ -s "90,91,92,93,94,95" -k "--cd-step 1 --cd-low 95"
Create a pangenome for a collection of highly similar genomes. Initially only cluster using cdhit at 100% (-cdl) over a range of high thresholds (-s). Use a stringent homology e-value cutoff (-e) and exclude hits that do not have HSPs that are greater than 50% of the length of the query or input sequence (--hsp_prop)
PIRATE -i /path/to/gff/files/ -s "95,96,97,98,99,100" -k "--cd-low 100 --e 1E-12 --hsp_prop 0.5"
A complicated one. Create a pangenome including a range of sequence features (-f), using nucleotide sequence homology (implied by non-CDS features), over a closely related range of % identity thresholds (-s), using a lower cut-off for cd-hit (-k and -cdl), stringent homology parameters (-k and -e). Finally, align all sequence features (-a) and produce R plots (-r).
PIRATE -i /path/to/gff/files/ -f "tRNA,rRNA,CDS" -s "95,96,97,98" -k "--cd-low 98 -e 1E-12" -a -r