Skip to content
SionBayliss edited this page Feb 27, 2019 · 7 revisions

Input format

PIRATE accepts GFF3 annotation files containing matching nucleotide sequence at the end of the file. This is the format produced by Prokka. PIRATE will verify and discard files that do not follow the accepted GFF3 format and do not have a .gff extension before running. GFF3 files obtained from other sources, such as RAST or the NCBI, may sometime cause problems as they may not adhere to the accepted format. It is recommended that the nucleotide FASTA is downloaded (use ncbi-genome-download) and annotated with Prokka. If this is not possible to do so, for instance you wish to retain the reference genome naming scheme, then it is recommended that you check the fasta header matches the first field in the annotation and that the file contains locus_tag and or ID fields.

Locus Tags/IDs

PIRATE renames locus_tag and ID to adhere to a standardised format (name of genome[underscore]locus number). The previous nomenclature is retained in the modified GFF3 files present in the "modified_gffs" directory under previous_ID and previous_locustag fields. The old nomenclature can be transferred to the output files using the subsample_outputs.pl script and the field of interest e.g. --prev_locustag.

Usage

The core functionality of PIRATE is invoked using the PIRATE command. A number of additional scripts are provided (in the scripts and tools directories) for converting or analysing the outputs.

	PIRATE -i /path/to/directory/containing/gffs/ 

 PIRATE input/output:
 -i|--input 	input directory containing gffs [mandatory]
 -o|--output 	output directory in which to create PIRATE folder 
 		        [default: input_dir/PIRATE]

 Global:
 -s|--steps	    % identity thresholds to use for pangenome construction
  		        [default: 50,60,70,80,90,95,98]
 -f|--features	choose features to use for pangenome construction. 
 		        Multiple may be entered, seperated by a comma [default: CDS]
 -n|--nucl	    CDS are not translated to AA sequence [default: off]
 --pan-opt	    additional arguments to pass to pangenome_contruction	
 --pan-off	    don't run pangenome tool [assumes PIRATE has been previously
  		        run and resulting files are present in output folder]

 Paralog classification:
 --para-off	    switch off paralog identification [default: off]

 Output:
 -a|--align	    align all genes and produce core/pangenome alignments 
 		        [default: off]
 -r|--rplots	plot summaries using R [requires dependencies]

 Usage:
 -t|--threads	number of threads/cores used by PIRATE [default: 2]
 -q|--quiet	    switch off verbose
 -z		        retain intermediate files [0 = none, 1 = retain pangenome 
 		        files (default - re-run using --pan-off), 2 = all]
 -c|--check	    check installation and run on example files
 -h|--help 	    usage information
 

Basic examples

Run PIRATE over a range of amino acid %ID thresholds (50,60,70,80,90,95,98), classify paralogs and produce output tables in the input directory.

PIRATE -i /path/to/gff/files/

PIRATE will run over a predefined range of thresholds (-s 50,70,90,95), classify paralogs and produce an output folder in the specified directory (-o). Align individual gene sequences with MAFFT and produce a core gene alignment (-a). Graphical summaries will be produced if optional R dependencies have been installed (-r).

PIRATE -i /path/to/gff/files/ -s "50,70,90,95" -o /path/to/output_directory/ -a -r

Paralog classification can sometime take some time for a large number of samples with open pangenomes. First run PIRATE with paralog classification off (--para-off).

PIRATE -i /path/to/gff/files/ --para-off 

Run PIRATE on a pangenome created by a previous PIRATE run without recreating the pangenome (--pan-off). Note that the thresholds selected (-s) should 'exactly' match the original thresholds.

PIRATE -i /path/to/gff/files/ -o /path/to/previous/output_directory/ --pan-off 

Create a pangenome on CDS features using nucleotide identity rather than amino acid identity (-n).

PIRATE -i /path/to/gff/files/ -n 

Run PIRATE tRNA and rRNA features in input GFF3 files (-f). By default, this will run on nucleotides rather than amino acids.

PIRATE -i /path/to/gff/files/ -f "rRNA,tRNA"

Advanced examples

PIRATE allows for more fine-scale control of the parameters used for pangenome construction by passing commands to pangenome_construction.pl directly using the -pan-opt option. The applicable options are listed below:

    Clustering options:
    -p|--perc       single % identity threshold to use for pangenome 
                    construction [default: 98]
    -s|--steps      multiple % id thresholds to use for pangenome 
                    construction, comma seperated 
                    [default: 50,60,70,80,90,95,98]
    -n|--nucl       create pangenome on nucleotide sequence 
                    [default: amino acid]

    CDHIT options: 
    --cd-low        cdhit lowest percentage id [default: 98]
    --cd-step       cdhit step size [default: 0.5]
    --cd-core-off   don't extract core families during cdhit clustering 
                    [default: on]

    BLAST options:
    -e|--evalue     e-value used for blast hit filtering [default: 1E-6]
    -d|--diamond    use diamond instead of BLAST - incompatible 
                    with --nucleotide [default: off]
    --hsp_prop      remove BLAST hsps that are < hsp_prop proportion
                    of query length/query hsp length [default: off]

    MCL options:
    -f|--flat       mcl inflation value [default: 1.5]

Create a pangenome using diamond (faster) rather than BLAST for homology searching (-k and --diamond).

PIRATE -i /path/to/gff/files/ -k "--diamond"

Create a pangenome by initially clustering the input fasta file with cdhit using a step size of 1% (-cds) until 95% identity (-cdl) over a %identity threshold range of 90,91,92,93,94,95% (-s)

PIRATE -i /path/to/gff/files/ -s "90,91,92,93,94,95" -k "--cd-step 1 --cd-low 95"

Create a pangenome for a collection of highly similar genomes. Initially only cluster using cdhit at 100% (-cdl) over a range of high thresholds (-s). Use a stringent homology e-value cutoff (-e) and exclude hits that do not have HSPs that are greater than 50% of the length of the query or input sequence (--hsp_prop)

PIRATE -i /path/to/gff/files/ -s "95,96,97,98,99,100" -k "--cd-low 100 --e 1E-12 --hsp_prop 0.5"

A complicated one. Create a pangenome including a range of sequence features (-f), using nucleotide sequence homology (implied by non-CDS features), over a closely related range of % identity thresholds (-s), using a lower cut-off for cd-hit (-k and -cdl), stringent homology parameters (-k and -e). Finally, align all sequence features (-a) and produce R plots (-r).

PIRATE -i /path/to/gff/files/ -f "tRNA,rRNA,CDS" -s "95,96,97,98" -k "--cd-low 98 -e 1E-12" -a -r
Clone this wiki locally