Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README.md options #63

Open
haruosuz opened this issue Jan 10, 2021 · 7 comments
Open

README.md options #63

haruosuz opened this issue Jan 10, 2021 · 7 comments

Comments

@haruosuz
Copy link

haruosuz commented Jan 10, 2021

I have questions about options in https://github.com/SionBayliss/PIRATE/blob/master/README.md

Usage

Basic examples

  • -a|--align
  • -r|--rplots

Running PIRATE with the options -a -r generated the core_alignment.fasta and pangenome_alignment.fasta files, and PIRATE_plots.pdf file in which Tree in page 8 (Number of gene families per sample) and page 9 (Pangenome cluster presence/absence) is drawn using the binary_presence_absence.nwk file. Is it possible to generate PIRATE_plots.pdf in which Tree is drawn using a Newick file obtained by performing fasttree on core_alignment.fasta?

The core_alignment.fasta file contained many "N" (e.g. "TNNNNNNNNNA") although input nucleotide sequence data contained only alphabet of "ACGT".

  • --pan-opt
--pan-opt	    additional arguments to pass to pangenome_contruction	

should be changed to the following?

-k|--pan-opt	    additional arguments to pass to pangenome_contruction	
  • -s|--steps
  • -n|--nucl

Is there any difference between the options Global: -s|--steps and Clustering options: -s|--steps?

 Global:
 -s|--steps	    % identity thresholds to use for pangenome construction
  		        [default: 50,60,70,80,90,95,98]

 -n|--nucl	    CDS are not translated to AA sequence [default: off]
    Clustering options:

    -s|--steps      multiple % id thresholds to use for pangenome 
                    construction, comma seperated 
                    [default: 50,60,70,80,90,95,98]
    -n|--nucl       create pangenome on nucleotide sequence 
                    [default: amino acid]

Advanced examples

  • -f|--flat

In this paper (https://academic.oup.com/gigascience/article/8/10/giz119/5584409), "A default MCL inflation value of 2" was used for intra-species clustering (Figure 3. complete Staphylococcus aureus genomes), while "an MCL inflation value of 6" was used for intra-species clustering (Figure 4. Pseudomonas complete genomes) and inter-species comparisons (Figure 5. Prochlorococcus marinus draft genomes). The MCL inflation value of 2 or 6 was chosen based on the previous studies for these bacterial taxa?

I presume a default MCL inflation value was changed from 2 to 1.5.

 MCL options:
    -f|--flat       mcl inflation value [default: 1.5]
  • -e|--evalue
PIRATE -i /path/to/gff/files/ -s "95,96,97,98,99,100" -k "--cd-low 100 --e 1E-12 --hsp-len 0.5"

should be changed to the following?

PIRATE -i /path/to/gff/files/ -s "95,96,97,98,99,100" -k "--cd-low 100 -e 1E-12 --hsp-len 0.5"
PIRATE -i /path/to/gff/files/ -s "95,96,97,98,99,100" -k "--cd-low 100 --evalue 1E-12 --hsp-len 0.5"

https://github.com/SionBayliss/PIRATE#piratetsv-file-format

PIRATE.*.tsv file format

21-22/ synteny_cluster/synteny_cluster_order - The syntenic cluster the gene_family has been assigned to and the corresponding order within the cluster. NOTE: these columns are only present in PIRATE.gene_families.tsv.

should be changed to the following?

21-22/ cluster/cluster_order - The syntenic cluster the gene_family has been assigned to and the corresponding order within the cluster. NOTE: these columns are only present in PIRATE.gene_families.ordered.tsv.

Support Scripts

Subset Outputs

Subsample PIRATE.gene_families.ordered.tsv file and rename loci in output. Allows for recalculation of number of genomes gene_families are present in PIRATE.gene_families.ordered.tsv

should be changed to the following?

Subsample PIRATE.*.tsv files and rename loci in output. Allows for recalculation of number of genomes gene_families are present in PIRATE.*.tsv

PIRATE.gene_families.tsv
PIRATE.gene_families.ordered.tsv
PIRATE.unique_alleles.tsv
PIRATE.genomes_per_allele.tsv

Subset alignments

PIRATE/tools/subsetting/subset_alignments.pl --help
Usage:
     subset_alignments.pl -i /path/to/PIRATE.gene_families.tab -f /path/to/sequence/alignments/

# should be changed to the following?

     subset_alignments.pl -i /path/to/PIRATE.gene_families.tsv -f /path/to/sequence/alignments/
subset_alignments.pl -i /path/to/PIRATE.gene_families.tab[PIRATE.unique_alleles.tsv] -f /path/to/PIRATE/feature_sequences/ -o ./path/to/output_directory/

# should be changed to the following?

subset_alignments.pl -i /path/to/PIRATE.unique_alleles.tsv -f /path/to/PIRATE/feature_sequences/ -o ./path/to/output_directory/

Running the command printed the following messages and generated empty output_directory.

 - 0 of 0 genes to be processed
 - 100 % clusters added to output

identify representative sequences for gene families/alleles

Identify the representative sequence for each cluster in a PIRATE.*.tsv file. The file can be found at PIRATE/scripts/representative_sequences.pl.

PIRATE/scripts/representative_sequences.pl should be the following file?

PIRATE/tools/subsetting/select_representative
PIRATE/scripts/select_representative.pl

Unique gene sequences

unique_sequences.pl -i /path/to/gene.fas -p /path/to/PIRATE.gene_families.tab -o /path/to/output_dir/

should be the following?

unique_sequences.pl -i /path/to/PIRATE/feature_sequences/g000000002.aa.fasta -p /path/to/PIRATE.gene_families.tsv -o /path/to/output_dir/
unique_sequences.pl -i /path/to/PIRATE/feature_sequences/g000000002.nucleotide.fasta -p /path/to/PIRATE.gene_families.tsv -o /path/to/output_dir/

Convert to roary file

PIRATE_to_roary.pl -i /path/to/PIRATE.*.tsv/ -o /path/to/output_file.csv

should be the following?

PIRATE_to_roary.pl -i /path/to/PIRATE.*.tsv -o /path/to/output_file.csv

Convert to binary presence-absence or count

Convert PIRATE.*.tsv to "gene/allele presence-absence" and "paralog presence-absence" tsv files can be done with the same command?

# gene/allele presence-absence
PIRATE_to_Rtab.pl -i /path/to/PIRATE.*.tsv/ -o /path/to/output_file.tsv

# paralog presence-absence
PIRATE_to_Rtab.pl -i /path/to/PIRATE.*.tsv/ -o /path/to/output_file.tsv

should be the following?

PIRATE_to_Rtab.pl -i /path/to/PIRATE.*.tsv -o /path/to/output_file.tsv
@SionBayliss
Copy link
Owner

SionBayliss commented Jan 11, 2021

Thanks for the feedback on the README.

  • "Is it possible to generate PIRATE_plots.pdf in which Tree is drawn using a Newick file obtained by performing fasttree on core_alignment.fasta?"
    This can be done by passing a tree, in this case the tree generated by fasttree, to plot_PIRATE.R as an optional second argument e.g.
Rscript plot_PIRATE.R path/to/PIRATE_results_dir/ path/to/example_tree.nwk 
  • "The core_alignment.fasta file contained many "N" (e.g. "TNNNNNNNNNA") although input nucleotide sequence data contained only alphabet of "ACGT"."
    Ns are added to represent sequence/genes missing in individual isolates e.g. if genomeA does not have a copy of gene1 then the length of the alignment for gene1 will be represented by Ns.

  • "Is there any difference between the options Global: -s|--steps and Clustering options: -s|--steps? "
    No

Subset alignments functionality changed during development and will not function on unique_alleles, I will update the README accordingly.

All the best,
Sion

@haruosuz
Copy link
Author

haruosuz commented Jan 13, 2021

  • plot_PIRATE.R should be plot_summary.R?
$git clone https://github.com/SionBayliss/PIRATE
$find PIRATE -name "plot_PIRATE.R"
$find PIRATE -name "*.R"
PIRATE/scripts/plot_summary.R

@haruosuz
Copy link
Author

haruosuz commented Mar 4, 2021

I would be grateful if you could provide examples for Subset alignments.

Running the command printed the following messages and generated empty output_directory.

input=${DIRECTORY}/PIRATE.gene_families.tsv
fasta=${DIRECTORY}/feature_sequences/
output=output_directory
PIRATE/tools/subsetting/subset_alignments.pl -i "${input}" -f "${fasta}" -o "${output}"

 - 0 of 0 genes to be processed
 - 100 % clusters added to output

@SionBayliss
Copy link
Owner

This looks like a bug. I will try and fix it when I get a spare minute or two.

Thanks,
Sion

@haruosuz
Copy link
Author

haruosuz commented Apr 12, 2021

https://github.com/SionBayliss/PIRATE/blob/master/README.md#output-files

  • [optional -a] feature_sequences directory - a directory containing all amino acid and nucleotide sequences for each gene family (aligned using MAFFT).

Running PIRATE with -a|--align option generates a feature_sequences/ directory missing amino acid and nucleotide sequences for some gene families.

In the following case, the feature_sequences/ directory contained amino acid and nucleotide sequences for only 104 of the 139 gene families; i.e., amino acid and nucleotide sequences for 35 gene families (g0001, g0002, g0003, g0004, g0006, g0010, ...) are missing in the feature_sequences/ directory.

$tail -n +2 PIRATE.gene_families.tsv | wc -l
     139

$ls feature_sequences/*.aa.fasta | wc -l
     104

$cat PIRATE.gene_families.tsv | cut -f2 | sort | head
g0001
g0002
g0003
g0004
g0005
g0006
g0007
g0008
g0009
g0010

$ls feature_sequences/*.aa.fasta | head
feature_sequences/g0005.aa.fasta
feature_sequences/g0007.aa.fasta
feature_sequences/g0008.aa.fasta
feature_sequences/g0009.aa.fasta
feature_sequences/g0011.aa.fasta
feature_sequences/g0014.aa.fasta
feature_sequences/g0015.aa.fasta
feature_sequences/g0016.aa.fasta
feature_sequences/g0017.aa.fasta
feature_sequences/g0018.aa.fasta

5931919.zip

In another case, running the following command generated the feature_sequences/ directory which contained amino acid and nucleotide sequences for only 189 of the 202 gene families.

PIRATE --input ${dir_gff} --output ${dir_pirate} --steps "0,10,20,30,40,50,60,70,80,90,95,98" --pan-opt "--cd-low 98 --evalue 1E-6 --hsp-len 0 --flat 1.5" --para-off --align --rplots --threads $(getconf _NPROCESSORS_ONLN) -z 2

@SionBayliss
Copy link
Owner

That is most likely due to PIRATE excluding genes with high copy numbers. If genes are highly fragmented then alignment will likely be problematic. This can be adjusted accordingly using the --dosage option for align_feature_sequences.pl and create_pangenome_alignment.pl. The default is 1.25.

You can raise the threshold, but I would question the utility of including these genes in an alignment for certain applications (e.g. core genome trees).

@haruosuz
Copy link
Author

haruosuz commented Mar 3, 2022

I was wondering if descriptions for output files such as genome2loci.tab and loci_list.tab could be added below?
https://github.com/SionBayliss/PIRATE/blob/master/README.md#output-files

The number of loci (CDS) can be obtained from the output files?

https://www.ncbi.nlm.nih.gov/labs/pmc/articles/PMC6785682/

The pangenome comprised 2,858,820 loci clustered into 102,425 gene clusters of which 1,841 (1.8%) were considered core (present in >95% of isolates) (Fig. 4A).
The pangenome comprised 91,593 loci clustered into 8,325 gene clusters of which 867 (10.41%) were considered core (present in >95% of isolates) (Fig. 5A).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants