Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIRATE.pangenome_summary.txt #87

Open
haruosuz opened this issue Mar 2, 2024 · 9 comments
Open

PIRATE.pangenome_summary.txt #87

haruosuz opened this issue Mar 2, 2024 · 9 comments

Comments

@haruosuz
Copy link

haruosuz commented Mar 2, 2024

I ran PIRATE with 322 genomes (gff files) as input, but the <PIRATE.pangenome_summary.txt> file indicates 321 genomes. Is there any way to investigate why the count decreased from 322 to 321? The confirmation details are as follows:

cat PIRATE.log | grep "32. "

 - 322 files in input directory.
 - 322 gff files passed QC and will be analysed by PIRATE - completed in: 2s
 - Loci file contains 21143 loci from 322 genomes.
322 genomes processed.
 - 322 samples found in file headers
 - 1415 genes of 1415 total genes were present in 322 isolates
 - 1415 clusters from 322 genomes (dosage threshold <= 1.1) used for graphing.
# 1415 gene families in 321 genomes.
@SionBayliss
Copy link
Owner

Hi @haruosuz, that is a little odd. Typically poorly formatted GFF files are removed at the initial stage. Were your samples annotated with prokka? I would suggest you check the GFF file for removed sample to ensure it is formatted correctly and contains CDS/genes (i.e. is not empty). If it looks normal feel free to email me the and I will check to see if there is anything odd going on (perhaps include a handful of the files that worked as well as contrasts).

@haruosuz
Copy link
Author

haruosuz commented Mar 7, 2024

Thank you for your reply. The 322 genomes were annotated with DFAST. Among the 322 GFF files, there isn't any empty file. In the <PIRATE.gene_families.ordered.tsv> file, there are 1415 rows and 344 columns (i.e., 344 - 22 = 322 genomes). Is there any way to identify which of the 322 GFF files was excluded from the <PIRATE.pangenome_summary.txt> file? This is suggested by # 1415 gene families in 321 genomes. in the <PIRATE.pangenome_summary.txt> file.

@SionBayliss
Copy link
Owner

You can check the headers in the PIRATE.gene_families.tsv file and compare them to your input sample list.

@haruosuz
Copy link
Author

Thank you for your reply.

The following command did not produce any output, indicating that there is no difference between the genomes listed in the headers in the PIRATE.gene_families.tsv file and input sample list provided in the "genome_list.txt" file:

diff <(head -n 1 PIRATE.gene_families.tsv | tr "\t" "\n" | tail +21) <(cat genome_list.txt | sort)

The discrepancy in the numbers (322 vs. 321 genomes) remains unclear. Here are the commands and their outputs provided:

$ wc -l genome_list.txt
     322 genome_list.txt

$ head -n 1 PIRATE.pangenome_summary.txt
# 1415 gene families in 321 genomes.

@SionBayliss
Copy link
Owner

So it found all your input genome files but is saying there is an additional one at one internal step? Are you sure you don't have a line including just whitespace in the genome_list.txt file?

@haruosuz
Copy link
Author

I ran PIRATE with 322 genomes (gff files) as input. While the <PIRATE.gene_families.tsv> file contains 322 genomes, the <PIRATE.pangenome_summary.txt> file indicates only 321 genomes.

The <genome_list.txt> file, generated by PIRATE, contains no whitespace, as shown below:

$ cat genome_list.txt | wc -l
     322
$ cat genome_list.txt | grep -v "^$" | wc -l
     322

@SionBayliss
Copy link
Owner

Hi Haruo,

Can you share the genomes with me so that I can track down the bug? I cannot replicate it on my end.

S

@haruosuz
Copy link
Author

Dear @SionBayliss
Attached is a directory containing the FASTA files for 322 genomes.
fna_322_genomes.tgz

@SionBayliss
Copy link
Owner

I will see if I can get to this over the next few weeks and feed back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants