Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report gene and model coverage from hmmsearch #2186

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mschecht
Copy link
Contributor

I think users will find it valuable to quickly explore the distribution of gene and model coverage values of anvi-run-hmms --domain-hits-table. To do this I implemented anvi-script-filter-hmm-hits-table --report-gene-and-model-coverage which will add two columns to the hmmsearch domtblout (model_coverage and gene_coverage) and report it as hmm_domtabl_alignment_coverage.tsv in the same directory as the domtblout.

You can test the --report-gene-and-model-coverage and see the output file here:

rm -rf metagenomics-full-test; anvi-self-test --suite metagenomics-full -o metagenomics-full-test

 head metagenomics-full-test/hmm_domtabl_alignment_coverage.tsv

@mschecht mschecht requested a review from meren December 11, 2023 22:29
@mschecht mschecht self-assigned this Dec 11, 2023
@meren
Copy link
Member

meren commented Dec 17, 2023

Hey @mschecht,

While testing this I run into a problem related to the parsing of the dom table. Here is a reproducible workflow using this contigs-db:

# run hmms
anvi-run-hmms -c P_MARINUS_MIT9301-contigs.db \
              -I Bacteria_71 \
              --domain-hits-table \
              --just-do-it \
              --hmmer-output-dir HMM_OUTPUT

# filter stuff using the dom output table and get an error
anvi-script-filter-hmm-hits-table -c P_MARINUS_MIT9301-contigs.db \
                                  --domain-hits-table HMM_OUTPUT/hmm.domtable \
                                  --hmm-source Bacteria_71 \                                  
                                  --report-gene-and-model-coverage \
                                  --min-model-coverage 0.9

HMM profiles .................................: 9 sources have been loaded: Bacteria_71 (71 genes, domain: bacteria), Archaea_76 (76 genes, domain: archaea), Ribosomal_RNA_23S (2
                                                genes, domain: None), Ribosomal_RNA_28S (1 genes, domain: None), Ribosomal_RNA_5S (5 genes, domain: None), Ribosomal_RNA_16S (3
                                                genes, domain: None), Ribosomal_RNA_12S (1 genes, domain: None), Protista_83 (83 genes, domain: eukarya), Ribosomal_RNA_18S (1
                                                genes, domain: None)
Database Path ................................: P_MARINUS_MIT9301-contigs.db
Domtblout Path ...............................: HMM_OUTPUT/hmm.domtable

Config Error: Doesn't look like a --domtblout... anvi'o can't even... Please look at this
              error message to find out what happened: Error tokenizing data. C error:
              Expected 23 fields in line 2, saw 25

This is happening because of these lines in import_domtblout:

        colnames_coltypes_list = list(zip(*self.dom_table_columns))
        colnames_coltypes_dict = dict(zip(colnames_coltypes_list[0], colnames_coltypes_list[1]))

The first line in dom table has 23 columns, but the next one has 25 if you only consider spaces to split due to changes in the length of the description (i.e., GrpE vs Ribosome-binding factor A):

GrpE                 PF01025.19   166 15                   -            239   4.5e-45  145.5   6.9   1   1     8e-47   5.7e-45  145.1   6.4     6   165    48   208    35   209 0.96 GrpE
RBFA                 PF02033.18   105 125                  -            130   4.2e-26   83.7   1.8   1   1   7.1e-28     5e-26   83.5   1.8     1   105     6   111     6   111 0.94 Ribosome-binding factor A
UPF0054              PF02130.17   128 199                  -            179   2.3e-34  110.2   0.1   1   1   4.2e-36     3e-34  109.8   0.1     8   128    42   178    36   178 0.87 Uncharacterized protein family UPF0054
tRNA-synt_1d         PF00750.19   349 204                  -            604  1.3e-120  394.6   1.9   1   1  2.3e-122  1.7e-120  394.3   1.2    19   349   132   474   122   474 0.92 tRNA synthetases class I (R)
PGK                  PF00162.19   378 212                  -            402  5.7e-163  534.4   0.2   1   1  9.2e-165  6.5e-163  534.2   0.2     2   378    10   391     9   391 0.97 Phosphoglycerate kinase
Ribosomal_L1         PF00687.21   204 220                  -            235   9.5e-56  180.9   0.0   1   1   1.7e-57   1.2e-55  180.6   0.0     2   204    30   219    29   219 0.99 Ribosomal protein L1p/L10e family
SecE                 PF00584.20    55 223                  -             80   1.7e-14   45.8   0.1   1   1   2.8e-16     2e-14   45.5   0.1     2    54    26    78    25    79 0.91 SecE/Sec61-gamma subunits of protein translocation complex
Chorismate_synt      PF01264.21   327 242                  -            365  1.2e-144  473.2   0.1   1   1  1.9e-146  1.3e-144  473.0   0.1     1   327     9   354     9   354 0.96 Chorismate synthase
Pept_tRNA_hydro      PF01195.19   171 269                  -            198   6.2e-55  178.0   0.4   1   1   1.1e-56   7.5e-55  177.7   0.4     2   165     6   186     5   194 0.89 Peptidyl-tRNA hydrolase
AICARFT_IMPCHas      PF01808.18   296 284                  -            517  7.9e-114  372.2   2.2   1   1  1.8e-115  1.3e-113  371.5   2.1     1   295   134   450   134   451 0.94 AICARFT/IMPCHase bienzyme

If you change the description for GrpE with GrpE x x manually in the table output, then you get this error, which shows that this is due to improper parsing of the table:

anvi-script-filter-hmm-hits-table -c P_MARINUS_MIT9301-contigs.db \
                                  --domain-hits-table HMM_OUTPUT/hmm.domtable \
                                  --hmm-source Bacteria_71 \                                  
                                  --report-gene-and-model-coverage \
                                  --min-model-coverage 0.9

HMM profiles .................................: 9 sources have been loaded: Bacteria_71 (71 genes, domain: bacteria), Archaea_76 (76 genes, domain: archaea), Ribosomal_RNA_23S (2
                                                genes, domain: None), Ribosomal_RNA_28S (1 genes, domain: None), Ribosomal_RNA_5S (5 genes, domain: None), Ribosomal_RNA_16S (3
                                                genes, domain: None), Ribosomal_RNA_12S (1 genes, domain: None), Protista_83 (83 genes, domain: eukarya), Ribosomal_RNA_18S (1
                                                genes, domain: None)
Database Path ................................: P_MARINUS_MIT9301-contigs.db
Domtblout Path ...............................: HMM_OUTPUT/hmm.domtable

Config Error: Doesn't look like a --domtblout... anvi'o can't even... Please look at this
              error message to find out what happened: Error tokenizing data. C error:
              Expected 25 fields in line 3, saw 26

The other instances that work with dom table knows that this is a stupid output file format with a variable number of length for the description column, and uses maxsplit argument to capture the table structure -- like here in the same file as an example:

 with open(self.domtblout) as dom_table:
            dom_table_entries = [l.strip('\n').split(maxsplit=len(self.dom_table_columns) - 1) for l in dom_table.readlines()]

FYI :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants