Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mat #1197

Merged
merged 52 commits into from
Apr 28, 2022
Merged

Improve mat #1197

merged 52 commits into from
Apr 28, 2022

Conversation

akaviaLab
Copy link
Contributor

I haven't modified tests yet. Not sure what's necessary, but I'm thinking about making a copy of load_model which will load mat models from BIGG/BioModels, and then maybe make a test of load_model, load_model_mat and compare them.

I haven't modified the writing of mat models. If this code makes sense, I can write the inverse function.

@codecov-commenter
Copy link

codecov-commenter commented Mar 31, 2022

Codecov Report

Merging #1197 (49cc180) into devel (ca8e8bd) will increase coverage by 0.60%.
The diff coverage is 84.28%.

@@            Coverage Diff             @@
##            devel    #1197      +/-   ##
==========================================
+ Coverage   83.63%   84.23%   +0.60%     
==========================================
  Files          65       65              
  Lines        5370     5549     +179     
  Branches     1203     1287      +84     
==========================================
+ Hits         4491     4674     +183     
+ Misses        569      564       -5     
- Partials      310      311       +1     
Impacted Files Coverage Δ
src/cobra/io/mat.py 82.28% <84.28%> (+14.73%) ⬆️
src/cobra/io/sbml.py 80.45% <0.00%> (+0.35%) ⬆️
src/cobra/sampling/core.py 93.93% <0.00%> (+15.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca8e8bd...49cc180. Read the comment docs.

@cdiener
Copy link
Member

cdiener commented Mar 31, 2022

That's awesome, thanks so much!

I think all cobrapy species have annotations, including genes:

In [1]: from cobra.test import create_test_model

In [2]: mod = create_test_model("textbook")

In [3]: mod.genes[0].annotation
Out[3]: {}

Empty in this case but present.

@cdiener cdiener self-requested a review March 31, 2022 22:45
@cdiener
Copy link
Member

cdiener commented Apr 1, 2022

With the annotated Recon3D it errors:

In [5]: rec = load_matlab_model("Recon3DModel_301.mat")
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 rec = load_matlab_model("Recon3DModel_301.mat")

File ~/code/cobrapy/src/cobra/io/mat.py:231, in load_matlab_model(infile_path, variable_name, inf)
    229 for possible_name in possible_names:
    230     try:
--> 231         return from_mat_struct(data[possible_name], model_id=possible_name, inf=inf)
    232     except ValueError:
    233         # TODO: use custom cobra exception to handle exception
    234         pass

File ~/code/cobrapy/src/cobra/io/mat.py:496, in from_mat_struct(mat_struct, model_id, inf)
    494         new_metabolite.formula = met_formulas[i]
    495     new_metabolites.append(new_metabolite)
--> 496 mat_annotations(new_metabolites, m, _d_replace=D_MET)
    497 model.add_metabolites(new_metabolites)
    499 new_genes = []

File ~/code/cobrapy/src/cobra/io/mat.py:368, in mat_annotations(target_list, mat_struct, _d_replace)
    364 annotation_matlab = list(
    365     set(mat_struct.dtype.names).intersection(D_REPLACE[_d_replace].keys())
    366 )
    367 annotation_lists = [[None]] * len(annotation_matlab)
--> 368 annotation_providers = [
    369     D_REPLACE[_d_replace + "_REV"][x] for x in annotation_matlab
    370 ]
    371 for i in range(len(annotation_matlab)):
    372     annotation_lists[i] = _cell_to_str_list(mat_struct[annotation_matlab[i]][0, 0])

File ~/code/cobrapy/src/cobra/io/mat.py:369, in <listcomp>(.0)
    364 annotation_matlab = list(
    365     set(mat_struct.dtype.names).intersection(D_REPLACE[_d_replace].keys())
    366 )
    367 annotation_lists = [[None]] * len(annotation_matlab)
    368 annotation_providers = [
--> 369     D_REPLACE[_d_replace + "_REV"][x] for x in annotation_matlab
    370 ]
    371 for i in range(len(annotation_matlab)):
    372     annotation_lists[i] = _cell_to_str_list(mat_struct[annotation_matlab[i]][0, 0])

KeyError: 'metKEGGID'

I think you don't want the + "_REV" in line 369.

@akaviaLab
Copy link
Contributor Author

Fixed.
If you can figure out what everything lower than
model.E m x evars
on the cobratoolbox page means, and if it is relevant to cobrapy, I'd appreciate it. I think cobrapy doesn't support any of these fields, but if I'm wrong, please let me know.

@cdiener
Copy link
Member

cdiener commented Apr 1, 2022

Oh yeah, those are additional variables used for enzyme capacity models. Not sure whether we should read them. They can be represented with optlang, but cobrapy has no methods for enzyme capacity models at this point. We may want to support model.C, model.ctrs, and model.ctrNames, which would just be added as additional constraints to the model. But this should probably be its own PR. What we may consider in this PR is to add writing annotations to Matlab as well, or did you want this in a separate PR (would be fine with me)?

@cdiener
Copy link
Member

cdiener commented Apr 1, 2022

Some of the properties don't seem to get parsed correctly. For instance, EC numbers:

In [97]: m["Recon3DModel"]["rxnECNumbers"][0][0][12]
Out[97]: array([array(['1.4.3.6'], dtype='<U7')], dtype=object)

In [98]: m["Recon3DModel"]["rxns"][0][0][12]
Out[98]: array([array(['13DAMPPOX'], dtype='<U9')], dtype=object)

In [99]: rec.reactions[12].annotation
Out[99]: {'pubmed': '15035803,8920635,9653080'}

In [100]: rec.reactions[12]
Out[100]: <Reaction 13DAMPPOX at 0x7fb42ae47ca0>

Copy link
Member

@cdiener cdiener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good. Some improvements to variable names and simplifications.

src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@akaviaLab akaviaLab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Will work on writing to matlab later.

src/cobra/io/mat.py Outdated Show resolved Hide resolved
@akaviaLab
Copy link
Contributor Author

Oh yeah, those are additional variables used for enzyme capacity models. Not sure whether we should read them. They can be represented with optlang, but cobrapy has no methods for enzyme capacity models at this point. We may want to support model.C, model.ctrs, and model.ctrNames, which would just be added as additional constraints to the model. But this should probably be its own PR. What we may consider in this PR is to add writing annotations to Matlab as well, or did you want this in a separate PR (would be fine with me)?

If we can add these as constraints, you can tell me what each one means or how to represent it in optlang, so they can be read in. It can be this PR.
The enzyme capacity that we don't have methods in cobrapy should probably be ignored for now.
As for writing annotations to matlab - I'm working on it.

@akaviaLab
Copy link
Contributor Author

Some thoughts before mission creep and/or adding stuff to this PR that doesn't belong

I've been working on comparing RECON3D SBML and RECON3D mat reading to see what's missing, and I have some helper functions I'm working on and so forth.

  1. Would you like me to try and add a function to compare models? I have a kludgy function I can tidy up. I'm using it to see where the differences are, so I can focus on what to fix. Separate PR is also an option.
  2. Would you like me to improve SBML reading? For example, the RECON3D SBML (which I think was written by cobratoolbox) has things in Notes that should really be in annotations. I can add some changes to SBML notes reading/writing to make sure these kind of annotations end up in annotations, or I can ignore them.
  3. It looks like cobratoolbox uses subsystems for groups, and only uses groups for reactions for now. Would you like me to make the matlab reader build groups based on subsystems? What about writing?

@cdiener
Copy link
Member

cdiener commented Apr 6, 2022

  1. Would you like me to try and add a function to compare models? I have a kludgy function I can tidy up. I'm using it to see where the differences are, so I can focus on what to fix. Separate PR is also an option.

I would say separate PR and maybe have it in the utils submodule?

  1. Would you like me to improve SBML reading? For example, the RECON3D SBML (which I think was written by cobratoolbox) has things in Notes that should really be in annotations. I can add some changes to SBML notes reading/writing to make sure these kind of annotations end up in annotations, or I can ignore them.

We had that discussion before but Notes in SBML are specifically meant to not be parsed and are only there for a human to look at them. No machine readable information should be stored there. So ignoring those is fine. This should be fixed on the cobratoolbox side when creating those models.

  1. It looks like cobratoolbox uses subsystems for groups, and only uses groups for reactions for now. Would you like me to make the matlab reader build groups based on subsystems? What about writing?

Yeah, the subsystem attribute itself is a leftover from legacy cobra models. Groups would definitely be the modern version for this. Probably better in a separate PR but it makes sense to read them as groups and write them from groups with specific names.

@cdiener
Copy link
Member

cdiener commented Apr 6, 2022

Oh and single annotations don't have to be in a list right now, but they can be. And maybe leave the writing to a new PR because that one is already large.

@akaviaLab
Copy link
Contributor Author

Oh and single annotations don't have to be in a list right now, but they can be.

Can single annotations be in a list? I wrote the code in a way that both single and multiple annotations will go into a list.

The PR has writing. If you want me to separate it to a different PR, I could, but I did it before I saw your comment.

@cdiener
Copy link
Member

cdiener commented Apr 7, 2022

Yes a list is fine. You mean like annotations = {"ec-number": ["1.2.3.4"]}, right? That is fine.

And no need to separate the PR. I thought you hadn't started on writing yet, but it's all good.

@akaviaLab
Copy link
Contributor Author

akaviaLab commented Apr 7, 2022 via email

@akaviaLab
Copy link
Contributor Author

Adding test cases, and nearly done.
Question - models that don't have compartment in the met_id (like raven.mat) that match with
if all(var in m.dtype.names for var in ["metComps", "comps", "compNames"]):

Are rewritten (because of create_mat_metabolite_id) as metabolites that do have compartment in the met id.
For example, in raven.mat
'm156' is rewritten as 'm156[Cytoplasm]'

It means that models that don't have compartments in met_ids do not make a seamless read, write, reread transfer.
Should this be happening? Should I change raven.mat or change the functions so it is rewritten the same as it was?

This seems to be a quirk of RAVEN format, so I don't know if it is even used apart from test cases.

@cdiener
Copy link
Member

cdiener commented Apr 20, 2022

Yeah, I agree with you that this is odd. I would also favor maintaining IDs the same. There isn't really a rule that compartment has to be in a metabolite ID. It's a convention that many IDs follow, but you usually also want a real attribute for a compartment. I would vote for removing that substitution and using the original ID. My argument would be that it would be surprising if I wrote a COBRApy model to Matlab and suddenly all my metabolites got renamed.

made failure better
added tests to fail when compartments are missing or problematic
added test to see that mat models with no genes are imported correctly
@akaviaLab
Copy link
Contributor Author

It looks like pepole who create mat files are sloppy and/or the behavior of the cobratoolbox changed over time. Specifically, the field names for annotations keep varying a bit on the capitalization
metChEBIID, metCHEBIID, and metChebiID are things that I've seen

Likewise for other fields.

Question - do we want cobrapy to be tolerant of this? If not, I'll go with the official description in https://github.com/opencobra/cobratoolbox/blob/master/docs/source/notes/COBRAModelFields.md

If so, do you have any idea on how to do case-insensitive matching?
Right now, I run
annotation_matlab = list(
set(mat_struct.dtype.names).intersection(D_REPLACE[_d_replace].keys())
)
Any ideas on how to do case-insensitive matching equivalent to this? I could do regular expression, but that seems like it would be complicated.

@cdiener
Copy link
Member

cdiener commented Apr 21, 2022

Yeah, they run no validations at all it seems. You could convert all keys in the mat struct to lower case and match on those only, maybe?

@akaviaLab
Copy link
Contributor Author

Okay. I'll output what is supposed to be the official version.

uri.akavia and others added 7 commits April 22, 2022 17:25
…ion of model fields (metChebiID, metCHEBIID, metChEBIID) - revised reading to be tolerant of this
…ssed in the CHEBI website.

removed unnecessary function create_mat_metabolite_id()
added tests - test_compare_xml_to_written_matlab_model, test_mat_model_with_long_compartment_ids, test_mat_model_wrong_caps
@akaviaLab
Copy link
Contributor Author

All done. Including tests for all kinds of weird cases.

Copy link
Member

@cdiener cdiener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, only a bunch of formatting/variable naming changes. The code is getting really hard to review at this point due to large number of commits. So I would probably recommend to leave most of those for a clean-up PR afterwards. The only thing for now would be to remove the code in the docstring that generates the wrong key model. Then we can merge and follow up with a PR to fix the wrong annotations in SBML models and another one for the variable name cleanup.

src/cobra/test/test_io/test_mat.py Show resolved Hide resolved
src/cobra/test/test_io/test_mat.py Show resolved Hide resolved
src/cobra/test/test_io/test_mat.py Outdated Show resolved Hide resolved
src/cobra/test/test_io/test_sbml.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
release-notes/next-release.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@akaviaLab akaviaLab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Some questions remaining.

src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/io/mat.py Outdated Show resolved Hide resolved
src/cobra/test/test_io/test_mat.py Show resolved Hide resolved
src/cobra/test/test_io/test_mat.py Show resolved Hide resolved
src/cobra/test/test_io/test_sbml.py Outdated Show resolved Hide resolved
src/cobra/test/test_io/test_mat.py Outdated Show resolved Hide resolved
uri.akavia added 2 commits April 26, 2022 14:26
next release corrected
removed instructions to gist
@akaviaLab
Copy link
Contributor Author

akaviaLab commented Apr 26, 2022

@cdiener - reverted the test_sbml.py changes
Will change the models and everything in a later PR. I'll add a note to the new PR to be merged after this one.
Yuck - this is not just changing the XML files, but changing how they are read. I'll open a discussion for that.

@cdiener
Copy link
Member

cdiener commented Apr 27, 2022

Actually, my bad. I read up on it and I think the lower case providers in SBML are now the de facto standard and are allowed by identifiers.org. So annotations["sbo"] is fine and the SBML parser will consolidate into this format (so transform from "SBO"). But we can discuss this more in another PR that addresses this because it also needs some adaptations for IDs.

@akaviaLab
Copy link
Contributor Author

@cdiener - I think we're done. Would you like another developer to review this?
I have the various improvements and cleanup ready to go in another PR once this is merged.

Copy link
Member

@cdiener cdiener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in.

@cdiener cdiener merged commit f5a1b96 into opencobra:devel Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the reading of MATLAB models
3 participants