Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In cobra.io neither sbml.py nor sbml3.py seem to import or export notes. #4

Open
Midnighter opened this issue Jul 21, 2017 · 12 comments
Labels

Comments

@Midnighter
Copy link
Member

From @ChristianLieven on July 6, 2017 16:6

Problem description

I am currently reconstructing a metabolic model, for which I am adding confidence scores, comments, and literature references in the notes attribute of reactions, metabolites and genes. The importance of confidence scores and related qualitative annotation parameters is discussed in the publications linked above.

I tried importing simple noted by adding the following notes field to the RECON1 model from BiGG.
<notes> <body xmlns="http://www.w3.org/1999/xhtml"> <center><h2>This is a TEST</h2></center> <p>I am wondering if COBRApy is able to import this.</p> </body> </notes>
I was quite surprised that the RECON1 model did not contain the confidence scores upon which some of the results of this research are based on.

I was not able to find the keywords 'confidence', 'score' or 'confidence_score' in cobra.io.sbml nor cobra.io.sbml3. If I saw that right the legacy import looks specifically for charge, GPR, and subsystem in the notes field but doesn't account for the confidence score.

Code Sample

You can find my modified example SMBL3+FBC RECON1 file here. The modification is at R_EX_dopa_e.

Discussion

It seems like the community hasn't decided yet what exactly the notes field should contain and how it should be formatted. Personally, I'd find most useful if there was a clever way of allowing both, short human-readable comment entries, as well as optional, but specifically related machine-readable DOI-styled literature references. In the model object, I suppose this could be a nested dictionary looking something like this:
some_model.reaction.SOME_RXN.notes = {"confidence_score":{"value":4, "reference":"some_doi"}}

Based on the referenced publications above, another useful key of the notes-field/attribute would be a simple 'comment' option, which would be limited in length (50 chars? 70 chars? 80 chars?).

some_model.reaction.some_metabolite.notes = {"comment":{"value":"Short string outlining a hypothesis or specific decision for this metabolite", "optional_reference":"some_doi"}}

I don't doubt that there could be a feasible, simple implementation on the python side of things, however I am unfamiliar with the options on the xml specifically SMBL side. A notes field according to the SMBL specifications is allowed to contain...

Almost any wellformed content permitted in XHTML subject to a few restrictions

...which seem pretty straight-forward, namely the notes field ...

must not contain an XML declaration or a DOCTYPE declaration.

Hence, I think a solution here could be to use <ul> from HTML?

What do you think?

Copied from original issue: opencobra/cobrapy#541

@Midnighter
Copy link
Member Author

From @cdiener on July 6, 2017 18:54

That is a good point and one that pops up every once in a while for discussion. There is some ongoing discussion about the meaning of the SBML spec regarding the notes field. SBML only says:

It is intended to serve as a place for storing optional information intended to be seen by humans.

and comparing to annotation:

Whereas Notes is a container for content to be shown directly to humans, Annotation is a container for optional software-generated content not meant to be shown to humans.

The interpretation of the cobrapy maintainers in the past was that since notes should not be "consumed by a machine" it would not be written or read by cobrapy except for supporting the SBML 2 cobra annotations. The argument was that all annotation should go into the annotation tag as described in the spec. For the particular use case of DOIs annotation this is the recommended solution. There is a MIRIAM tag for DOIs so you can just use that. For instance the following is valid SBML and would be read into model.metabolites.h_c.annotation in cobrapy:

<annotation>
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" 
    xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:bqbiol="http://biomodels.net/biology-
    qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
    <rdf:Description rdf:about="#M_h_c">
      <bqbiol:is>
        <rdf:Bag>
          <rdf:li rdf:resource="http://identifiers.org/kegg.compound/C00080"/>
          <rdf:li rdf:resource="http://identifiers.org/doi/10.1038/nbt1156"/>
        </rdf:Bag>
      </bqbiol:is>
    </rdf:Description>
  </rdf:RDF>
</annotation>

However, that only works for direct annotations and not for adding data. For instance if I want to add some other quantity to the species or reaction (confidence scores or charge in various conditions, etc.), there is no way to do that with annotations. This is a shortcoming of SBML IMHO. So I would be in favour of reading and writing the notes field. Could be just raw text of could be a dictionary that is read and written to <ul> tags as you specified and is written into a <p> tag if it's just a string. But that would depend on how others interpret the SBML spec here.

@Midnighter
Copy link
Member Author

From @ChristianLieven on July 11, 2017 14:24

#534 Referencing this issue because @draeger, @Midnighter and @hredestig came up with this solution, which I consider quite optimal:

We are not aware of any existing schema or documentation of the annotation tags used in cobra. Our suggestion is to create a new repository under the opencobra organization. That way, any member of the opencobra community (most importantly of the Matlab COBRA Toolbox) can feel free to contribute to the schema, there can be versioned releases of the schema, and for the time being it can be hosted on https://opencobra.github.io/annotations/schema or whatever is decided for the name and URL.

We would then implement in cobrapy whatever is dictated by the schema and there's a chance for other tools in the opencobra community to do the same.

@Midnighter
Copy link
Member Author

From @draeger on July 19, 2017 14:0

Well, there is, of course, another way of storing confidence scores for reactions in a standard-compliant form. You could use Parameter objects for this. These are objects in the listOfParameters directly within the model and have an id, optional name and value. In their id you could prefix the reaction id that confidence score is referring to. However, this would again not be the best solution of storing that sort of information because it is not obvious what these parameters are.

@luciansmith
Copy link

I fully support the idea of coming up with your own schema to store information in the 'annotation' child of SBML objects; I think this is a great idea. However, there are a couple things you've mentioned wanting that you could store in SBML packages:

  • Groups of objects (i.e. metabolites) could be stored using the 'groups' package (This was discussed in Non-integer charge #3).
  • Confidence intervals could be stored using the 'distrib' package for distributions.

The 'groups' package is released and ready to use today. The 'distrib' package has not yet been finalized, so if there's anything you need that is not yet there, it would be relatively straightforward to add it (I've been in charge of shepherding that package to completion; email me and/or the package working group at [email protected] if you have questions or requests.)

@cdanielmachado
Copy link

I see pros and cons of having the notes field and the annotations field, and the fact that one is supposed to be human-readable and the other machine-readable.

The thing is... what if you want to have something that is both human-readable and machine-readable? It is very nice and convenient just to have the best of both worlds.

I currently added support for having an extended set of metabolite and reaction attributes in framed and carveme.

When reading/writing an SBML file I parse attributes in the form of "key: value" pairs which are stored in the notes field. These are then stored inside the Metabolite and Reaction objects, using an attribute called "metadata" which is just a python dictionary.

This metadata includes things like formulas, ec numbers, manual curation notes, etc. I frequently use these attributes to implement different kinds of methods (e.g.: delta G values for thermodynamic FBA).

I think that constantly extending SBML with new attributes every time someone needs a new attribute is not very sustainable in the long term. You need to wait for a new release of the fbc package, which takes a lot of time, and in the meantime, people already came up with their own workarounds.

One possible solution (not ideal, I know) is to have these dictionaries of extended attributes, and the subset of people who want to use a particular attribute (like delta G value), or implement support for it in their simulation libraries, just come together and agree on a suitable identifier name.

@draeger
Copy link

draeger commented Nov 19, 2017

@luciansmith: One comment about the confidence scores. These are not confidence intervals from a distribution. These are typically discrete numbers (often from 0 to 4) indicating the level of knowledge the model creator has that the component should be in the model. The numbers correspond to categories such as "read in a paper," "experimentally verified," "from a related organism," "computationally inferred," or similar. I, therefore, believe that the distrib package is not the right recommendation for storing this kind of information.

@cdanielmachado: I think it would also be good to create a specified new package for adding additional properties to model components. The SBML extension would only introduce an extension to SBase in the sense that you can add a value pair of an ontology term, some value (either a qualitative value or a quantitative one), and a third attribute for the data type of the value. For instance, an ontology term for Gibbs free energy would be one attribute and the value would be a stored as a String. The third attribute would indicate that the value is a floating-point number so that a software package could parse it out. The ontology could be continuously extended and improved, independent from the SBML extension package. In this way, we could systematically add many kinds of values. Best practices should be given in this package's specification to avoid that information is stored there that should better go to other (more specific) fields. For instance, EC-numbers should go to MIRIAM annotations.

@matthiaskoenig
Copy link

matthiaskoenig commented Nov 19, 2017 via email

@ChristianLieven
Copy link

ChristianLieven commented Dec 3, 2017

Personally I would just annotate this to an evidence ontology, which has a
much more fine grained evidence handling (and especially the tree
relationship between the different confidence/evidence
http://www.evidenceontology.org/browse/

I can get behind using an evidence ontology instead of the rather arbitrary confidence scores that are floating around.

Just to get us back on track, however, my initial question was more aimed at finding the best way of connecting any annotation-information with both a human-readable note AND a machine-readable DOI. So through this schema, I'd like to consolidate a way that this can be done consistently for COBRA models. The whole reason for this is: Using memote, I want to be able to not only gather information on the number of annotations for any given model component but also provide information on the amount and quality of evidence backing up these annotations.

To take up Matthias suggestion for ECO again, I could imagine a possible metric to be the ratio of experimental evidence vs genomic context evidence for a given metabolic model. Or I could simply provide an overview of evidence types.

Edit:
Ignore my comment above, I'm retracing all the things said back in July to get back into the discussion, and found that in #3 @draeger has already pointed out a suitable solution for this.

You can do something like this using MIRIAM annotations. This gives you a method to specify an online resource (such as a publication identifier) and state the relationship between the model component and that online resource. For instance, you can say IS_DESCRIBED_BY and then add the resource http://identifiers.org/pubmed/25562137 which is exactly the publication you cited above. For more information, please see http://identifiers.org or http://www.ebi.ac.uk/miriam/main/collections/MIR:00000113

@ChristianLieven
Copy link

Looks like the discussion at draeger-lab/ModelPolisher#5 provided an excellent solution for this issue without necessarily needing to reinvent the wheel with a new schema.

@bdelepine
Copy link

Hi all,

From what I read above, in associated issues, and in SBML L3V2 documentation, I understand that we can annotate in <annotations> pretty much anything that refers to a concept or an external resource with the right combination of relation element (bqmodel:is, bqbiol:isDescribedBy, etc.) and ontologies (SBOterm, evidence ontology, etc.) defined in external namespaces (rdf, bqmodel, vcard4, etc.).

But I still can't find a way to encode data, such as Gibbs free energy in <reaction>. Other examples mentioned earlier in this tread can benefit from the use of a readily available ontology (confidence score) or already have their own dedicated SBML feature (curator name, date of modification etc. see History section 6.6).

In my opinion, COBRA should not parse anything within the <notes> to respect "human-only" SBML specification, but still import/export whatever is in <notes> in a blob to make it available to users. This would allow them to hack their way when they don't want to use a separate file to store data.

Note that @draeger proposed to create a SBML package that would be generic enough to solve this kind of problem.

@draeger
Copy link

draeger commented Apr 18, 2018

@bdelepine, thanks for pointing out that notes aren't the right place to store machine-readable information. A few additional comments from my side:

  1. Confidence Scores. This discussion has come to a good solution already in a separate thread: SBO term for confidence score draeger-lab/ModelPolisher#5
  2. Additional Data, such as Gibbs energy values: @bgoli and @fbergmann suggested to extend the fbc package for SBML with an additional key-value-pair list where this could go in. Maybe they can provide a link to their proposal?

@bgoli
Copy link

bgoli commented Apr 19, 2018

@draeger here we go: http://pysces.sourceforge.net/KeyValueData/

Note that in the "practical example" some of the terms can be written as MIRIAM uri's or are now encoded in FBC, the keys are arbitrary.

I've been using this for a few years in my tools and it is simple to parse as an SBML annotation and extremely flexible. In general I've found the "type" attribute to be practically redundant. One extension I'm considering is to add a "url" attribute to the element that will act as a optional/supplementary controlled key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants