Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SBML -> Store Model IDs #8

Open
tpfau opened this issue Nov 20, 2017 · 6 comments
Open

SBML -> Store Model IDs #8

tpfau opened this issue Nov 20, 2017 · 6 comments

Comments

@tpfau
Copy link

tpfau commented Nov 20, 2017

This is a spin of from #6, concerning only the SBML <-> COBRA ID conversion.
Current status is two options:
A: Define a clear conversion scheme between COBRA-ID and SBML ID
B: Add the fbc label as an extension of SBase instead of only on GeneProducts.

If B does not exist in a model, A will be used (and due to old code, some modifications will happen for backward compatability).

@tpfau
Copy link
Author

tpfau commented Jan 3, 2018

Since this is coming up again for us, I'm wondering, whether we, at least within the cobra projects can figure out a way how we store our ids independent of conversions.
From what I figured in several discussions, we will need to keep a bunch of conversions in place when reading SBML (BioModels uses __DASH__ and similar things), bigg has the R_ / M_ markers and _c nomenclature for localisation etc, while the cobra toolbox has [c] as compartment id.
So at least when reading models, we will have to potentially make some adjustments (or just read the data without any modification).

I'm not sure, when a new FBC scheme will be ready and I don't know whether a "label" field is actually useful for us. What I'm wondering is if we can set up an annotation that allows literally "anything" to be stored as id e.g.

<annotation>
<cobra:id>....Something "That Can [be] anything"..</cobra:id>
</annotation>

While I don't like this, it is the only way I can think of how we can handle "any" char...

@rmtfleming
Copy link
Member

I think we need to define what characters are and are not going to be permitted, rather than allowing everything, which is too flexible. How about we start with the set of printable ASCII characters, available in matlab using:
ascii = char(reshape(32:127,32,3)')
ascii =
! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
' a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

Then excluding certain special characters, such as [ ) |, or certain phrases 'and' 'or'. Once we have a proposition, we should try to get agreement across the cobra community.

@draeger
Copy link

draeger commented Jan 3, 2018

This sounds very reasonable to me! As a side note, for this annotation to work, we will also need to define a unique namespace URI, for instance, https://github.com/opencobra/schema/sbml/v1 or so, where v1 denotes a version number so that it can be updated later on.

@Midnighter
Copy link
Member

Midnighter commented Jan 3, 2018

On the cobrapy side, if we stay with the current implementation it would also be very desirable to turn IDs into valid Python variable names. So they should follow a regex such as [a-zA-Z_][a-zA-Z0-9_]*. It's desirable for us because it facilitates quick access to the metabolite and reaction objects without causing a syntax error.

@tpfau
Copy link
Author

tpfau commented Jan 3, 2018

And here we got our first major clash 😄
Cobra Toolbox ids commonly come with '[]' and '()' as the metabolite compartment is stored in the metabolite id. So we currently always have to convert our ID into something that is SBMLID compatible.
But how does cobrapy handle the current ids? Is it using the SBML-IDs directly (which should be possible)?

With respect to restrictions:
I'm fine with whatever choice is made there, but we have to ensure that we don't forbid existing names.
Overall I personally think it is better to allow any ID and store it as a simple string than to start restricting, as this will likely lead to problems for someone. Yes, a restriction to printable ASCII characters makes sense, but other restrictions are likely to cause issues in the future.
Also, if we don't stick to the SBMLID restrictions, we still have the problem of how to store this id, and thus of conversion, and the only reason this is problematic is because of conversions, so if we don't need to convert it, we don't need to restrict, and if we need to convert to something matching @Midnighter 's regexp, we have a valid SBML ID, but have to change a lot of things in the Toolbox 😄

@tpfau
Copy link
Author

tpfau commented Jan 25, 2018

Coming back to this:
I personally don't think, that we need tough restrictions on the model IDs.
The only real restrictions I can think of from the COBRA Toolbox perspective (at the moment):
metabolite IDs are not allowed to have [] at any place outside the enclosing brackets for the compartment ID.

And even this restriction could be overcome if we switch from the implicit representation of compartments in the ID to an explicit representation of the compartment in a metComps field. We will also soon have to handle the question on how to represent multiple tissues in both SBML and cobra models. Sure, SBML just defines all compartments in different tissues as individual compartments, but we will need to have something that (apart from again storing information in IDs) clarifies which compartments form a group in which tissue. My suggestion would be to use the group extension to group the different compartments and, again store the individual compartment IDs using the COBRA ID annotations.

The benefit would be, that SBML IDs would finally become what they are supposed to: pure computational IDs without attached meaning, while all meaning would be given by annotations.
To be able to handle older models, we will of course have to have some type of conversion function for old style IDs but that would be software side implementations which are not interfering with model exchange.
For python, this would essentially mean, that you could use the SBMLID as variable name, but have __str__ and __repr__ return the displayID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants