Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for IDSPACE conflicts on new ontology submission #1704

Open
jamesaoverton opened this issue Dec 16, 2021 · 24 comments
Open

Check for IDSPACE conflicts on new ontology submission #1704

jamesaoverton opened this issue Dec 16, 2021 · 24 comments
Labels
automated validation of principles Issues for the editorial WG pertinent to the automating the validation of the Principles.

Comments

@jamesaoverton
Copy link
Member

#1703 indicates a larger problem with namespace conflicts.

New projects should be encouraged (required?) to check for conflicts. Checking https://bioregistry.io/ would be the easiest and most effective place to look currently. The easiest thing would be to update our instructions. Better would be an automated check, maybe in https://github.com/OBOFoundry/obo-nor.github.io.

Our current documentation https://obofoundry.org/id-policy.html#allocating-idspaces points to http://identifiers.org/, but it does not include "EPSO" and would not have helped in this case.

@cthoyt
Copy link
Collaborator

cthoyt commented Dec 16, 2021

Thanks for writing this up, James. As we've noted on biopragmatics/bioregistry#273, the Bioregistry does not a likely will never consume the full BioPortal, so we should consider the other aspect of whether the OBO Foundry would want to fully respect the prefixes minted in the BioPortal or not (e.g., there can be and in some cases already are nonsensical overlaps/conflicts with high quality resources in the OBO Foundry, etc.)

@matentzn
Copy link
Contributor

@cthoyt Do you have a list of conflicts between BioPortal and OBO by any chance?

@cthoyt
Copy link
Collaborator

cthoyt commented Dec 16, 2021

Cross post of biopragmatics/bioregistry#273 (comment):

The conflicts that I've curated manually are all in this file https://github.com/biopragmatics/bioregistry/blob/main/src/bioregistry/data/mismatch.json

The curation sheet for BioPortal (which represents all unaligned prefixes) is https://github.com/biopragmatics/bioregistry/blob/main/src/bioregistry/data/external/bioportal/curation.tsv. Because the BioPortal API does not let you get direct access to most of the metadata for each entry, unfortunately the only thing that's represented in this sheet is the BioPortal prefix and BioPortal name. This makes it awfully tricky/time-consuming to do untargeted curation

@matentzn
Copy link
Contributor

@nataled for now we need to document clearly that a new request must not conflict with anything in bioregistry, BioPortal. I can

  • Modify thew new request template with a request to manually check bioregistry & BioPortal

before we:

  • Implement automated checks as part of the dashboard

@nataled
Copy link
Contributor

nataled commented Dec 16, 2021

https://obofoundry.org/id-policy.html#allocating-idspaces currently indicates to check with identifiers.org. Please confirm that this should be replaced with bioregistry and bioportal. Alternatively, the latter two can be added.

@cthoyt
Copy link
Collaborator

cthoyt commented Dec 16, 2021

See previous discussion on #1519

@nataled
Copy link
Contributor

nataled commented Dec 16, 2021

Clear as mud! ;)

So...
identifiers.org
bioregistry.io
bioportal.org
obofoundry.org
n2t.net
...and...? (I saw several others mentioned, without URLs, like PrefixCommons and BioContext)

I'm looking for a definitive list of resources (names and URLs), specifically for the lists themselves (not some upper-level landing page). In other words, the user should be able to go to the link we provide and see the list of prefixes. Failing that, some page that provides a search function.

@cthoyt
Copy link
Collaborator

cthoyt commented Dec 16, 2021

The Bioregistry imports Identifiers.org, OBO Foundry, and N2T as well as many other resources (see here for a full list), so it can be a one-stop shop for most resources. However, it does not import all of BioPortal, so users should check there too.

Web Access

Data Dumps

Bioregistry also has several full dumps

for potential contributors who want to access this information programmatically. These are all updated on a nightly basis.

Bioportal doesn't offer any first-party data dumps, but the Bioregistry generates one nightly at https://github.com/biopragmatics/bioregistry/blob/main/src/bioregistry/data/external/bioportal/raw.json

Programmatic Access

Programmatic way to check if something is in the Bioregistry:

import bioregistry

query = "EPSO"
available_in_bioregistry = bioregistry.normalize_prefix(query) is None

Programmatic way to check if something is in BioPortal:

from bioregistry.external.bioportal import get_bioportal

query = "EPSO"
bioportal_dict = get_bioportal()
available_in_bioportal = query not in bioportal_dict

@nataled
Copy link
Contributor

nataled commented Dec 16, 2021

Perfect, thanks!

@cthoyt
Copy link
Collaborator

cthoyt commented Dec 16, 2021

@nataled I updated my comment above with more information that might be more actionable. Feel free to reuse part or all of it

@nataled
Copy link
Contributor

nataled commented Dec 16, 2021

I have updated the documentation (which is outside the scope of this ticket). Please see this page: https://obofoundry.org/id-policy.html and look for the section Allocating IDSPACEs and the subsection Guidelines for selecting an IDSPACE

@nataled
Copy link
Contributor

nataled commented Dec 16, 2021

@matentzn note that my changes should satisfy your first checkbox regarding updating the instructions (the template itself already points to the document I just revised; I see no need to add text to the template itself since that will just duplicate the information).

@matentzn
Copy link
Contributor

This all looks great. I made an issue at the OBO dashboard to implement @cthoyt checker:
OBOFoundry/OBO-Dashboard#59

Thank you both for dealing with this! Are there any remaining action items here?

@nataled
Copy link
Contributor

nataled commented Dec 17, 2021

Looks like all aspects of this have either been taken care of, or have a ticket to do so.

@matentzn
Copy link
Contributor

Great! Thank everyone for your input!

@nataled
Copy link
Contributor

nataled commented Dec 17, 2021

The only concern I have is with the 'strength' of this requirement, and its scope. Strength referring to dashboard report error, warn, or info. I'm certain that a clash with another Foundry ontology would be an ERROR, for example, but not so sure about clashes with non-ontology resources that might be little-known projects. Scope refers to whether or not the ontology needs to be concerned with obsolete resources. I'm not sure these aspects have been discussed.

@matentzn
Copy link
Contributor

I am happy to publicise this widely, but I think bioportal and bioregistry clashes at the very least MUST be avoided moving forward.. we owe this to open science. I am happy to leave this ticket open, but I would say, if we don't get any seriously strong argument for permitting namespace clashes with existing resources, used or otherwise, I think this will be an ERROR. What about this: If we don't see counter arguments on this issue until Friday 24th December, the bioportal/bioregistry clash rule goes into OBO Law.

@nataled
Copy link
Contributor

nataled commented Dec 17, 2021

It's basically written that way now, at least by interpretation. I'm not objecting or wavering, really, but I don't recall any discussion of nuances like those I mentioned. Perhaps an Ops call agenda item?

@matentzn matentzn added the attn: OFOC call Issue to discuss on fortnightly OBO Operations meeting label Dec 17, 2021
@matentzn
Copy link
Contributor

Ok.

Remaining action item:

  • Operations committee to sign of on mandatory non-clash rule with BioPortal and Bioregistry (ERROR status in dashboard).

@cthoyt
Copy link
Collaborator

cthoyt commented Dec 17, 2021

@matentzn I'd also propose this should require a technical check that fails on a PR that has problematic content, it's always possible people miss what's in the dashboard.

@matentzn
Copy link
Contributor

matentzn commented Dec 17, 2021

This is not just for this case here - I think I have a better idea for that which does not require a check. Basically, in order to pass the dashboard the whole config must be present - since its already there, we should just be able to use it instead of having ontology submitters use their own. An even better idea: We require the pull request with the metadata right from the start, even before permission - then the dashboard can just pull that - this will totally automated the OBO nor dashboard with no need for me to intervene anymore.

@nlharris nlharris added the automated validation of principles Issues for the editorial WG pertinent to the automating the validation of the Principles. label Dec 18, 2021
@nlharris
Copy link
Contributor

Remaining action item:

  • Operations committee to sign of on mandatory non-clash rule with BioPortal and Bioregistry (ERROR status in dashboard).

We should add this to the next OBO Ops call agenda, which will be chaired by @nicolevasilevsky.

@matentzn
Copy link
Contributor

Given that we have put this in our ID Policy here https://obofoundry.org/id-policy.html, and our NTR issue here I don't think that it needs to be put in front of OBO Ops again. It has been decided. @nataled should check that this is documented accurately enough, I think it is good enough, but some stronger wording could help.

So the only remaining item here is

  • Make an OBO dashboard ticket that checks no-clash rule.

@nicolevasilevsky nicolevasilevsky removed the attn: OFOC call Issue to discuss on fortnightly OBO Operations meeting label Feb 8, 2022
@nataled
Copy link
Contributor

nataled commented Oct 23, 2024

In addition to Bioregistry and BioPortal, we will need to include a check of (probably) the NOR Dashboard. See biopragmatics/bioregistry#1212 for discussion and reason why. That should catch 99.9% of potential clashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automated validation of principles Issues for the editorial WG pertinent to the automating the validation of the Principles.
Projects
None yet
Development

No branches or pull requests

6 participants