Preliminary questions to investigate for starting a new Knetminer API #2
marco-brandizi
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Moved from this ticket.
Fulltext search
We have a basic use case like this:
keywords
, search for instances ofnode types: Gene, Protein, Phenotype, Trait, Publication,
accession, description, abstract (for Publication only).
Do we need Lucene or another indexing system?
Or can Neo4j support this?
With which performance?
With DB-independent features?
The API architecture
What we can already agree
but outside of the web or Spring (must be useable as a lib, eg, in a CLI tool)
or Knetminer-specific entities, see below)
The Data Model
What to keep from Ondex?
Reference
Shall we keep the 'Concept' term? Distinct from eg, Accession, Evidence, DataSource
I propose Provenance instead of DataSource (for the sake of standard terminology)
I propose to avoid the need for Evidence everywhere. This is specific to eg, experimental data, and it's a form of provenance
Do we need Accession? Possible alternative prefix:ID
Do we need name and prefName? Or just name? Do we need multiple names?
Do we have units? How to deal with them? Simple approaches:
How to manage a polyglot representation?!
Exchange format
Possible solutions
RDF with mapping from LPG, eg,
rdf:type
(roughly)It's an established standard, it has well-known mapping rules, much software available
It's unpopular and little-known
extra-layer of mapping, possible mapping impedance
It's for triples, works for LPGs too, but only with conventions like above
We'll need this anyway, eg, to support SPARQL
Possibly, we'll need it in both directions eg, to import RDF data into Neo4j
Representing LPGs in RDF, eg,
There are proposals to standardise this (example)
I can't see any particular advantage, other than it's an easy way to transport LPGs to
RDF, as-is
JSON-LD
namespaces defined separately (via
@context
directives and alike)JSON people to know about RDF. But's requires 'very careful'
Some other 'standard'
The JSON used for Import/Export
URIs as universal identifiers
URIs are an established method to provide universal and resolvable identifiers.
Universal is more important to us.
The main alternative is accessions, ie, pairs of context + identifer.
Related to it, local identifiers, auto-increment (or auto-created) IDs, merging entities
In Knetminer RDF/LPG datasets, we have alignment at several levels, URIs included
Issues
eg, http://knetminer.org/data/rdf/resources/bioproc_go_0006225 vs
http://purl.obolibrary.org/obo/GO_0006225
URI-like approach to offer universal IDs
Possible solutions
Multiple subgraphs/datasets
In Triple stores like Virtuoso, you can have multiple datasets in the same database
managed via named graphs, eg,
You also have a 'union graph', ie, if you search for:
Main features provided by NGs
performance)
granularity is at triple level (equivalent to property or plain relation in LPGs)
Issues
Possible solutions:
We mark each node and each relation with a property like 'dataset':
MATCH ( g1:Gene{ name: GABA1} ), (g2:Gene)-[encodes]->(p:Protein)
WHERE g1.uri = g2.uri
As above, but we also create a union graph
We manage multiple datasets in multiple databases
Triple vs LPG patterns
If we adopt standards like schema.org for the data model, there will be
parts that are influenced/constrained by the triple vision
eg, ItemList, works like:
ItemList -> ItemListElement (storing eg, index) -> node-value
How to deal with those cases? Follow the standard in LPG too?
Probably it's better to deviate, but keeping track and knowing there is work to do for
import/export
How many such patterns exist?
AOB
Beta Was this translation helpful? Give feedback.
All reactions