Preliminary questions to investigate for starting a new Knetminer API #2

marco-brandizi · 2024-03-18T17:42:50Z

marco-brandizi
Mar 18, 2024
Maintainer

Moved from this ticket.

Fulltext search

We have a basic use case like this:
- for a search string keywords, search for instances of
  node types: Gene, Protein, Phenotype, Trait, Publication,
- by looking into the following properties, potentially in each node: name, prefName,
  accession, description, abstract (for Publication only).
Do we need Lucene or another indexing system?
Or can Neo4j support this?
With which performance?
With DB-independent features?

The API architecture

Shall we work on this before dealing with other points?
Might be risky, eg, what we decide about the data model might influence the architecture

What we can already agree

Main layers:
- Basic data access (wraps Neo4j)
  - More advanced data access (eg, semantic motifs) and API-like functionality
    but outside of the web or Spring (must be useable as a lib, eg, in a CLI tool)
  - Spring and web wrappers
  - A separate module for the data model (to represent nodes, relations and domain-specific
    or Knetminer-specific entities, see below)
Related: can Spring Neo4j be used outside of web APIs? Outside of Spring Boot (to have it in JAR dependency)?

The Data Model

We need basic data interfaces such as Node, Relation, maybe Provenance, etc
We need some domain classes, eg, Gene, Protein, etc, but that's less important
We can learn from Ondex, but there is quite a lot to review (ditch...), see below
Ondex Model

What to keep from Ondex?

Reference
Shall we keep the 'Concept' term? Distinct from eg, Accession, Evidence, DataSource
I propose Provenance instead of DataSource (for the sake of standard terminology)
I propose to avoid the need for Evidence everywhere. This is specific to eg, experimental data, and it's a form of provenance
Do we need Accession? Possible alternative prefix:ID
Do we need name and prefName? Or just name? Do we need multiple names?

Do we have units? How to deal with them? Simple approaches:

{ph0, label: 'Phenotype', height: '3 cm', weight: '120 g' }
{ph0, label: 'Phenotype', height: '3', height-unit: 'cm', weight: '120', weight-unit: 'g' }
{
  	ph0, label: 'Phenotype', 
    height: { value: '3', unit: 'cm' }, 
    weight: { value: '120', unit: 'g' }
  }

Second approach isn't supported by Neo4j (can only be done with JSON inside strings)
- Another approach: value as node (compatible with schema:PropertyValue)

How to manage a polyglot representation?!

That is, defining a basic schema for the data model, eg, in JSONSchema
And then align Java, Javascript, possibly, Python definitions of the same model
Manually?
Check out:
- KGX, BioLink
- ISA Model (uses JSONSchema)

Exchange format

How to manage data dumps in a standardised, Neo4j-independent format?
Which standards to use?

Possible solutions

RDF with mapping from LPG, eg,
- properties map to data properties
  - plain relations (without properties) map to object properties
  - Relationships with properties map to reified statements
  - node labels and relation types map to rdf:type (roughly)
It's an established standard, it has well-known mapping rules, much software available
It's unpopular and little-known
extra-layer of mapping, possible mapping impedance
It's for triples, works for LPGs too, but only with conventions like above
We'll need this anyway, eg, to support SPARQL
Possibly, we'll need it in both directions eg, to import RDF data into Neo4j
- Keep using rdf2neo, or adopt another approach (eg, NeoSemantics)

Representing LPGs in RDF, eg,

gene123 rdf:type lpg:Node;
    lpg:label 'Gene';
    lpg:prop [ rdf:type lpg:Property; lpg:name 'name'; lpg:value 'GAB1' ];
  	lpg:prop [ rdf:type lpg:Property; lpg:description '...' ];

  rel_gene123_encodes_protein456 rdf:type lpg:Relation
    lpg:type 'encodes';
  	lpg:prop [...];

There are proposals to standardise this (example)
I can't see any particular advantage, other than it's an easy way to transport LPGs to
RDF, as-is
JSON-LD
- This is actually RDF, but in JSON, with mappings of complicated things like URIs/
  namespaces defined separately (via @context directives and alike)
  - If carefully managed, allows for bridging RDF and JSON, with practically no need for
    JSON people to know about RDF. But's requires 'very careful'
  - Same issues as RDF with extra-mapping layer, reification, etc
Some other 'standard'
- Cytoscape (.js JSON)
  - GraphML, well known and supported, but it's based on XML
  - KGX
    - has nice Python libs
      - it's quite an isolated project
The JSON used for Import/Export
- not very standard

URIs as universal identifiers

URIs are an established method to provide universal and resolvable identifiers.
Universal is more important to us.
The main alternative is accessions, ie, pairs of context + identifer.
Related to it, local identifiers, auto-increment (or auto-created) IDs, merging entities
In Knetminer RDF/LPG datasets, we have alignment at several levels, URIs included
- In RDF, they're native
  - In our LPGs, there is always (for both relations and nodes) a 'uri' property
  - Such property contains extended URIs
  - Which might be difficult to manage

Issues

What to do?! Keep using URIs? Forget about them?
Give up with having universal IDs?
In many cases, we have to mint them anyway,
eg, http://knetminer.org/data/rdf/resources/bioproc_go_0006225 vs
http://purl.obolibrary.org/obo/GO_0006225
Yet, they can be useful in a number of cases. I might be worth to adopt some form of
URI-like approach to offer universal IDs
We can't ignore them for what concerns schemas and ontologies (see below)

Possible solutions

Just used them, extended form (quite bad for schemas)
Always use CURIEs, ie, schema:Gene, instead of https://schema.org/Gene
- There should be a common namespace dictionary somewhere
Use CURIEs, except for a few well-known namespaces (eg, knetminer, schema, bioschema, etc)
- So, Gene, not schema:Gene
- Require uniqueness within such namespaces, eg, bioschema:Gene can't exist

Multiple subgraphs/datasets

In Triple stores like Virtuoso, you can have multiple datasets in the same database

managed via named graphs, eg,

(gene123 name 'GABA1') dataset1
(gene123 description 'Blah Blah Blah') dataset1
  
  (gene123 name 'GAB1') dataset2
  (gene123 encodes protein456 ) dataset2

You also have a 'union graph', ie, if you search for:
- ?gene name 'GABA1' and encodes ?protein in the union,
  - you'll find gene123, thanks to matches across multiple NGs/datasets
Main features provided by NGs
- a dataset can be updated (deleted and recreated) leaving the rest there
  - queries over the union, cross-datasets
- queries over certain datasets only (I trust DS1 only, I know its on DS2 only and I want
  performance)
  - Info from multiple datasets are automatically merged (in the union), the
    granularity is at triple level (equivalent to property or plain relation in LPGs)
- NGs/datasets are URIs and can have triples/information attached

Issues

What to do in LPGs? And in Neo4j in particular?
Which of the NG features do we actually need?
- Delete/Updated subsets?
  - Union?
  - Both?
  - None? (Eg, we rebuild a DB entirely, or we have other means for partial updates)

Possible solutions:

We mark each node and each relation with a property like 'dataset':
```
 {uri: gene123, name: GABA1, description: '...', dataset: ds1}
 {uri: gene123, name: GAB2, dataset: ds2}
 [gene123 encodes protein455, dataset: ds2]
```
- We might have multiple nodes about the same entity
  - union queries can't be implicit, you need:
    MATCH ( g1:Gene{ name: GABA1} ), (g2:Gene)-[encodes]->(p:Protein)
    WHERE g1.uri = g2.uri
As above, but we also create a union graph
- In another database, or in the same DB, with dataset='union'
  - union queries are seamless
  - doubles the necessary space
We manage multiple datasets in multiple databases
- union queries have the same issue, we can have a union DB as above
  - Can't be implemented in Neo4j Community/Free edition (at least, not easily)

Triple vs LPG patterns

If we adopt standards like schema.org for the data model, there will be
parts that are influenced/constrained by the triple vision
eg, ItemList, works like:
ItemList -> ItemListElement (storing eg, index) -> node-value
How to deal with those cases? Follow the standard in LPG too?
Probably it's better to deviate, but keeping track and knowing there is work to do for
import/export
How many such patterns exist?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KnetMiner

Preliminary questions to investigate for starting a new Knetminer API #2

{{title}}

Replies: 0 comments

Select a reply

KnetMiner

Preliminary questions to investigate for starting a new Knetminer API #2

marco-brandizi Mar 18, 2024 Maintainer

Fulltext search

The API architecture

What we can already agree

The Data Model

What to keep from Ondex?

How to manage a polyglot representation?!

Exchange format

Possible solutions

URIs as universal identifiers

Issues

Possible solutions

Multiple subgraphs/datasets

Issues

Possible solutions:

Triple vs LPG patterns

AOB

Replies: 0 comments

marco-brandizi
Mar 18, 2024
Maintainer