Refactoring NEO4J Schema for Fewer Data Types and Minimised Duplication #776

Arnedeklerk · 2023-07-14T16:06:10Z

Arnedeklerk
Jul 14, 2023
Maintainer

Background

This discussion is raised to stimulate some thoughts and trigger a discussion before we venture into planning any concrete actions. I only know so much about the technicalities of such changes and am unsure about (1) whether this is to be completed by us or a partner company, (2) how much of it is related to the Neo4j converter (I think most, hence the issue is here...), (3) how much of KM will need remapping after the fact and, (4) exact details on our previous discussions surrounding this topic - so feel free to add to the discussion below.

Our current NEO4J schema is functional, however, there's room for improvement, especially with respect to optimisation and simplification. We've spotted a need for:

Objectives

Fewer Data Types: An excess of data types can lead to complicated queries and challenges when integrating with new features. By streamlining the schema to fewer types, we could simplify the querying process, especially for our upcoming Latent Linguistic Modelling (LLM) Natural Language Processing (NLP) Querying feature.
All Required Data: As we endeavour to reduce data types, it's crucial to ensure that the schema encapsulates all necessary data fields. Identifying any missing data fields that are critical to our application's performance is a key aspect of this task. I know that the cyverse neo4j version is now outdated, so perhaps the data just needs updating there.
Minimised Duplication: Currently, our schema presents a significant level of data redundancy, which contributes unnecessary complexity. A refactoring process that identifies and eliminates these duplications will lead to more efficient data handling. I do, however, recall Marco mentioning this to be a difficult fix, because of the nature of how the data is being stored or handled.

Questions to Discuss

Is it feasible to restructure our schema? Where do we begin? I have this Google Sheet which I hoped we could use to understand the schema (Property Keys Sheet), but it's not very good and was initially to be used for a different purpose. There is probably a cleaner way to extract the full schema from Neo4j (though I think previously it didn't give me all the info I hoped for, to be investigated.). What effects could this restructuring have on the current version of Knetminer, and what implications might it carry for the future Knetminer 6.0?

Moving Forward

More exploration and expertise on the subject are needed to define concrete suggestions and a plan of action. A discussion during one of our meetings would be a good start.

Arnedeklerk · 2023-08-02T13:26:01Z

Arnedeklerk
Aug 2, 2023
Maintainer Author

Please discuss here: https://github.com/KnetMiner/knetminer/discussions/7

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring NEO4J Schema for Fewer Data Types and Minimised Duplication #776

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Refactoring NEO4J Schema for Fewer Data Types and Minimised Duplication #776

Arnedeklerk Jul 14, 2023 Maintainer

Background

Objectives

Questions to Discuss

Moving Forward

Replies: 1 comment

Arnedeklerk Aug 2, 2023 Maintainer Author

Arnedeklerk
Jul 14, 2023
Maintainer

Arnedeklerk
Aug 2, 2023
Maintainer Author