Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider ingest/updates via changes (delta) #79

Open
dr-rodriguez opened this issue Aug 13, 2024 · 0 comments
Open

Consider ingest/updates via changes (delta) #79

dr-rodriguez opened this issue Aug 13, 2024 · 0 comments

Comments

@dr-rodriguez
Copy link
Collaborator

Currently, updating the database is the same as creating a new one- all records are deleted and re-ingested. This works well to ensure that objects that are deleted are properly handled. However, once the database reaches a certain size this can be an expensive operation.

Instead, we may want to figure out how to handle a delta-style ingest, that is, only process those JSON documents that have been updated. This may be tricky and may require several iterations.
I do think for purposes of Production-level databases and testing one may want to build the entire database, so this is more thinking about the development aspect or user's local copies were they may not want/need a production-ready instance. Or for instances were the production-level database is so large we only want to apply deltas.

I can think of several aspects we may want to check out:

  1. Perform no deletions, only insert JSON files that have been produced. This is already supported, but when saving the database by default this saves all JSON output. It's also not clear if we'd have foreign key violations, particularly if reference tables have been updated.
  2. Use git diff to determine which JSON documents have changed. Do not delete or change anything else. This requires git installed and for the data to be version controlled, both likely true in development situations.
  3. Figure out a way to export only records that have changed when saving the database. I do not know if there is a way to capture all changes to the DB since a connection was made. It might be architecture-dependent, but it starts getting into DB data migration tools of which several exist but our DB/Tool architecture wasn't built with them in mind.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant