-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(restore-indices): always clear search and graph service before restoring indices #475
fix(restore-indices): always clear search and graph service before restoring indices #475
Conversation
This PR is stale. We will close it in 30 days if there is no comment or activity. If you want feedback but not able to get it on github please head to #contribute channel in slack at https://slack.datahubproject.io. |
bc617e0
to
65151d5
Compare
This PR is stale. We will close it in 30 days if there is no comment or activity. If you want feedback but not able to get it on github please head to #contribute channel in slack at https://slack.datahubproject.io. |
To make sure that this PR is not automatically closed - here’s a comment. |
@Masterchen09 This is nice in theory but it effectively creates problems for anyone that has to restart/resume the job for whatever reason, or if just trying the job for the first time. If this is a default that makes sense for your setup, please feel free to add it behind a flag. Otherwise, such behavior that leads to data deletion should not be enabled by default. |
@darnaut What about a second job template where the option is enabled? Using the |
So the tradeoff here is yet another duplicated template that needs to be maintained, or using the existing |
@darnaut Then we should at least document it a bit more clearly that, without cleaning-up the indices, there could be inconsistencies between the local database and the indices. I can live with modifying the arguments for the restore indices job, I just wanted to make sure that it is also consistent for other users, because I suspect that others are not aware of the potential inconsistencies...what is a data catalog without consistency? 😉 See here for a proposal regarding the documentation: datahub-project/datahub#11380 |
As the PR with the enhancements of the documentation was already merged, I will close this PR here. ;-) |
I recently played around a bit with the restore-indices job and noticed that under some circumstances the restoration might have an inconsistent result. If we say that the database is the single-point-of-truth of everything and especially when restoring the indices, then the search index should only contain entities which are also present in the database after restoring indices.
Let's assume that the search index and the database are getting out of sync and entities are deleted - for whatever reason the deletion is not processed by the mae-consumer, which means that the entities were not deleted in the search index, then there is currently no possibility with the existing restore-indices job to remove these entities in the search index - the current restore-indices job will only add missing entities, but will not delete non-existent entities.
There is the possibility to clean-up the indices before restoring the indices, however this is not enabled by default in the job template - I think to have consistent results after restoring indices the indices should always be cleaned-up before.
One drawback of this would be, that when the restore-indices job is started DataHub would not show anything, because the indices are cleaned-up...alternatively, instead of adding the "-a clean" option to the existing job template, there could be a second job template which includes this option (e.g., "datahub-clean-restore-indices-job-template"). Please suggest what's the best approach here - I am happy to change the PR accordingly. 😊
Checklist