Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(restore-indices): always clear search and graph service before restoring indices #475

Conversation

Masterchen09
Copy link
Contributor

I recently played around a bit with the restore-indices job and noticed that under some circumstances the restoration might have an inconsistent result. If we say that the database is the single-point-of-truth of everything and especially when restoring the indices, then the search index should only contain entities which are also present in the database after restoring indices.

Let's assume that the search index and the database are getting out of sync and entities are deleted - for whatever reason the deletion is not processed by the mae-consumer, which means that the entities were not deleted in the search index, then there is currently no possibility with the existing restore-indices job to remove these entities in the search index - the current restore-indices job will only add missing entities, but will not delete non-existent entities.

There is the possibility to clean-up the indices before restoring the indices, however this is not enabled by default in the job template - I think to have consistent results after restoring indices the indices should always be cleaned-up before.

One drawback of this would be, that when the restore-indices job is started DataHub would not show anything, because the indices are cleaned-up...alternatively, instead of adding the "-a clean" option to the existing job template, there could be a second job template which includes this option (e.g., "datahub-clean-restore-indices-job-template"). Please suggest what's the best approach here - I am happy to change the PR accordingly. 😊

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

Copy link

github-actions bot commented Jul 8, 2024

This PR is stale. We will close it in 30 days if there is no comment or activity. If you want feedback but not able to get it on github please head to #contribute channel in slack at https://slack.datahubproject.io.

@github-actions github-actions bot added the stale label Jul 8, 2024
@Masterchen09 Masterchen09 force-pushed the fix-restore-indices-job-cleanup branch from bc617e0 to 65151d5 Compare July 18, 2024 19:23
@github-actions github-actions bot removed the stale label Jul 19, 2024
Copy link

This PR is stale. We will close it in 30 days if there is no comment or activity. If you want feedback but not able to get it on github please head to #contribute channel in slack at https://slack.datahubproject.io.

@github-actions github-actions bot added the stale label Aug 18, 2024
@Masterchen09
Copy link
Contributor Author

To make sure that this PR is not automatically closed - here’s a comment.

@github-actions github-actions bot removed the stale label Aug 27, 2024
@darnaut
Copy link
Contributor

darnaut commented Aug 28, 2024

I think to have consistent results after restoring indices the indices should always be cleaned-up before.

@Masterchen09 This is nice in theory but it effectively creates problems for anyone that has to restart/resume the job for whatever reason, or if just trying the job for the first time. If this is a default that makes sense for your setup, please feel free to add it behind a flag. Otherwise, such behavior that leads to data deletion should not be enabled by default.

@Masterchen09
Copy link
Contributor Author

@darnaut What about a second job template where the option is enabled? Using the datahubUpgrade.restoreIndices.image.args option it is already possible to adjust the existing job template, but you do not have both options available and changing the existing job „afterwards“ (or on demand) would make a redeployment of the chart necessary…

@darnaut
Copy link
Contributor

darnaut commented Aug 29, 2024

What about a second job template where the option is enabled?

So the tradeoff here is yet another duplicated template that needs to be maintained, or using the existing args option for your use case. Unless there are other users expressing the need for this particular flavor of the restore-indices template, I would strongly suggest using the available options.

@Masterchen09
Copy link
Contributor Author

@darnaut Then we should at least document it a bit more clearly that, without cleaning-up the indices, there could be inconsistencies between the local database and the indices. I can live with modifying the arguments for the restore indices job, I just wanted to make sure that it is also consistent for other users, because I suspect that others are not aware of the potential inconsistencies...what is a data catalog without consistency? 😉

See here for a proposal regarding the documentation: datahub-project/datahub#11380

@Masterchen09
Copy link
Contributor Author

As the PR with the enhancements of the documentation was already merged, I will close this PR here. ;-)

@Masterchen09 Masterchen09 deleted the fix-restore-indices-job-cleanup branch October 24, 2024 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants