Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakiness with SegmentReplicationSuiteIT #11977

Merged
merged 5 commits into from
Apr 14, 2024
Merged

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jan 22, 2024

Description

This PR fixes race conditions while shutting down SegmentReplicationSourceService used for node-node segment replication that cause store refs to not close before checks are made. This is the cause for flakiness with SegmentReplicationSuiteIT and a few others.

The issue here is we are creating new replications on the primary after we have cancelled replications but before the shard is closed. OngoingSegmentReplications has a prepareForReplications method that previously would invoke getCachedCopyState. In that method we would first fetch copyState from the map and incref it if it exists. This left us open to incref the copyState ref even though the shard is shutting down & no longer in indexService.

To fix this this change removes the responsibility of caching CopyState inside of OngoingSegmentReplications.

  1. CopyState was originally cached to prevent frequent disk reads while building segment metadata. This is now
    cached lower down in IndexShard and is not required here. This eliminates the need to synchronize on the object for create/cancel methods.
  2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly.
  3. Move responsibility of creating and closing CopyState to the handler.
  4. Changes testDropRandomNodeDuringReplication to wait until shards have recovered before deleting the index. The intent of the test was to ensure we can recover after the node drop. The subsequent test testDeleteIndexWhileReplicating covers paths for index deletion while replication is running.

With these changes I have run the entire suite ~2500 times without failure.

Related Issues

Resolves #9499

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.11.0 Issues and PRs related to version 2.11.0 labels Jan 22, 2024
Copy link
Contributor

github-actions bot commented Jan 22, 2024

Compatibility status:

Checks if related components are compatible with change 13522a0

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

@peternied peternied assigned peternied and unassigned peternied Jan 31, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

Copy link
Contributor

github-actions bot commented Mar 5, 2024

❕ Gradle check result for 8687594: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadNonexistentBlobThrowsNoSuchFileException
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testNoSearchIdleForAnyReplicaCount

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes I have run the entire suite ~2500 times without failure.

@mch2 This is good - and its a shame we haven't gotten this change reviewed.

I'll try to prioritize this in my review queue, I'm not familiar with the replication space - is there anything that warrants extra attention that you'd like focused on?

@mch2
Copy link
Member Author

mch2 commented Mar 20, 2024

@peternied Thanks for taking a look here. @andrross offered to review this last week and I had asked him to pause until I revisit it as its been a while but I think this is good to go. I believe this will fix more flakiness with #12408 that popped up after the introduction of more tests executing with SegRep as the root cause there looks the same.

The key components of this PR that may warrant extra attention are in moving the cancellation to after the shard is closed (beforeIndexShardClosed -> AfterIndexShardClosed) to ensure open file refs are closed only after the shard is, preventing the possibility of a race between cancel & shard closure. Second is the removal of caching the CopyState object on the source/primary side. That object fetches file refs to ensure they are not discarded until after replication completes. This logic was buggy and hard to debug. The object is now all tied 1-1 to a handler that manages closing that state when replication is completed.

mch2 added 3 commits April 2, 2024 16:47
This test fails because of a race during shard/node shutdown with node-node replication.
Fixed by properly synchronizing creation of new replication events with cancellation and cancelling
after shards are closed.

Signed-off-by: Marc Handalian <[email protected]>
This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications.
1. CopyState was originally cached to prevent frequent disk reads while building segment metadata.  This is now
cached lower down in IndexShard and is not required here.
2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly
3. Move responsibility of creating and clearing CopyState to the handler.

Signed-off-by: Marc Handalian <[email protected]>
@mch2
Copy link
Member Author

mch2 commented Apr 3, 2024

@andrross Apologies for stalling here - After running both this suite & IndexActionIT I do not get any failures and think this is ready.
I've made a couple small changes here since this was first cut but its largely the same. The fix is to create a 1-1 relationship of CopyState to the Source handler so that state is always closed on shard closure.

Copy link
Contributor

github-actions bot commented Apr 3, 2024

✅ Gradle check result for ce239f7: SUCCESS

Copy link
Contributor

github-actions bot commented Apr 3, 2024

✅ Gradle check result for 66ede03: SUCCESS

Copy link
Contributor

github-actions bot commented Apr 3, 2024

❕ Gradle check result for 13522a0: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@mch2
Copy link
Member Author

mch2 commented Apr 3, 2024

❕ Gradle check result for 13522a0: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

#11547

Copy link
Contributor

@stephen-crawford stephen-crawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about replication but the code looks good. I left one comment/question around the fetching of the metadata map but assuming that is all good should be ready to merge!

@mch2 mch2 added the backport 2.x Backport to 2.x branch label Apr 12, 2024
@mch2 mch2 merged commit e828c18 into opensearch-project:main Apr 14, 2024
35 of 37 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 14, 2024
* Fix SegmentReplicationSuiteIT

This test fails because of a race during shard/node shutdown with node-node replication.
Fixed by properly synchronizing creation of new replication events with cancellation and cancelling
after shards are closed.

Signed-off-by: Marc Handalian <[email protected]>

* Remove CopyState caching from OngoingSegmentReplications.

This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications.
1. CopyState was originally cached to prevent frequent disk reads while building segment metadata.  This is now
cached lower down in IndexShard and is not required here.
2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly
3. Move responsibility of creating and clearing CopyState to the handler.

Signed-off-by: Marc Handalian <[email protected]>

* Fix comment for afterIndexShardClosed method.

Signed-off-by: Marc Handalian <[email protected]>

* Fix comment on beforeIndexShardClosed

Signed-off-by: Marc Handalian <[email protected]>

* Remove unnecessary method from OngoingSegmentReplications

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
(cherry picked from commit e828c18)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@mch2 mch2 deleted the repro branch April 16, 2024 16:08
mch2 pushed a commit that referenced this pull request Apr 16, 2024
* Fix SegmentReplicationSuiteIT

This test fails because of a race during shard/node shutdown with node-node replication.
Fixed by properly synchronizing creation of new replication events with cancellation and cancelling
after shards are closed.



* Remove CopyState caching from OngoingSegmentReplications.

This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications.
1. CopyState was originally cached to prevent frequent disk reads while building segment metadata.  This is now
cached lower down in IndexShard and is not required here.
2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly
3. Move responsibility of creating and clearing CopyState to the handler.



* Fix comment for afterIndexShardClosed method.



* Fix comment on beforeIndexShardClosed



* Remove unnecessary method from OngoingSegmentReplications



---------


(cherry picked from commit e828c18)

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep skip-changelog v2.11.0 Issues and PRs related to version 2.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Test org.opensearch.indices.replication.SegmentReplicationSuiteIT is flaky
5 participants