Fix flakiness with SegmentReplicationSuiteIT #11977

mch2 · 2024-01-22T22:16:14Z

Description

This PR fixes race conditions while shutting down SegmentReplicationSourceService used for node-node segment replication that cause store refs to not close before checks are made. This is the cause for flakiness with SegmentReplicationSuiteIT and a few others.

The issue here is we are creating new replications on the primary after we have cancelled replications but before the shard is closed. OngoingSegmentReplications has a prepareForReplications method that previously would invoke getCachedCopyState. In that method we would first fetch copyState from the map and incref it if it exists. This left us open to incref the copyState ref even though the shard is shutting down & no longer in indexService.

To fix this this change removes the responsibility of caching CopyState inside of OngoingSegmentReplications.

CopyState was originally cached to prevent frequent disk reads while building segment metadata. This is now
cached lower down in IndexShard and is not required here. This eliminates the need to synchronize on the object for create/cancel methods.
Change prepareForReplication method to return SegmentReplicationSourceHandler directly.
Move responsibility of creating and closing CopyState to the handler.
Changes testDropRandomNodeDuringReplication to wait until shards have recovered before deleting the index. The intent of the test was to ensure we can recover after the node drop. The subsequent test testDeleteIndexWhileReplicating covers paths for index deletion while replication is running.

With these changes I have run the entire suite ~2500 times without failure.

Related Issues

Resolves #9499

Check List

New functionality includes testing.
- All tests pass
~~New functionality has been documented.~~
- ~~New functionality has javadoc added~~
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
~~Commit changes are listed out in CHANGELOG.md file (See: Changelog)~~
~~Public documentation issue/PR created~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-01-22T22:37:39Z

Compatibility status:

Checks if related components are compatible with change 13522a0

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

opensearch-trigger-bot · 2024-03-05T15:19:56Z

This PR is stalled because it has been open for 30 days with no activity.

github-actions · 2024-03-05T18:45:18Z

❕ Gradle check result for 8687594: UNSTABLE

TEST FAILURES:

      1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadNonexistentBlobThrowsNoSuchFileException
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testNoSearchIdleForAnyReplicaCount

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

peternied

With these changes I have run the entire suite ~2500 times without failure.

@mch2 This is good - and its a shame we haven't gotten this change reviewed.

I'll try to prioritize this in my review queue, I'm not familiar with the replication space - is there anything that warrants extra attention that you'd like focused on?

mch2 · 2024-03-20T23:44:14Z

@peternied Thanks for taking a look here. @andrross offered to review this last week and I had asked him to pause until I revisit it as its been a while but I think this is good to go. I believe this will fix more flakiness with #12408 that popped up after the introduction of more tests executing with SegRep as the root cause there looks the same.

The key components of this PR that may warrant extra attention are in moving the cancellation to after the shard is closed (beforeIndexShardClosed -> AfterIndexShardClosed) to ensure open file refs are closed only after the shard is, preventing the possibility of a race between cancel & shard closure. Second is the removal of caching the CopyState object on the source/primary side. That object fetches file refs to ensure they are not discarded until after replication completes. This logic was buggy and hard to debug. The object is now all tied 1-1 to a handler that manages closing that state when replication is completed.

This test fails because of a race during shard/node shutdown with node-node replication. Fixed by properly synchronizing creation of new replication events with cancellation and cancelling after shards are closed. Signed-off-by: Marc Handalian <[email protected]>

This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications. 1. CopyState was originally cached to prevent frequent disk reads while building segment metadata. This is now cached lower down in IndexShard and is not required here. 2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly 3. Move responsibility of creating and clearing CopyState to the handler. Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>

mch2 · 2024-04-03T00:27:03Z

@andrross Apologies for stalling here - After running both this suite & IndexActionIT I do not get any failures and think this is ready.
I've made a couple small changes here since this was first cut but its largely the same. The fix is to create a 1-1 relationship of CopyState to the Source handler so that state is always closed on shard closure.

github-actions · 2024-04-03T01:01:41Z

✅ Gradle check result for ce239f7: SUCCESS

github-actions · 2024-04-03T01:02:14Z

✅ Gradle check result for 66ede03: SUCCESS

github-actions · 2024-04-03T01:09:35Z

❕ Gradle check result for 13522a0: UNSTABLE

TEST FAILURES:

      1 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

mch2 · 2024-04-03T17:17:02Z

❕ Gradle check result for 13522a0: UNSTABLE

TEST FAILURES:
      1 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

#11547

stephen-crawford

I don't know much about replication but the code looks good. I left one comment/question around the fetching of the metadata map but assuming that is all good should be ready to merge!

server/src/main/java/org/opensearch/indices/replication/common/CopyState.java

* Fix SegmentReplicationSuiteIT This test fails because of a race during shard/node shutdown with node-node replication. Fixed by properly synchronizing creation of new replication events with cancellation and cancelling after shards are closed. Signed-off-by: Marc Handalian <[email protected]> * Remove CopyState caching from OngoingSegmentReplications. This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications. 1. CopyState was originally cached to prevent frequent disk reads while building segment metadata. This is now cached lower down in IndexShard and is not required here. 2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly 3. Move responsibility of creating and clearing CopyState to the handler. Signed-off-by: Marc Handalian <[email protected]> * Fix comment for afterIndexShardClosed method. Signed-off-by: Marc Handalian <[email protected]> * Fix comment on beforeIndexShardClosed Signed-off-by: Marc Handalian <[email protected]> * Remove unnecessary method from OngoingSegmentReplications Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> (cherry picked from commit e828c18) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Fix SegmentReplicationSuiteIT This test fails because of a race during shard/node shutdown with node-node replication. Fixed by properly synchronizing creation of new replication events with cancellation and cancelling after shards are closed. * Remove CopyState caching from OngoingSegmentReplications. This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications. 1. CopyState was originally cached to prevent frequent disk reads while building segment metadata. This is now cached lower down in IndexShard and is not required here. 2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly 3. Move responsibility of creating and clearing CopyState to the handler. * Fix comment for afterIndexShardClosed method. * Fix comment on beforeIndexShardClosed * Remove unnecessary method from OngoingSegmentReplications --------- (cherry picked from commit e828c18) Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

github-actions bot added bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep v2.11.0 Issues and PRs related to version 2.11.0 labels Jan 22, 2024

mch2 added the skip-changelog label Jan 22, 2024

peternied assigned peternied and unassigned peternied Jan 31, 2024

opensearch-trigger-bot bot added the stalled Issues that have stalled label Mar 5, 2024

mch2 force-pushed the repro branch from fe0d6b1 to 8687594 Compare March 5, 2024 17:57

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Mar 9, 2024

mch2 mentioned this pull request Mar 15, 2024

[BUG] IndexActionIT tests are flaky Shard [test][0] is still locked after 5 sec waiting #12408

Open

peternied reviewed Mar 20, 2024

View reviewed changes

mch2 added 3 commits April 2, 2024 16:47

Fix comment for afterIndexShardClosed method.

66ede03

Signed-off-by: Marc Handalian <[email protected]>

mch2 force-pushed the repro branch from 8687594 to 66ede03 Compare April 3, 2024 00:08

mch2 added 2 commits April 2, 2024 17:10

Fix comment on beforeIndexShardClosed

ce239f7

Signed-off-by: Marc Handalian <[email protected]>

Remove unnecessary method from OngoingSegmentReplications

13522a0

Signed-off-by: Marc Handalian <[email protected]>

stephen-crawford approved these changes Apr 11, 2024

View reviewed changes

server/src/main/java/org/opensearch/indices/replication/common/CopyState.java Show resolved Hide resolved

andrross approved these changes Apr 11, 2024

View reviewed changes

mch2 added the backport 2.x Backport to 2.x branch label Apr 12, 2024

mch2 merged commit e828c18 into opensearch-project:main Apr 14, 2024
35 of 37 checks passed

opensearch-trigger-bot bot mentioned this pull request Apr 14, 2024

[Backport 2.x] Fix flakiness with SegmentReplicationSuiteIT #13180

Merged

mch2 deleted the repro branch April 16, 2024 16:08

mch2 mentioned this pull request May 7, 2024

[BUG] org.opensearch.indices.replication.SegmentReplicationIT.classMethod test failure due to file handle leaks #11034

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flakiness with SegmentReplicationSuiteIT #11977

Fix flakiness with SegmentReplicationSuiteIT #11977

mch2 commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2024 •

edited

Loading

opensearch-trigger-bot bot commented Mar 5, 2024

github-actions bot commented Mar 5, 2024

peternied left a comment

mch2 commented Mar 20, 2024

mch2 commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

mch2 commented Apr 3, 2024

stephen-crawford left a comment

Fix flakiness with SegmentReplicationSuiteIT #11977

Fix flakiness with SegmentReplicationSuiteIT #11977

Conversation

mch2 commented Jan 22, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Jan 22, 2024 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

opensearch-trigger-bot bot commented Mar 5, 2024

github-actions bot commented Mar 5, 2024

peternied left a comment

Choose a reason for hiding this comment

mch2 commented Mar 20, 2024

mch2 commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

mch2 commented Apr 3, 2024

stephen-crawford left a comment

Choose a reason for hiding this comment

mch2 commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2024 •

edited

Loading