Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix grpc crash #595

Merged
merged 3 commits into from
Nov 26, 2024
Merged

Fix grpc crash #595

merged 3 commits into from
Nov 26, 2024

Conversation

Besroy
Copy link
Contributor

@Besroy Besroy commented Nov 22, 2024

If data channel and raft channel release the same data at the same time, there is a chance that send_response may be called twice, then grpc will crash with error TOO_MANY_OPERATIONS. This pr make sure data channel release data at first.

Signed-off-by: Yawen Zhang [email protected]

@Besroy Besroy requested a review from xiaoxichen November 22, 2024 02:41
@codecov-commenter
Copy link

codecov-commenter commented Nov 22, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.65%. Comparing base (1a0cef8) to head (ee90d97).
Report is 95 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #595       +/-   ##
===========================================
+ Coverage   56.51%   66.65%   +10.13%     
===========================================
  Files         108      109        +1     
  Lines       10300    10771      +471     
  Branches     1402     1470       +68     
===========================================
+ Hits         5821     7179     +1358     
+ Misses       3894     2887     -1007     
- Partials      585      705      +120     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

xiaoxichen
xiaoxichen previously approved these changes Nov 22, 2024
@Besroy Besroy requested a review from raakella1 November 22, 2024 07:19
JacksonYao287
JacksonYao287 previously approved these changes Nov 22, 2024
Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now, before we append log, we will make sure that the data corresponding to this log is flushed and the rreq state is changed to DATA_WRITTEN.

so I am think can we remove the release data op in commit phase?

@Besroy
Copy link
Contributor Author

Besroy commented Nov 22, 2024

now, before we append log, we will make sure that the data corresponding to this log is flushed and the rreq state is changed to DATA_WRITTEN.

so I am think can we remove the release data op in commit phrase?

I think remove release data in req->clear() is a strength way to avoid double release --- actually my fix cannot guarantee the sequence 'data channel release -> append log->commit release', because I found another conflict:

  1. data channel set DATA_WRITTEN at https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L456 at first
  2. check data received or not before append logs: if state==DATA_WRITTEM, will return directly at https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L545
  3. data channel release_data && handle_commit release_data at the same time

And other way to guarantee the sequence is to add lock on release_data. cc: @JacksonYao287 @xiaoxichen @raakella1

@xiaoxichen
Copy link
Collaborator

xiaoxichen commented Nov 22, 2024

@Besroy i think you mean here
https://github.com/eBay/HomeStore/blob/f83679af8ce71e71fb5b6f57df6d54047cd7d940/src/lib/replication/repl_dev/raft_repl_dev.cpp#L875?

There are 3 places we do release_data()

  1. on_push_data_reveive()
  2. handle_fetch_data_response()
  3. on_commit()

conflicts between 1/3 and 2/3 can be resolved by either remove 3, or finish release_data in data path before setting the DATA_WRITTEN flag as well as the promise.

Conflict between 1/2 will not happen as add_state_if_not_already is atomic, the latter one will return and not proceed to write phase. However there seems an issue in push_data path, we dont send response in this case, @raakella1 pls take a look as well.

@Besroy Besroy dismissed stale reviews from JacksonYao287 and xiaoxichen via 58458db November 25, 2024 08:54
Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaoxichen xiaoxichen merged commit 6a2dfd8 into eBay:master Nov 26, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants