-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make PyErrState
thread-safe
#4671
Conversation
I noticed clippy was failing so I just pushed a fix. I'll try to get the CI green on this if there are any more issues. |
src/err/err_state.rs
Outdated
match self_state { | ||
Some(PyErrStateInner::Normalized(n)) => n, | ||
_ => unreachable!(), | ||
let normalized_state = PyErrStateInner::Normalized(state.normalize(py)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only spot where there might be a deadlock is here, if normalize
somehow leads to arbitrary Python code execution.
Is that possible? If not I think it deserves a comment explaining why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can deadlock, I'm not sure what we can do, since at this point we haven't actually constructed any Python objects yet and we only have a handle to an FnOnce
that knows how to construct them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great observation; I've added a wrapping call to py.allow_threads
before potentially blocking on the Once
, which I think avoids the deadlock (I pushed a test which did deadlock before that change).
The algorithm makes sense to me, I agree that this ensures that normalizing an error state can't be done simultaneously in two threads. |
CodSpeed Performance ReportMerging #4671 will degrade performances by 26.66%Comparing Summary
Benchmarks breakdown
|
Huh, I can reproduce the test failure happening on CI. It's flakey, but you can trigger it with |
.expect("Cannot normalize a PyErr while already normalizing it.") | ||
}; | ||
// avoid deadlock of `.call_once` with the GIL | ||
py.allow_threads(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess somehow dropping the GIL somehow allows a race condition to happen where multiple threads try to simultaneously create a module...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's a combination with GILOnceCell
in the test_declarative_module
; we allow racing in GILOnceCell
under the condition where switching the GIL, so this module does actually attempt to get created multiple times. I think it's a bug in using GILOnceCell
for that test, but this also just makes me dislike this lazy stuff even more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is just a fundamental issue with GILOnceCell
being racey if the code it wraps ever drops the GIL.
EDIT: jinx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #4676, if I apply that patch on this branch, the problem goes away.
Well, this passes CI, but somewhat expectedly the additional complexity added here brings performance overheads. I think we either merge and release this, or we push forward on removing #4584 and the lazy state entirely, which I think yet needs some additional development but should hopefully be simpler and faster. Given that I think 0.23 is already overdue and doesn't need more breaking changes (which #4584 would be) I think we should merge this here and look hard at optimising errors on the way to 0.24. |
Yes, I think a huge chunk of that flamegraph can be vanished away by not having this lazy layer. But that's a problem for 0.24 now, sigh. |
Will proceed to merge this so that we're one final hop closer to that release! |
* make `PyErrState` thread-safe * fix clippy * add test of reentrancy, fix deadlock * newsfragment * fix MSRV * fix nightly build --------- Co-authored-by: Nathan Goldbaum <[email protected]>
9ba571d
to
cd39af0
Compare
This PR resolves the thread-safety challenges of #4584 for us to be able to at least ship 0.23.
I don't love the complexity that this lazy state creates inside error-handling pathways, so I think in the future I will work to proceed with #4669 and further steps to remove the lazy state. But 0.23 is already breaking enough, users don't need more changes and this should be an in-place drop-in.