-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report child workflow started synchronously in Test Implementation #928
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 19 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
Pull Request Test Coverage Report for Build 2566Details
💛 - Coveralls |
} | ||
}); | ||
try { | ||
parent.get().childWorkflowStarted(a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh. any idea why it was async in the first place? since it seems to be creating decision tasks above, it seems like the start-event probably already exists, so this doesn't seem to risk deadlocking or something...
is the issue that multiple fork-join-pool executes are racing, and sometimes complete runs before this start runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two tiers of locks for the test service:
- There's a lock in TestWorkflowService, which controls access to the Map of MutableState for every workflow.
- Every MutableState has a lock.
Typically we lock the TestWorkflowService, get the MutableState, unlock the TestWorkflowService, and then perform an action on the MutableState. The MutableState then manages its own lock internally.
Any time a MutableState performs an action on another workflow we use the ForkJoinPool to run the operation asynchronously without holding any locks for the original workflow. These calls use the TestWorkflowService, so they repeat the process above. This is particularly important for any of the methods that process Decisions, which are all called while the lock for the workflow is held. If two workflows attempted to cancel each other they could deadlock.
By running this block asynchronously we end up triggering a race between:
- This block running and updating the parent workflow that the child workflow started.
- The next decision task being scheduled , processed, and reported back to the TestService, which could complete the child workflow. This similarly prompts an async operation to report the status update to the parent workflow. We see the test failure any time this process beats the first one.
- The next decision task gets pushed to a queue before we even release the lock on the MutableState.
This block is unique for two reasons:
- Starting a workflow is the only time we hold both the TestWorkflowService lock and MutableState lock simultaneously, so we're still holding the TestWorkflowService lock.
- We've already released the MutableState lock above.
Running this synchronously somewhat incidentally ends up fixing the race condition because we're still holding the lock on the TestWorkflowService. The next decision task can't retrieve the MutableState for the child workflow until we release the lock, so we've created a strict order of events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense / fits about what I expected, yea. Thanks!
And it looks like the childWorkflowStarted
acquires its own lock in update
(mutable state lock?), so that sounds like a probably-correct fix.
I suppose a later step for all this would be to make these related test workflows share a single execution thread, since otherwise this seems prone to this kind of race for no benefit :\ though I'm gonna assume that's extremely hard without a complete rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - I mostly intended to approve earlier, and was asking so I could understand the rest since this seemed like a pretty good "deep in the internals" spot where everything matters and there are a lot of assumptions I'm not familiar with.
Thanks for the explanation! It's definitely stuff I'm not familiar with yet :)
By reporting child workflow started asynchronously we can observe a child workflow attempting to complete prior to its state machine registering that it started. This is an invalid state transition and causes test failures. See temporalio/sdk-java#1289 , we have the same test case still and it has similarly been flakey.
By reporting child workflow started asynchronously we can observe a child workflow attempting to complete prior to its state machine registering that it started. This is an invalid state transition and causes test failures.
See https://github.com/uber/cadence-java-client/actions/runs/11508671742/job/32037318976 as the most recent occurrence of this issue.
Temporal addressed this issue in temporalio/sdk-java#1289 due to the same test case failing..
What changed?
Why?
How did you test it?
Potential risks
Release notes
Documentation Changes