Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report child workflow started synchronously in Test Implementation #928

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

natemort
Copy link
Member

By reporting child workflow started asynchronously we can observe a child workflow attempting to complete prior to its state machine registering that it started. This is an invalid state transition and causes test failures.

See https://github.com/uber/cadence-java-client/actions/runs/11508671742/job/32037318976 as the most recent occurrence of this issue.

Temporal addressed this issue in temporalio/sdk-java#1289 due to the same test case failing..

What changed?

  • Fixed cause of test failures

Why?

How did you test it?

Potential risks

Release notes

Documentation Changes

Copy link

codecov bot commented Oct 25, 2024

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 61.93%. Comparing base (8693c37) to head (6d35286).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...rnal/testservice/TestWorkflowMutableStateImpl.java 40.00% 3 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
...rnal/testservice/TestWorkflowMutableStateImpl.java 78.33% <40.00%> (-0.08%) ⬇️

... and 19 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc0f0c4...6d35286. Read the comment docs.

@coveralls
Copy link

coveralls commented Oct 25, 2024

Pull Request Test Coverage Report for Build 2566

Details

  • 2 of 5 (40.0%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.01%) to 66.637%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/main/java/com/uber/cadence/internal/testservice/TestWorkflowMutableStateImpl.java 2 5 40.0%
Files with Coverage Reduction New Missed Lines %
src/main/java/com/uber/cadence/serviceclient/WorkflowServiceTChannel.java 2 16.93%
Totals Coverage Status
Change from base Build 2565: 0.01%
Covered Lines: 12911
Relevant Lines: 19375

💛 - Coveralls

}
});
try {
parent.get().childWorkflowStarted(a);
Copy link
Member

@Groxx Groxx Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh. any idea why it was async in the first place? since it seems to be creating decision tasks above, it seems like the start-event probably already exists, so this doesn't seem to risk deadlocking or something...

is the issue that multiple fork-join-pool executes are racing, and sometimes complete runs before this start runs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two tiers of locks for the test service:

  • There's a lock in TestWorkflowService, which controls access to the Map of MutableState for every workflow.
  • Every MutableState has a lock.

Typically we lock the TestWorkflowService, get the MutableState, unlock the TestWorkflowService, and then perform an action on the MutableState. The MutableState then manages its own lock internally.

Any time a MutableState performs an action on another workflow we use the ForkJoinPool to run the operation asynchronously without holding any locks for the original workflow. These calls use the TestWorkflowService, so they repeat the process above. This is particularly important for any of the methods that process Decisions, which are all called while the lock for the workflow is held. If two workflows attempted to cancel each other they could deadlock.

By running this block asynchronously we end up triggering a race between:

  • This block running and updating the parent workflow that the child workflow started.
  • The next decision task being scheduled , processed, and reported back to the TestService, which could complete the child workflow. This similarly prompts an async operation to report the status update to the parent workflow. We see the test failure any time this process beats the first one.
    • The next decision task gets pushed to a queue before we even release the lock on the MutableState.

This block is unique for two reasons:

  • Starting a workflow is the only time we hold both the TestWorkflowService lock and MutableState lock simultaneously, so we're still holding the TestWorkflowService lock.
  • We've already released the MutableState lock above.

Running this synchronously somewhat incidentally ends up fixing the race condition because we're still holding the lock on the TestWorkflowService. The next decision task can't retrieve the MutableState for the child workflow until we release the lock, so we've created a strict order of events.

Copy link
Member

@Groxx Groxx Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense / fits about what I expected, yea. Thanks!
And it looks like the childWorkflowStarted acquires its own lock in update (mutable state lock?), so that sounds like a probably-correct fix.

I suppose a later step for all this would be to make these related test workflows share a single execution thread, since otherwise this seems prone to this kind of race for no benefit :\ though I'm gonna assume that's extremely hard without a complete rewrite.

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I mostly intended to approve earlier, and was asking so I could understand the rest since this seemed like a pretty good "deep in the internals" spot where everything matters and there are a lot of assumptions I'm not familiar with.

Thanks for the explanation! It's definitely stuff I'm not familiar with yet :)

By reporting child workflow started asynchronously we can observe a child workflow attempting to complete prior to its state machine registering that it started. This is an invalid state transition and causes test failures.

See temporalio/sdk-java#1289 , we have the same test case still and it has similarly been flakey.
@natemort natemort merged commit c04bf6b into cadence-workflow:master Oct 29, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants