Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(async-flow): panic if a guest fulfills during replay #10160

Merged
merged 4 commits into from
Sep 28, 2024

Conversation

michaelfig
Copy link
Member

@michaelfig michaelfig commented Sep 25, 2024

closes: #9465
closes: #10148

Description

This pull request introduces several changes to the async-flow package, focusing on enhancing error handling, updating documentation, and improving test robustness. Key changes include updating panicking if the guest function settles during replay (unexpected early completion), Failure state handling, adding a panic handler, and refactoring tests to utilize a new utility function.

Security Considerations

Behaviour is more correct by guaranteeing the guest behaviour actually matches the recorded log.

Scaling Considerations

n/a

Documentation Considerations

n/a

Testing Considerations

Converted an existing test.failing to a passing test, with a minor change to use eventLoopIteration to paper over multiple turns needed for the async flow to register an early completion failure.

Upgrade Considerations

n/a

@michaelfig michaelfig requested a review from mhofman September 25, 2024 23:12
@michaelfig michaelfig self-assigned this Sep 25, 2024
@michaelfig michaelfig added the asyncFlow related to membrane-based replay and upgrade of async functions label Sep 25, 2024
@mhofman
Copy link
Member

mhofman commented Sep 25, 2024

fulfills

quick nit: settles. An early rejection is as incorrect and an early fulfillment, and actually more likely to occur.

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle the rejection case too :)

packages/async-flow/src/async-flow.js Outdated Show resolved Hide resolved
packages/async-flow/src/async-flow.js Outdated Show resolved Hide resolved
Copy link

cloudflare-workers-and-pages bot commented Sep 26, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1daa37f
Status: ✅  Deploy successful!
Preview URL: https://72a47a5e.agoric-sdk.pages.dev
Branch Preview URL: https://mfig-async-flow-early-comple.agoric-sdk.pages.dev

View logs

@michaelfig michaelfig force-pushed the mfig-async-flow-early-completion branch from fb1407a to cf3fe86 Compare September 26, 2024 00:53
@michaelfig michaelfig requested a review from a team as a code owner September 26, 2024 16:50
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, feel free to merge once the following is addressed:

  • Small documentation update to reflect in which case the flow goes from running to failed.
  • I am uncomfortable defaulting the test async flow panic handler to just log, and I think it's safe to make it fail the test. Please let m know if I'm misunderstanding something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the flow can also go from running to failed if the guest does something illegal like passing a non passable to a guest wrapper. It's not just for internal failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the diagram.

* @param {PreparationOptions} [opts]
*/
export const prepareTestAsyncFlowTools = (t, zone, opts) => {
const { panicHandler = e => t.log('Handled panic', e) } = opts || {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the default behavior be to t.fail the test? I know that's what the unhandled rejection handler ava sets up would do, but I think that's the correct default behavior, and here it'd just be more explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so you're talking about the default panicHandler. I presume you mean that the tests that had unhandled rejections, but expected them, would override the panicHandler option?

@@ -149,8 +150,12 @@ const testBadShortReplay = async (t, zone) => {

const { guestMethod } = {
async guestMethod(_gOrch7, _g1, _p3) {
t.log(' badReplay return early');
resolveStep(true);
t.log(' badShortReplay return early');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this log line before the return ?

@@ -10,6 +10,7 @@

- ***Replaying***. To start ***Replaying***, the activation first translates the saved activation arguments from host to guest, invokes the guest function, and starts the membrane replaying from its durable log. The replay is finished when the last log entry has been replayed. Once replaying is finished, the activation has caught up and transitions back to ***Running***.

- ***Failed***. If during the ***Replaying*** state the guest activation fails to exactly reproduce its previously logged behavior, it goes into the inactive ***Failed*** state, with a diagnostic explaining how the replay failed, so it can be repaired by another future upgrade. As of the next reincarnation, the failure status is cleared and it starts ***Replaying*** again, hoping not to fail this time. If replay failed because the guest async function did not reproduce its previous behavior, then the upgrade needs to replace the function with one which does. If the replay failed because of a failure of the `asyncFlow` mechanism, whether a bug or merely hitting a case that is not yet implemented, then the upgrade needs to replace the relevant part of `asyncFlow`'s mechanism.
- ***Failed***. If during the ***Replaying*** state the guest activation fails to exactly reproduce its previously logged behavior, it goes into the inactive ***Failed*** state, with a diagnostic explaining how the replay failed, so it can be repaired by another future upgrade. As of the next reincarnation, the failure status is cleared and it starts ***Replaying*** again, hoping not to fail this time. If replay failed because the guest async function did not reproduce its previous behavior, then the upgrade needs to replace the function with one which does. If a ***Replaying*** or ***Running*** guest failed because of a failure of the `asyncFlow` mechanism, whether a bug or merely hitting a case that is not yet implemented, then an upgrade needs to replace the relevant part of the `asyncFlow`'s mechanism.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to mention that a Running guest can go into failed state if it interacts with a host wrapper in an unsupported way (e.g. uses non passable or otherwise non recognized values as arguments)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "a case that is not yet implemented" cover that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded this bullet to be more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things may be not yet implemented. But others may just be intrinsic limitations of async-flow or of our object model. Passing a non-passable value as argument is one of those purely guest error that async-flow isn't responsible for.

@mhofman
Copy link
Member

mhofman commented Sep 27, 2024

closes: #10148

I just realized, can we add an explicit test of the new panic handler? I'm thinking something like a guest that passes a locally defined function as argument, and verify that trips up the panic handler.

@michaelfig
Copy link
Member Author

I just realized, can we add an explicit test of the new panic handler? I'm thinking something like a guest that passes a locally defined function as argument, and verify that trips up the panic handler.

A guest that passes a locally defined function doesn't cause a panic, it just shows up as a 'doThrow' in the log. Do you have another way to test what you have in mind?

@mhofman
Copy link
Member

mhofman commented Sep 27, 2024

A guest that passes a locally defined function doesn't cause a panic, it just shows up as a 'doThrow' in the log

That is surprising, and not what I expected at all.

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Small nit, as relying on precise turn counts is likely not reliable long term. That said, we likely need to update all tests to be more resilient to that, so feel free to ignore for now.

I'm satisfied that the new panic handler is exercised in some of these tests already, It would be nice to also cover the case of guest misbehavior in playing state, but that's something we can add as part of #9933

@@ -103,6 +111,8 @@ const testBadHostFirstPlay = async (t, zone) => {
// Notice that the bad call was not recorded in the log
]);
t.log('badHost firstPlay done');
// Allow panicHandler to be called.
await null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to say we should await eventLoopIteration() (aka await nextCrank()) here

@michaelfig michaelfig force-pushed the mfig-async-flow-early-completion branch from 870a055 to 1daa37f Compare September 28, 2024 16:13
@michaelfig michaelfig added the automerge:no-update (expert!) Automatically merge without updates label Sep 28, 2024
@mergify mergify bot merged commit 442f07c into master Sep 28, 2024
91 checks passed
@mergify mergify bot deleted the mfig-async-flow-early-completion branch September 28, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asyncFlow related to membrane-based replay and upgrade of async functions automerge:no-update (expert!) Automatically merge without updates
Projects
None yet
2 participants