fix(async-flow): panic if a guest fulfills during replay #10160

michaelfig · 2024-09-25T23:12:28Z

closes: #9465
closes: #10148

Description

This pull request introduces several changes to the async-flow package, focusing on enhancing error handling, updating documentation, and improving test robustness. Key changes include updating panicking if the guest function settles during replay (unexpected early completion), Failure state handling, adding a panic handler, and refactoring tests to utilize a new utility function.

Security Considerations

Behaviour is more correct by guaranteeing the guest behaviour actually matches the recorded log.

Scaling Considerations

n/a

Documentation Considerations

n/a

Testing Considerations

Converted an existing test.failing to a passing test, with a minor change to use eventLoopIteration to paper over multiple turns needed for the async flow to register an early completion failure.

Upgrade Considerations

n/a

mhofman · 2024-09-25T23:16:08Z

fulfills

quick nit: settles. An early rejection is as incorrect and an early fulfillment, and actually more likely to occur.

mhofman

We need to handle the rejection case too :)

packages/async-flow/src/async-flow.js

cloudflare-workers-and-pages · 2024-09-26T00:52:05Z

Deploying agoric-sdk with Cloudflare Pages

Latest commit:	`1daa37f`
Status:	✅ Deploy successful!
Preview URL:	https://72a47a5e.agoric-sdk.pages.dev
Branch Preview URL:	https://mfig-async-flow-early-comple.agoric-sdk.pages.dev

View logs

mhofman

Looks good, feel free to merge once the following is addressed:

Small documentation update to reflect in which case the flow goes from running to failed.
I am uncomfortable defaulting the test async flow panic handler to just log, and I think it's safe to make it fail the test. Please let m know if I'm misunderstanding something.

mhofman · 2024-09-26T21:08:40Z

packages/async-flow/docs/async-flow-states.png

Technically the flow can also go from running to failed if the guest does something illegal like passing a non passable to a guest wrapper. It's not just for internal failures.

Updated the diagram.

mhofman · 2024-09-26T21:13:26Z

packages/async-flow/test/_utils.js

+ * @param {PreparationOptions} [opts]
+ */
+export const prepareTestAsyncFlowTools = (t, zone, opts) => {
+  const { panicHandler = e => t.log('Handled panic', e) } = opts || {};


Shouldn't the default behavior be to t.fail the test? I know that's what the unhandled rejection handler ava sets up would do, but I think that's the correct default behavior, and here it'd just be more explicit.

Okay, so you're talking about the default panicHandler. I presume you mean that the tests that had unhandled rejections, but expected them, would override the panicHandler option?

mhofman · 2024-09-26T21:14:51Z

packages/async-flow/test/async-flow-early-completion.test.js

@@ -149,8 +150,12 @@ const testBadShortReplay = async (t, zone) => {

  const { guestMethod } = {
    async guestMethod(_gOrch7, _g1, _p3) {
-      t.log('  badReplay return early');
-      resolveStep(true);
+      t.log('  badShortReplay return early');


Let's move this log line before the return ?

mhofman · 2024-09-26T21:20:35Z

packages/async-flow/docs/async-flow-states.md

@@ -10,6 +10,7 @@

 - ***Replaying***. To start ***Replaying***, the activation first translates the saved activation arguments from host to guest, invokes the guest function, and starts the membrane replaying from its durable log. The replay is finished when the last log entry has been replayed. Once replaying is finished, the activation has caught up and transitions back to ***Running***.

- ***Failed***. If during the ***Replaying*** state the guest activation fails to exactly reproduce its previously logged behavior, it goes into the inactive ***Failed*** state, with a diagnostic explaining how the replay failed, so it can be repaired by another future upgrade. As of the next reincarnation, the failure status is cleared and it starts ***Replaying*** again, hoping not to fail this time. If replay failed because the guest async function did not reproduce its previous behavior, then the upgrade needs to replace the function with one which does. If the replay failed because of a failure of the `asyncFlow` mechanism, whether a bug or merely hitting a case that is not yet implemented, then the upgrade needs to replace the relevant part of `asyncFlow`'s mechanism.
+- ***Failed***. If during the ***Replaying*** state the guest activation fails to exactly reproduce its previously logged behavior, it goes into the inactive ***Failed*** state, with a diagnostic explaining how the replay failed, so it can be repaired by another future upgrade. As of the next reincarnation, the failure status is cleared and it starts ***Replaying*** again, hoping not to fail this time. If replay failed because the guest async function did not reproduce its previous behavior, then the upgrade needs to replace the function with one which does. If a ***Replaying*** or ***Running*** guest failed because of a failure of the `asyncFlow` mechanism, whether a bug or merely hitting a case that is not yet implemented, then an upgrade needs to replace the relevant part of the `asyncFlow`'s mechanism.


I think we also need to mention that a Running guest can go into failed state if it interacts with a host wrapper in an unsupported way (e.g. uses non passable or otherwise non recognized values as arguments)

Does "a case that is not yet implemented" cover that?

I reworded this bullet to be more explicit.

Some things may be not yet implemented. But others may just be intrinsic limitations of async-flow or of our object model. Passing a non-passable value as argument is one of those purely guest error that async-flow isn't responsible for.

mhofman · 2024-09-27T00:04:06Z

closes: #10148

I just realized, can we add an explicit test of the new panic handler? I'm thinking something like a guest that passes a locally defined function as argument, and verify that trips up the panic handler.

michaelfig · 2024-09-27T17:17:10Z

I just realized, can we add an explicit test of the new panic handler? I'm thinking something like a guest that passes a locally defined function as argument, and verify that trips up the panic handler.

A guest that passes a locally defined function doesn't cause a panic, it just shows up as a 'doThrow' in the log. Do you have another way to test what you have in mind?

mhofman · 2024-09-27T17:46:06Z

A guest that passes a locally defined function doesn't cause a panic, it just shows up as a 'doThrow' in the log

That is surprising, and not what I expected at all.

mhofman

LGTM

Small nit, as relying on precise turn counts is likely not reliable long term. That said, we likely need to update all tests to be more resilient to that, so feel free to ignore for now.

I'm satisfied that the new panic handler is exercised in some of these tests already, It would be nice to also cover the case of guest misbehavior in playing state, but that's something we can add as part of #9933

mhofman · 2024-09-28T00:06:29Z

packages/async-flow/test/bad-host.test.js

@@ -103,6 +111,8 @@ const testBadHostFirstPlay = async (t, zone) => {
    // Notice that the bad call was not recorded in the log
  ]);
  t.log('badHost firstPlay done');
+  // Allow panicHandler to be called.
+  await null;


I'm tempted to say we should await eventLoopIteration() (aka await nextCrank()) here

michaelfig requested a review from mhofman September 25, 2024 23:12

michaelfig self-assigned this Sep 25, 2024

michaelfig added the asyncFlow related to membrane-based replay and upgrade of async functions label Sep 25, 2024

mhofman requested changes Sep 25, 2024

View reviewed changes

packages/async-flow/src/async-flow.js Outdated Show resolved Hide resolved

packages/async-flow/src/async-flow.js Outdated Show resolved Hide resolved

michaelfig force-pushed the mfig-async-flow-early-completion branch from fb1407a to cf3fe86 Compare September 26, 2024 00:53

michaelfig requested a review from a team as a code owner September 26, 2024 16:50

michaelfig requested review from AgoricTriage and mhofman September 26, 2024 16:50

mhofman approved these changes Sep 26, 2024

View reviewed changes

mhofman approved these changes Sep 28, 2024

View reviewed changes

michaelfig added 4 commits September 28, 2024 10:12

fix(async-flow): panic if a guest fulfills during replay

77aadf6

fix(async-flow): more robust guest completion and panic handling

caf934f

test(async-flow): use prepareTestAsyncFlowTools(t,...)

df43f33

docs(async-flow): add state change from Running to Failed

1daa37f

michaelfig force-pushed the mfig-async-flow-early-completion branch from 870a055 to 1daa37f Compare September 28, 2024 16:13

michaelfig added the automerge:no-update (expert!) Automatically merge without updates label Sep 28, 2024

mergify bot merged commit 442f07c into master Sep 28, 2024
91 checks passed

mergify bot deleted the mfig-async-flow-early-completion branch September 28, 2024 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(async-flow): panic if a guest fulfills during replay #10160

fix(async-flow): panic if a guest fulfills during replay #10160

michaelfig commented Sep 25, 2024 •

edited

Loading

mhofman commented Sep 25, 2024

mhofman left a comment

cloudflare-workers-and-pages bot commented Sep 26, 2024 •

edited

Loading

mhofman left a comment

mhofman Sep 26, 2024

michaelfig Sep 27, 2024

mhofman Sep 26, 2024

michaelfig Sep 27, 2024

mhofman Sep 26, 2024

mhofman Sep 26, 2024

michaelfig Sep 27, 2024

michaelfig Sep 27, 2024

mhofman Sep 27, 2024

mhofman commented Sep 27, 2024

michaelfig commented Sep 27, 2024

mhofman commented Sep 27, 2024

mhofman left a comment

mhofman Sep 28, 2024

fix(async-flow): panic if a guest fulfills during replay #10160

fix(async-flow): panic if a guest fulfills during replay #10160

Conversation

michaelfig commented Sep 25, 2024 • edited Loading

Description

Security Considerations

Scaling Considerations

Documentation Considerations

Testing Considerations

Upgrade Considerations

mhofman commented Sep 25, 2024

mhofman left a comment

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Sep 26, 2024 • edited Loading

Deploying agoric-sdk with Cloudflare Pages

mhofman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhofman commented Sep 27, 2024

michaelfig commented Sep 27, 2024

mhofman commented Sep 27, 2024

mhofman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelfig commented Sep 25, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Sep 26, 2024 •

edited

Loading