suspended vats: no worker, no snapshot, no transcript, held in-between incarnations #8955

warner · 2024-02-20T20:23:07Z

What is the Problem Being Solved?

Extracting the "Suspended Vats" discussion from #8405 (comment) :

"Suspended Vats"

What happens if the restart fails? Specifically, the startVat delivery (which is what runs buildRootObject, and thus prepare in ZCF-based contract vats) might fail, perhaps if liveslots notices that the vat code failed to re-define all durable Kinds, or if consistency checks in the vat code itself trigger during startup.

For a normal vat upgrade, the kernel rolls back the upgrade, leaving the vat in its old state, and rejects the promises returned by E(adminNode).upgrade() so the userspace code that asked for an upgrade can decide what to do. But if we're restarting vats to switch to a new xsnap, we have no way to return to the old version: that old heap snapshot and transcript are unusable without the old xsnap to run them.

In addition, this restart-all-vats plan might be practical for now, when we have 87 vats on chain, but not when we have a thousand.

@mhofman introduced the idea of "suspending" the vats (he originally used the term "pause", but we agreed that would conflict with a smaller-yet-overlapping feature that stops pulling items off the run-queue for a while, like when the vat has exceeded its runtime budget). "Hibernation" might be another word to use, but I'm thinking that "suspended animation" (no activity, needs significant/risky effort to revive) captures the idea pretty well.

The name refers to the state of a vat in the middle of the usual upgrade process, after the old worker has been shut down (and the old snapshot/transcript effectively deleted), but before the new worker is started (with the new vat bundle). In this state, there is no worker and no transcript.

Just like we currently "page-in" offline vats on-demand when a delivery arrives, starting a worker (from heap snapshot and transcript), we can imagine "unsuspending" suspended vats on-demand when a delivery arrives. Unlike page-in, which can be different on each validator/follower, unsuspending would happen in-consensus (all validators have the same set of suspended vats at the same time).

This leads to a vat lifetime shaped like:

createVat starts the lifetime, and vat termination ends it
that lifetime is broken up into "incarnations", separated by periods of suspension (perhaps just for a moment, or for months)
each incarnation ends when the vat is upgraded, suspended, or terminated
- ending the incarnation means deleting the heap snapshot/transcript, perhaps after a final BOYD
each incarnation starts when the vat is upgraded, unsuspended, or created
- starting the incarnation means sampling the current liveslots/supervisor bundles, initializing a worker, and delivering startVat
within each incarnation, the vat might be online or offline at any given moment, different for each validator/follower
bringing a vat online means loading a heap snapshot and replaying a transcript
bringing a vat offline means killing the worker

This "unsuspend" revivification would take longer than a snapshot+replay, because we have to execute the whole startVat (which, traditionally, is kind of expensive), and this delay might happen at an inconvenient time.

And it might fail (since it might be using a different xsnap/liveslots/Endo), at least it might fail in different ways than a transcript replay (which is "shouldn't fail" enough that we panic the kernel if it occurs). If it does fail, since we can't return to the old state, our best option is to leave the vat in a suspended state, and set a flag that inhibits automatic unsuspension so don't get stuck in a loop.

With suspension, our restart-all-vats process now looks like:

the chain-halting upgrade invokes the swingset/controller API that says "I want to restart all vats"
- that marks all vats as suspended: all heap snapshots/transcripts are deleted (in practice, transcripts are truncated, not necessarily deleted, but the effect is equivalent)
the kernel starts, and some in-consensus set of vats are unsuspended immediately: probably all static vats, and any vats marked with the criticalVatFlag
- if any of these vats fails to unsuspend, the kernel should panic and the chain should halt, awaiting a better xsnap/liveslots/supervisor which doesn't have the problem (this is why we need to terminate all non-restartable vats first)
  - this is similar to the worker-preload that vat-warehouse does, except that it must be in-consensus, whereas each validator could preload a different number/set of workers without consensus issues
the remaining vats are left suspended, and will be unsuspended on-demand the first time a delivery is made to each
- an unsuspension error in these vats will mark the vat as deliberately suspended, inhibiting automatic unsuspension until some manual process (perhaps a normal vat-upgrade) clears the flag and allows unsuspension to resume

Over time, most vats will remain in a suspended state, and only active vats will have an active transcript/heap-snapshot. Vats which are idle across multiple upgrades will not experience the intermediate versions. The kernel work will be proportional to the number of non-idle vats, rather than the total number of vats.

We might want a second flag, perhaps named restartCriticalFlag, distinct from criticalVatFlag (which means "panic the kernel if this vat is ever terminated"), to control the unsuspend-on-restart behavior. Setting this flag on a vat means more delay during restart-all-vats, but it also means we refuse to proceed without proof that it can restart (which lets us discover the problem right away).

The difference between "pausing" a vat and "suspending" one is that "paused" flag just inhibits run-queue message delivery: there is still a worker, but each time we pull something off the run-queue for the vat, instead of delivering it, we push it off to a side-queue that will be serviced later, when the vat is unpaused. Vats which are suspended do not have any worker state, and will need a startVat to generate some.

I think we need another flag to distinguish between "suspended by a restart-all-vats event", which means we automatically start the new incarnation when a delivery arrives, and "deliberately suspended" (because of error), where we do not. Vats which are deliberately suspended are also paused. Maybe we can use combinations of two boolean flags:

suspended=no, paused=no : deliver as usual
suspended=yes, paused=no : startVat on demand
suspended=yes, paused=yes: enqueue would-be deliveries, uncertain about unsuspend working
suspended=no, paused=yes : enqueue would-be deliveries, confident about page-in working

Minor metering overruns would set paused=yes but not suspend the vat. This might also just be implemented by a more sophisticated kernel scheduler, with an input-queue-per-vat instead of a single merged run-queue, by just deciding to not service that vat's input queue until it accumulated more service priority.

More severe vat errors would be dealt with by setting suspended=yes paused=yes, deleting the worker state (leaving the durable state), which inhibits both delivery and automatic restart until someone calls E(adminNode).upgrade() to mark the vat as ready for work. This upgrade would be expected to provide code that resolves the original problem. upgrade() would clear the paused flag, and would also clear the suspended flag as a side-effect of launching the new incarnation right away.

It might make sense to skip the page-in preload for vats which are currently paused: why waste the memory and CPU when we know it will take something special to unpause them. Likewise we might preemptively page-out the worker when a vat gets paused.

We might want to introduce an E(adminNode).resume() or unpause() to let userspace (zoe? core-eval?) clear the paused flag (and deliver any queued messages) for a vat that got itself paused by overrunning its meter, sort of like paying a parking fine to get your car un-booted.

Description of the Design

TBD. Each vat will get a flag to say whether it is suspended or not, and whether it's deliberately suspended, or if the vat will be unsuspended as soon as the next message arrives for it. New APIs to suspend/unsuspend existing vats. Interaction with the #8405 restart-all-vats code. Interaction with the #3528 "pause vat" feature.

Security Considerations

Pausing a vat causes it to stop making progress, so any userspace control should be limited to the adminNode object (which already allows termination).

Scaling Considerations

We need to carefully consider any API which touches all vats, because there may be a lot of them. Restart-all-vats might do this.

Test Plan

Kernel unit tests to exercise all states and state transitions of a vat.

Upgrade Considerations

Since pause/suspend does not exist yet, the current state of all vats is "unpaused and unsuspended". Whatever tracking metadata we define needs to decode the lack of metadata as this state.

We could consider an upgrade-time metadata replacement step which writes explicit "unpaused and unsuspended" state out for all vats, rather than using the implicit "lack of metadata means.." approach.

The text was updated successfully, but these errors were encountered:

warner added enhancement New feature or request SwingSet package: SwingSet labels Feb 20, 2024

warner mentioned this issue Feb 20, 2024

xsnap upgrade by restart-time forced upgrade of all vats by kernel #8405

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suspended vats: no worker, no snapshot, no transcript, held in-between incarnations #8955

suspended vats: no worker, no snapshot, no transcript, held in-between incarnations #8955

warner commented Feb 20, 2024

suspended vats: no worker, no snapshot, no transcript, held in-between incarnations #8955

suspended vats: no worker, no snapshot, no transcript, held in-between incarnations #8955

Comments

warner commented Feb 20, 2024

What is the Problem Being Solved?

"Suspended Vats"

Description of the Design

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations