Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suspended vats: no worker, no snapshot, no transcript, held in-between incarnations #8955

Open
warner opened this issue Feb 20, 2024 · 0 comments
Labels
enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Feb 20, 2024

What is the Problem Being Solved?

Extracting the "Suspended Vats" discussion from #8405 (comment) :

"Suspended Vats"

What happens if the restart fails? Specifically, the startVat delivery (which is what runs buildRootObject, and thus prepare in ZCF-based contract vats) might fail, perhaps if liveslots notices that the vat code failed to re-define all durable Kinds, or if consistency checks in the vat code itself trigger during startup.

For a normal vat upgrade, the kernel rolls back the upgrade, leaving the vat in its old state, and rejects the promises returned by E(adminNode).upgrade() so the userspace code that asked for an upgrade can decide what to do. But if we're restarting vats to switch to a new xsnap, we have no way to return to the old version: that old heap snapshot and transcript are unusable without the old xsnap to run them.

In addition, this restart-all-vats plan might be practical for now, when we have 87 vats on chain, but not when we have a thousand.

@mhofman introduced the idea of "suspending" the vats (he originally used the term "pause", but we agreed that would conflict with a smaller-yet-overlapping feature that stops pulling items off the run-queue for a while, like when the vat has exceeded its runtime budget). "Hibernation" might be another word to use, but I'm thinking that "suspended animation" (no activity, needs significant/risky effort to revive) captures the idea pretty well.

The name refers to the state of a vat in the middle of the usual upgrade process, after the old worker has been shut down (and the old snapshot/transcript effectively deleted), but before the new worker is started (with the new vat bundle). In this state, there is no worker and no transcript.

Just like we currently "page-in" offline vats on-demand when a delivery arrives, starting a worker (from heap snapshot and transcript), we can imagine "unsuspending" suspended vats on-demand when a delivery arrives. Unlike page-in, which can be different on each validator/follower, unsuspending would happen in-consensus (all validators have the same set of suspended vats at the same time).

This leads to a vat lifetime shaped like:

  • createVat starts the lifetime, and vat termination ends it
  • that lifetime is broken up into "incarnations", separated by periods of suspension (perhaps just for a moment, or for months)
  • each incarnation ends when the vat is upgraded, suspended, or terminated
    • ending the incarnation means deleting the heap snapshot/transcript, perhaps after a final BOYD
  • each incarnation starts when the vat is upgraded, unsuspended, or created
    • starting the incarnation means sampling the current liveslots/supervisor bundles, initializing a worker, and delivering startVat
  • within each incarnation, the vat might be online or offline at any given moment, different for each validator/follower
  • bringing a vat online means loading a heap snapshot and replaying a transcript
  • bringing a vat offline means killing the worker

This "unsuspend" revivification would take longer than a snapshot+replay, because we have to execute the whole startVat (which, traditionally, is kind of expensive), and this delay might happen at an inconvenient time.

And it might fail (since it might be using a different xsnap/liveslots/Endo), at least it might fail in different ways than a transcript replay (which is "shouldn't fail" enough that we panic the kernel if it occurs). If it does fail, since we can't return to the old state, our best option is to leave the vat in a suspended state, and set a flag that inhibits automatic unsuspension so don't get stuck in a loop.

With suspension, our restart-all-vats process now looks like:

  • the chain-halting upgrade invokes the swingset/controller API that says "I want to restart all vats"
    • that marks all vats as suspended: all heap snapshots/transcripts are deleted (in practice, transcripts are truncated, not necessarily deleted, but the effect is equivalent)
  • the kernel starts, and some in-consensus set of vats are unsuspended immediately: probably all static vats, and any vats marked with the criticalVatFlag
    • if any of these vats fails to unsuspend, the kernel should panic and the chain should halt, awaiting a better xsnap/liveslots/supervisor which doesn't have the problem (this is why we need to terminate all non-restartable vats first)
      • this is similar to the worker-preload that vat-warehouse does, except that it must be in-consensus, whereas each validator could preload a different number/set of workers without consensus issues
  • the remaining vats are left suspended, and will be unsuspended on-demand the first time a delivery is made to each
    • an unsuspension error in these vats will mark the vat as deliberately suspended, inhibiting automatic unsuspension until some manual process (perhaps a normal vat-upgrade) clears the flag and allows unsuspension to resume

Over time, most vats will remain in a suspended state, and only active vats will have an active transcript/heap-snapshot. Vats which are idle across multiple upgrades will not experience the intermediate versions. The kernel work will be proportional to the number of non-idle vats, rather than the total number of vats.

We might want a second flag, perhaps named restartCriticalFlag, distinct from criticalVatFlag (which means "panic the kernel if this vat is ever terminated"), to control the unsuspend-on-restart behavior. Setting this flag on a vat means more delay during restart-all-vats, but it also means we refuse to proceed without proof that it can restart (which lets us discover the problem right away).

The difference between "pausing" a vat and "suspending" one is that "paused" flag just inhibits run-queue message delivery: there is still a worker, but each time we pull something off the run-queue for the vat, instead of delivering it, we push it off to a side-queue that will be serviced later, when the vat is unpaused. Vats which are suspended do not have any worker state, and will need a startVat to generate some.

I think we need another flag to distinguish between "suspended by a restart-all-vats event", which means we automatically start the new incarnation when a delivery arrives, and "deliberately suspended" (because of error), where we do not. Vats which are deliberately suspended are also paused. Maybe we can use combinations of two boolean flags:

  • suspended=no, paused=no : deliver as usual
  • suspended=yes, paused=no : startVat on demand
  • suspended=yes, paused=yes: enqueue would-be deliveries, uncertain about unsuspend working
  • suspended=no, paused=yes : enqueue would-be deliveries, confident about page-in working

Minor metering overruns would set paused=yes but not suspend the vat. This might also just be implemented by a more sophisticated kernel scheduler, with an input-queue-per-vat instead of a single merged run-queue, by just deciding to not service that vat's input queue until it accumulated more service priority.

More severe vat errors would be dealt with by setting suspended=yes paused=yes, deleting the worker state (leaving the durable state), which inhibits both delivery and automatic restart until someone calls E(adminNode).upgrade() to mark the vat as ready for work. This upgrade would be expected to provide code that resolves the original problem. upgrade() would clear the paused flag, and would also clear the suspended flag as a side-effect of launching the new incarnation right away.

It might make sense to skip the page-in preload for vats which are currently paused: why waste the memory and CPU when we know it will take something special to unpause them. Likewise we might preemptively page-out the worker when a vat gets paused.

We might want to introduce an E(adminNode).resume() or unpause() to let userspace (zoe? core-eval?) clear the paused flag (and deliver any queued messages) for a vat that got itself paused by overrunning its meter, sort of like paying a parking fine to get your car un-booted.

Description of the Design

TBD. Each vat will get a flag to say whether it is suspended or not, and whether it's deliberately suspended, or if the vat will be unsuspended as soon as the next message arrives for it. New APIs to suspend/unsuspend existing vats. Interaction with the #8405 restart-all-vats code. Interaction with the #3528 "pause vat" feature.

Security Considerations

Pausing a vat causes it to stop making progress, so any userspace control should be limited to the adminNode object (which already allows termination).

Scaling Considerations

We need to carefully consider any API which touches all vats, because there may be a lot of them. Restart-all-vats might do this.

Test Plan

Kernel unit tests to exercise all states and state transitions of a vat.

Upgrade Considerations

Since pause/suspend does not exist yet, the current state of all vats is "unpaused and unsuspended". Whatever tracking metadata we define needs to decode the lack of metadata as this state.

We could consider an upgrade-time metadata replacement step which writes explicit "unpaused and unsuspended" state out for all vats, rather than using the implicit "lack of metadata means.." approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

1 participant