-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xsnap upgrade by restart-time forced upgrade of all vats by kernel #8405
Comments
We already do not enforce metering during normal replays, so we do in fact accept differences in metering, and we don't need to make an upgrade replay any different in that regard.
Current evidence shows that this is already the case, that transcripts from the pismo era are compatible (identical minus metering) across wide ranges of XS versions, even with dramatic changes to allocation and gc behavior. We do still need to confirm this is still the case for transcripts made in the current "vaults" era, but I don't see any reason why we would have regressed.
I don't understand what you mean by "kernel restart" part. I assume we're not talking about simply restarting the process which hosts the kernel, because that isn't deterministic.
While I would love to get to a point where the kernel can unilaterally decide to restart any vat without involvement from any user code (to support a more healthy page out of inactive vats for example), I am worried that not involving the vat at all brings too many complications. One case mentioned is the ability for liveslots to do some cleanup. But also I worry about the case where the vat cannot in fact restart correctly. We'd now have to tombstone that vat until it can be upgraded to something that will hopefully restart in the future. But in general I agree, it would be preferable to be able to rely on upgrades. I think #7855 is more generic because it doesn't rely on upgrades, but instead upgrades can optimize the performance of the replay. Most of the engineering complexity of #7855 is in the pre-compute optimization. We already know how to replay transcripts, and updating transcripts with new snapshot hashes (and maybe computron usage if we want to be thorough) is pretty straightforward too. If we end up supporting cleaning of vat resources without liveslots involvement, we could actually have both: attempt a force restart of a vat, and if that fails, fallback to a replay. |
In a meeting last week, we agreed that Force-Restart would give us the features that we care about most. In some sense, it would provide our chain with the most satisfying workflow: a chain-halting upgrade which does all of:
Those are better than an approach which:
In addition, the support-multiple-versions approach would get unwieldly if we're successful at upgrading xsnap/XS on a monthly cadence: who knows how many versions would be in play at the same time. I generally treat SwingSet as a separate product (with its own release schedule, and non-chain customers). In the support-multiple-versions approach, our SwingSet release decision-making process includes which versions to support, e.g. you might have a supported-versions table like:
However in practice, when we are preparing to release swingset-v3 and considering whether we can drop support for xsnap-v1, we would survey customers and find out what xsnap versions they still need, and try to satisfy their requirements by retaining the old ones while still encouraging them to upgrade their vats so we could drop those versions. There would be a constant tension between the swingset complication needed to maintain support, vs the chain-side effort needed to stop requiring the old versions. If a swingset release went ahead anyways and dropped support for a still-needed version, the chain would be unable to take advantage of that release, making it harder to deploy important fixes or features without using expensive old-release maintenance branches. That said, force-restart is an awkward process to impose upon kernel customers. It obligates vat authors to anticipate changes in Endo/lockdown/liveslots which their deployed code must tolerate, and/or it forces Endo/lockdown/liveslots authors to anticipate compatibility requirements of deployed vat code (or swingset authors to refrain from incorporating newer versions of those components). We've had discussions (TODO ticket) about marking liveslots/supervisor bundles with version/feature indicators, and vat bundles with feature requirements, so there is at least enough metadata for something to discover an incompatibility early enough to avoid problems. But that won't magically enable old abandoned contracts to become compatible with a new liveslots (or e.g. its embedded Endo components, qv #8826) that changes some significant API. I've been trying to find analogies with desktop operating systems (eg Linux, macOS, Windows). Creating a new vat is like launching an application. Halting the swingset kernel is like hiberating the computer. Upgrading a vat is like upgrading an application, which involves stopping it and starting it again (carrying over only the durable document state). Liveslots and the Endo environment are like dynamic libraries, provided by the OS but used by any given incarnation of a program. Vat code It's not a perfect analogy, but in this view, we might try to maximize the correspondence between a desktop OS upgrade and a chain upgrade. The chain upgrade is the (only) opportunity to replace the kernel, liveslots, lockdown/Endo, supervisor, and xsnap. It does not mandate a change of liveslots/lockdown/Endo/supervisor (since those come from bundles, tracked separately for each vat, and it's easy to store bundles in the DB and keep re-using them until the vat is upgraded). It does mandate a change of the kernel, since there's obviously only one kernel. A change of xsnap is mandated with the force-restart approach, and optional with the support-multiple-xsnaps approach. Upgrading a desktop OS risks compatibility with existing applications (I've certainly held off upgrading macOS until I was sure my main applications would keep working). Forcibly restarting a vat (and thus switching to the new liveslots/lockdown/endo bundles) risks compatibility too: if the old vat bundles are doing something too old, or the new kernel-provided components are doing something too new, that vat might fail its restart, and then it's kinda stuck. In the desktop OS world this is managed with compatibility testing on both sides (OS vendors test popular applications against new OS versions, and application vendors test existing applications against upcoming beta/seed versions of the OS). Abandoned applications suffer the worst fates. TasksWe identified a couple of tasks needed to implement the restart-all-vats approach.
Then we'll need a controller API to indicate that all vats should be restarted as the kernel is brought up. This must either be a flag to Once the kernel starts, it needs to suspend processing of the run-queue (if there was anything leftover from the previous boot, it must not be executed until all upgrades are done). Then it needs to restart one vat at a time, probably in vatID order (it might be good to restart all static vats before doing any dynamic ones). For each vat, we do the same thing as "Suspended Vats"(extracted to #8955) What happens if the restart fails? Specifically, the For a normal vat upgrade, the kernel rolls back the upgrade, leaving the vat in its old state, and rejects the promises returned by In addition, this restart-all-vats plan might be practical for now, when we have 87 vats on chain, but not when we have a thousand. @mhofman introduced the idea of "suspending" the vats (he originally used the term "pause", but we agreed that would conflict with a smaller-yet-overlapping feature that stops pulling items off the run-queue for a while, like when the vat has exceeded its runtime budget). "Hibernation" might be another word to use, but I'm thinking that "suspended animation" (no activity, needs significant/risky effort to revive) captures the idea pretty well. The name refers to the state of a vat in the middle of the usual upgrade process, after the old worker has been shut down (and the old snapshot/transcript effectively deleted), but before the new worker is started (with the new vat bundle). In this state, there is no worker and no transcript. Just like we currently "page-in" offline vats on-demand when a delivery arrives, starting a worker (from heap snapshot and transcript), we can imagine "unsuspending" suspended vats on-demand when a delivery arrives. Unlike page-in, which can be different on each validator/follower, unsuspending would happen in-consensus (all validators have the same set of suspended vats at the same time). This leads to a vat lifetime shaped like:
This "unsuspend" revivification would take longer than a snapshot+replay, because we have to execute the whole And it might fail (since it might be using a different xsnap/liveslots/Endo), at least it might fail in different ways than a transcript replay (which is "shouldn't fail" enough that we panic the kernel if it occurs). If it does fail, since we can't return to the old state, our best option is to leave the vat in a suspended state, and set a flag that inhibits automatic unsuspension so don't get stuck in a loop. With suspension, our restart-all-vats process now looks like:
Over time, most vats will remain in a suspended state, and only active vats will have an active transcript/heap-snapshot. Vats which are idle across multiple upgrades will not experience the intermediate versions. The kernel work will be proportional to the number of non-idle vats, rather than the total number of vats. We might want a second flag, perhaps named The difference between "pausing" a vat and "suspending" one is that "paused" flag just inhibits run-queue message delivery: there is still a worker, but each time we pull something off the run-queue for the vat, instead of delivering it, we push it off to a side-queue that will be serviced later, when the vat is unpaused. Vats which are suspended do not have any worker state, and will need a I think we need another flag to distinguish between "suspended by a restart-all-vats event", which means we automatically start the new incarnation when a delivery arrives, and "deliberately suspended" (because of error), where we do not. Vats which are deliberately suspended are also paused. Maybe we can use combinations of two boolean flags:
Minor metering overruns would set More severe vat errors would be dealt with by setting It might make sense to skip the page-in preload for vats which are currently paused: why waste the memory and CPU when we know it will take something special to unpause them. Likewise we might preemptively page-out the worker when a vat gets paused. We might want to introduce an Run-Queue HandlingWe cannot guarantee that the run-queue will be empty when the worker is restarted. We do not want previously-queued deliveries to be interleaved with the restart work. And basically we want to pretend that all vats upgrade simultaneously. So we want all The vat restarts should be executed in a loop, not by pushing I suspect that we'll see some pathologies in this sequence. We have some code patterns where an ephemeral publisher in VatA is being followed by a subscriber in VatB. When VatA is restarted, VatB will get a rejected promise, which will prompt it to ask the publisher for a new one, but the publisher will be gone, which ought to prompt it to ask a higher-up (durable) object for a replacement publisher. If VatB is also being restarted in this sequence, I can imagine seeing some wasted messages, which could be avoided if we restarted them in a different order. But this is probably just an inefficiency, not a functionality problem. Lack of a final BOYDOur normal vat-upgrade process delivers one last An abrupt restart, without this final BOYD, will leave these objects pinned by the vat. Until we get a mark-and-sweep GC system in liveslots, the new incarnation won't have enough information to realize that they can be dropped, so this will effectively constitute a storage leak. Our current We talked about finding ways to let the kernel participate in this cleanup, by having liveslots store more information in the vatstore. This would unfortunately introduce more coupling between the kernel and liveslots (weakening the abstraction boundary between them), however it might help us clean up this garbage faster. One idea was to have liveslots store its memory pillar data in the vatstore (in a new But, vats in the suspended state have no RAM pillars at all, so the current vatstore contents are complete and sufficient (they document all export and virtual-data pillars). It's just that using them requires an expensive mark-and-sweep GC pass. Our second idea was to the kernel implement this pass, sometime during suspension, rather than liveslots. The big issue is how long it would take to sweep everything. This might be easier to tackle once we've addressed the chain stability problems and purged the enormous piles of unneeded objects, reducing the cost of this operation. Signalling Restart ReadinessWhen will it be safe to trigger a restart-all-vats event? What tools can be provide to make this state visible? At last week's kernel meeting (2024-01-24), we discussed ways for vat bundles to export metadata that indicates their environmental requirements, like "I need to be run in a liveslots that gives me We might use this to let vats signal that they're prepared to be restarted unilaterally. The kernel could look at these flags across all vats and provide a Some other variant of this might make it easier to determine when it's safe to deploy a liveslots/Endo/lockdown which changes the features that are available. We might hope that this signal gets set when we upgrade or terminate the last non-restartable vat, and then never gets reset again (because we never deploy a new non-restartable vat). So its utility might be too limited to be worth deploying. The value of a which-features-are-in-use aspect would depend upon our ability to identify which such features are relevant, which is historically something that happens after deployment, not before. |
@siarhei-agoric and I were discussing this today, and we realized that we might be able to change the cosmos-sdk chain-halting-upgrade timing to help this out. The governance proposal that says "halt as of block 1234" would be changed to mean "open the halt window at block 1234". Internally, this would put the chain into "ready to halt" mode, which means it stops servicing the action queue (so no new inputs to kernel devices). And it invokes some special
Then, cosmic-swingset says "ok, I'm ready to halt now", and the node exits. Validator operators would need to wait for this indication before they replace the software and restart with the new version. We might need a Depending upon how long we think the suspend will take, This would require cosmos changes to allow a module to delay upgrade-based halt. I have no idea how big a deal that would be, we plan to pull @JeancarloBarrios and/or @mhofman into the conversation. |
This sounds similar to #6263, which last time I looked was really not feasible in cosmos-sdk without heavy modifications. |
What is the Problem Being Solved?
We've been thinking for a long time about how we're going to deploy a non-snapshot-compatible new version of
xsnap
(#6361, . We might do this to provide a new feature, improve performance, or fix a security bug.The primary constraint is that any transcript replays must remain consistent. Generally, when we restart the kernel, we need to bring all active vat workers back up to their current states (whatever state they were in when the kernel last shut down). We do that by loading the most recent heap snapshot, then we replay transcript entries until the new worker has received all the same deliveries as the previous kernel had observed. While doing this replay, we satisfy syscall responses with data from the transcript, rather than executing them for real, so that the replayed vat does not influence any other vats being brought back up (or communicate with the outside world). To ensure that the syscall responses match the syscalls being made, we insist that every syscall made by the worker during replay must exactly match the syscall recorded in the transcript. If these deviate, we declare an "anachrophobia error" and panic the kernel, because something has gone very wrong.
One such deviation would happen if we perform the replay with a version of
xsnap
that does not behave the same as the one used to record the transcript. Behavior differences are only tolerable if they do not occur during replay. For example, we could fix a bug that has not been triggered yet: the transcript will make no claims about thexsnap
behavior when exposed to the bad input, so it could plausibly have been generated from either version. However we cannot tolerate metering differences (a few extra computrons during replay could make the difference between exhausting a metering limit or not), and many changes we might like to make toxsnap
would change the metering behavior.In addition, if the new version of
xsnap
is unable to (accurately) load a heap snapshot produced by the earlier version, then we cannot use replay-from-last-snapshot. This limits our ability to roll out more significant structural changes to the XS architecture.The only safe time to change
xsnap
is when we don't have any vat transcript to replay. This happens when we upgrade the vat: we throw out the old transcript and heap snapshots, and start again in a fresh worker, carrying over only the durablevatstore
data (presented to userspace in thebaggage
object).So far, to make more drastic changes to
xsnap
, we've had two general approaches in mind.Multiple Simultaneous Versions of
xsnap
The first, proposed by me (@warner), is to give the kernel a way to host multiple versions of
xsnap
at the same time (#6596). In this approach, each vat continues to run on the same version of xsnap for the entire incarnation. We record some metadata about the vat, and the kernel uses that to decide which version ofxsnap
it should launch when starting the worker. When the vat is finally upgraded (or "restarted", i.e.E(adminNode).upgradeVat()
but using the same source bundle as the original), the metadata is updated to point to the newest availablexsnap
version. This way, new vats, and updated vats, will all use the latestxsnap
, while load-time replays of old vats will retain their consistent behavior. After upgrading the kernel, operators are responsible for upgrading all vats too. Once all ugprades are complete, the oldxsnap
will no longer be used. As long as each kernel supports two (overlapping) versions ofxsnap
, and all vats are upgraded before the next kernel upgrade, we should have no problems.The downside of this approach is the build-time complexity of hosting multiple
xsnap
versions in the same kernel package. #6596 explores this, defining a package namedworker-v1
(to hold the first version of xsnap), and proposing aworker-v2
for the subsequent version. Another approach which @kriskowal proposed is to take advantage of the package.jsonaliases" (https://docs.npmjs.com/cli/v10/using-npm/package-spec/), where the
dependencies:section can declare one name for use by
import(e.g.
xsnap-v1), and a name+version for use by the
yarn installprocess (e.g.
@agoric/[email protected]). We'd have to experiment with it, and our use of git submodules in the
xsnap` build process might make things complicated.Replay The Whole Incarnation Under The New
xsnap
The second approach, raised by @mhofman in #7855, is to replay the entire incarnation's transcript at kernel restart time, for every vat. This requires the new version of xsnap to produce the same syscalls, but we'd tolerate computron/metering differences (i.e. tolerate greater usage the second time around, even if that would have caused the original to be terminated). We'd hope that vat code was not so sensitive to the XS version that it might perform syscalls in a different order.
The benefit of this approach is that we wouldn't need multiple versions of
xsnap
in a single kernel. The downside is that replaying the entire incarnation is expensive, and we'd have to do it for every vat in the system, even ones that have been idle for a long time. This cost can be reduced if we manage to perform vat upgrades fairly soon before the kernel/xsnap upgrade (reducing the size of the most recent incarnation by making sure it is very new). And there are some clever precomputation tricks we might do to give validators a way to start the replay process in parallel with their normal execution, a week or two ahead of time, and then the final restart would only need to do the last few minutes of updates. But that represents a significant engineering complexity.Third Approach: Force-Restart All Vats at Kernel Restart
A third approach came up today in a meeting with @ivanlei . If we manage to get all vats upgradable (#8104), then we could decide that each kernel restart will immediately execute a vat restart (null upgrade) on all vats, before performing any other work. We'd need the kernel to inject an
upgrade-vat
event into the run-queue, for all vats, ahead of anything else that might be on the run-queue at the time. All upgrades would complete before allowing any other messages to be delivered.This would avoid the complexity of having multiple versions of
xsnap
simultaneously, and it would avoid the cost of replaying all vats from the start of their current incarnation.The downsides are:
privateArgs
provided by whoever is telling Zoe to restart them. The kernel may not store enough information to generate the rightvatParameters
for the vat: it can look invatOptions
to see what parameters were used the previous time, but for e.g. contract vats that won't be correct (the contract would re-initialize things that should instead be re-used). It might be necessary to get Zoe involved: restart all static vats, then ask Zoe to restart all contract vats. However that would risk Zoe talking with vats that have not been restarted yet, as well as being a significant layering violationSo, it's an idea worth exploring, but not immediately obvious that it would work.
The text was updated successfully, but these errors were encountered: