Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(swingset): use "dirt" to schedule vat reap/bringOutYourDead #9169

Merged
merged 5 commits into from
Aug 12, 2024

Conversation

warner
Copy link
Member

@warner warner commented Mar 29, 2024

fix(swingset): use "dirt" to schedule vat reap/bringOutYourDead

NOTE: deployed kernels require a new upgradeSwingset() call upon
(at least) first boot after upgrading to this version of the kernel
code.

dispatch.bringOutYourDead(), aka "reap", triggers garbage collection
inside a vat, and gives it a chance to drop imported c-list vrefs that
are no longer referenced by anything inside the vat.

Previously, each vat has a configurable parameter named
reapInterval, which defaults to a kernel-wide
defaultReapInterval (but can be set separately for each vat). This
defaults to 1, mainly for unit testing, but real applications set it
to something like 1000.

This caused BOYD to happen once every 1000 deliveries, plus an extra
BOYD just before we save an XS heap-state snapshot.

This commit switches to a "dirt"-based BOYD scheduler, wherein we
consider the vat to get more and more dirty as it does work, and
eventually it reaches a reapDirtThreshold that triggers the
BOYD (which resets the dirt counter).

We continue to track dirt.deliveries as before, with the same
defaults. But we add a new dirt.gcKrefs counter, which is
incremented by the krefs we submit to the vat in GC deliveries. For
example, calling dispatch.dropImports([kref1, kref2]) would increase
dirt.gcKrefs by two.

The reapDirtThreshold.gcKrefs limit defaults to 20. For normal use
patterns, this will trigger a BOYD after ten krefs have been dropped
and retired. We choose this value to allow the #8928 slow vat
termination process to trigger BOYD frequently enough to keep the BOYD
cranks small: since these will be happening constantly (in the
"background"), we don't want them to take more than 500ms or so. Given
the current size of the large vats that #8928 seeks to terminate, 10
krefs seems like a reasonable limit. And of course we don't want to
perform too many BOYDs, so gcKrefs: 20 is about the smallest
threshold we'd want to use.

External APIs continue to accept reapInterval, and now also accept
reapGCKrefs, and neverReap (a boolean which inhibits all BOYD,
even new forms of dirt added in the future).

  • kernel config record
    • takes config.defaultReapInterval and defaultReapGCKrefs
    • takes vat.NAME.creationOptions.reapInterval and .reapGCKrefs
      and .neverReap
  • controller.changeKernelOptions() still takes defaultReapInterval
    but now also accepts defaultReapGCKrefs

The APIs available to userspace code (through vatAdminSvc) are
unchanged (partially due to upgrade/backwards-compatibility
limitations), and continue to only support setting reapInterval.
Internally, this just modifies reapDirtThreshold.deliveries.

  • E(vatAdminSvc).createVat(bcap, { reapInterval })
  • E(adminNode).upgrade(bcap, { reapInterval })
  • E(adminNode).changeOptions({ reapInterval })

Internally, the kernel-wide state records defaultReapDirtThreshold
instead of defaultReapInterval, and each vat records
.reapDirtThreshold in their vNN.options key instead of
vNN.reapInterval. The vat-level records override the kernel-wide
values. The current dirt level is recorded in vNN.reapDirt.

NOTE: deployed kernels require explicit state upgrade, with:

import { upgradeSwingset } from '@agoric/swingset-vat';
..
upgradeSwingset(kernelStorage);

This must be called after upgrading to the new kernel code/release,
and before calling buildVatController(). It is safe to call on every
reboot (it will only modify the swingstore when the kernel version has
changed). If changes are made, the host application is responsible for
commiting them, as well as recording any export-data updates (if the
host configured the swingstore with an export-data callback).

During this upgrade, the old reapCountdown value is used to
initialize the vat's reapDirt.deliveries counter, so the upgrade
shouldn't disrupt the existing schedule. Vats which used reapInterval = 'never' (eg comms) will get a reapDirtThreshold.never = true, so
they continue to inhibit BOYD. Any per-vat settings that match the
kernel-wide settings are removed, allowing the kernel values to take
precedence (as well as changes to the kernel-wide values; i.e. the
per-vat settings are not sticky).

We do not track dirt when the corresponding threshold is 'never', or
if neverReap is true, to avoid incrementing the comms dirt counters
forever.

This design leaves room for adding .computrons to the dirt record,
as well as tracking a separate snapshotDirt counter (to trigger XS
heap snapshots, ala #6786). We add reapDirtThreshold.computrons, but
do not yet expose an API to set it.

Future work includes:

  • upgrade vat-vat-admin to let userspace set reapDirtThreshold

New tests were added to exercise the upgrade process, and other tests
were updated to match the new internal initialization pattern.

We now reset the dirt counter upon any BOYD, so this also happens to
help with #8665 (doing a reapAllVats() resets the delivery counters,
so future BOYDs will be delayed, which is what we want). But we should
still change controller.reapAllVats() to avoid BOYDs on vats which
haven't received any deliveries.

closes #8980

@warner warner added the SwingSet package: SwingSet label Mar 29, 2024
@warner warner requested a review from mhofman March 29, 2024 16:49
@warner warner self-assigned this Mar 29, 2024
@warner warner force-pushed the warner/8980-boyd-scheduler branch 3 times, most recently from 62aa511 to 0fe9f39 Compare April 13, 2024 04:10
Copy link

cloudflare-workers-and-pages bot commented Apr 13, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6547c83
Status: ✅  Deploy successful!
Preview URL: https://1301a1ce.agoric-sdk.pages.dev
Branch Preview URL: https://warner-8980-boyd-scheduler.agoric-sdk.pages.dev

View logs

@warner warner force-pushed the warner/8980-boyd-scheduler branch from 0fe9f39 to 967e458 Compare April 15, 2024 15:36
@warner warner assigned mhofman and unassigned warner Apr 16, 2024
@warner warner force-pushed the warner/8980-boyd-scheduler branch from 967e458 to 402811a Compare April 23, 2024 19:03
@mhofman
Copy link
Member

mhofman commented Apr 24, 2024

We choose this value to allow the #8928 slow vat
termination process to trigger BOYD frequently enough to keep the BOYD
cranks small: since these will be happening constantly (in the
"background"), we don't want them to take more than 500ms or so. Given
the current size of the large vats that #8928 seeks to terminate, 10
krefs seems like a reasonable limit. And of course we don't want to
perform too many BOYDs, so gcKrefs: 20 is about the smallest
threshold we'd want to use.

I am somewhat concerned that the cost of engine gc might mean 20 is too low. How do we validate we won't spend too much time in BOYD? Do we have an idea of what portion of the time spent in BOYD is related to the engine doing a gc mark and sweep, and how much of it is liveslots reacting to the found garbage? I assume liveslots usually has vatStore calls to make when garbage was found?

External APIs continue to accept reapInterval, and now also accept reapGCKrefs.

  • kernel config record

    • takes config.defaultReapInterval and defaultReapGCKrefs
    • takes vat.NAME.creationOptions.reapInterval and .reapGCKrefs
  • controller.changeKernelOptions() still takes defaultReapInterval
    but now also accepts defaultReapGCKrefs

This does not sound scalable for adding new dirt thresholds. Why not add a reapDirtThresholds record in configs / options?

The kernel will automatically upgrade both the kernel-wide and the
per-vat state upon the first reboot with the new kernel code.

I'll need to look at this. It sounds like a sort of schema migration.

as well as tracking a separate snapshotDirt counter (to trigger XS
heap snapshots, ala #6786).

I'm not sure I understand this. Is the idea to trigger snapshots every n BOYD? Snapshots should already force BYOD before snapshotting. The snapshotting process reveals gc to liveslots, which may have bugs revealing this out (#6784), so the idea has been to force a BOYD before each snapshot (#7504) but we expect BOYD to be more frequent than snapshots.

Edit: I totally misunderstood snapshortDirt: it's not meant to trigger reap, or be related to reapDirt, just will likely use a similar accounting mechanism, with likely similar dirt input.

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly looking good.

I think we're missing a clearReapDirt for the BOYD injected when creating a snapshot.

Wondering if the schema migration should be handled more explicitly when starting the kernel instead of optimistically running it before each crank.

I'll need to do another pass and review the tests.

packages/SwingSet/src/kernel/state/vatKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Outdated Show resolved Hide resolved
Comment on lines 381 to 382
// the caller is responsible for ensuring that any changes made here
// get committed at a useful point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some worry that a rollback could revert the kernel state but would leave the upgraded flag in place, but it looks like this is always called outside savepoints. Why not run this at the beginning of every startCrank()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's one upgraded flag per KernelKeeper object, and each kernel has exactly one KernelKeeper (created at the start of buildKernel()), but we also create a KernelKeeper when we initialize/populate the DB, and we should not be doing an upgrade at that time, which is why maybeUpgradeKernelState() is a method and not an automatic consequence of makeKernelKeeper().

It's currently called just before each call to startCrank().. I wanted to make it more explicit, because (as you noted) it's important to call it from outside a savepoint. I was worried that adding it to startCrank() would obscure the upgrade point and make it fragile under maintenance.

But.. in looking more carefully at how/when startCrank is called (it also show up in changeKernelOptions and installBundle, to which I did not add a maybeUpgrade call), I think my PR would have a small bug. kernelKeeper.startCrank() calls kernelStorage.startCrank() in swing-store, which sets an inCrank = true flag, and calls resetCrankHash(). By not calling startCrank before maybeUpgrade (or not calling it at all, if we start with changeKernelOptions or installBundle, which sounds pretty plausible TBH), we don't include any upgrade-related kvStore changes in the crankhash. That's not the intent: upgrades are in-consensus, all nodes of a consensus kernel should upgrade at the same time, and their changes should be included in the crankhash.

So I think I need to change this as you suggest: call maybeUpgradeKernelState() from inside kernelKeeper.startCrank(), but do it specifically after that function calls kernelStorage.startCrank(). I'm not excited to conceal the fact that upgrades are happening from call sites like step() and run(), but it's more important to have these upgrade changes be captured in the hash reliably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also create a KernelKeeper when we initialize/populate the DB, and we should not be doing an upgrade at that time, which is why maybeUpgradeKernelState() is a method and not an automatic consequence of makeKernelKeeper().

That had totally skipped my mind. Can we document somewhere these constraints?

return reapDirt;
}

function clearReapDirt() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to also call this when injecting the BOYD message in maybeSaveSnapshot

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll see if there's a way to factor out BOYD delivery to have just one place to call it from (we actually have three code paths: processBringOutYourDead, processUpgradeVat as the last thing sent to the old incarnation, and maybeSaveSnapshot). Ideally all three would call processBringOutYourDead but that wasn't trivial. I'll see what I can do, worst case I'll just add another clearReapDirt() call like you suggested.

Comment on lines +191 to +226
// this default "reap interval" is low for the benefit of tests:
// applications should set it to something higher (perhaps 200) based
// on their expected usage

export const DEFAULT_DELIVERIES_PER_BOYD = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat concerned with a default value that is not "production ready", but this seems to be the existing behavior already, so out of scope. Is the builder for a test kernel the same as the production one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (weak) philosophy here was that we don't know what a good value would be, so make the default a predictable and hyper-safe one, at the cost of performance. Making it convenient for tests is another consideration, although of course it's always a lazy and inadvisable choice to prioritize tests over reality.

We might even make a hand-wavy argument that this matches XS: minimizing RAM usage (ie GCing frequently) at the cost of performance.

Is the builder for a test kernel the same as the production one?

Not sure what you mean, our most important host application (cosmic-swingset) overrides this value, and in general we expect all host applictions to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was remembering that all tests used a single function to build the kernel, in which case we could have a value more appropriate for production as a default here. Anyway, as stated this not really in scope for this PR.

packages/SwingSet/src/kernel/state/vatKeeper.js Outdated Show resolved Hide resolved
@warner
Copy link
Member Author

warner commented May 10, 2024

We choose this value to allow the #8928 slow vat
termination process to trigger BOYD frequently enough to keep the BOYD
cranks small: since these will be happening constantly (in the
"background"), we don't want them to take more than 500ms or so. Given
the current size of the large vats that #8928 seeks to terminate, 10
krefs seems like a reasonable limit. And of course we don't want to
perform too many BOYDs, so gcKrefs: 20 is about the smallest
threshold we'd want to use.

I am somewhat concerned that the cost of engine gc might mean 20 is too low. How do we validate we won't spend too much time in BOYD? Do we have an idea of what portion of the time spent in BOYD is related to the engine doing a gc mark and sweep, and how much of it is liveslots reacting to the found garbage? I assume liveslots usually has vatStore calls to make when garbage was found?

We certainly need to experiment on real data and tune this. The "reapDirtThresholds.gcKrefs -> work-per-BOYD" ratio is worst-case determined by how many objects/krefs are entrained by each #8401 Zoe/ExitObject cycle, and how much work (in particular how many syscalls) Zoe does in the resulting BOYD.

My notes/predictions are in #8928 (comment) . gcKrefs: 20 and a cleanup budget of "5" ought to trigger one zoe BOYD every other empty block, which will do 500 syscalls, free 10 cycles, and take 500ms. It will take about a week to get through the cycle-breaking c-list deletions.

This might actually be too aggressive, but if so we can reduce the cleanup rate instead of reducing the gcKrefs threshold.

My worry about increasing reapDirtThresholds.gcKrefs is that it would make the BOYDs bigger (albeit less frequent). We're going to be triggering BOYDs constantly, every N blocks, and I'm worried that the slower validators/followers will fall behind. You're absolutely right that smaller+faster BOYD vs larger+slower is influenced by the non-kref-handling code, and so the "percentage of total time spent in a BOYD" equation is not independent of gcKrefs. My hunch is that it's dominated by the syscalls, rather than the engine GC, but I'll see if I can measure that.

Also, your question means I should measure the rate of non-cleanup-related gcKrefs going into a vat like zoe, and estimate what kind of BOYD rate we expect this to provoke, just in normal operation. Our baseline is one BOYD per 200 deliveries (triggered by heap snapshots), so if reapDirtThresholds.gcKrefs doesn't trigger BOYDs faster than that, we should be ok (since we'll make snapshot-triggered BOYDs reset the dirt counters, so they'll replace instead of adding).

External APIs continue to accept reapInterval, and now also accept reapGCKrefs.

  • kernel config record

    • takes config.defaultReapInterval and defaultReapGCKrefs
    • takes vat.NAME.creationOptions.reapInterval and .reapGCKrefs
  • controller.changeKernelOptions() still takes defaultReapInterval
    but now also accepts defaultReapGCKrefs

This does not sound scalable for adding new dirt thresholds. Why not add a reapDirtThresholds record in configs / options?

I started out doing it that way, but it triggered a lot of test churn, and made it hard to distinguish between "I'm setting a threshold and I don't care about property X, use your default", vs "I'm setting a threshold and I want property X to never trigger BOYD", mostly because I stuck with a type of Number | undefined, and could only get one meaning for the undefined case. In the future, when we add a new threshold type/property, we want to be backwards-compatible with the intentions of the old config, so missing properties need to mean "use your default", and I found it hard to do that when the thresholds were all glued together into a single argument record.

I'll take another look and see if there's a way to get a cleaner API, but I have to admit it's not my highest priority.

The kernel will automatically upgrade both the kernel-wide and the
per-vat state upon the first reboot with the new kernel code.

I'll need to look at this. It sounds like a sort of schema migration.

It is, for sure. I thought briefly about adding a controller.upgradeKernelState(), but I didn't like the DX of that:

  • existing host apps which don't get the memo, and fail to add the upgrade call, would break
  • to make them break cleanly, I'd need buildKernel() to check the kvStore for evidence of being out-of-date, and bail early if so (with an error pointing them at upgradeKernelState), so we don't charge ahead with bogus state
  • creating the controller also starts the kernel, and a lot of schema upgrades we can imagine (for the future) will need to be done by that point, so it's kinda too late, which means it really wants a standalone function, kind of like initializeSwingset() (which takes a kernelStorage but doesn't use/create a controller or kernel), which would make the host app responsibility be like:
import { makeSwingsetController, maybeUpgradeKernelState } from '@agoric/swingset-vat';
const { kernelStorage, hostStorage } = openSwingStore(..);
maybeUpgradeKernelState(kernelStorage);
hostStorage.commit();
const controller = makeSwingsetController(kernelStorage, ..);
..

We've had similar discussions in #8089, and I think you convinced me to make it explicit, where the host app must call a function (at the right time), and not have the library perform an automatic upgrade, but then export-data considerations led us both to give up on using an explicit call, and instead having upgrade be implicit during the openSwingStore() call.

I think we've got similar issues here: the reapDirt upgrades will cause kvStore changes, and export-data/IAVL changes, that need to be committed. The situation is better, though, because we already have the kernelStorage object, so we've got somewhere to hold the pending changes.

I'll try to think about how we could make this an explicit call.

as well as tracking a separate snapshotDirt counter (to trigger XS
heap snapshots, ala #6786).

I'm not sure I understand this. Is the idea to trigger snapshots every n BOYD? Snapshots should already force BYOD before snapshotting. The snapshotting process reveals gc to liveslots, which may have bugs revealing this out (#6784), so the idea has been to force a BOYD before each snapshot (#7504) but we expect BOYD to be more frequent than snapshots.

Edit: I totally misunderstood snapshortDirt: it's not meant to trigger reap, or be related to reapDirt, just will likely use a similar accounting mechanism, with likely similar dirt input.

Right, we'd have reapDirt and snapshotDirt counters, compared against reapDirtThreshold and snapshotDirtThreshold respectively. We'd want snapshots to be driven mostly by computrons and partially by deliveries, and whereas we'd want BOYD to be mostly driven by memory/vref things rather than CPU-usage things, but both could use the same structure, just with different thresholds.

@aj-agoric aj-agoric assigned warner and unassigned mhofman May 14, 2024
@warner
Copy link
Member Author

warner commented May 17, 2024

I went to move the upgrade code into startCrank as suggested, and to write a test that the crankhash is updated to include the upgrade changes, and realized that this branch wouldn't work at all, because the kernelKeeper requires an up-to-date kvStore as soon as it is created, but the normal kernel.start() process creates the kernelKeeper before doing any cranks. My original unit test created a kernel in isolation, and the new test I was writing used controller instead, and it couldn't get past creating the kernel before it asserted.

I thought about our plans for swing-store schema migration (concluding by this comment), and the constraints therein, and how it applies to this upgrade step. We don't have the same "where do the export-data changes show up?" problem, since makeSwingsetController() accepts a swing-store (vs #8089 where openSwingStore() must return one). We still have the ergonomics question of "when does the DB get mutated?" and the corresponding host-app obligation to commit those changes (with swingStore.hostStorage.commit()) at an appropriate time.

Each kernel reboot uses some particular version of the @agoric/swingset-vat package (and likewise of the swing-store package). Each kernel "run" (the committed results of a controller.run(), plus whatever IO/device inputs arrived first) uses some version, and one of the host-app responsibilities is that this version never regresses: every run must use the a version that is the same or newer than the version used on the previous run. That means you can upgrade your kernel between reboots, and you are not allowed to revert it to an earlier version.

In a consensus system, each "block" (or whatever they call the coordinated units of work) is executed with some particular version of the kernel/swingstore. The host-app's responsbility is to ensure that each block is executed with the same version, across all workers. You must upgrade your kernel at the same time as everybody else, even though you're allowed to make non-upgrading reboots more frequently than that.

So, I think I see three options:

  • A: add a standalone upgradeSwingset(kernelStorage) function, which looks for an explicit kvStore version key, makes upgrading changes if necessary, and always leaves it at version = CURRENT ("1" in our case). Then kernelKeeper can assert that version is as expected. In this option, makeSwingsetController() does not mutate the DB.
  • B: add an { upgrade: true } option to makeSwingsetController(), which tells it to perform an upgrade when it starts, and then have kernelKeeper assert the version as above. Then makeSwingsetController() might mutate, but only if you approve it.
  • C: have makeSwingsetController() automatically upgrade, without an extra approval flag

I prefer being very explicit about when DB changes might happen, so I don't like C. And it feels odd to have makeSwingsetController() be a source of mutation: I expect controller.run() to mutate, and of course initializeSwingset, but not a function that claims to be merely creating a control surface. B would be a compromise, to make it very clear to the caller that they must commit any changes made by makeSwingsetController(), but it sticks out like a sore thumb.

So I'm going to pick A, where we have a separate upgradeSwingset() function that must be called before makeSwingsetController(). The simplest host-app behavior would be to call it every time, in every reboot, and rely upon the fact that it won't mutate anything if we're already up-to-date. Technically, if you knew exactly which reboots might have upgraded the kernel, you could refrain from calling it most of the time, but I don't think that would make anything simpler. Just in case, I'll have upgradeSwingset() return a boolean that says whether it actually changed anything, so host-apps could include an assertion like "hey, I wasn't following an upgrade plan, nothing should have changed, bail before I commit anything".

I'm also going to move to the explicit kvStore key that says what version the schema is, and graduate from our "ad-hoc" phase . I need something for makeSwingsetController() (really makeKernelKeeper()) to assert against. The rule will be that (only) upgradeSwingset() is allowed to change this, and consistency between the version key and the rest of the data is both a pre-condition and a post-condition around that call, but we're allowed to have inconsistent data while in the middle of the call (i.e. just what you'd expect). I'll need to look at initializeSwingset and decide whether to create a v0 DB and immediately upgrade it (as is my plan with #8089), or to create a v1 DB right off the bat. It might be easier to write unit tests if there's a createV0 function available.

And then I'll need to look at how cosmic-swingset calls this and make sure I'm not creating any problems. We'll already need to handle mutation (and thus export-data callbacks being invoked) with the #8089 changes, because the constraints there obligate us to perform upgrades as part of openSwingStore(), rather than using a separate upgrade function. But since #8089 isn't done yet, the cosmic-swingset code might not be ready for it. I think it won't require anything special, I'll add the call to upgradeSwingset() just before wherever we call makeSwingsetController(), with a note that this changes the DB, and that both the export-data callbacks (headed into IAVL) and the actual state changes will get committed as part of the first block after reboot. This should only actually change anything in the first block after a chain upgrade, which will be in-consensus.

Note that this means the upgrade changes won't be included in any crank, and they won't be included in a crankhash. (To be precise, they'll get hashed into the crankhasher created by swingStore.js when the swing-store is built, but they'll get cleared by the startCrank() that the kernel calls at the top of the first real crank, so they're effectively ignored). I think that's fine, we weren't expecting that #8089 swing-store upgrade changes (which might cause an IAVL change, even if just for the version key) would be included in a crank. We still get export-data/IAVL shadows of everything, so a buggy validator which upgrades at the wrong time will fall out of consensus right away, we just won't have a slogfile crank-hash value that explains why.

@mhofman
Copy link
Member

mhofman commented May 18, 2024

I think it won't require anything special, I'll add the call to upgradeSwingset() just before wherever we call makeSwingsetController()

That sounds right. cosmic-swingset only builds the kernel as part of an init message, which is always done while processing a block at the cosmos level, so calls to the bridge are safe during that time.

the upgrade changes won't be included in any crank, and they won't be included in a crankhash

This part is a little unfortunate, but I guess acceptable. If we ever change how we export the swing-store kv data, we'll likely still have data to validate it in IAVL. The unfortunate part is that right now we would have no way to detect this difference using the slog only, and exporting and comparing swing-store and iavl trees is difficult compared to comparing slog entries (e.g. from a flight recorder). Can we maybe add a new slog entry for upgrades, which maybe captures the crankHasher state at the end even if it doesn't end up chained like a normal crank?

@warner
Copy link
Member Author

warner commented Jun 3, 2024

Can we maybe add a new slog entry for upgrades, which maybe captures the crankHasher state at the end even if it doesn't end up chained like a normal crank?

Let's see, if I call emitCrankHashes() at the end of upgradeSwingset(), it will return the accumulated crankhash, which we could write to the slog. But it will also update activityhash, which will affect all subsequent activityhashes.

On one hand, that's good, it's a more explicit way for consensus to fail if one node diverges from the rest during upgradeSwingset(): the small changes they make to kvStore will get mirrored into IAVL and cause a divergence, but they'll also be captured by the activityhash (which itself is mirrored into IAVL, and will cause divergence), in addition to the non-IAVL-stored crankhash appearing in the slog.

On the other hand, it makes the act of calling upgradeSwingset() always mutating, even if the schema was up-to-date and no changes needed to be made. Which means we could not afford to call it on every reboot, because reboots are not supposed to be part of consensus. We'd need cosmic-swingset code to tell us whether we're running an upgrade plan (i.e. we're in the first reboot after a chain-halting upgrade event), and only call upgradeSwingset() on that one reboot-per-chain-upgrade.

I guess I could split emitCrankHashes() into two pieces. The first rotates and returns the crankhash, but doesn't mutate the DB. The second would accept a crankhash, and would update/store/return the activityhash. Normal cranks would call both, but upgradeSwingset() would only call the first.

I'm trying to think through how this interacts with the mutations caused by device inputs (eg pushing messages onto the run-queue). Those don't happen inside cranks, they happen before a controller.run(). They mutate the DB, which feeds the crankhasher, but the kernel calls startCrank() at the top of run(), which means their crankhash changes are thrown away before they can affect anything else, like the activityhash. I guess I'd assumed that pre-run() changes would get folded into the activityhash by the first crank, but I was wrong. Divergence would still be caught by the IAVL shadow, but we'd have the same "where did it diverge?" debugging problem that prompted @mhofman 's request (although, I think they would show up in the flight-recorder). I guess that's an argument for making "device input methods" more first-class (have device registration create functions that the host-app can safely call at arbitrary times, have the kernel schedule their actual invocation to avoid overlap with kernel execution), so we could wrap them in a startCrank/emitCrankHashes` pair and capture their mutations in an updated activityhash.

@warner
Copy link
Member Author

warner commented Jun 10, 2024

I'm going to leave the crankhash/activityhash alone for now, I'm worried about accidentally introducing "how many times did we call upgradeSwingset()" non-determinism.

@warner warner force-pushed the warner/8980-boyd-scheduler branch from 402811a to 3200c52 Compare June 10, 2024 17:35
@warner warner requested a review from mhofman June 10, 2024 17:43
@warner
Copy link
Member Author

warner commented Jun 11, 2024

One lingering concern I've got.. we have a kernel-wide default value, and vat-creation -time options that can override it, and we combine those two values at the moment we create the vat, and store the resulting per-vat threshold in vNN.options. This PR expands the nature of the value, but doesn't change that basic model. A consequence of the model is that changing multiple vats at the same time would require doing a adminNode.changeVatOptions() on every single vat. In particular I'm thinking about how tune threshold.gcKrefs after this gets deployed.

I'm wondering if there's a decent way to record "the creator of this vat didn't override anything" as a missing per-vat threshold property, and have the kernel merge in the kernel-wide default later, either as the vatKeeper is created (so basically once per worker page-in), or on every crank (safer but more expensive, although you're probably thinking it's a drop in the bucket compared to all the other DB reads we do on each crank and you're probably right).

That would change the upgrade process to look for differences between vNN.reapInterval and the kernel-wide defaultReapInterval, and only record deliveries: into vNN.options if they were different, otherwise omitting that property. For mainnet, every vat would have their JSON.parse(v${vatID}.options).reapDirtThreshold be {}, except for comms, which would get never/never/never.

Adding new kinds of dirt would only require a change to the kernel-wide defaultReapDirtThreshold, not something to walk all vats.

@mhofman thoughts?

@warner warner marked this pull request as draft June 11, 2024 22:02
@warner
Copy link
Member Author

warner commented Jun 11, 2024

changed PR to draft for an hour while I try to implement that new thing

@warner warner force-pushed the warner/8980-boyd-scheduler branch from 3200c52 to a698f7b Compare June 12, 2024 02:15
@warner warner marked this pull request as ready for review June 12, 2024 02:15
@warner
Copy link
Member Author

warner commented Jun 12, 2024

@mhofman ok I updated the PR to minimize the per-vat recordkeeping: it only puts something in vNN.options (.reapDirtThreshold) if it needs to override the kernel-wide setttings. That mean that if we change the kernel-wide gcKrefs in the future, all vats will automatically start using the new value, which should help out a lot as we tune this thing.

@warner warner force-pushed the warner/8980-boyd-scheduler branch from a698f7b to 2e3641a Compare June 13, 2024 18:12
@warner
Copy link
Member Author

warner commented Jun 13, 2024

just updated it to slog the cleanup work (and when it's complete), and to rebase onto current trunk

@warner warner force-pushed the warner/8980-boyd-scheduler branch 2 times, most recently from fdecbe5 to 796a2d3 Compare June 14, 2024 23:07
@warner warner force-pushed the warner/8980-boyd-scheduler branch 3 times, most recently from c05be76 to 70300ac Compare June 27, 2024 21:39
@warner
Copy link
Member Author

warner commented Jun 27, 2024

Upgrade Considerations

NOTE: deployed kernels require explicit state upgrade, with:

import { upgradeSwingset } from '@agoric/swingset-vat';
..
upgradeSwingset(kernelStorage);

This must be called after upgrading to the new kernel code/release,
and before calling buildVatController(). It is safe to call on every
reboot (it will only modify the swingstore when the kernel version has
changed). If changes are made, the host application is responsible for
commiting them, as well as recording any export-data updates (if the
host configured the swingstore with an export-data callback).

(we still need an integration PR which adds cosmic-swingset support; until that lands, pre-existing kernels, like mainnet, will crash at startup. I don't think we have any CI which exercises upgrade from old kernel state)

@warner
Copy link
Member Author

warner commented Jul 10, 2024

I've updated the PR to include the necessary upgradeSwingset() call in cosmic-swingset (in its own commit). There's also a commit to make sure that we don't get export-data callbacks at a time that would cause problems (to guard against the problems we wrote up in #9655): @mhofman please take a look at those and confirm that I'm enabling/disabling the callback in all the right places, it passes the cosmic-swingset unit tests (and failed when I commented out the enabled = true lines), but I need extra eyes on it.

@warner warner requested a review from gibson042 July 10, 2024 17:10
@warner warner force-pushed the warner/8980-boyd-scheduler branch 2 times, most recently from ede07c7 to a1b9efe Compare July 11, 2024 23:51
Copy link
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask about the interaction of various dirt-driven BOYDs, but I think this approach of allowing each metric to independently prompt them makes sense.

No blockers, but a large handful of æsthetic suggestions. Also, thanks for the thorough testing.

packages/SwingSet/src/controller/initializeKernel.js Outdated Show resolved Hide resolved
Comment on lines 46 to 47
const CURRENT_VERSION = 1;
kernelStorage.kvStore.set('version', `${CURRENT_VERSION}`);
Copy link
Member

@gibson042 gibson042 Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels wrong to modify kvStore contents before asserting that kernelKeeper.getInitialized() returns false, and for that matter it seems like there's some redundancy between "initialized" and "version" data. I can think of a few possible refactorings, but the one that appeals to me most is:

  • Remove kernelStorage.kvStore.set('version', `${CURRENT_VERSION}`) here.
  • Replace kernelKeeper.setInitialized() below with something like kernelKeeper.setInitialized(CURRENT_VERSION).
  • In kernel.js, update setInitialized accordingly (to take a version argument and write that to the kvStore, preferably at key "version" rather than "initialized").
    • If at key "version", then update upgradeSwingset accordingly (to migrate data from "initialized" to "version").
  • Also in kernel.js, move the version check from makeKernelKeeper to getInitialized:
    function getInitialized() {
      const version = kvStore.get('version');
      if (version !== `${EXPECTED_VERSION}`) {
        throw Error(
          `kernelStorage is too old (have version ${version}, need ${EXPECTED_VERSION}), please upgradeSwingset()`,
        );
      }
      return true;
    }
  • Optionally, update kernel.js to check getInitialized right after makeKernelKeeper(kernelStorage, kernelSlog) (maintaining the early error introduced by the current state of this PR).
  • Optionally, rename setInitialized to e.g. setVersion and/or getInitialized to e.g. assertInitialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds like the best plan. I'm not jazzed about the idea of makeKernelKeeper() returning an API surface for the wrong data (e.g. if kernel.js forgot to make the getInitialized/assertInitialized/assertVersion check, and started using a non-upgraded DB), but the real problem is that that we need a kernelKeeper to populate the initial state, which of course happens before the DB initialized.

Hm, what about makeKernelKeeper(expectedVersion) ? We'd use undefined for the initial construction, then the normal kernel.js usage would pass in CURRENT_VERSION or whatever. Oh, and what if we used 'uninitialized' instead of undefined? That would make it impossible to call makeKernelKeeper(), and you'd have to be explicit about whether you're making one for initialization purposes, or for actual use?

Hm, now that makes me wonder if upgradeSwingset could go back to being a method of kernelKeeper, as I first tried. If we went that way, upgradeSwingset() would be like makeKernelKeeper(..).upgrade() .. hm, no, we wouldn't know what version to expect (it could be either 0 or 1, but shouldn't be missing entirely / uninitialized).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I went with makeKernelKeeper(expectedVersion), and I like how it flows. kernelKeeper.js exports the current version, so the kernel demands it, and won't get back an API that's using wrong-version data. And there's a way for initializeKernel to demand an uninitialized one. The host app doesn't create the kernelKeeper, so it doesn't need to know anything about versions, it just calls upgradeSwingset (either every boot, or every boot after the operator/host-app-author knows they've changed the kernel code).

packages/SwingSet/src/kernel/kernel.js Show resolved Hide resolved
if (!kvStore.has(`${vatID}.o.nextID`)) {
initializeVatState(kvStore, transcriptStore, vatID);
}
assert(kvStore.has(`${vatID}.o.nextID`), `${vatID} was not initialized`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit:

Suggested change
assert(kvStore.has(`${vatID}.o.nextID`), `${vatID} was not initialized`);
kvStore.has(`${vatID}.o.nextID`) || Fail`${vatID} was not initialized`;

packages/SwingSet/src/kernel/state/vatKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
Comment on lines 445 to 449
if (typeof value === 'number') {
assert(value > 0, `threshold[${key}] = ${value}`);
} else {
assert.equal(value, 'never', `threshold[${key}] = ${value}`);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (typeof value === 'number') {
assert(value > 0, `threshold[${key}] = ${value}`);
} else {
assert.equal(value, 'never', `threshold[${key}] = ${value}`);
}
(typeof value === 'number' && value > 0) ||
value === 'never' ||
Fail`threshold[${assert.quote(key)}] ${value} must be a positive number or "never"`;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow the previous pattern

packages/SwingSet/src/kernel/state/vatKeeper.js Outdated Show resolved Hide resolved
@warner
Copy link
Member Author

warner commented Jul 29, 2024

I've pushed the feedback changes. I'll squash these into better shape before landing. @mhofman if you get time to look, the first four commits are organized/review-separately, the remainder are haphazard and not worth looking at separately.

warner added 5 commits August 12, 2024 16:46
Previously, vat state initialization (e.g. setting counters to zero)
happened lazily, the first time that `provideVatKeeper()` was
called. When creating a new vat, the pattern was:

  vk = kernelKeeper.provideVatKeeper();
  vk.setSourceAndOptions(source, options);

Now, we initialize both counters and source/options explicitly, in
`recordVatOptions`, when the static/dynamic vat is first defined:

  kernelKeeper.createVatState(vatID, source, options);

In the future, this will make it easier for `provideVatKeeper` to rely
upon the presence of all vat state keys, especially `options`.

Previously, vatKeeper.getOptions() would tolerate a missing key by
returning empty options. The one place where this was
needed (terminating a barely-created vat because startVat() failed,
using getOptions().critical) was changed, so this tolerance is no
longer needed, and was removed. The tolerance caused bug #9157 (kernel
doesn't panic when it should), which continues to be present, but at
least won't cause a crash.

refs #8980
NOTE: deployed kernels require a new `upgradeSwingset()` call upon
(at least) first boot after upgrading to this version of the kernel
code. See below for details.

`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection
inside a vat, and gives it a chance to drop imported c-list vrefs that
are no longer referenced by anything inside the vat.

Previously, each vat has a configurable parameter named
`reapInterval`, which defaults to a kernel-wide
`defaultReapInterval` (but can be set separately for each vat). This
defaults to 1, mainly for unit testing, but real applications set it
to something like 1000.

This caused BOYD to happen once every 1000 deliveries, plus an extra
BOYD just before we save an XS heap-state snapshot.

This commit switches to a "dirt"-based BOYD scheduler, wherein we
consider the vat to get more and more dirty as it does work, and
eventually it reaches a `reapDirtThreshold` that triggers the
BOYD (which resets the dirt counter).

We continue to track `dirt.deliveries` as before, with the same
defaults. But we add a new `dirt.gcKrefs` counter, which is
incremented by the krefs we submit to the vat in GC deliveries. For
example, calling `dispatch.dropImports([kref1, kref2])` would increase
`dirt.gcKrefs` by two.

The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use
patterns, this will trigger a BOYD after ten krefs have been dropped
and retired. We choose this value to allow the #8928 slow vat
termination process to trigger BOYD frequently enough to keep the BOYD
cranks small: since these will be happening constantly (in the
"background"), we don't want them to take more than 500ms or so. Given
the current size of the large vats that #8928 seeks to terminate, 10
krefs seems like a reasonable limit. And of course we don't want to
perform too many BOYDs, so `gcKrefs: 20` is about the smallest
threshold we'd want to use.

External APIs continue to accept `reapInterval`, and now also accept
`reapGCKrefs`, and `neverReap` (a boolean which inhibits all BOYD,
even new forms of dirt added in the future).

* kernel config record
  * takes `config.defaultReapInterval` and `defaultReapGCKrefs`
  * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs`
    and `.neverReap`
* `controller.changeKernelOptions()` still takes `defaultReapInterval`
   but now also accepts `defaultReapGCKrefs`

The APIs available to userspace code (through `vatAdminSvc`) are
unchanged (partially due to upgrade/backwards-compatibility
limitations), and continue to only support setting `reapInterval`.
Internally, this just modifies `reapDirtThreshold.deliveries`.

* `E(vatAdminSvc).createVat(bcap, { reapInterval })`
* `E(adminNode).upgrade(bcap, { reapInterval })`
* `E(adminNode).changeOptions({ reapInterval })`

Internally, the kernel-wide state records `defaultReapDirtThreshold`
instead of `defaultReapInterval`, and each vat records
`.reapDirtThreshold` in their `vNN.options` key instead of
`vNN.reapInterval`. The vat-level records override the kernel-wide
values. The current dirt level is recorded in `vNN.reapDirt`.

NOTE: deployed kernels require explicit state upgrade, with:

```js
import { upgradeSwingset } from '@agoric/swingset-vat';
..
upgradeSwingset(kernelStorage);
```

This must be called after upgrading to the new kernel code/release,
and before calling `buildVatController()`. It is safe to call on every
reboot (it will only modify the swingstore when the kernel version has
changed). If changes are made, the host application is responsible for
commiting them, as well as recording any export-data updates (if the
host configured the swingstore with an export-data callback).

During this upgrade, the old `reapCountdown` value is used to
initialize the vat's `reapDirt.deliveries` counter, so the upgrade
shouldn't disrupt the existing schedule. Vats which used `reapInterval
= 'never'` (eg comms) will get a `reapDirtThreshold.never = true`, so
they continue to inhibit BOYD. Any per-vat settings that match the
kernel-wide settings are removed, allowing the kernel values to take
precedence (as well as changes to the kernel-wide values; i.e. the
per-vat settings are not sticky).

We do not track dirt when the corresponding threshold is 'never', or
if `neverReap` is true, to avoid incrementing the comms dirt counters
forever.

This design leaves room for adding `.computrons` to the dirt record,
as well as tracking a separate `snapshotDirt` counter (to trigger XS
heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but
do not yet expose an API to set it.

Future work includes:
* upgrade vat-vat-admin to let userspace set `reapDirtThreshold`

New tests were added to exercise the upgrade process, and other tests
were updated to match the new internal initialization pattern.

We now reset the dirt counter upon any BOYD, so this also happens to
help with #8665 (doing a `reapAllVats()` resets the delivery counters,
so future BOYDs will be delayed, which is what we want). But we should
still change `controller.reapAllVats()` to avoid BOYDs on vats which
haven't received any deliveries.

closes #8980
makeKernelKeeper() now takes the expected version, and throws if it
doesn't match. This replaces the version check in kernel.js start().
Any changes will be committed as part of the first block after
reboot. This should be in-consensus, because all nodes are supposed to
change kernel versions at the same time, and only during a
chain-halting upgrade.
The swingstore export-data callback gives us export-data records,
which must be written into IAVL by sending them over to the golang
side with swingStoreExportCallback . However, that callback isn't
ready right away, so if e.g. openSwingStore() were to invoke it, we
might lose those records. Likewise saveOutsideState() gathers the
chainSends just before calling commit, so if the callback were invoked
during commit(), those records would be left for a subsequent block,
which would break consensus if the node crashed before the next
commit.

This commit adds an `allowExportCallback` flag, to catch these two
cases. It is enabled at the start of AG_COSMOS_INIT and BEGIN_BLOCK,
and then disabled just before we flush the chain sends in
saveOutsideState() (called during COMMIT_BLOCK).

Note that swingstore is within its rights to call exportCallback
during openSwingStore() or commit(), it just happens to not do so
right now. If that changes under maintenance, this guard should
turn a corruption bug into a crash bug

refs #9655
@warner warner force-pushed the warner/8980-boyd-scheduler branch from 1f7943e to 6547c83 Compare August 12, 2024 21:46
@warner warner added the automerge:rebase Automatically rebase updates, then merge label Aug 12, 2024
@mergify mergify bot merged commit 179572e into master Aug 12, 2024
90 checks passed
@mergify mergify bot deleted the warner/8980-boyd-scheduler branch August 12, 2024 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dirty-vat -driven BOYD scheduler
3 participants