Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Block Saturdays: investigations and planned mitigations #10743

Open
warner opened this issue Dec 19, 2024 · 0 comments
Open

Slow Block Saturdays: investigations and planned mitigations #10743

warner opened this issue Dec 19, 2024 · 0 comments
Labels
bug Something isn't working performance Performance related issues

Comments

@warner
Copy link
Member

warner commented Dec 19, 2024

Here are some notes about what we understand/believe about the current mainnet chain slowdowns.

Observations

Here's a graph of the inter-block times, as observed by my follower agreeable, over the last week (actually it tracks how many blocks were executed by my follower, but it is usually up-to-date, so the only significant differences should be when I reboot it every 5-6 days and it has to catch up for 10-20 minutes):

ag0_blockheight_rate-week

When the chain is running smoothly, we get a block every 6 seconds. But we frequently see elevated inter-block times, in bursts that get worse and worse until they spike and then revert back to normal. During these periods, I observe my follower spending far more time doing swingset processing than normal, sometimes up to 60 seconds of work. I also observe the in-consensus blockTime deltas grow, which means enough validators are having problems that the 2/3rds vote threshold is not being met as quickly as usual.

We have reports from many validators of the same symptoms. We also have anecdotal evidence that many validator nodes are losing network connectivity to some peers during these periods. Some nodes do a better job of maintaining connectivity than others.

A month or two ago, we tended to see these spikes on friday afternoons, so we've been referring to the effect as "Slow Block Fridays". The effect has since shifted (precessed?) to Saturdays. This graph shows the inter-block times over the last month, and you can see some spikes with a period of maybe 7.2 days, as well as some other spikes with less obvious periodicities:

ag0_blockheight_rate-month

This problem has been growing slowly for the last several months, as shown in this graph of the entire last year:

ag0_blockheight_rate-year

Our chain processes two basic kinds of transactions: cosmos txns and swingset txns. The cosmos txns include basic balance transfers, staking/bonding/delegation changes, withdrawing staking rewards, governance proposals and votes, etc. The swingset txns include everything implemented in javascript: contract interactions, requesting and accepting Invitations, funding/closing Seats, all smart-wallet operations, and everything triggered by a timer/repeater wakeup.

The most frequent swingset operation by far is the oracle price-feed submission, which is implemented as a contract Invitation, so it involves a smart-wallet walletSpendAction, as well as the full suite of wallet updates, result publishing to vstorage, etc.

I did some analysis work in October which suggested that the number of price-feed submissions per unit time does not go up during the slow periods, in fact it goes down slightly (probably due to one or two oracle providers reacting to the slow blocks by backing off their update rates). In addition, the number of computrons spent by each submission remained the same, so we weren't taking more expensive JS code paths during the slow periods.

Known Problems

We have fairly good evidence that the p2p network protocol used by Cosmos/CometBFT (to distribute loose transactions, votes, and proposed/finalized blocks) does not react well to the END_BLOCK computation step taking a long time (>20-30 seconds). We've found similar reports from other Cosmos-SDK -based chains. We believe that long computation prevents the network keepalive (PING/PONG) messages from being emitted or accepted, causing connection timeouts, which cause otherwise-functional connections to be dropped. This breaks connectivity, and requires extra time to re-establish the connections, slowing processing further. This is not a root cause of the slowdown, but once we reach some threshold of slowness, it makes things get even worse. (#10742)

A recent upgrade (maybe upgrade-17?) moved us to a version of cosmos-sdk that activated a new IAVL cache by default. When active, this cache does a lot of extra disk IO (enough to burn through a typical SSD's maximum write limits in under a year), and spends a lot of extra CPU time. This doesn't appear to improve performance much, and can be disabled through the config file with iavl-disable-fastnode = true. We have observed the slow execution on machines with the cache disabled, so it does not seem to be the root problem.

We have very large JS heaps in two mainnet vats, v43-walletFactory and v7-board, as measured by the XS snapshot size (sqlite3 -readonly -box ~/.agoric/data/agoric/swingstore.sqlite 'SELECT vatID, uncompressedSize FROM snapshots WHERE inUse=1 ORDER BY uncompressedSize DESC LIMIT 20).

┌───────┬──────────────────┐
│ vatID │ uncompressedSize │
├───────┼──────────────────┤
│ v43   │ 711730867        │
│ v7    │ 604348547        │
│ v1    │ 67799747         │
│ v9    │ 61611067         │
│ v14   │ 33626683         │
│ v48   │ 30966491         │
│ v63   │ 29296891         │
│ v157  │ 25818347         │
│ v24   │ 15017787         │
│ v169  │ 14928091         │
│ v170  │ 14928091         │

As of today (18-dec-2024), v43-walletFactory is currently 712MB in size, and growing at a rate of about 9.5MB/day. v7-board is 604MB, and growing, but I have not calculated its growth rate yet.

Large JS heaps cause several problems:

  • garbage collection is slower, since the "mark" phase of mark-and-sweep must look at every object
  • heap snapshot saving is slower, since it must visit every object
  • heap snapshots are larger, so compressing them takes more time
  • large heap snapshots take longer to decompress and reload into a new worker
  • heap snapshots are also written as state-sync artifacts, which have a size limit, so sufficiently-large heap snapshots will break state-sync (Raise state-sync snapshot payload size to accommodate 8 weeks of wallet factory leak #10740)
  • the host OS (linux) will experience more memory pressure, which might slow execution, or trigger swap thrash

Vats are not supposed to use a lot of RAM. For intentionally high-cardinality things, we have virtual/durable Kinds/Exos, and collections, which keep most data in the vatstore DB, and only spend RAM on Presences/Representatives which are actively in use during each delivery.

For v7-board, we believe we understand the problem. It has not yet been upgraded to incorporate the passStyleOf improvements (#9781) that were included in the agoric-sdk tree as of upgrade17 (07-oct-2024), so every non-retired vref which passes through any marshalling code will cause an item to be added to an in-RAM Map. The RecoverySet inside each of our six scaledPriceAuthority vats are collectively holding millions of QuotePayment objects (#8400), and most of these are keeping Map entries alive in v7-board, which costs RAM. They are also keeping c-list entries alive, but those are in the DB, not RAM.

Unknown Problems, Hypotheses

We do not understand why v43-walletFactory is growing (#10706). It should already have the passStyleOf fix (deployed in upgrade-17). We have some preliminary analysis that suggests it may have a large number of Promises outstanding, perhaps related to Vow/WatchedPromise code changes deployed at the same time.

We do not understand why the chain slowdown should be periodic. One theory is that the oracles are operating on a fixed timer (every N seconds, different for each oracle), and there is some sort of beat frequency between their timers that results in load synchronization. For example, if there were only two oracles (and only one denom), and one submitted a price every 4 minutes, and the other every 5 minutes, we might see two submissions in a single block occur once every 20 minutes (LCM(4,5)). Maybe the combination of five oracles and other delays is creating some complex resonance that results in submissions stacking up about once a week. With the network timeout/connectivity problem, we could handle (6s of execution per block for six blocks) a lot better than (36s of execution in one block followed by five empty blocks).

Before we observed the slowness move from fridays to saturdays, we hypothesized that many validators are running stock linux distributions, with automatic system-maintenance cronjobs that run at the same time (e.g. 2am saturday morning), creating common-mode load simultaneously on all validators in a given timezone. 2am saturday in Asia or eastern Europe could land on friday afternoon in Pacific time, which is roughly when we saw the slowdowns. But the shift from fridays to saturdays does not support a rigid cron-triggered root cause.

Another theory is that there are more cosmos-type transactions happening during the slow periods than usual. My monitoring tools are mostly tracking swingset time, so I don't have data on how much Cosmos-SDK -only work is happening. Swingset and Cosmos work are supposed to be fairly independent (happening in separate threads/processes), but perhaps some of the cosmos workload is making swingset work slower (DB contention? thread/core contention? elevated uncommitted IAVL writes causing reads to take longer?). And maybe some cosmos-only economic thing (staking reward distribution?) is being performed selectively on a weekly schedule. To investigate this, we need data on the cosmos txns in each block, and then look for periodic spikes in their types or counts.

Upcoming Fixes/Mitigations

Upgrade-18 will restart v43-walletFactory, which should reset the heap size back down to some minimal baseline, and should mitigate the problem for a while. But until we fix the core problem, it is likely to grow at the same rate as before. We should have about two months before it grows back up to 600MB and resumes causing us problems.

We will continue to investigate the v43-walletFactory heap snapshots, to figure out why it is growing (#10706), and will deploy a fix as soon as we have one. We intend to restart v43-walletFactory at each opportunity (upgrade-19, -20, etc) until we deploy a fix for the growth (#10493).

Upgrade-18 will also stop using the old price-feed/scaled-price-authority vats, and install new ones which don't use a RecoverySet. So even without additional fixes, it should stop provoking the same magnitude of problems in v7-board.

And when we next upgrade v7-board (#10394, in upgrade-19 at the latest, but it may get pulled into u18), it will benefit in two ways:

  • by simply restarting the vat, we lose that in-RAM Map, so the heap size should reset back to some small baseline
  • by switching to the fixed passStyleOf code, we should no longer spend RAM on vrefs whose Presences have been dropped, so any long-lived objects that pass through v7-board will only consume c-list space, not RAM

We are also working on a mitigation for the p2p timeout problem (#10742), to use a queue in front of the reactor, allowing network keepalive handshakes to be serviced during slow execution. We think this should reduce the frequency of disconnections, removing the extra step function in slowdown that occurs once execution time exceeds some timeout-triggering threshold.

@warner warner added bug Something isn't working performance Performance related issues labels Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Performance related issues
Projects
None yet
Development

No branches or pull requests

1 participant