-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auctioneer will create duplicate QuoteNotifiers until the first price is established #10660
Comments
This was fixed by PR #10615 , so it can be closed immediately. |
Reopenned because @warner noticed that we have a reoccurrence of the same issue with vaultManager.
|
It looks like the vaultManager has the same pattern, with the same bug. |
refs: #10660 ## Description in testing upgrade 18 candidates, we realized that vaultManager was subject to the same error as auctions fixed in #10615. ### Security Considerations No Security implications. ### Scaling Considerations Not fixing this would add a new notifier to the vaultManager's quote watcher for every hour that elapsed between upgrade and the oracles starting to provide price updates. ### Documentation Considerations Unnecessary. ### Testing Considerations This was detected by trawling slogs. It's not clear that we can do better than that to verify the fix. ### Upgrade Considerations It would probably be a mistake to upgrade vaults without this fix.
While doing
upgrade18
testing on the Emerynet testnet, we discovered a bug in the way the auctioneer establishes the "Quote Notifier", which resulted in a ton of duplicate notifiers being created. This caused a massive amount of work (nine hours of full blocks) to be triggered by the PushPrice oracle message.This was fixed by PR #10615 , to be incorporated into
agoric-upgrade-18-rc3
.Background
The AuctionBook is part of the Inter Protocol code which periodically holds auctions to sell off assets (like ATOM or stATOM). The primary current use is for the vaultManager to be able to liquidate assets when their value has dropped below a threshold, so that all vaults maintain a positive value. VaultManagers capture the price they will use as the threshold for liquidation shortly before each auction round starts. The auction sells each asset at a price captured at the beginning of the auction.
The price is provided by a set of "oracles" which periodically submit PushPrice transactions. These prices are submitted to a per-denomination "price feed vat", which runs an internal Notifier whose updates contain prices. That vat can also create a QuoteNotifier, whose updates contain "QuotePayments", which are ERTP Payment objects whose Amount describes the price and a timestamp reflecting the effective time of issuance of the quote. A second per-denom vat named the "scaled price authority" creates "scaled" QuotePayments, which then have their own QuoteNotifiers. Finally, there is a single vat named vat-priceAuthority which performs the denom-to-scaledPriceAuthority dispatch (and can be updated when we replace the sPA/price-feed vats), through which all requests for a QuoteNotifier are routed.
The auction book is woken up periodically (once per hour on mainnet, once every 10 minutes on emerynet) to see if any collateral needs to sold off. For various performance reasons, the auction book wants to maintain a copy of a recent QuotePayment rather than asking the price feeds every time. To support that, the auction book uses
observeNotifier
to subscribe to the scaled-price-authority'sQuoteNotifier
, and updates a local variable (actually a portion of the durablestate
of the AuctionBook exo) every time it receives an update. This way, the auction book always (or usually always) has a copy of a relatively-recent price, which can be accessed synchronously.When the auction book is first created (e.g. when the denom is added), the exo
finish()
function calls a local helper method namedobserveQuoteNotifier()
to initialize the process. This askspriceAuthority
tomakeQuoteNotifier()
, which gets routed to the scaled-price-authority vat for that denom, which then sends its ownmakeQuoteNotifier()
to the corresponding price-feed vat, which makes a Notifier and sends it back. The scaled-price-authority vat builds a new Notifier around that one (applying the scaling transform on each update) and returns it to the auction book. The auction code then usesobserveNotifier()
to attach a callback which updatesstate.updatingOracleQuote
each time a new quote is delivered.At the end of this process, we have one new Notifier in vat-price-feed, one new Notifier in vat-scaled-priceAuthority, and one new Observer in vat-auctioneer. This notifier/notifier/observer chain will then be triggered each time the price is updated, forever (or at least until the vats are killed).
In addition, each time the auction timer is triggered,
captureOraclePriceForRound()
is invoked, which can also useobserveQuoteNotifier()
to get one set up. This handles the case where the price-feed vat is terminated, and we need to establish a Notifier in its replacement. Also, observers are ephemeral, and the auctioneer vat might get upgraded, so this is the pathway by which the new auctioneer incarnation will establish a new set of notifiers and observers (although note that we do not yet claim the auctioneer vat is upgradable, and there is more work to be done before we can make that claim).The Bug
The core problem was the flag/lock used by the AuctionBook to ensure that we don't have duplicate observers/notifiers was based on the wrong criteria.
captureOraclePriceForRound()
used the truthiness ofstate.updatingOracleQuote
to decide whether to callobserveQuoteNotifier
or not.https://github.com/Agoric/agoric-sdk/blob/agoric-upgrade-18-rc2/packages/inter-protocol/src/auction/auctionBook.js#L643-L651
But
state.updatingOracleQuote
does not get set at all until the first price has been delivered:This won't happen until enough oracles have submitted a price for a round to meet the quorum requirements (two oracles, at least for Emerynet). And that can't happen until the oracles have accepted their invitations and submitted the PushPrice txns.
The Stall
On Emerynet (unlike mainnet), we don't have automatic oracles, and only do a manual PushPrice when we're doing qualification testing. So most of the denoms don't have established prices, especially after the initial
upgrade18
was deployed, which replaced the auctioneers (creating new ones, without prices in theirstate.updatingOracleQuote
).As a result, every 10 minutes,
captureOraclePriceForRound()
was called, found nostate.updatingOracleQuote
, calledobserveQuoteNotifier()
, which caused both the scaled-price-authority and the price-feed to create new QuoteNotifiers and wire them up to the internal price notifier that had never fired yet. We deployed ugprade18 to emerynet early in November, so this growth continued for several weeks without being noticed or causing visible problems, in both the current vat-auctioneer and the previous vat-auctioneer (we actually have five copies of vat-auctioneer on Emerynet, because our chain upgrades merely add new ones, and do not kill off the old ones, and they are all still subscribed to the wakeup timer).On saturday night (01-dec-2024, about 18:30 PST), we submitted a PushPrice (I think for ATOM, maybe stATOM). By this time, we had tens of thousands of these notifiers in place, all waiting for the first price to arrive. This first PushPrice didn't meet the quorum threshold, so didn't cause much activity. A few minutes later, when the second price was submitted at 18:33, that triggered the price-feed vat to fire the internal Notifier, which then triggered about 18k callbacks to all of the QuoteNotifiers in that vat. Each one needed to get an updated timestamp to create the QuotePayment amount, so that one delivery sent 18k
getCurrentTimestamp
messages to vat-timer before finishing. This one delivery took about 30 minutes to execute, resulting in one extremely long block on emerynet, with a gigantic transcript full ofsyscall.send
messages.After that, those 18k messages resulted in 18k responses, followed by 18k followups, etc. Each of the 18k observer/notifier sets required about a dozen messages to get from the initial PushPrice to the final assignment of
state.updatingOracleQuote
, so there were roughly 200k deliveries performed before the chain finally became idle again. Every individual delivery was small and completed in milliseconds, so every block was well-constrained by our 65M computron limit. But the sheer number of them meant that every block was full for the next nine hours, during which time no other actions could be performed. No new txns were taken off theactionQueue
, no timer updates were performed, we couldn't even pull new PushPrices actions from the high-priority action queue, until we finished the "catch up" run that we do at the start of each block. By the time it finished,state.updatingOracleQuote
had been updated 18k times, each with a different QuotePayment object (all with the same price and timestamp).So from the point of view of an external observer, the chain was stuck for 30 minutes (no new blocks), followed by nine hours of txns being added to blocks but not actually getting executed, after which a whole bunch of stale work was executed. And then finally the chain could behave normally again. (However any new PushPrices will trigger another nine hours of work).
The Fix
The fix, landed in #10615, was to change the lock/flag to be more accurate. We introduced a new flag, which is tested (and then set) upon entry to
observeQuoteNotifier
, and cleared if/when themakeQuoteNotifier
fails, or if/when the resulting notifier exists (finish
, meaning the observer has been told to shut down and emits its final update, orfail
, which means the observer has failed, e.g. when the price-feed vat is terminated). This should ensure that we don't initiate the creation of a new notifier while we might already be trying to create one. By clearing the flag on failure, it should also ensure that we don't inhibit notifier/observer creation when there is no chance that a creation attempt is already in place.The
state
variable will still benull
until the first update, and non-null after the update, but we no longer use it to make decisions about creating new notifiers.This flag is managed as an ephemeral Set, indexed by denom, closed over by all AuctionBooks (one per denom). This turned out to be the best way to ensure that upgrading vat-auctioneer would clear the flag (the Observers are ephemeral, so we must make sure that new auctioneer incarnations make new ones). A durable flag wouldn't be cleared, and putting the flag in per-instance state would require us to iterate over all existing auctions at incarnation startup to update their flags.
We rely upon the call to
captureOraclePriceForRound()
to trigger notifier/observer creation in new vat-auctioneer incarnations. We still have the call infinish()
, so deploying a new denomination will start the process early (before the first auction timer fires), but that doesn't help with vat-auctioneer upgrades.The text was updated successfully, but these errors were encountered: