Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add finalizer component to TXM #13638

Conversation

amit-momin
Copy link
Contributor

Description

Implements a finalizer component that regularly checks confirmed transactions on new heads if they are considered finalized and marks them if they are. The finality validation uses the checks below

  • Is the tx's receipt block num older or equal to the latest finalized head
  • Does the tx's receipt block hash match the block hash in the cached longest chain
  • If the receipt is older than the cached longest chain history, check using BlockByHash

Note

  • To account for finality violations, the Confirmer has the ability to unmark a transaction as finalized if it finds the receipt hash missing in the longest chain

Ticket:
https://smartcontract-it.atlassian.net/browse/BCI-3486

Requires:

@amit-momin amit-momin changed the title Add a finalizer component to TXM Add finalizer component to TXM Jun 20, 2024
@amit-momin amit-momin marked this pull request as ready for review June 20, 2024 18:06
@amit-momin amit-momin requested review from a team as code owners June 20, 2024 18:06

// Check if block hashes exist for receipts on-chain older than the earliest cached head
// Transactions are grouped by their receipt block hash to minimize the number of RPC calls in case transactions were confirmed in the same block
// This check is only expected to be used in rare cases if there was an issue with the HeadTracker or if the node was down for significant time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not true. This is directly affected by the HistoryDepth config we have in Head Tracker, which controls how many blocks we store before the latest finalized. This is actually a very useful observation. Initially, we were aiming to have HistoryDepth set to 0 since most of HT's users were not interested in blocks past the latest finalized. However, based on your approach above it makes sense to set a reasonable value for that config so we can avoid excessive RPC calls in exchange for some in-memory storage. Once we introduce HT's optimizations, we could further increase it.
CAUTION: Recently we observed on certain chains that the depth of latest to 'finalized` is quite large, so we added a config to return an error when this happens. We should consider such cases here and perhaps gracefully handle them or even alert the user as transactions might not get finalized for a long period of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! Didn't know we were considering setting HistoryDepth=0. Surprised none of the other callers needed to do a block hash check to protect against re-orgs but I don't really know the use cases.

For the case that the depth of latest to 'finalized` becomes quite large, I can check with CCIP/Automation to see if they care. I could foresee warning on this adding special scenarios that become hard to debug especially if products act on them.

@dimriou dimriou requested a review from dhaidashenko June 21, 2024 12:13
// This check is only expected to be used in rare cases if there was an issue with the HeadTracker or if the node was down for significant time
var wg sync.WaitGroup
var txMu sync.RWMutex
for receiptBlockHash, txs := range receiptBlockHashToTx {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not BatchCall instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer using a batch call but I was trying to avoid introducing another method in the txmgr client just for this. Otherwise, using BatchCallContext directly would have introduced some chain specific code when specifying eth_getBlockByHash. Although, I'm reconsidering now if this component even needs to be in common. If we're pushing to degeneralize all of this code, I should put this in the EVM side where this isn't a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the Finalizer to the EVM code and changed this to a batch call in the latest commit

wg.Add(1)
go func(hash BLOCK_HASH, txs []*txmgrtypes.Tx[CHAIN_ID, ADDR, TX_HASH, BLOCK_HASH, SEQ, FEE]) {
defer wg.Done()
if head, rpcErr := f.client.HeadByHash(ctx, hash); rpcErr == nil && head.IsValid() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also check if the block number of the requested head is lower than the fetched latestFinalizedHead just to be on the safe side, otherwise we could throw an invariant violation error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm might be talking about a different thing, but I agree that we need to check fetched block against the latest finalized block, but we should use HeadTracker's LatestAndFinalizedBlock from this PR.

The main reason is that MultiNode only provides a repeatable read guarantee for data that is older than the finalized block fetched with LatestAndFinalizedBlock. It does not provide this guarantee for blocks fetched by LatestFinalizedHead.
Why? Because of HeadTracker's FinalityTagBypass option. We plan to remove it with HeadTracker's performance optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check to ensure the fetched block's num is still earlier or equal to the latest finalized block num. Thanks for the heads up! I'll update to using LatestAndFinalizedBlock once your PR is merged. For now, I've marked this PR dependent on yours.

@@ -181,6 +181,7 @@ type DbEthTx struct {
// InitialBroadcastAt is recorded once, the first ever time this eth_tx is sent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we modify ReapTxHistory to only delete finalized txs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I was trying to avoid touching it in this PR to avoid scope creep but the changes seemed minimal enough. I made the updates in the latest commit

f.initSync.Lock()
defer f.initSync.Unlock()
if f.isStarted {
return errors.New("Finalizer is already started")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isStarted and initSync do not seem to be needed. StartOnce will return an error if it's called multiple times.
The same works for StopOnce

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This got carried over from another component. I removed it in the latest commit.

f.lggr.Debugw("processing latest finalized head", "block num", latestFinalizedHead.BlockNumber(), "block hash", latestFinalizedHead.BlockHash(), "earliest block num in chain", earliestBlockNumInChain)

// Retrieve all confirmed transactions, loaded with attempts and receipts
confirmedTxs, err := f.txStore.FindTransactionsByState(ctx, TxConfirmed, f.chainId)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the ReaperThreshold config option we might load a lot of transactions that were already finalized, which in theory could lead to out-of-memory issues. Can't we filter them on the db level?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it might be a good idea to process transactions in batches.
On release we'll have to process all confirmed transactions at once, which might be problematic. On some chains we store txs for 5 days (168h) by default and CCIP is running all chains on single node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was trying to use a generic method so that this would be easier to integrate with the in-memory store later on. But that's getting ahead of myself. I've updated to use a more targeted query. I still need to think on batching though. You're right about the large number of tx we'd finalize when this first runs but I also want to avoid slowing this process down for normal use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added limits to the batch RPC call to respect the RPCDefaultBatchSize but don't think we need to worry about loading all of the confirmed tx into memory. I can discuss more offline in the thread.

continue
}
receipt := tx.GetReceipt()
if receipt == nil || receipt.IsZero() || receipt.IsUnmined() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we log an error here to signal that we expect a valid receipt for confirmed tx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assumption violation log here in the latest commit

wg.Add(1)
go func(hash BLOCK_HASH, txs []*txmgrtypes.Tx[CHAIN_ID, ADDR, TX_HASH, BLOCK_HASH, SEQ, FEE]) {
defer wg.Done()
if head, rpcErr := f.client.HeadByHash(ctx, hash); rpcErr == nil && head.IsValid() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm might be talking about a different thing, but I agree that we need to check fetched block against the latest finalized block, but we should use HeadTracker's LatestAndFinalizedBlock from this PR.

The main reason is that MultiNode only provides a repeatable read guarantee for data that is older than the finalized block fetched with LatestAndFinalizedBlock. It does not provide this guarantee for blocks fetched by LatestFinalizedHead.
Why? Because of HeadTracker's FinalityTagBypass option. We plan to remove it with HeadTracker's performance optimization.

})
}

func (f *evmFinalizer) startInternal(_ context.Context) error {
Copy link
Collaborator

@dimriou dimriou Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need startInternal scheme, which is something we want to deprecate eventually. This is only used for services that require to refetch the enabled addresses via EnabledAddressesForChain . You should follow the same pattern as reaper.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into some problems with health checks when I followed the Reaper pattern. I removed startinternal but still kept the StartOnce pattern which has some health status changes built in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to remove reaper_chain_config.go from mocks.

@@ -30,6 +31,7 @@ type TxmClient[
ctx context.Context,
attempts []TxAttempt[CHAIN_ID, ADDR, TX_HASH, BLOCK_HASH, SEQ, FEE],
) (txReceipt []R, txErr []error, err error)
HeadByHash(ctx context.Context, hash BLOCK_HASH) (HEAD, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think evm TXM client uses the general Client evm interface, so this is not needed.

return nil
}

func (f *evmFinalizer) batchCheckReceiptHashesOnchain(ctx context.Context, blockNumToReceiptsMap map[int64][]Receipt) ([]Receipt, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't seem to return an error in any path. You can probably drop it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It's removed in the latest commit

@@ -36,12 +36,13 @@ type (
Tx = txmgrtypes.Tx[*big.Int, common.Address, common.Hash, common.Hash, evmtypes.Nonce, gas.EvmFee]
TxMeta = txmgrtypes.TxMeta[common.Address, common.Hash]
TxAttempt = txmgrtypes.TxAttempt[*big.Int, common.Address, common.Hash, common.Hash, evmtypes.Nonce, gas.EvmFee]
Receipt = dbReceipt // EvmReceipt is the exported DB table model for receipts
Receipt = DbReceipt // DbReceipt is the exported DB table model for receipts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: during clean up we need to switch to evtypes.Receipt instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps using a straight mock instead for Finalizer tests, might have been a cleaner approach.

@prashantkumar1982 prashantkumar1982 added this pull request to the merge queue Aug 6, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 6, 2024
@prashantkumar1982 prashantkumar1982 added this pull request to the merge queue Aug 6, 2024
Merged via the queue into develop with commit 2312827 Aug 6, 2024
118 checks passed
@prashantkumar1982 prashantkumar1982 deleted the BCI-3486-implement-finality-check-for-the-get-transaction-status-method branch August 6, 2024 19:51
momentmaker added a commit that referenced this pull request Aug 6, 2024
* develop:
  Add finalizer component to TXM (#13638)
  auto: adjust cron contract imports (#13927)
  Set PriceMin to match pip-35 definition (#14014)
  update solana e2e test build deps (#13978)
  fix data race in syncer/launcher (#14050)
  [KS-411] Extra validation for FeedIDs in Streams Codec (#14038)
  [TT-1262] dump pg on failure (#14029)
  ks-409 fix the mock trigger to ensure events are sent (#14047)
  update readme's with information about CL node TOML config (#14028)
  [CCIP-Merge] OCR2 plugins  [CCIP-2942] (#14043)
  [BCF - 3339] - Codec and CR hashed topics support (#14016)
  common version update to head of develop (#14030)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants