Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronously clean up obsolete HTLC info from DB #2705

Merged
merged 5 commits into from
Feb 14, 2024

Conversation

t-bast
Copy link
Member

@t-bast t-bast commented Jul 4, 2023

When a channel is closed, we can forget the data from historical HTLCs sent and received through that channel (which is otherwise required to punish cheating attempts by our peer). When a splice transaction confirms, we can also forget the data from revoked commitment transactions that cannot be published anymore.

We previously synchronously removed that data from the DB when a closing transaction confirmed. However, this could create performance issues as the htlc_infos table can be very large for busy nodes and many concurrent writes may be happening at the same time.

We don't need to get rid of this data immediately: we only want to remove it to avoid degrading the performance of active channels that read and write to the htlc_infos table. We now mark channels as closed in a dedicated table, and run a background actor that deletes batches of obsolete htlc data at regular intervals. This ensures that the table is eventually cleaned up, without impacting the performance of active channels.

⚠️ This is very sensitive code: if we end up removing data for a commitment that our peer may still publish, we can lose funds. Please carefully review this change, especially the last commit!

Fixes #2610, #2702 and #2740

@t-bast t-bast requested a review from pm47 July 4, 2023 13:32
@t-bast t-bast force-pushed the asynchronous-htlc-info-clean-up branch from df48b89 to b22d899 Compare July 4, 2023 13:33
@codecov-commenter
Copy link

codecov-commenter commented Jul 4, 2023

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (c37df26) 85.85% compared to head (ce50751) 85.94%.
Report is 7 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2705      +/-   ##
==========================================
+ Coverage   85.85%   85.94%   +0.09%     
==========================================
  Files         216      218       +2     
  Lines       18201    18410     +209     
  Branches      749      767      +18     
==========================================
+ Hits        15626    15823     +197     
- Misses       2575     2587      +12     
Files Coverage Δ
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala 93.47% <100.00%> (+0.08%) ⬆️
...in/scala/fr/acinq/eclair/channel/Commitments.scala 97.01% <100.00%> (+0.19%) ⬆️
...q/eclair/channel/fsm/ChannelOpenSingleFunded.scala 93.52% <100.00%> (+0.05%) ⬆️
...inq/eclair/channel/fsm/CommonFundingHandlers.scala 91.11% <100.00%> (+0.41%) ⬆️
...inq/eclair/channel/fund/InteractiveTxBuilder.scala 91.37% <100.00%> (ø)
...src/main/scala/fr/acinq/eclair/crypto/Sphinx.scala 100.00% <100.00%> (ø)
...main/scala/fr/acinq/eclair/db/DbEventHandler.scala 93.58% <100.00%> (+0.08%) ⬆️
...la/fr/acinq/eclair/db/RevokedHtlcInfoCleaner.scala 100.00% <100.00%> (ø)
...ain/scala/fr/acinq/eclair/db/pg/PgChannelsDb.scala 99.39% <100.00%> (+0.11%) ⬆️
...scala/fr/acinq/eclair/payment/send/Recipient.scala 98.63% <100.00%> (+0.10%) ⬆️
... and 8 more

... and 16 files with indirect coverage changes

@n1bor
Copy link
Contributor

n1bor commented Jul 18, 2023

+1 happened to me again - it can stop the node even - as can get the whole connection pool full of sessions doing the delete then the lease renewal fails and then eclair shuts down.

@n1bor
Copy link
Contributor

n1bor commented Jul 18, 2023

I think the interval could be much smaller e.g. 5 seconds. All that is important is that:

  1. only one connection pool connection is tied up doing these delete.
  2. the deletes do not timeout.

Or leave at 5mins - and if the last run deleted some rows schedule to run again right away.

Although this will delete 14million a day so should keep up!

@t-bast
Copy link
Member Author

t-bast commented Jul 23, 2023

I think the interval could be much smaller e.g. 5 seconds.

There's no "good" default value that fits all since every node has a very different usage pattern here, but 5 seconds is most likely way too aggressive for a default value.

You should change the default for your specific node to match your usage, feel free to use 5 seconds if that's what makes sense for you.

@n1bor
Copy link
Contributor

n1bor commented Jul 25, 2023

Do you really need this extra table?
What is wrong with:
delete from local.htlc_infos WHERE channel_id not in (select distinct channel_id from local.channels where not is_closed) limit XXX
And just run that once an hour (or day) - and if it deletes anything run again till nothing deleted.

I think it should be very efficient if nothing to delete as will only be 1000's of channel ids to check.

Also - was thinking - actually the real bug is that the lease renewal / check process shares the same connection pool as the rest of the app - this is what causes the crash. Should that not be on it's own connection - as might be other scenarios in which it gets starved of a connection from the pool and causes eclair to exit?

@t-bast
Copy link
Member Author

t-bast commented Jul 26, 2023

delete from local.htlc_infos WHERE channel_id not in (select distinct channel_id from local.channels where not is_closed) limit XXX

This would be very inefficient. The select clause will require iterating through the whole local.channels table (even if you only keep the non closed channels, you have to scan all of them to know which ones aren't closed). I've actually ran some tests and tried this approach initially, but it's not good enough.

Also - was thinking - actually the real bug is that the lease renewal / check process shares the same connection pool as the rest of the app - this is what causes the crash. Should that not be on it's own connection - as might be other scenarios in which it gets starved of a connection from the pool and causes eclair to exit?

I don't think so. We should generally make sure all DB operations are fast, because they are synchronous operations. If they're all fast enough, we don't need to introduce any unnecessary complexity around the lease renewal process. Also, what you suggest only applies to postgres, while the current fix also handles sqlite.

@n1bor
Copy link
Contributor

n1bor commented Jul 26, 2023

I think the optimiser is smarter than that - I think it will get the list of distinct channel_ids from the htlc_infos and channels tables, union that to find the channel_ids that should be deleted - then scan the index on htlc_infos.channel_ids to find the rows to delete if there are any. (Or at least I know the SAP ASE optimiser would - have a lot less experience with postgres optimiser). And in the normal case of nothing to delete - will then be very fast as is just union'ing 2 lists of a few 1000 numbers.

Will try to do a test later with analyse to see...

@t-bast
Copy link
Member Author

t-bast commented Jul 26, 2023

This simply cannot be more efficient than a dedicated table, because you'll be fetching information from a much larger table.
Also, this means you will unnecessarily run the clean-up job even when you have nothing to clean-up, which is inefficient.
With a dedicated table, you instantly know whether you have things to clean-up or not.

@n1bor
Copy link
Contributor

n1bor commented Aug 1, 2023

Seems you are correct - postgres can not to "Loose indexscan" so does a complete table scan even when there is an index on the column. So takes 1min on my 250m row table.
With a hacked together index scan runs in 0.8 secs - so is fine - but SQL is very nasty - not sure I really understand how it works!
https://wiki.postgresql.org/wiki/Loose_indexscan

Surprised this is missing as the delete we are trying to run is very common use-case in a relational database where your are deleting data from master and related tables..

@rorp
Copy link
Contributor

rorp commented Aug 2, 2023

Also - was thinking - actually the real bug is that the lease renewal / check process shares the same connection pool as the rest of the app - this is what causes the crash. Should that not be on it's own connection - as might be other scenarios in which it gets starved of a connection from the pool and causes eclair to exit?

If you are not going to migrate your node to another box any time soon, you can set

eclair.db.postgres.lock-type = none

That lease stuff is just a safeguard against user errors to prevent possible funds loss if multiple instances of Eclair start updating same channels in parallel. You don't have to use it all the time, though.

Also, it seems to me it's safe to replace withLock with inTransaction here https://github.com/ACINQ/eclair/pull/2705/files#diff-89b3a9b451c9cd791fcf1ad2ad18309d3cfd607e90e12b1894226b445b500ca0R252, because the relevant funds are already on chain. It's even desirable to delete all those HTLCs in a distributed setting :)

One more thought about the safeguards. In addition to lease we can generate Postgres user name based on a unique identifier of the machine, like a hash of the MAC address eg. eclair-d4671ae9, so that the DBA will have to manually grant/revoke access for those users. That can add additional protection when lease is turned off.

@t-bast t-bast force-pushed the asynchronous-htlc-info-clean-up branch from b22d899 to efd305f Compare September 21, 2023 14:58
Copy link
Member

@pm47 pm47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a stateless:

DELETE FROM local.htlc_infos
WHERE channel_id IN (
  SELECT channel_id FROM local.channels WHERE is_closed
)

And how about running this cleaning task once at startup (with a way to disable it) instead of periodically, so we don't have unpredictable performance hits on the database?

@t-bast
Copy link
Member Author

t-bast commented Sep 26, 2023

And how about running this cleaning task once at startup (with a way to disable it) instead of periodically, so we don't have unpredictable performance hits on the database?

That wouldn't fix the issue we're trying to fix, which is that if there is too much to delete, it will timeout and we'll never be able to complete that deletion...

Also, I need to investigate it in more details, but that mechanism should be quite helpful for #2740: I'll spend some time working on this to add support for fixing #2740 in this PR and will see if that works well.

@pm47
Copy link
Member

pm47 commented Sep 26, 2023

That wouldn't fix the issue we're trying to fix, which is that if there is too much to delete, it will timeout and we'll never be able to complete that deletion...

Right, and we can tune the batch size to limit the impact on the database 👍

pm47
pm47 previously approved these changes Sep 26, 2023
@t-bast t-bast force-pushed the asynchronous-htlc-info-clean-up branch from efd305f to b483c25 Compare September 28, 2023 13:33
@t-bast
Copy link
Member Author

t-bast commented Sep 28, 2023

I took this opportunity to also fix #2740, to make sure we only needed to update the DB versions once. I rebased so the commit history contains:

  • a first commit that introduces asynchronous clean-up for closed channels (which was previously reviewed)
  • a second commit that slightly changes the data model to also perform some clean-up on splice transaction confirmation

This is highly sensitive code, as deleting the wrong data can lead to loss of funds: please review very carefully!

@t-bast t-bast requested review from remyers and pm47 September 28, 2023 13:41
Copy link
Contributor

@remyers remyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me! I just had a few questions and suggestions.

@t-bast t-bast force-pushed the asynchronous-htlc-info-clean-up branch from b483c25 to 125eafc Compare October 4, 2023 12:39
remyers
remyers previously approved these changes Oct 5, 2023
@t-bast t-bast force-pushed the asynchronous-htlc-info-clean-up branch from 125eafc to dbeaaab Compare November 10, 2023 10:56
@t-bast t-bast force-pushed the asynchronous-htlc-info-clean-up branch from dbeaaab to 53dd2a0 Compare November 23, 2023 15:12
When a channel is closed, we can forget the data from historical HTLCs
sent and received through that channel (which is otherwise required to
punish cheating attempts by our peer).

We previously synchronously removed that data from the DB when the closing
transaction confirmed. However, this could create performance issues as
the `htlc_infos` table can be very large for busy nodes and many concurrent
writes may be happening at the same time.

We don't need to get rid of this data immediately: we only want to remove
it to avoid degrading the performance of active channels that read and
write to the `htlc_infos` table. We now mark channels as closed in a
dedicated table, and run a background actor that deletes batches of
obsolete htlc data at regular intervals. This ensures that the table is
eventually cleaned up, without impacting the performance of active
channels.

Fixes #2610 and #2702
When a splice transaction confirms, all the revoked commitment transactions
that only applied to the previous funding transaction cannot be published
anymore, because the previous funding output has already been spent.

We can thus forget all the historical HTLCs that were included in those
commitments, because we will never need to generate the corresponding
penalty transactions.

This ensures that the growth of our DB is bounded, and will shrink every
time a splice transaction is confirmed.

Fixes #2740
- use `forgetHtlcInfo` to avoid code duplication
- test exact number of DB entries cleaned up
- use commitment index mismatch in splice test
@t-bast t-bast force-pushed the asynchronous-htlc-info-clean-up branch from 53dd2a0 to d5f4e5d Compare January 18, 2024 13:02
@t-bast
Copy link
Member Author

t-bast commented Jan 18, 2024

Rebased.

remyers
remyers previously approved these changes Jan 18, 2024
Copy link
Contributor

@remyers remyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase looks good to me!

While it requires more duplication in the codecs, because we didn't
version the commitments codec, it is cleaner to have this filed in
the commitment instead of having it in the funding status.
@remyers
Copy link
Contributor

remyers commented Feb 14, 2024

What's the semantic difference between WatchFundingConfirmed and WatchFundingDeeplyBuried? Is WatchFundingConfirmed supposed to be triggered when the funding tx is buried deeply enough that it can not be removed by a short reorg?

Perhaps we should only clean up htlcs when funding txs are "deeply buried" to prevent a situation where a short reorg creates a window when our peer could publish a revoked commitment we no longer have htlc info for. It seems like this could occur when we are the funding initiator for a splice and wait for only a default minDepthFunder of 1 block. Am I missing something?

Edit: It looks like minDepthScaled will scale the minDepth of dual funding txs to prevent the risk of reorgs. NB: we should schedule a decrease of references to 6.25 to 3.125 soon. :-)

Edit2: minDepthScaled will not scale channels of "small amounts" less than 16777216 sats, but that has become a significant amount of value today.

@t-bast
Copy link
Member Author

t-bast commented Feb 14, 2024

Perhaps we should only clean up htlcs when funding txs are "deeply buried" to prevent a situation where a short reorg creates a window when our peer could publish a revoked commitment we no longer have htlc info for.

This is a good point, we must ensure we have enough confirmations before cleaning up the revoked states. We cannot use WatchFundingDeeplyBuried as-is because it is only used for the initial channel open, in order to send channel_announcement. The number of confirmations we use for WatchFundingConfirmed is always at least 3, and scaled based on the channel size. At that point we prune the older funding transactions and the corresponding commitments (

private def pruneCommitments()(implicit log: LoggingAdapter): Commitments = {
) so we wouldn't be able to react to a pruned revoked commitment anyway.

If we feel that 3 confirmations isn't enough, that's a parameter that can already be increased in eclair.conf!

It seems like this could occur when we are the funding initiator for a splice and wait for only a default minDepthFunder of 1 block.

I don't see when that can happen, can you point me to that code?

@remyers
Copy link
Contributor

remyers commented Feb 14, 2024

If we feel that 3 confirmations isn't enough, that's a parameter that can already be increased in eclair.conf!

It seems like this could occur when we are the funding initiator for a splice and wait for only a default minDepthFunder of 1 block.

I don't see when that can happen, can you point me to that code?

My mistake, it appears only singleFundingMinDepth can result in a 1 block min depth. It does seem like 3 should be enough.

remyers
remyers previously approved these changes Feb 14, 2024
Copy link
Contributor

@remyers remyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@pm47 pm47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some bikeshedding.

@t-bast t-bast merged commit f41bd22 into master Feb 14, 2024
1 check passed
@t-bast t-bast deleted the asynchronous-htlc-info-clean-up branch February 14, 2024 13:06
@fishcakeday
Copy link

When I tried to run the latest git commit that included this patch, I ran into issues with channels DB version. It doesn't seem that migration ran, or I might have missed something.

2024-02-15 07:40:42,275 INFO  fr.acinq.eclair.Setup - connecting to database with instanceId=219fc9f9-a46b-4683-ba4d-69a570896297
2024-02-15 07:40:42,314 WARN  c.z.hikari.HikariConfig - HikariPool-1 - idleTimeout has been set but has no effect because the pool is operating as a fixed size pool.
2024-02-15 07:40:42,321 INFO  c.z.h.HikariDataSource - HikariPool-1 - Starting...
2024-02-15 07:40:42,502 INFO  c.z.h.HikariDataSource - HikariPool-1 - Start completed.
2024-02-15 07:40:43,764 ERROR fr.acinq.eclair.Boot - fatal error: Unknown version of DB channels found, version=8java.lang.RuntimeException: Unknown version of DB channels found, version=8
        at fr.acinq.eclair.db.pg.PgChannelsDb.$anonfun$new$2(PgChannelsDb.scala:157)
        at fr.acinq.eclair.db.pg.PgChannelsDb.$anonfun$new$2$adapted(PgChannelsDb.scala:51)
        at fr.acinq.eclair.db.jdbc.JdbcUtils.using(JdbcUtils.scala:52)
        at fr.acinq.eclair.db.jdbc.JdbcUtils.using$(JdbcUtils.scala:48)
        at fr.acinq.eclair.db.pg.PgUtils$.using(PgUtils.scala:33)
        at fr.acinq.eclair.db.pg.PgChannelsDb.$anonfun$new$1(PgChannelsDb.scala:51)
        at fr.acinq.eclair.db.pg.PgChannelsDb.$anonfun$new$1$adapted(PgChannelsDb.scala:50)
        at fr.acinq.eclair.db.pg.PgUtils$.inTransactionInternal(PgUtils.scala:265)
        at fr.acinq.eclair.db.pg.PgUtils$.$anonfun$inTransaction$1(PgUtils.scala:280)
        at fr.acinq.eclair.db.jdbc.JdbcUtils.withConnection(JdbcUtils.scala:37)
        at fr.acinq.eclair.db.jdbc.JdbcUtils.withConnection$(JdbcUtils.scala:34)
        at fr.acinq.eclair.db.pg.PgUtils$.withConnection(PgUtils.scala:33)
        at fr.acinq.eclair.db.pg.PgUtils$.inTransaction(PgUtils.scala:279)
        at fr.acinq.eclair.db.pg.PgChannelsDb.<init>(PgChannelsDb.scala:50)
        at fr.acinq.eclair.db.Databases$PostgresDatabases$.apply(Databases.scala:153)
        at fr.acinq.eclair.db.Databases$.postgres(Databases.scala:354)
        at fr.acinq.eclair.db.Databases$.init(Databases.scala:274)
        at fr.acinq.eclair.Setup.<init>(Setup.scala:204)
        at fr.acinq.eclair.Boot$.delayedEndpoint$fr$acinq$eclair$Boot$1(Boot.scala:44)
        at fr.acinq.eclair.Boot$delayedInit$body.apply(Boot.scala:34)
        at scala.Function0.apply$mcV$sp(Function0.scala:42)
        at scala.Function0.apply$mcV$sp$(Function0.scala:42)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
        at scala.App.$anonfun$main$1(App.scala:98)

2024-02-15 07:40:43,774 INFO  a.a.CoordinatedShutdown - Running CoordinatedShutdown with reason [JvmExitReason]
2024-02-15 07:40:43,789 INFO  f.a.eclair.db.Databases - cancelling the pg lock renew task...
2024-02-15 07:40:43,790 INFO  f.a.eclair.db.Databases - releasing the curent pg lock...
2024-02-15 07:40:46,812 INFO  f.a.eclair.db.Databases - closing the connection pool...
2024-02-15 07:40:46,813 INFO  c.z.h.HikariDataSource - HikariPool-1 - Shutdown initiated...
2024-02-15 07:40:46,832 INFO  c.z.h.HikariDataSource - HikariPool-1 - Shutdown completed.

@pm47
Copy link
Member

pm47 commented Feb 15, 2024

I ran into issues with channels DB version. It doesn't seem that migration ran

Thanks for reporting, fixed in #2824.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Closing channels in Switchboard Init can cause issues due to long running deletes on htlc_infos table.
7 participants