-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronously clean up obsolete HTLC info from DB #2705
Conversation
df48b89
to
b22d899
Compare
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #2705 +/- ##
==========================================
+ Coverage 85.85% 85.94% +0.09%
==========================================
Files 216 218 +2
Lines 18201 18410 +209
Branches 749 767 +18
==========================================
+ Hits 15626 15823 +197
- Misses 2575 2587 +12
|
+1 happened to me again - it can stop the node even - as can get the whole connection pool full of sessions doing the delete then the lease renewal fails and then eclair shuts down. |
I think the interval could be much smaller e.g. 5 seconds. All that is important is that:
Or leave at 5mins - and if the last run deleted some rows schedule to run again right away. Although this will delete 14million a day so should keep up! |
There's no "good" default value that fits all since every node has a very different usage pattern here, but 5 seconds is most likely way too aggressive for a default value. You should change the default for your specific node to match your usage, feel free to use 5 seconds if that's what makes sense for you. |
Do you really need this extra table? I think it should be very efficient if nothing to delete as will only be 1000's of channel ids to check. Also - was thinking - actually the real bug is that the lease renewal / check process shares the same connection pool as the rest of the app - this is what causes the crash. Should that not be on it's own connection - as might be other scenarios in which it gets starved of a connection from the pool and causes eclair to exit? |
This would be very inefficient. The
I don't think so. We should generally make sure all DB operations are fast, because they are synchronous operations. If they're all fast enough, we don't need to introduce any unnecessary complexity around the lease renewal process. Also, what you suggest only applies to postgres, while the current fix also handles sqlite. |
I think the optimiser is smarter than that - I think it will get the list of distinct channel_ids from the htlc_infos and channels tables, union that to find the channel_ids that should be deleted - then scan the index on htlc_infos.channel_ids to find the rows to delete if there are any. (Or at least I know the SAP ASE optimiser would - have a lot less experience with postgres optimiser). And in the normal case of nothing to delete - will then be very fast as is just union'ing 2 lists of a few 1000 numbers. Will try to do a test later with analyse to see... |
This simply cannot be more efficient than a dedicated table, because you'll be fetching information from a much larger table. |
Seems you are correct - postgres can not to "Loose indexscan" so does a complete table scan even when there is an index on the column. So takes 1min on my 250m row table. Surprised this is missing as the delete we are trying to run is very common use-case in a relational database where your are deleting data from master and related tables.. |
If you are not going to migrate your node to another box any time soon, you can set
That lease stuff is just a safeguard against user errors to prevent possible funds loss if multiple instances of Eclair start updating same channels in parallel. You don't have to use it all the time, though. Also, it seems to me it's safe to replace One more thought about the safeguards. In addition to lease we can generate Postgres user name based on a unique identifier of the machine, like a hash of the MAC address eg. |
b22d899
to
efd305f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a stateless:
DELETE FROM local.htlc_infos
WHERE channel_id IN (
SELECT channel_id FROM local.channels WHERE is_closed
)
And how about running this cleaning task once at startup (with a way to disable it) instead of periodically, so we don't have unpredictable performance hits on the database?
That wouldn't fix the issue we're trying to fix, which is that if there is too much to delete, it will timeout and we'll never be able to complete that deletion... Also, I need to investigate it in more details, but that mechanism should be quite helpful for #2740: I'll spend some time working on this to add support for fixing #2740 in this PR and will see if that works well. |
Right, and we can tune the batch size to limit the impact on the database 👍 |
efd305f
to
b483c25
Compare
I took this opportunity to also fix #2740, to make sure we only needed to update the DB versions once. I rebased so the commit history contains:
This is highly sensitive code, as deleting the wrong data can lead to loss of funds: please review very carefully! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good to me! I just had a few questions and suggestions.
eclair-core/src/main/scala/fr/acinq/eclair/db/pg/PgChannelsDb.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/test/scala/fr/acinq/eclair/channel/states/e/NormalSplicesStateSpec.scala
Show resolved
Hide resolved
eclair-core/src/test/scala/fr/acinq/eclair/db/ChannelsDbSpec.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/wire/internal/channel/version4/ChannelCodecs4.scala
Outdated
Show resolved
Hide resolved
b483c25
to
125eafc
Compare
125eafc
to
dbeaaab
Compare
dbeaaab
to
53dd2a0
Compare
When a channel is closed, we can forget the data from historical HTLCs sent and received through that channel (which is otherwise required to punish cheating attempts by our peer). We previously synchronously removed that data from the DB when the closing transaction confirmed. However, this could create performance issues as the `htlc_infos` table can be very large for busy nodes and many concurrent writes may be happening at the same time. We don't need to get rid of this data immediately: we only want to remove it to avoid degrading the performance of active channels that read and write to the `htlc_infos` table. We now mark channels as closed in a dedicated table, and run a background actor that deletes batches of obsolete htlc data at regular intervals. This ensures that the table is eventually cleaned up, without impacting the performance of active channels. Fixes #2610 and #2702
When a splice transaction confirms, all the revoked commitment transactions that only applied to the previous funding transaction cannot be published anymore, because the previous funding output has already been spent. We can thus forget all the historical HTLCs that were included in those commitments, because we will never need to generate the corresponding penalty transactions. This ensures that the growth of our DB is bounded, and will shrink every time a splice transaction is confirmed. Fixes #2740
- use `forgetHtlcInfo` to avoid code duplication - test exact number of DB entries cleaned up - use commitment index mismatch in splice test
53dd2a0
to
d5f4e5d
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase looks good to me!
eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelData.scala
Outdated
Show resolved
Hide resolved
While it requires more duplication in the codecs, because we didn't version the commitments codec, it is cleaner to have this filed in the commitment instead of having it in the funding status.
What's the semantic difference between Perhaps we should only clean up htlcs when funding txs are "deeply buried" to prevent a situation where a short reorg creates a window when our peer could publish a revoked commitment we no longer have htlc info for. It seems like this could occur when we are the funding initiator for a splice and wait for only a default Edit: It looks like Edit2: |
This is a good point, we must ensure we have enough confirmations before cleaning up the revoked states. We cannot use
If we feel that 3 confirmations isn't enough, that's a parameter that can already be increased in
I don't see when that can happen, can you point me to that code? |
My mistake, it appears only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some bikeshedding.
eclair-core/src/main/scala/fr/acinq/eclair/channel/fsm/CommonFundingHandlers.scala
Outdated
Show resolved
Hide resolved
When I tried to run the latest git commit that included this patch, I ran into issues with channels DB version. It doesn't seem that migration ran, or I might have missed something.
|
Thanks for reporting, fixed in #2824. |
) Fixes a regression caused by #2705 reported by @fishcakeday.
When a channel is closed, we can forget the data from historical HTLCs sent and received through that channel (which is otherwise required to punish cheating attempts by our peer). When a splice transaction confirms, we can also forget the data from revoked commitment transactions that cannot be published anymore.
We previously synchronously removed that data from the DB when a closing transaction confirmed. However, this could create performance issues as the
htlc_infos
table can be very large for busy nodes and many concurrent writes may be happening at the same time.We don't need to get rid of this data immediately: we only want to remove it to avoid degrading the performance of active channels that read and write to the
htlc_infos
table. We now mark channels as closed in a dedicated table, and run a background actor that deletes batches of obsolete htlc data at regular intervals. This ensures that the table is eventually cleaned up, without impacting the performance of active channels.Fixes #2610, #2702 and #2740