-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing+htlcswitch: fix stuck inflight payments #9150
base: master
Are you sure you want to change the base?
routing+htlcswitch: fix stuck inflight payments #9150
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi YY, made a quick pass through the change set. Looks pretty cool to me! Had a couple questions to make sure I understand the approach and also on whether the modification of GetAttemptResult is strictly necessary for this bug fix as it may have implications (albeit slight) on how the ChannelRouter can be used remotely.
routing/router.go
Outdated
@@ -123,7 +123,8 @@ type PaymentAttemptDispatcher interface { | |||
// longer be in flight. The switch shutting down is signaled by | |||
// closing the channel. If the attemptID is unknown, | |||
// ErrPaymentIDNotFound will be returned. | |||
GetAttemptResult(attemptID uint64, paymentHash lntypes.Hash, | |||
GetAttemptResult(attempt *channeldb.HTLCAttempt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this PR moves to transporting the hop + session key information across the interface boundary by way of channeldb.HTLCAttempt type. This is what allows the forwarding error decryption to be performed one layer deeper within the Switch rather than within the ChannelRouter. This may have implications for ability to support oblivious payments by providing optionality in server versus client side decryption of forwarding errors.
Maybe this is too future thinking to be relevant, but such an opportunity is a bit more natural when you consider alternative deployment scenarios in which ChannelRouter and Switch run remotely in different processes - perhaps both controlled by the same entity (eg: gRPC reverse proxy to make multiple nodes act as a single logical node) or controlled by different parties in the case where the client would like to build onions and send payments in a way that is private from a hosted node infrastructure provider.
Whether forwarding error decryption should be conducted server/Switch side or by the caller of GetAttemptResult could previously be signaled by the presence of the deobfuscator:
// extractResult uses the given deobfuscator to extract the payment result from
// the given network message.
func (s *Switch) extractResult(deobfuscator ErrorDecrypter, n *networkResult,
attemptID uint64, paymentHash lntypes.Hash) (*PaymentResult, error) {
...
// If the caller did not provide a deobfuscator, then we'll
// return the onion-encrypted blob that details why the HTLC was
// failed. This blob is only fully decryptable by the entity
// which built the onion packet.
if deobfuscator == nil {
return &PaymentResult{
EncryptedError: htlc.Reason,
}, nil
}
I could imagine a similar approach to optional Switch error decryption by constructing a HTLCAttempt object which does not set the fields needed to perform forwarding error decryption in the Switch, but the attempt.Circuit() call would have to allow that information not being there so the Switch could instead return the encrypted error blob via the result.
In our case, we control both the remote ChannelRouter and the lnd/Switch instances, so whether the forwarding error decryption occurs in the router or switch isn't a huge concern since privacy across the RPC boundary is not relevant.
From the perspective of a router client controlled separately from the entity operating the lnd/Switch instances: With the removal of any ChannelRouter error decryption logic, the client application could not re-use the ChannelRouter type, but it could at least in principle provide a different implementation of a path-finding, onion building, and payment life-cycle manager component that could do client-side decryption of forwarding errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?
Maybe we like the improvements this brings better than the currently somewhat hypothetical ability to allow ChannelRouter re-use in clients seeking to maintain privacy from hosted node providers. One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I followed - if you wanna create onions then you should use SendToOnions
?
Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?
Nope it's not needed but an improvement - an easy way to measure it is to run the payment benchmark, which increases a bit. This is more of a logical code separation since there's no need to re-create the error decryptor if we know it's not a fail but a settle.
One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.
I don't think that's the case, all paymentLifecycle
needs is to call RequestRoute
and pass the route to the htlcswitch
.
Anyway I removed this change to limit the scope of this PR, think we'll take a look at it again when working on #8834.
result *htlcswitch.PaymentResult) bool { | ||
|
||
// Save the keys so we know which items to delete from the map. | ||
attempts = append(attempts, a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only mark an attempt result for deletion if it is actually successfully handled without error by the call to handleAttemptResult below? If a result is not deleted from the map, it will be reprocessed when this function is again called on the next life-cycle loop iteration. On one hand retries sound good, but in this context maybe that risks a perpetually growing result map if certain results are borked and never able to be processed successfully 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I thought about this too - since the only error we get here is DB-related, I don't think retrying it again would help. This is also the current behavior. But I guess some logs would help so will add!
abd6e94
to
930fadf
Compare
930fadf
to
b631432
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactor bringing the result processing into one goroutine.
I might be wrong here but afaict we also need to make sure we cancel the trackpayment
subscription when canceling the payment. Right now when encountering a switch result which would trigger a payment cancelation, we would stop all the result collectors never being able to resolve still inflight HTLCs (thinking of the MPP_Timeout). We would only do this after restart, so I think before canceling a payment we should make sure all htlcs are resolved or at least also abort the trackpayment request because it has not chance of resolving inflight htlcs because the payment_lifecycle got canceled.
routing/payment_lifecycle.go
Outdated
|
||
// refreshPayment returns the latest payment found in the control tower after | ||
// all its attempt results are processed. | ||
func (p *paymentLifecycle) refreshPayment() (DBMPPayment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: In this commit it does not really look like we are refreshing something, just fetching the payment ..., only makes sense when looking into the following commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reading from the db to get the latest state is by definition "refreshing".
routing/payment_lifecycle.go
Outdated
} | ||
|
||
// We successfully got a payment result back from the switch. | ||
log.Debugf("Payment %v succeeded with pid=%v", p.identifier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Payment => HTLC Attempt succeeded ...? Or are we assuming as soon as one attempt succeeds the whole payment succeeds because the preimage was revealed ? But can we be sure in an MPP case that the whole amount was settled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah nice call, must be leftover and now updated - and we always check the inflight htlcs before considering the payment as finalized or not.
routing/payment_lifecycle.go
Outdated
func (p *paymentLifecycle) handleAttemptResult(attempt *channeldb.HTLCAttempt, | ||
result *htlcswitch.PaymentResult) (*attemptResult, error) { | ||
|
||
// In case of a payment failure, fail the attempt with the control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can also not only fail the attempt but the whole payment so we should maybe be more detailed here in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comments.
routing/payment_lifecycle.go
Outdated
return p.handleSwitchErr(attempt, result.Error) | ||
} | ||
|
||
// We successfully got a payment result back from the switch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not only a result but a positive result, the case where the payment failed is handled in handleSwitchErr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong here but afaict we also need to make sure we cancel the trackpayment subscription when canceling the payment.
Checked the trackpayment
impl and realized we do send all the updates and optionally terminate the subscription here (maybe we need to take a second look at the subscribe all case),
Lines 416 to 418 in c37baa6
if terminal { | |
close(subscriber.queue.ChanIn()) | |
} |
routing/payment_lifecycle.go
Outdated
|
||
// refreshPayment returns the latest payment found in the control tower after | ||
// all its attempt results are processed. | ||
func (p *paymentLifecycle) refreshPayment() (DBMPPayment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah rename it to be more verbose.
routing/payment_lifecycle.go
Outdated
func (p *paymentLifecycle) handleAttemptResult(attempt *channeldb.HTLCAttempt, | ||
result *htlcswitch.PaymentResult) (*attemptResult, error) { | ||
|
||
// In case of a payment failure, fail the attempt with the control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comments.
routing/payment_lifecycle.go
Outdated
return p.handleSwitchErr(attempt, result.Error) | ||
} | ||
|
||
// We successfully got a payment result back from the switch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comments
routing/payment_lifecycle.go
Outdated
} | ||
|
||
// We successfully got a payment result back from the switch. | ||
log.Debugf("Payment %v succeeded with pid=%v", p.identifier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah nice call, must be leftover and now updated - and we always check the inflight htlcs before considering the payment as finalized or not.
b631432
to
7fa1e4f
Compare
because htlcs are still in flight, the payment state is not terminated so we will not close the channel here, given the fact that we still have |
I think that's not the case anymore since we fixed the stuck payment here? Also even if it's not fixed before re the stuckness, the track payment IMO is behaving correctly because it shouldn't quit unless the payment is terminated. |
So I analysed our payment behaviour and found out, that we would fail a payment without waiting for all result collectors. This would lead also to the case that trackPayment subscriber would be hang indefintily. This is now solved by waiting for all results and updating the payment db. Let me know if this approach is good @yyforyongyu Thoughts about the current change. So the reason why we should wait for all shards to resolve is the following: If we launch an MPP payment for example and receive the first error for one of the Shards, we would mark the payment as failed, closing all the collectors in the process. THis means our payment has never the chance to resolve and the trackPayment subscribers will wait indefinitly because the payment never reached a terminal condition (still inflight payments). |
d5f5a97
to
30f5826
Compare
30f5826
to
686acb9
Compare
The assessment is correct, however I don't think we want to add another state to decide whether we should exit the payment lifecycle or not, instead we should stick to
I did try to write an itest for this case but failed. Could you elaborate on this case? Also I'd like to limit the scope of this PR to focus on fixing this case only, as stated in the issue description. Given the complexity of the payment lifecycle, I think it's better to focus on one issue at a time, and open PRs for other fixes. Moving forward, I think the lnd/routing/payment_lifecycle.go Lines 897 to 903 in 22ae3e5
We need to revisit the errors returned from the link and decide which error causes an attempt failure and which causes a payment failure. I suspect we are prematurely marking the payment as failed when it should instead mark the attempt as failed. |
routing/payment_lifecycle.go
Outdated
log.Errorf("Failed to find a route for payment %v: %v", | ||
p.identifier, err) | ||
|
||
// The payment is now marked as failed, we'll continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this comment holds tho because in the requestRoute func, we would not necessarily fail the payment depending on the error:
var routeErr noRouteError
if !errors.As(err, &routeErr) {
return nil, fmt.Errorf("requestRoute got: %w", err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice yeah it was handled there already, one less commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I think we need to keep this exception, imagine we launch a payment, we need to shard the payment, so we send one paymenet successfully now we try the next shard, run out of routes ? We would then close the payment_lifecycle and remove all the result collectors again (for the first shard which will then never have the chance to get resolved)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's no route error we'd return a nil error and continue the lifecycle due to nil route below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm aren't we returning an error when it's no NoRoute
err ?:
return nil, fmt.Errorf("requestRoute got: %w", err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case we also wanna wait for all payment to resolve tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case something seriously wrong happened and we should exit, of course it relies on the API to promise what it does - if not we need to fix it in the router. Otherwise, we will end up in an infinite loop here because the payment is not failed and we will keep sending new attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, makes sense to exit. My only concern here is then that we are closing the result collectors, which is unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as the router and switch are active we probably should keep them alive.
Looking at your proposal it looks like it might solve the issue, because we would always wait in the |
5439c3f
to
686acb9
Compare
I actually think it's quite the opposite - if we are using another counter and read the chan it creates more questions, e.g., what if there are multiple results being sent to That being said, I'm considering removing it too as it's hacky. As for the edge case, say a payment made of three HTLCs,
If we can confirm that step 2 indeed happens, then we can provide a solid fix to make sure we don't fail the payment prematurely, then we don't need the hack. On the other hand, by using the counter approach, we won't notice this issue and won't attempt a fix. |
could you elaborate why we would block at Yeah I think your example might happen not sure how often tho but it seems reasonable to assume. My current concern and maybe you can jump in and mitigate this case is the following: Our So I am not sure here, maybe we should really invest the time and find some proper solution rather than get this fix it. Maybe my concerns are overrated and we never run in those edge cases that often anymore given the fact that we know sync the payment in one loop ... |
I also don't know how it could happen😂, just making assumptions. If
now that you mention it I think the fix is to notify it in the payment lifecycle, something like this. |
I think this can also be a solution. Not super happy with it, because we kinda prevent the payment from being marked as failed until a restart but at least the subscriber does not hang indefinitely. Maybe this issue should be taken care of in the payment sql refactor I am doing and let's keep a TODO here ?. However @hieblmi might need this feature for his static address functionality. i think we have all the cards on the table, just need to decide which direction to go. (cc @saubyk I think we need to discuss this one in the after next standup) |
!lightninglabs-deploy mute 336h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are very close, I think the only thing we should do is that we always mark a payment as failed in resumePayment
and wait for all active resolvers. This makes sure we do not start any more attempts but also collect all ongoing htlcs attempts and in the process also signaling this to the trackPayment command.
So basically in
exitWithErr := func(err error) ([32]byte, *route.Route, error) {
log.Errorf("Payment %v with status=%v failed: %v",
p.identifier, payment.GetStatus(), err)
return [32]byte{}, nil, err
}
just mark the payment as failed and continue until all resolvers are finsihed, which will that exit this process.
Even when having a timeout set for the payment,this timeout should just set the payment as failed and then when all attempts are resolved we would exit via:
if !wait {
return stepExit, nil
}
// reloadInflightAttempts is called when the payment lifecycle is resumed after | ||
// a restart. It reloads all inflight attempts from the control tower and | ||
// collects the results of the attempts that have been sent before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it is already implicitly described by inflight payments
, but also ok if you keep it as is.
return stepExit, err | ||
} | ||
|
||
case <-p.resultCollected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Add a comment that we are here waiting for the result form the switch
log.Errorf("Error reporting payment success to mc: %v", err) | ||
} | ||
|
||
// In case of success we atomically store settle result to the DB move |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: store settle resutls to the DB and move ...
@@ -803,8 +803,8 @@ func (p *paymentLifecycle) handleSwitchErr(attempt *channeldb.HTLCAttempt, | |||
// case we can safely send a new payment attempt, and wait for its | |||
// result to be available. | |||
if errors.Is(sendErr, htlcswitch.ErrPaymentIDNotFound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should differentiate between the payment not found in the network resutts or the attempts, there are imo two different things:
lnd/htlcswitch/payment_result.go
Lines 235 to 255 in 6694002
func fetchResult(tx kvdb.RTx, pid uint64) (*networkResult, error) { | |
var attemptIDBytes [8]byte | |
binary.BigEndian.PutUint64(attemptIDBytes[:], pid) | |
networkResults := tx.ReadBucket(networkResultStoreBucketKey) | |
if networkResults == nil { | |
return nil, ErrPaymentIDNotFound | |
} | |
// Check whether a result is already available. | |
resultBytes := networkResults.Get(attemptIDBytes[:]) | |
if resultBytes == nil { | |
return nil, ErrPaymentIDNotFound | |
} | |
// Decode the result we found. | |
r := bytes.NewReader(resultBytes) | |
return deserializeNetworkResult(r) | |
} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt ?
@@ -262,6 +262,16 @@ lifecycle: | |||
// We found a route to try, create a new HTLC attempt to try. | |||
attempt, err := p.registerAttempt(rt, ps.RemainingAmt) | |||
if err != nil { | |||
// If the error is due to we cannot register another |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should in neither case when receiving an error also in requestRoute just exist the payment without marking it as failed not collecting all results. So what we should do is, every erro that occurs should fail the payment in the payment and wait for all collectors to resolve.
This commit caches the creation of sphinx circuit and onion blob to avoid re-creating them again.
To shorten the method `resumePayment` and make each step more clear.
To further shorten the lifecycle loop.
This commit refactors `collectResultAsync` such that this method is now only responsible for collecting results from the switch. A new method `processSwitchResults` is added to process these results in the same goroutine where we fetch the payment from db, to make sure the lifecycle loop always have a consistent view of a given payment.
686acb9
to
5b655b1
Compare
@yyforyongyu, remember to re-request review from reviewers when ready |
Fix #8975
This PR aims at fixing a scenario where the payment could be stuck. Major changes are,
resumePayment
)update_fail_htlc
.TODOs
htlcswitch
routing