routing+htlcswitch: fix stuck inflight payments #9150

yyforyongyu · 2024-10-02T13:44:46Z

Fix #8975

This PR aims at fixing a scenario where the payment could be stuck. Major changes are,

all the db updates for a given payment now happens in a single goroutine (resumePayment)
minor optimization on sphinx circuit creation - the error decryptor is now conditionally created when needed, i.e., when processing an update_fail_htlc.

TODOs

fix unit test in htlcswitch
fix unit test in routing

coderabbitai · 2024-10-02T13:44:57Z

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)

llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

calvinrzachman

Hi YY, made a quick pass through the change set. Looks pretty cool to me! Had a couple questions to make sure I understand the approach and also on whether the modification of GetAttemptResult is strictly necessary for this bug fix as it may have implications (albeit slight) on how the ChannelRouter can be used remotely.

calvinrzachman · 2024-10-04T14:26:45Z

routing/router.go

@@ -123,7 +123,8 @@ type PaymentAttemptDispatcher interface {
 	// longer be in flight.  The switch shutting down is signaled by
 	// closing the channel. If the attemptID is unknown,
 	// ErrPaymentIDNotFound will be returned.
-	GetAttemptResult(attemptID uint64, paymentHash lntypes.Hash,
+	GetAttemptResult(attempt *channeldb.HTLCAttempt,


IIUC, this PR moves to transporting the hop + session key information across the interface boundary by way of channeldb.HTLCAttempt type. This is what allows the forwarding error decryption to be performed one layer deeper within the Switch rather than within the ChannelRouter. This may have implications for ability to support oblivious payments by providing optionality in server versus client side decryption of forwarding errors.

Maybe this is too future thinking to be relevant, but such an opportunity is a bit more natural when you consider alternative deployment scenarios in which ChannelRouter and Switch run remotely in different processes - perhaps both controlled by the same entity (eg: gRPC reverse proxy to make multiple nodes act as a single logical node) or controlled by different parties in the case where the client would like to build onions and send payments in a way that is private from a hosted node infrastructure provider.

Whether forwarding error decryption should be conducted server/Switch side or by the caller of GetAttemptResult could previously be signaled by the presence of the deobfuscator:

// extractResult uses the given deobfuscator to extract the payment result from // the given network message. func (s *Switch) extractResult(deobfuscator ErrorDecrypter, n *networkResult, attemptID uint64, paymentHash lntypes.Hash) (*PaymentResult, error) { ... // If the caller did not provide a deobfuscator, then we'll // return the onion-encrypted blob that details why the HTLC was // failed. This blob is only fully decryptable by the entity // which built the onion packet. if deobfuscator == nil { return &PaymentResult{ EncryptedError: htlc.Reason, }, nil }

I could imagine a similar approach to optional Switch error decryption by constructing a HTLCAttempt object which does not set the fields needed to perform forwarding error decryption in the Switch, but the attempt.Circuit() call would have to allow that information not being there so the Switch could instead return the encrypted error blob via the result.

In our case, we control both the remote ChannelRouter and the lnd/Switch instances, so whether the forwarding error decryption occurs in the router or switch isn't a huge concern since privacy across the RPC boundary is not relevant.

From the perspective of a router client controlled separately from the entity operating the lnd/Switch instances: With the removal of any ChannelRouter error decryption logic, the client application could not re-use the ChannelRouter type, but it could at least in principle provide a different implementation of a path-finding, onion building, and payment life-cycle manager component that could do client-side decryption of forwarding errors.

Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?

Maybe we like the improvements this brings better than the currently somewhat hypothetical ability to allow ChannelRouter re-use in clients seeking to maintain privacy from hosted node providers. One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.

Not sure I followed - if you wanna create onions then you should use SendToOnions?

Is the optimization on sphinx circuit creation needed to fix the bug? Do we gain much by way performance or more logical code separation by burying the decryption of errors within the Switch?

Nope it's not needed but an improvement - an easy way to measure it is to run the payment benchmark, which increases a bit. This is more of a logical code separation since there's no need to re-create the error decryptor if we know it's not a fail but a settle.

One argument for keeping the error decryption logic in the ChannelRouter could be that it does create the onion after all so will still need to handle or know about the hops + session key.

I don't think that's the case, all paymentLifecycle needs is to call RequestRoute and pass the route to the htlcswitch.

Anyway I removed this change to limit the scope of this PR, think we'll take a look at it again when working on #8834.

calvinrzachman · 2024-10-04T13:52:47Z

routing/payment_lifecycle.go

+		result *htlcswitch.PaymentResult) bool {
+
+		// Save the keys so we know which items to delete from the map.
+		attempts = append(attempts, a)


Should we only mark an attempt result for deletion if it is actually successfully handled without error by the call to handleAttemptResult below? If a result is not deleted from the map, it will be reprocessed when this function is again called on the next life-cycle loop iteration. On one hand retries sound good, but in this context maybe that risks a perpetually growing result map if certain results are borked and never able to be processed successfully 🤔

yeah I thought about this too - since the only error we get here is DB-related, I don't think retrying it again would help. This is also the current behavior. But I guess some logs would help so will add!

routing/payment_lifecycle.go

ziggie1984

Nice refactor bringing the result processing into one goroutine.

I might be wrong here but afaict we also need to make sure we cancel the trackpayment subscription when canceling the payment. Right now when encountering a switch result which would trigger a payment cancelation, we would stop all the result collectors never being able to resolve still inflight HTLCs (thinking of the MPP_Timeout). We would only do this after restart, so I think before canceling a payment we should make sure all htlcs are resolved or at least also abort the trackpayment request because it has not chance of resolving inflight htlcs because the payment_lifecycle got canceled.

routing/payment_lifecycle_test.go

channeldb/mp_payment.go

ziggie1984 · 2024-10-22T20:36:23Z

routing/payment_lifecycle.go

+
+// refreshPayment returns the latest payment found in the control tower after
+// all its attempt results are processed.
+func (p *paymentLifecycle) refreshPayment() (DBMPPayment,


Nit: In this commit it does not really look like we are refreshing something, just fetching the payment ..., only makes sense when looking into the following commits.

I think reading from the db to get the latest state is by definition "refreshing".

routing/payment_lifecycle.go

ziggie1984 · 2024-10-22T23:09:22Z

routing/payment_lifecycle.go

+	}
+
+	// We successfully got a payment result back from the switch.
+	log.Debugf("Payment %v succeeded with pid=%v", p.identifier,


Nit: Payment => HTLC Attempt succeeded ...? Or are we assuming as soon as one attempt succeeds the whole payment succeeds because the preimage was revealed ? But can we be sure in an MPP case that the whole amount was settled ?

yeah nice call, must be leftover and now updated - and we always check the inflight htlcs before considering the payment as finalized or not.

routing/payment_lifecycle.go

ziggie1984 · 2024-10-23T08:52:49Z

routing/payment_lifecycle.go

+func (p *paymentLifecycle) handleAttemptResult(attempt *channeldb.HTLCAttempt,
+	result *htlcswitch.PaymentResult) (*attemptResult, error) {
+
+	// In case of a payment failure, fail the attempt with the control


I think this can also not only fail the attempt but the whole payment so we should maybe be more detailed here in the comment.

updated the comments.

ziggie1984 · 2024-10-23T08:59:26Z

routing/payment_lifecycle.go

+		return p.handleSwitchErr(attempt, result.Error)
+	}
+
+	// We successfully got a payment result back from the switch.


Nit: not only a result but a positive result, the case where the payment failed is handled in handleSwitchErr ?

updated the comments

routing/payment_lifecycle.go

yyforyongyu

I might be wrong here but afaict we also need to make sure we cancel the trackpayment subscription when canceling the payment.

Checked the trackpayment impl and realized we do send all the updates and optionally terminate the subscription here (maybe we need to take a second look at the subscribe all case),

lnd/routing/control_tower.go

Lines 416 to 418 in c37baa6

    
           if terminal { 
        
           	close(subscriber.queue.ChanIn()) 
        
           }

routing/payment_lifecycle_test.go

channeldb/mp_payment.go

routing/payment_lifecycle.go

yyforyongyu · 2024-10-28T13:20:55Z

routing/payment_lifecycle.go

+
+// refreshPayment returns the latest payment found in the control tower after
+// all its attempt results are processed.
+func (p *paymentLifecycle) refreshPayment() (DBMPPayment,


yeah rename it to be more verbose.

yyforyongyu · 2024-10-28T13:26:22Z

routing/payment_lifecycle.go

+func (p *paymentLifecycle) handleAttemptResult(attempt *channeldb.HTLCAttempt,
+	result *htlcswitch.PaymentResult) (*attemptResult, error) {
+
+	// In case of a payment failure, fail the attempt with the control


updated the comments.

routing/payment_lifecycle.go

yyforyongyu · 2024-10-28T13:31:26Z

routing/payment_lifecycle.go

+		return p.handleSwitchErr(attempt, result.Error)
+	}
+
+	// We successfully got a payment result back from the switch.


updated the comments

yyforyongyu · 2024-10-28T13:32:14Z

routing/payment_lifecycle.go

+	}
+
+	// We successfully got a payment result back from the switch.
+	log.Debugf("Payment %v succeeded with pid=%v", p.identifier,


yeah nice call, must be leftover and now updated - and we always check the inflight htlcs before considering the payment as finalized or not.

ziggie1984 · 2024-10-28T14:19:27Z

if terminal {
close(subscriber.queue.ChanIn())
}

because htlcs are still in flight, the payment state is not terminated so we will not close the channel here, given the fact that we still have inflight_htlcs. However we will close all result_collectors which makes it impossible for the track_payment stream to report back.

yyforyongyu · 2024-10-28T17:39:27Z

given the fact that we still have inflight_htlcs. However we will close all result_collectors which makes it impossible for the track_payment stream to report back.

I think that's not the case anymore since we fixed the stuck payment here? Also even if it's not fixed before re the stuckness, the track payment IMO is behaving correctly because it shouldn't quit unless the payment is terminated.

ziggie1984 · 2024-11-03T21:36:11Z

So I analysed our payment behaviour and found out, that we would fail a payment without waiting for all result collectors. This would lead also to the case that trackPayment subscriber would be hang indefintily. This is now solved by waiting for all results and updating the payment db. Let me know if this approach is good @yyforyongyu

Thoughts about the current change.

So the reason why we should wait for all shards to resolve is the following: If we launch an MPP payment for example and receive the first error for one of the Shards, we would mark the payment as failed, closing all the collectors in the process. THis means our payment has never the chance to resolve and the trackPayment subscribers will wait indefinitly because the payment never reached a terminal condition (still inflight payments).

yyforyongyu · 2024-11-04T06:36:45Z

So the reason why we should wait for all shards to resolve is the following

The assessment is correct, however I don't think we want to add another state to decide whether we should exit the payment lifecycle or not, instead we should stick to p.decideNextStep so we only have a single place to act this logic, which I pushed a fix for this case.

So I analysed our payment behaviour and found out, that we would fail a payment without waiting for all result collectors.

I did try to write an itest for this case but failed. Could you elaborate on this case?

Also I'd like to limit the scope of this PR to focus on fixing this case only, as stated in the issue description. Given the complexity of the payment lifecycle, I think it's better to focus on one issue at a time, and open PRs for other fixes.

Moving forward, I think the handleSwitchErr may need a closer look,

lnd/routing/payment_lifecycle.go

Lines 897 to 903 in 22ae3e5

    
           var rtErr htlcswitch.ClearTextError 
        
           ok := errors.As(sendErr, &rtErr) 
        
           if !ok { 
        
           	return p.failPaymentAndAttempt( 
        
           		attemptID, &internalErrorReason, sendErr, 
        
           	) 
        
           }

We need to revisit the errors returned from the link and decide which error causes an attempt failure and which causes a payment failure. I suspect we are prematurely marking the payment as failed when it should instead mark the attempt as failed.

ziggie1984 · 2024-11-04T07:31:34Z

routing/payment_lifecycle.go

+			log.Errorf("Failed to find a route for payment %v: %v",
+				p.identifier, err)
+
+			// The payment is now marked as failed, we'll continue


I am not sure if this comment holds tho because in the requestRoute func, we would not necessarily fail the payment depending on the error:

var routeErr noRouteError if !errors.As(err, &routeErr) { return nil, fmt.Errorf("requestRoute got: %w", err) }

oh nice yeah it was handled there already, one less commit

Not sure I think we need to keep this exception, imagine we launch a payment, we need to shard the payment, so we send one paymenet successfully now we try the next shard, run out of routes ? We would then close the payment_lifecycle and remove all the result collectors again (for the first shard which will then never have the chance to get resolved)

if it's no route error we'd return a nil error and continue the lifecycle due to nil route below.

hmm aren't we returning an error when it's no NoRoute err ?:

return nil, fmt.Errorf("requestRoute got: %w", err)

in that case we also wanna wait for all payment to resolve tho.

in that case something seriously wrong happened and we should exit, of course it relies on the API to promise what it does - if not we need to fix it in the router. Otherwise, we will end up in an infinite loop here because the payment is not failed and we will keep sending new attempts.

Good point, makes sense to exit. My only concern here is then that we are closing the result collectors, which is unfortunate.

as long as the router and switch are active we probably should keep them alive.

ziggie1984 · 2024-11-04T07:56:39Z

Also I'd like to limit the scope of this PR to focus on fixing #8975 (comment) only, as stated in the issue description. Given the complexity of the payment lifecycle, I think it's better to focus on one issue at a time, and open PRs for other fixes.

Looking at your proposal it looks like it might solve the issue, because we would always wait in the decideNextStep. I am more in favour of not doing so much exceptions and special casing a lot of errors, I would rather prefer the approach were we error out and just wait for all collectors everytime but I can agree on both solutions.

yyforyongyu · 2024-11-04T08:54:26Z

I am more in favour of not doing so much exceptions and special casing a lot of errors, I would rather prefer the approach were we error out and just wait for all collectors everytime but I can agree on both solutions.

I actually think it's quite the opposite - if we are using another counter and read the chan it creates more questions, e.g., what if there are multiple results being sent to resultCollected? what if resultCollected is empty and blocking? By using the counter approach not only do we need to create more states to deal with, but we also doge the issue in decideNextStep or registerAttempt, and likely to hide future issues - the commit I put there is a hack, but it doesn't pretend the issue is fixed - we need to go deep into the switch and understand when to fail the payment vs the attempt.

That being said, I'm considering removing it too as it's hacky. As for the edge case, say a payment made of three HTLCs,

two of the HTLCs are sent, the third one is blocking on requestRoute
one of the already sent HTLCs has failed, causing the payment to be marked as failed
when sending the third one, the lifecycle loop will error out because registerAttempt will fail due to Registrable returning an error.

If we can confirm that step 2 indeed happens, then we can provide a solid fix to make sure we don't fail the payment prematurely, then we don't need the hack. On the other hand, by using the counter approach, we won't notice this issue and won't attempt a fix.

ziggie1984 · 2024-11-04T09:17:41Z

two of the HTLCs are sent, the third one is blocking on requestRoute

could you elaborate why we would block at requestRoute imo we fetch the route there and if its not available we fail ? We only block in the decideNextStep or ?

Yeah I think your example might happen not sure how often tho but it seems reasonable to assume.

My current concern and maybe you can jump in and mitigate this case is the following:

Our SendPaymentV2 probably the most used case for sending payments, never listens to the outcome of the payment_lifecycle but it subscribes to the payment and listens to the control_tower to signal the terminal state. So in my opinion we should make sure the control_tower always is able to resolve the payment hence waiting for all shards, and not let the payment_lifecycle terminate the ability for the control_tower to resolve payments by closing the result collectors.

So I am not sure here, maybe we should really invest the time and find some proper solution rather than get this fix it. Maybe my concerns are overrated and we never run in those edge cases that often anymore given the fact that we know sync the payment in one loop ...

yyforyongyu · 2024-11-04T11:08:17Z

could you elaborate why we would block at requestRoute imo we fetch the route there and if its not available we fail ? We only block in the decideNextStep or ?

I also don't know how it could happen😂, just making assumptions. If requestRoute always returns quickly, then I doubt we will ever run into this edge case because we'd send out all three HTLCs immediately.

but it subscribes to the payment and listens to the control_tower to signal the terminal state.

now that you mention it I think the fix is to notify it in the payment lifecycle, something like this.

ziggie1984 · 2024-11-04T11:44:43Z

now that you mention it I think the fix is to notify it in the payment lifecycle, something like this.

I think this can also be a solution. Not super happy with it, because we kinda prevent the payment from being marked as failed until a restart but at least the subscriber does not hang indefinitely.

Maybe this issue should be taken care of in the payment sql refactor I am doing and let's keep a TODO here ?. However @hieblmi might need this feature for his static address functionality.

i think we have all the cards on the table, just need to decide which direction to go. (cc @saubyk I think we need to discuss this one in the after next standup)

yyforyongyu · 2024-11-11T12:44:35Z

!lightninglabs-deploy mute 336h

ziggie1984

I think we are very close, I think the only thing we should do is that we always mark a payment as failed in resumePayment and wait for all active resolvers. This makes sure we do not start any more attempts but also collect all ongoing htlcs attempts and in the process also signaling this to the trackPayment command. So basically in

exitWithErr := func(err error) ([32]byte, *route.Route, error) {
		log.Errorf("Payment %v with status=%v failed: %v",
			p.identifier, payment.GetStatus(), err)
		return [32]byte{}, nil, err
	}

just mark the payment as failed and continue until all resolvers are finsihed, which will that exit this process.

Even when having a timeout set for the payment,this timeout should just set the payment as failed and then when all attempts are resolved we would exit via:

if !wait {
			return stepExit, nil
		}

ziggie1984 · 2024-11-25T11:46:00Z

routing/payment_lifecycle.go

+// reloadInflightAttempts is called when the payment lifecycle is resumed after
+// a restart. It reloads all inflight attempts from the control tower and
+// collects the results of the attempts that have been sent before.


I was thinking it is already implicitly described by inflight payments, but also ok if you keep it as is.

ziggie1984 · 2024-11-25T11:57:40Z

routing/payment_lifecycle.go

-				return stepExit, err
-			}
-
+		case <-p.resultCollected:


Nit: Add a comment that we are here waiting for the result form the switch

ziggie1984 · 2024-11-25T12:09:26Z

routing/payment_lifecycle.go

+		log.Errorf("Error reporting payment success to mc: %v", err)
+	}
+
+	// In case of success we atomically store settle result to the DB move


Nit: store settle resutls to the DB and move ...

ziggie1984 · 2024-11-25T12:34:45Z

routing/payment_lifecycle.go

@@ -803,8 +803,8 @@ func (p *paymentLifecycle) handleSwitchErr(attempt *channeldb.HTLCAttempt,
 	// case we can safely send a new payment attempt, and wait for its
 	// result to be available.
 	if errors.Is(sendErr, htlcswitch.ErrPaymentIDNotFound) {


I think we should differentiate between the payment not found in the network resutts or the attempts, there are imo two different things:

lnd/htlcswitch/payment_result.go

Lines 235 to 255 in 6694002

func fetchResult(tx kvdb.RTx, pid uint64) (*networkResult, error) {

var attemptIDBytes [8]byte

binary.BigEndian.PutUint64(attemptIDBytes[:], pid)

networkResults := tx.ReadBucket(networkResultStoreBucketKey)

if networkResults == nil {

return nil, ErrPaymentIDNotFound

}

// Check whether a result is already available.

resultBytes := networkResults.Get(attemptIDBytes[:])

if resultBytes == nil {

return nil, ErrPaymentIDNotFound

}

// Decode the result we found.

r := bytes.NewReader(resultBytes)

return deserializeNetworkResult(r)

}

ziggie1984 · 2024-11-25T12:49:54Z

routing/payment_lifecycle.go

@@ -262,6 +262,16 @@ lifecycle:
 		// We found a route to try, create a new HTLC attempt to try.
 		attempt, err := p.registerAttempt(rt, ps.RemainingAmt)
 		if err != nil {
+			// If the error is due to we cannot register another


I think we should in neither case when receiving an error also in requestRoute just exist the payment without marking it as failed not collecting all results. So what we should do is, every erro that occurs should fail the payment in the payment and wait for all collectors to resolve.

This commit caches the creation of sphinx circuit and onion blob to avoid re-creating them again.

To shorten the method `resumePayment` and make each step more clear.

To further shorten the lifecycle loop.

This commit refactors `collectResultAsync` such that this method is now only responsible for collecting results from the switch. A new method `processSwitchResults` is added to process these results in the same goroutine where we fetch the payment from db, to make sure the lifecycle loop always have a consistent view of a given payment.

lightninglabs-deploy · 2024-12-30T14:03:56Z

@yyforyongyu, remember to re-request review from reviewers when ready

calvinrzachman reviewed Oct 4, 2024

View reviewed changes

ziggie1984 self-requested a review October 14, 2024 06:34

yyforyongyu force-pushed the fix-stuck-payment branch 2 times, most recently from abd6e94 to 930fadf Compare October 14, 2024 12:35

yyforyongyu added payments Related to invoices/payments bug fix size/kilo medium, proper context needed, less than 1000 lines labels Oct 14, 2024

yyforyongyu added this to the v0.19.0 milestone Oct 14, 2024

yyforyongyu self-assigned this Oct 14, 2024

yyforyongyu marked this pull request as ready for review October 14, 2024 12:37

yyforyongyu force-pushed the fix-stuck-payment branch from 930fadf to b631432 Compare October 14, 2024 17:42

ziggie1984 requested changes Oct 23, 2024

View reviewed changes

yyforyongyu commented Oct 28, 2024

View reviewed changes

yyforyongyu force-pushed the fix-stuck-payment branch from b631432 to 7fa1e4f Compare October 28, 2024 13:39

ziggie1984 force-pushed the fix-stuck-payment branch 2 times, most recently from d5f5a97 to 30f5826 Compare November 3, 2024 21:46

yyforyongyu force-pushed the fix-stuck-payment branch from 30f5826 to 686acb9 Compare November 4, 2024 06:16

ziggie1984 reviewed Nov 4, 2024

View reviewed changes

yyforyongyu force-pushed the fix-stuck-payment branch from 5439c3f to 686acb9 Compare November 4, 2024 08:53

yyforyongyu mentioned this pull request Nov 21, 2024

[bug]: Payment Flow: Don't abort payment until HTLC attempts are still in flight #8975

Open

ziggie1984 reviewed Nov 25, 2024

View reviewed changes

yyforyongyu and others added 11 commits December 20, 2024 22:13

routing: remove redundant GetHash

f1d39dd

channeldb+routing: cache circuit and onion blob for htlc attempt

847d3b1

This commit caches the creation of sphinx circuit and onion blob to avoid re-creating them again.

routing: add new method reloadInflightAttempts

8c87d00

To shorten the method `resumePayment` and make each step more clear.

routing: move amendFirstHopData into requestRoute

41d66e8

routing: add new method processResultsAndReloadPayment

7aa0a44

To further shorten the lifecycle loop.

routing: improve loggings for attempt result handling

3426a46

docs: update release notes

d34640e

routerrpc: only log in the err case.

b24adea

channeldb+routing: continue payment lifecycle on register error

3d0ee83

x

5b655b1

yyforyongyu force-pushed the fix-stuck-payment branch from 686acb9 to 5b655b1 Compare December 20, 2024 14:16

	func fetchResult(tx kvdb.RTx, pid uint64) (*networkResult, error) {
	var attemptIDBytes [8]byte
	binary.BigEndian.PutUint64(attemptIDBytes[:], pid)

	networkResults := tx.ReadBucket(networkResultStoreBucketKey)
	if networkResults == nil {
	return nil, ErrPaymentIDNotFound
	}

	// Check whether a result is already available.
	resultBytes := networkResults.Get(attemptIDBytes[:])
	if resultBytes == nil {
	return nil, ErrPaymentIDNotFound
	}

	// Decode the result we found.
	r := bytes.NewReader(resultBytes)

	return deserializeNetworkResult(r)
	}

routing+htlcswitch: fix stuck inflight payments #9150

Are you sure you want to change the base?

routing+htlcswitch: fix stuck inflight payments #9150

Conversation

yyforyongyu commented Oct 2, 2024 • edited Loading

coderabbitai bot commented Oct 2, 2024 • edited Loading

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

calvinrzachman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ziggie1984 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yyforyongyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ziggie1984 commented Oct 28, 2024 • edited Loading

yyforyongyu commented Oct 28, 2024

ziggie1984 commented Nov 3, 2024 • edited Loading

yyforyongyu commented Nov 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ziggie1984 Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ziggie1984 Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ziggie1984 commented Nov 4, 2024

yyforyongyu commented Nov 4, 2024

ziggie1984 commented Nov 4, 2024

yyforyongyu commented Nov 4, 2024

ziggie1984 commented Nov 4, 2024 • edited Loading

yyforyongyu commented Nov 11, 2024

ziggie1984 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lightninglabs-deploy commented Dec 30, 2024

yyforyongyu commented Oct 2, 2024 •

edited

Loading

coderabbitai bot commented Oct 2, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

ziggie1984 commented Oct 28, 2024 •

edited

Loading

ziggie1984 commented Nov 3, 2024 •

edited

Loading

ziggie1984 Nov 4, 2024 •

edited

Loading

ziggie1984 Nov 4, 2024 •

edited

Loading

ziggie1984 commented Nov 4, 2024 •

edited

Loading

ziggie1984 left a comment •

edited

Loading