feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll · 2024-12-18T02:13:49Z

There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration.

After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost.

This PR attempts to solve this by adding a "retry" wrapper on top of the existing SyncNodeInfo method. We rely on avast/retry-go to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours.

DPA-1371

gustavogama-cll · 2024-12-18T02:15:28Z

core/config/toml/types.go

+	InsecureFastScrypt       *bool
+	RootDir                  *string
+	ShutdownGracePeriod      *commonconfig.Duration
+	FeedsManagerSyncInterval *commonconfig.Duration


New config option was added at the root level because I couldn't find a better place. Happy to move it elsewhere as per the maintainers advice.

gustavogama-cll · 2024-12-18T02:17:02Z

core/services/feeds/service.go

unit tests were skipped in this draft PR as I want to get some feedback about the approach before finishing the PR

github-actions · 2024-12-18T02:40:56Z

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

graham-chainlink · 2024-12-19T02:36:51Z

Hmm i wonder would this solve the connection issue?

If there is communication issue between node and JD, how would the auto sync help resolve it? It will try and it will fail right?

Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue)

graham-chainlink · 2024-12-19T02:25:14Z

core/services/feeds/service.go

+
+			for _, manager := range managers {
+				s.lggr.Infow("synchronizing node info", "managerID", manager.ID)
+				err := s.SyncNodeInfo(ctx, manager.ID)


managers/JDs can be disabled/enabled, we should avoid synching to disabled JDs, we can use the DisabledAt to determine it like at line 1130 of this file

graham-chainlink · 2024-12-19T02:40:33Z

core/services/feeds/service.go

@@ -1550,6 +1553,32 @@ func (s *service) isRevokable(propStatus JobProposalStatus, specStatus SpecStatu
 	return propStatus != JobProposalStatusDeleted && (specStatus == SpecStatusPending || specStatus == SpecStatusCancelled)
 }

+func (s *service) periodicallySyncNodeInfo(ctx context.Context) {


I think there is also an assumption in code where JD is connected to the core node when SyncNodeInfo is called, else it will return the error could not fetch client which may not be a big deal, just noise. But if we could check if the nodes are connected before synching, that would be nice.

There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration. After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost. This PR attempts to solve this by adding a "retry" wrapper on top of the existing `SyncNodeInfo` method. We rely on `avast/retry-go` to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours. Ticket Number: DPA-1371

cl-sonarqube-production · 2024-12-20T05:28:13Z

Quality Gate passed

Issues
0 New issues
1 Fixed issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

gustavogama-cll · 2024-12-20T05:56:14Z

Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue)

As discussed earlier today, I went ahead and implemented your suggestion. I ran a few manual tests and it seems to work as expected, though I had to add some extra logic around the context instances to get there.

I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me.

graham-chainlink · 2024-12-21T05:51:25Z

I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me.

Thanks @gustavogama-cll, yeah the background go-routine definitely has its pros, both approaches are valid, just that for me i think the retry is simpler.

graham-chainlink · 2024-12-21T06:20:35Z

core/services/feeds/service.go

+		retry.Delay(5 * time.Second),
+		retry.Delay(10 * time.Second),


Delay is configured twice.

graham-chainlink · 2024-12-21T06:21:01Z

core/services/feeds/service.go

+	var ctx context.Context
+	ctx, s.syncNodeInfoCancel = context.WithCancel(context.Background())
+
+	retryOpts := []retry.Option{


Is there a reason we didn use retry.BackOffDelay ?

graham-chainlink · 2024-12-21T06:22:22Z

core/services/feeds/service.go

+		retry.Delay(5 * time.Second),
+		retry.Delay(10 * time.Second),
+		retry.MaxDelay(30 * time.Minute),
+		retry.Attempts(48 + 8), // 30m * 48 =~ 24h; plus the initial 8 shorter retries


Where did you derive the 8? where do we configure the shorter retries?

graham-chainlink · 2024-12-21T06:27:51Z

core/services/feeds/service.go

+		}
+
+		s.syncNodeInfoCancel()
+		s.syncNodeInfoCancel = func() {}


Hmm wont this introduce race condition as each request that wants to update node info will try to set this variable?

graham-chainlink · 2024-12-21T06:28:38Z

core/services/feeds/service.go

@@ -141,6 +143,7 @@ type service struct {
 	lggr                logger.Logger
 	version             string
 	loopRegistrarConfig plugins.RegistrarConfig
+	syncNodeInfoCancel  context.CancelFunc


I think instead of using this to pass context, we should just have syncNodeInfoWithRetry accept a context as parameter, each of the caller should have a context value to pass it in.

graham-chainlink · 2024-12-21T06:30:32Z

core/services/feeds/service.go

+func (s *service) syncNodeInfoWithRetry(id int64) {
+	// cancel the previous context -- and, by extension, the existing goroutine --
+	// so that we can start anew
+	s.syncNodeInfoCancel()


I dont think we need to do this right?

If the caller of syncNodeInfoWithRetry pass in their context which is scoped to a request, then we dont have to manually cancel each context. Each request should have its own retry. Eg request A should not cancel request B sync which is happening with this setup?

gustavogama-cll commented Dec 18, 2024

View reviewed changes

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 3 times, most recently from 301972c to 83b1842 Compare December 18, 2024 05:51

graham-chainlink reviewed Dec 19, 2024

View reviewed changes

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 2 times, most recently from 57b55cc to c5d0079 Compare December 20, 2024 04:34

gustavogama-cll changed the title ~~feat(job-distributor): periodically sync node info with job distributors~~ feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() Dec 20, 2024

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from c5d0079 to a1a4281 Compare December 20, 2024 04:37

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from a1a4281 to 61297ab Compare December 20, 2024 05:17

gustavogama-cll requested a review from graham-chainlink December 20, 2024 18:34

graham-chainlink reviewed Dec 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll commented Dec 18, 2024 •

edited by jira bot

Loading

gustavogama-cll Dec 18, 2024

gustavogama-cll Dec 18, 2024

github-actions bot commented Dec 18, 2024 •

edited

Loading

graham-chainlink commented Dec 19, 2024 •

edited

Loading

graham-chainlink Dec 19, 2024

graham-chainlink Dec 19, 2024

cl-sonarqube-production bot commented Dec 20, 2024

gustavogama-cll commented Dec 20, 2024

graham-chainlink commented Dec 21, 2024

graham-chainlink Dec 21, 2024

graham-chainlink Dec 21, 2024

graham-chainlink Dec 21, 2024

graham-chainlink Dec 21, 2024

graham-chainlink Dec 21, 2024

graham-chainlink Dec 21, 2024

feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() #15752

Are you sure you want to change the base?

feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() #15752

Conversation

gustavogama-cll commented Dec 18, 2024 • edited by jira bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 18, 2024 • edited Loading

AER Report: CI Core ran successfully ✅

AER Report: Operator UI CI ran successfully ✅

graham-chainlink commented Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cl-sonarqube-production bot commented Dec 20, 2024

Quality Gate passed

gustavogama-cll commented Dec 20, 2024

graham-chainlink commented Dec 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll commented Dec 18, 2024 •

edited by jira bot

Loading

github-actions bot commented Dec 18, 2024 •

edited

Loading

graham-chainlink commented Dec 19, 2024 •

edited

Loading