Add metric for catchup failure and increase catchup time to 2 minutes #91

eparker-tulip · 2024-12-11T16:13:00Z

Previously the max catchup time default was 1 minute, which wasn't always enough to recover from a pod restart. This

doubles the time to 2 minutes
and also increases the dedup key TTL from 2m to 2.5m to allow for covering the catchup time without greatly increasing the number of keys (25% increase in total dedup keys, same create/expire rate).
Also added is resume_failed metric that increments each time it cannot catchup from where it left off.
Logging is also improved with the addition of last processed age in seconds.

torywheelwright

What is the before/after for number of keys that we expect to exist at any instant? How about for rate of expiry? We should compare these values to current usage to understand what proportion of aggregate load this represents.

lib/oplog/tail.go

torywheelwright · 2024-12-11T18:17:11Z

lib/oplog/tail.go

 	}

 	if (redisErr != nil) && (redisErr != redis.Nil) {
 		log.Log.Errorw("Error querying Redis for last processed timestamp. Will start from end of oplog.",
 			"error", redisErr)
 	}
+
+	metricOplogFailedResume.Inc()


Two thoughts:

Maybe we ought to track every resume, and partition by whether it was successful or not?

It might be nice to make this a histogram, and make the value "how far behind we are in seconds".

jgdef-tulip

LGTM

jgdef-tulip · 2024-12-13T22:14:48Z

lib/config/main_test.go

@@ -109,42 +109,42 @@ func TestParseEnv(t *testing.T) {

 func checkConfigExpectation(t *testing.T, expectedConfig *oplogtoredisConfiguration) {
 	if expectedConfig.MongoURL != MongoURL() {
-		t.Errorf("Incorrect Mongo URL. Got \"%s\", Expected \"%s\"",
+		t.Errorf("Incorrect Mongo URL. Expected \"%s\", Got \"%s\"",


good catch here

Thanks, I was confused for a little while at first...

jgdef-tulip · 2024-12-13T22:15:31Z

lib/oplog/tail.go

+		Name:      "resume_gap_seconds",
+		Help:      "Histogram recording the gap in time that a tailing resume had to catchup and whether it was successful or not.",
+		Buckets:   []float64{1, 2.5, 5, 10, 25, 50, 100, 250, 500, 1000},
+	}, []string{"status"})


I see the point in making this a string, but for now the value is more or a less a bool, correct?

And.. the histogram buckets < MaxCatchUp should always be successful, and the buckets >= MaxCatchUp should always be failure? So that assuming you knew the max catch up value at a certain time and the historgram threshold, you would be able to derive the success/failure value?

Yes, it could be derived from that, but those are tunable parameters so having the label I think makes sense, since it's explicitly stating the action taken. Failures are noteworthy since it means downstream meteor instances will potentially be in an inconsistent state.

As for string/bool -- I could go either way -- I don't know that we'd expand it in the future or not, so maybe bool is enough. @torywheelwright do you have thoughts on this?

My understanding is that this is a literal array of strings declaring the various label names, rather than a declaration of the type of the value (which I understand is always a float).

It is true that if you exported the configured max resume time as a different metric, you could derive whether the resume was successful or not. This is probably a better encoding strictly speaking, though it makes the query a little more complicated. I have no strong preference.

For success/fail, I'd rather have it report what it did for historic reference since the thresholds can be changed (for example, I hope that in the near future after refactoring the dedup method we can further increase the catchup time). Since this decision logic on catchup vs start from present happens within OTR, I'd rather not duplicate it on the reporting side.

add metric for catchup failure and increase catchup time to 2 minutes

2f8841b

eparker-tulip requested a review from torywheelwright December 11, 2024 16:19

torywheelwright reviewed Dec 11, 2024

View reviewed changes

convert resume gap metric to histogram

a33f8f8

eparker-tulip requested a review from torywheelwright December 13, 2024 21:02

jgdef-tulip approved these changes Dec 13, 2024

View reviewed changes

torywheelwright approved these changes Dec 16, 2024

View reviewed changes

eparker-tulip merged commit a74d8ba into master Dec 16, 2024
8 checks passed

eparker-tulip deleted the eparker.oplogcatchup branch December 16, 2024 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric for catchup failure and increase catchup time to 2 minutes #91

Add metric for catchup failure and increase catchup time to 2 minutes #91

eparker-tulip commented Dec 11, 2024 •

edited

Loading

torywheelwright left a comment

torywheelwright Dec 11, 2024

jgdef-tulip left a comment

jgdef-tulip Dec 13, 2024

eparker-tulip Dec 13, 2024

jgdef-tulip Dec 13, 2024

jgdef-tulip Dec 13, 2024

eparker-tulip Dec 13, 2024

torywheelwright Dec 16, 2024

eparker-tulip Dec 16, 2024

Add metric for catchup failure and increase catchup time to 2 minutes #91

Add metric for catchup failure and increase catchup time to 2 minutes #91

Conversation

eparker-tulip commented Dec 11, 2024 • edited Loading

torywheelwright left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgdef-tulip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eparker-tulip commented Dec 11, 2024 •

edited

Loading