Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for catchup failure and increase catchup time to 2 minutes #91

Merged
merged 2 commits into from
Dec 16, 2024

Conversation

eparker-tulip
Copy link
Contributor

@eparker-tulip eparker-tulip commented Dec 11, 2024

Previously the max catchup time default was 1 minute, which wasn't always enough to recover from a pod restart. This

  1. doubles the time to 2 minutes
  2. and also increases the dedup key TTL from 2m to 2.5m to allow for covering the catchup time without greatly increasing the number of keys (25% increase in total dedup keys, same create/expire rate).
  3. Also added is resume_failed metric that increments each time it cannot catchup from where it left off.
  4. Logging is also improved with the addition of last processed age in seconds.

Copy link
Member

@torywheelwright torywheelwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the before/after for number of keys that we expect to exist at any instant? How about for rate of expiry? We should compare these values to current usage to understand what proportion of aggregate load this represents.

lib/oplog/tail.go Outdated Show resolved Hide resolved
lib/oplog/tail.go Outdated Show resolved Hide resolved
}

if (redisErr != nil) && (redisErr != redis.Nil) {
log.Log.Errorw("Error querying Redis for last processed timestamp. Will start from end of oplog.",
"error", redisErr)
}

metricOplogFailedResume.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  • Maybe we ought to track every resume, and partition by whether it was successful or not?
  • It might be nice to make this a histogram, and make the value "how far behind we are in seconds".

Copy link
Contributor

@jgdef-tulip jgdef-tulip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -109,42 +109,42 @@ func TestParseEnv(t *testing.T) {

func checkConfigExpectation(t *testing.T, expectedConfig *oplogtoredisConfiguration) {
if expectedConfig.MongoURL != MongoURL() {
t.Errorf("Incorrect Mongo URL. Got \"%s\", Expected \"%s\"",
t.Errorf("Incorrect Mongo URL. Expected \"%s\", Got \"%s\"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was confused for a little while at first...

Name: "resume_gap_seconds",
Help: "Histogram recording the gap in time that a tailing resume had to catchup and whether it was successful or not.",
Buckets: []float64{1, 2.5, 5, 10, 25, 50, 100, 250, 500, 1000},
}, []string{"status"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point in making this a string, but for now the value is more or a less a bool, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And.. the histogram buckets < MaxCatchUp should always be successful, and the buckets >= MaxCatchUp should always be failure? So that assuming you knew the max catch up value at a certain time and the historgram threshold, you would be able to derive the success/failure value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it could be derived from that, but those are tunable parameters so having the label I think makes sense, since it's explicitly stating the action taken. Failures are noteworthy since it means downstream meteor instances will potentially be in an inconsistent state.

As for string/bool -- I could go either way -- I don't know that we'd expand it in the future or not, so maybe bool is enough. @torywheelwright do you have thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this is a literal array of strings declaring the various label names, rather than a declaration of the type of the value (which I understand is always a float).

It is true that if you exported the configured max resume time as a different metric, you could derive whether the resume was successful or not. This is probably a better encoding strictly speaking, though it makes the query a little more complicated. I have no strong preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For success/fail, I'd rather have it report what it did for historic reference since the thresholds can be changed (for example, I hope that in the near future after refactoring the dedup method we can further increase the catchup time). Since this decision logic on catchup vs start from present happens within OTR, I'd rather not duplicate it on the reporting side.

@eparker-tulip eparker-tulip merged commit a74d8ba into master Dec 16, 2024
8 checks passed
@eparker-tulip eparker-tulip deleted the eparker.oplogcatchup branch December 16, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants