-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric for catchup failure and increase catchup time to 2 minutes #91
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the before/after for number of keys that we expect to exist at any instant? How about for rate of expiry? We should compare these values to current usage to understand what proportion of aggregate load this represents.
lib/oplog/tail.go
Outdated
} | ||
|
||
if (redisErr != nil) && (redisErr != redis.Nil) { | ||
log.Log.Errorw("Error querying Redis for last processed timestamp. Will start from end of oplog.", | ||
"error", redisErr) | ||
} | ||
|
||
metricOplogFailedResume.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts:
- Maybe we ought to track every resume, and partition by whether it was successful or not?
- It might be nice to make this a histogram, and make the value "how far behind we are in seconds".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -109,42 +109,42 @@ func TestParseEnv(t *testing.T) { | |||
|
|||
func checkConfigExpectation(t *testing.T, expectedConfig *oplogtoredisConfiguration) { | |||
if expectedConfig.MongoURL != MongoURL() { | |||
t.Errorf("Incorrect Mongo URL. Got \"%s\", Expected \"%s\"", | |||
t.Errorf("Incorrect Mongo URL. Expected \"%s\", Got \"%s\"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was confused for a little while at first...
Name: "resume_gap_seconds", | ||
Help: "Histogram recording the gap in time that a tailing resume had to catchup and whether it was successful or not.", | ||
Buckets: []float64{1, 2.5, 5, 10, 25, 50, 100, 250, 500, 1000}, | ||
}, []string{"status"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point in making this a string, but for now the value is more or a less a bool, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And.. the histogram buckets < MaxCatchUp
should always be successful
, and the buckets >= MaxCatchUp
should always be failure? So that assuming you knew the max catch up value at a certain time and the historgram threshold, you would be able to derive the success/failure value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it could be derived from that, but those are tunable parameters so having the label I think makes sense, since it's explicitly stating the action taken. Failures are noteworthy since it means downstream meteor instances will potentially be in an inconsistent state.
As for string/bool -- I could go either way -- I don't know that we'd expand it in the future or not, so maybe bool is enough. @torywheelwright do you have thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this is a literal array of strings declaring the various label names, rather than a declaration of the type of the value (which I understand is always a float).
It is true that if you exported the configured max resume time as a different metric, you could derive whether the resume was successful or not. This is probably a better encoding strictly speaking, though it makes the query a little more complicated. I have no strong preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For success/fail, I'd rather have it report what it did for historic reference since the thresholds can be changed (for example, I hope that in the near future after refactoring the dedup method we can further increase the catchup time). Since this decision logic on catchup vs start from present happens within OTR, I'd rather not duplicate it on the reporting side.
Previously the max catchup time default was 1 minute, which wasn't always enough to recover from a pod restart. This
resume_failed
metric that increments each time it cannot catchup from where it left off.