New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add query sampling for checking #7

Open

andrewma2 wants to merge 1 commit into databricks:db_obs_m3coord_cache from andrewma2:Sampling

andrewma2 commented Aug 12, 2022

What changes are proposed in this pull request?

Implementing a method to periodically check results against the expected M3DB result and record metrics (+refactor of config).

How is this tested?

Deployed to dev-aws-us-west-1, verified from emitted metrics that checks are being performed and are accurate (also verified that mismatches are registered). Other performance metrics aren't affected.

How is this feature monitored?

Code Review

For information about the code review process, e.g. how to find a reviewer or how to ping non-responsive reviewers, check the contents of go/code-review Confluence page.

Approvals

Other than the mandatory approvers enforced by the OWNER file framework (http://go/owners), this PR
requires at least one approval from another engineer.

[NEW] Shiproom

Platform & Compute Fabric:
Changes should be tracked by an approved "material change." Multiple PRs may be tracked by a single material change.

Change modifies code owned or released by a Platform or Compute Fabric team
- A "material change" covering this PR exists in http://go/engshiproom: <CHANGE_ID>

See http://go/platshiproomwiki for instructions and use http://go/lightcm-template to evaluate risk. Ask questions in #platform-shiproom.

Runtime changes:

Change targets a runtime maintenance release (i.e., targets a maintenance dbr-branch-x.x branch or has a maintenance dbr-branch-x.x label)
- Change is NOT a “material change”
- Change IS a “material change” of low / medium risk
- Change IS a “material change” of high risk needing Shiproom review

Please refer Runtime Shiproom Wiki: http://go/runtimeshiproomwiki

Security implications

This section is intended for the reviewers of this PR
Please, make sure you consider the content of "What are the responsibilities of code reviewers?" section of go/pr-security-review


          Add query sampling for checking

d2baa38

andrewma2 marked this pull request as ready for review

August 12, 2022 17:48

andrewma2 requested a review from davidyuanfs

August 12, 2022 18:25

davidyuanfs reviewed

View reviewed changes

src/query/api/v1/handler/prom/read.go

+              	// Ratio of queries we make a check for
+              	DefaultCheckSampleRate float64 = 0.0
+              	// Threshold in % to determine if there's difference in results (1 means 1% diff)
+              	DefaultComparePercentThreshold float64 = 1.0

davidyuanfs Aug 15, 2022

1% difference cross all buckets?

Author

andrewma2 Aug 15, 2022

So it checks the final aggregated result, so it's a 1% difference in the final result

src/query/api/v1/handler/prom/read.go

               )
+              // Compares results a, b to the specified percent threshold
+              // Results should be vectors
+              func compareResults(a, b *promql.Result, threshold float64) bool {

davidyuanfs Aug 15, 2022

What may cause the value different?

Author

andrewma2 Aug 15, 2022

I think there can be some slight floating point precision errors especially with comparison, so I thought a % threshold would be best

src/query/api/v1/handler/prom/read.go

               		return
               	}
+              	// Rulemanager results are vector values (list of metric + value)
+              	// Take a random number and check if under rate so we check a proportion of the queries
+              	if rand.Float64() < float64(h.queryCheckConfig.CheckSampleRate) && res.Value.Type() == parser.ValueTypeVector {

davidyuanfs Aug 15, 2022

not a big deal, but do we need float64, why not float32?

src/query/api/v1/handler/prom/read.go

+              		if result.Err != nil {
+              			h.logger.Error("Comparison query failed to execute")
+              		} else {
+              			if result != nil && !compareResults(res, result, h.queryCheckConfig.ComparePercentThreshold) {

davidyuanfs Aug 15, 2022

How do you know this res if queried from cache instead of m3db if cache miss the hit?

src/query/api/v1/handler/prom/read.go

+              		}
+              		defer query.Close()
+              		// Set context so we can default to M3DB later on
+              		result := query.Exec(context.WithValue(ctx, "UseM3DB", true))

davidyuanfs Aug 15, 2022

rename name to m3dbQueryResult to avoid confusion between res and result

src/query/api/v1/handler/prom/read.go

               		return
               	}
+              	// Rulemanager results are vector values (list of metric + value)
+              	// Take a random number and check if under rate so we check a proportion of the queries
+              	if rand.Float64() < float64(h.queryCheckConfig.CheckSampleRate) && res.Value.Type() == parser.ValueTypeVector {

davidyuanfs Aug 15, 2022

Can we move this check into redis_cache.go? Put this a this section code into root HTTP method is exposing the storage/cache knowledge to upper layer which is not good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet