-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exponential Histogram : observed bucket counts sometimes not matching with overall count #4450
Comments
@kothariroshni8 is there a minimal way to reproduce this? How often is "intermittent"? Without a reproduction this may be hard to track down, but I think it is important to fix. @mwear can you please look into this? I think this should definitely not be possible. edit: sorry Aaron for the errant ping |
@martinkuba and I looked at the code involving the counts and zero counts and were not able to find anything obviously wrong. We'd like to see if there is some way to narrow this down more definitively to an issue in the JS sdk. I'll paste an image below that has what I believe to be something representative of your current setup (on top). Would you be able to setup a single JS process that reports into a single collector that exports to mimir (like the lower part of the diagram) and see if that repros? If it does, and you can provide logs (as in your original post) that would be helpful. We're just trying to rule out some of the potential complexities in the multi-process, multi-collector scenario you are currently running. |
@dyladan Yes, it is difficult to reproduce, I tried multiple ways but no luck.
|
Those observations are helpful. Looking at the data, I suspect that the max value might actually be correct and that the counts in the last bucket might be there as the result of an off by 1 error. The prior non-zero bucket is |
I just realized that we allow Just speculating, but maybe this is behavior here is the result of two different issues. 🤔 |
I think error is coming from the Merge Function in ExponentialHistogram.ts What I TriedI Patched the merge function and compared the count after it runs.
Output Log.
Also, it mostly happens with bucket length 81 and 161. |
@kothariroshni8 thanks for your debugging work and isolating this to the merge operation; it's super helpful. I'll look into the merge logic in detail tomorrow. |
If you have the time, you could add the following to the printBuckets() {
console.log(`scale: %d\ncount: %d\nzero count: %d\nmin: %d\nmax: %d\nsum: %d`,
this.scale, this.count, this.zeroCount, this.min, this.max, this.sum);
const buckets = this.positive;
for (let i = 0; i < buckets.length; i++) {
const index = buckets.offset + i;
const lower = this._mapping.lowerBoundary(index).toFixed(8);
const upper = this._mapping.lowerBoundary(index + 1).toFixed(8)
const count = buckets.at(i);
console.log(`Bucket: (%d, %d], Count: %d, Index: %d`, lower, upper, count, index);
}
} Then you could modify merge in the merge(
previous: ExponentialHistogramAccumulation,
delta: ExponentialHistogramAccumulation
): ExponentialHistogramAccumulation {
const result = delta.clone();
result.merge(previous);
// if counts not matching
console.log('merge: delta');
delta.printBuckets();
console.log('merge: previous');
previous.printBuckets();
console.log('merge: result');
result.printBuckets();
return result;
} That will output the buckets, counts, etc in a very similar format to what you see in the collector. I should be able to reconstruct the histograms from that if need be. |
@mwear : PFB bucket details as you requested :
|
is it happening when any one of the buckets has no value? |
I setup a test case to replicate the data from your original comment #4450 (comment) (which has since been edited). By the way, the current edit doesn't include the data from all three histograms: delta, previous, and result, so I can't do the same analysis on it. I'll describe my process with your original data and see if you might be able to replicate it on any output you capture and see if we can find anything interesting or a reliable repro. Before getting into the data, I wanted to point out one detail. The count for the histogram includes the counts of all the buckets, plus the zero count. Zero doesn't get a dedicated bucket in the output. I think that is one detail missing when you are detecting a mismatch in your patched merge function. That said, the data you provided shows the counts are off by more than just the zero count. The data I'm working was from your first comment before the edits, which I'll paste below:
In the data provided, delta had a count of 11, which matched its bucket count. Previous had a count of 86, and a zero count of 1. It does not match the expected bucket count (including the zero count) of 95. The count for result was off because of this. Your data showed a count of 97 when the bucket count (including zero count) should be 106. I filtered out the buckets with 0 items to make processing them easier (more on this soon). I copied the relevant lines and ran delta
previous
result
I then converted these into histograms to use in a mocha test case. Since I don't have the exact values available, I picked values that are just inside the range for each bucket. This will give us histograms with the same scale, same buckets, and same counts in each bucket, but slightly different values for max, min, and sum. This test case is available in a branch here: main...mwear:opentelemetry-js:expohisto-debug, but I'll paste it below for the purpose of discussion. import {
ExponentialHistogramAccumulation,
} from '../../src/aggregator/ExponentialHistogram';
import * as assert from 'assert';
describe('debug', () => {
it.only('attempts to repro', () => {
const delta = new ExponentialHistogramAccumulation([0, 0], 160);
delta.updateByIncrement(0.000979, 2); // Bucket: [0.00097656, 0.0010198], Count: 2, Index: -160
delta.updateByIncrement(0.001959, 2); // Bucket: [0.00195313, 0.0020396], Count: 2, Index: -144
delta.updateByIncrement(0.002889, 2); // Bucket: [0.00288443, 0.00301213], Count: 2, Index: -135
delta.updateByIncrement(0.003909, 1); // Bucket: [0.00390625, 0.00407919], Count: 1, Index: -128
delta.updateByIncrement(0.004859, 1); // Bucket: [0.00485101, 0.00506578], Count: 1, Index: -123
delta.updateByIncrement(0.005769, 1); // Bucket: [0.00576885, 0.00602426], Count: 1, Index: -119
delta.updateByIncrement(0.007819, 1); // Bucket: [0.0078125, 0.00815839], Count: 1, Index: -112
delta.updateByIncrement(0.119709, 1); // Bucket: [0.11970041, 0.125], Count: 1, Index: -49
console.log('delta');
delta.printBuckets();
const previous = new ExponentialHistogramAccumulation([0, 0], 160);
previous.updateByIncrement(0, 1);
previous.updateByIncrement(0.000979, 29); // Bucket: [0.00097656, 0.00106495], Count: 29, Index: -80
previous.updateByIncrement(0.001959, 14); // Bucket: [0.00195313, 0.0021299], Count: 14, Index: -72
previous.updateByIncrement(0.002769, 10); // Bucket: [0.00276214, 0.00301213], Count: 10, Index: -68
previous.updateByIncrement(0.003909, 2); // Bucket: [0.00390625, 0.0042598], Count: 2, Index: -64
previous.updateByIncrement(0.004649, 3); // Bucket: [0.00464534, 0.00506578], Count: 3, Index: -62
previous.updateByIncrement(0.006569, 1); // Bucket: [0.0065695, 0.00716409], Count: 1, Index: -58
previous.updateByIncrement(0.007819, 2); // Bucket: [0.0078125, 0.00851959], Count: 2, Index: -56
previous.updateByIncrement(0.011049, 1); // Bucket: [0.01104854, 0.01204852], Count: 1, Index: -52
previous.updateByIncrement(0.018589, 3); // Bucket: [0.01858136, 0.02026312], Count: 3, Index: -46
previous.updateByIncrement(0.020269, 5); // Bucket: [0.02026312, 0.02209709], Count: 5, Index: -45
previous.updateByIncrement(0.028659, 2); // Bucket: [0.02865638, 0.03125], Count: 2, Index: -41
previous.updateByIncrement(0.037169, 2); // Bucket: [0.03716272, 0.04052624], Count: 2, Index: -38
previous.updateByIncrement(0.040529, 1); // Bucket: [0.04052624, 0.04419417], Count: 1, Index: -37
previous.updateByIncrement(0.044199, 1); // Bucket: [0.04419417, 0.04819409], Count: 1, Index: -36
previous.updateByIncrement(0.048199, 1); // Bucket: [0.04819409, 0.05255603], Count: 1, Index: -35
previous.updateByIncrement(0.052559, 2); // Bucket: [0.05255603, 0.05731275], Count: 2, Index: -34
previous.updateByIncrement(0.057319, 1); // Bucket: [0.05731275, 0.0625], Count: 1, Index: -33
previous.updateByIncrement(0.068159, 2); // Bucket: [0.06815673, 0.07432544], Count: 2, Index: -31
previous.updateByIncrement(0.074329, 1); // Bucket: [0.07432544, 0.08105247], Count: 1, Index: -30
previous.updateByIncrement(0.105119, 1); // Bucket: [0.10511205, 0.11462551], Count: 1, Index: -26
previous.updateByIncrement(0.192779, 1); // Bucket: [0.19277635, 0.2102241], Count: 1, Index: -19
previous.updateByIncrement(1.010000, 9); // Bucket: [1, 1.09050773], Count: 9, Index: 0
console.log('previous');
previous.printBuckets();
const result = delta.clone();
result.merge(previous);
console.log('result')
result.printBuckets();
assert.equal(result.count, delta.count + previous.count);
assert.equal(result.count, bucketCounts(result));
assert.equal(delta.count, bucketCounts(delta));
assert.equal(previous.count, bucketCounts(previous));
assert.equal(bucketCounts(result), bucketCounts(delta) + bucketCounts(previous));
})
});
function bucketCounts(histo: ExponentialHistogramAccumulation): number {
// zero counts do not get a dedicated bucket, but they are part of the overall
// histogram count
return histo.toPointValue().positive.bucketCounts.reduce(
(total, current) => total += current,
histo.zeroCount
);
} In this test case, the counts were the expected counts, and did not show discrepancies from your original data. The output from the test is below:
I think you are definitely on to something with your debug process, but you should update your logic to detect mismatched counts to add both the bucket counts and the zero count otherwise it will find false positives. My observation from this data, is that the |
@mwear Sorry for confusion. I shouldn't have edited the same comment. Logs which I updated later is the first occurrence of issue. |
and Buckets are missing for Delta because ZeroCount is also equal to 2.
|
@mwear Use below code to repro the error:
|
@kothariroshni8 thank you so much for your help. That does indeed repro and I have a tentative fix. I'll open a PR shortly. If you wanted to try this out early, I added a special case for a histogram with zero length buckets to The lines I added were: if (buckets.length === 0) {
buckets.indexStart = buckets.indexEnd = buckets.indexBase = index;
} The full function looks like this: private _incrementIndexBy(
buckets: Buckets,
index: number,
increment: number
) {
if (increment === 0) {
// nothing to do for a zero increment, can happen during a merge operation
return;
}
if (buckets.length === 0) {
buckets.indexStart = buckets.indexEnd = buckets.indexBase = index;
}
if (index < buckets.indexStart) {
const span = buckets.indexEnd - index;
if (span >= buckets.backing.length) {
this._grow(buckets, span + 1);
}
buckets.indexStart = index;
} else if (index > buckets.indexEnd) {
const span = index - buckets.indexStart;
if (span >= buckets.backing.length) {
this._grow(buckets, span + 1);
}
buckets.indexEnd = index;
}
let bucketIndex = index - buckets.indexBase;
if (bucketIndex < 0) {
bucketIndex += buckets.backing.length;
}
buckets.incrementBucket(bucketIndex, increment);
} |
When you will release this fix? @pichlermarc @mwear |
Aiming for next week. 🙂 |
What happened?
When using Exponential Histogram for metrics, we have encountered intermittent mismatch between the overall count and the sum of bucket counts.
Steps to Reproduce
Our metrics pipeline is running inside K8s. Plainly, we have a collection of service pods running an otel collector as a sidecar agent. Services are publishing metrics to their local sidecar agent, which in turn publishes metrics via otlphttp to our OpenTelemetryCollector Gateway (i.e. cluster of otel collector pods). Pods in our OpenTelemetryCollector Gateway deployment then publish metrics to Grafana Mimir via prometheusremotewrite.
Expected Result
No Mismatch in counts.
Actual Result
Mimir Distributor Log:
error exporterhelper/common.go:49 Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite/mimir", "dropped_items": 66, "error": "Permanent error: Permanent error: remote write returned HTTP status 500 Internal Server Error; err = %!w(<nil>): failed pushing to ingester: rpc error: code = Unknown desc = user={user}: 16504 observations found in buckets, but the Count field is 13906: histogram's observation count should be at least the number of observations found in the bucket
Additional Details
OpenTelemetry Setup Code
package.json
Relevant log output
No response
The text was updated successfully, but these errors were encountered: