Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value in latency histogram is larger than latency max value, wrong computation or wrong interpretation? #41

Open
git4ghw opened this issue May 16, 2023 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested

Comments

@git4ghw
Copy link

git4ghw commented May 16, 2023

For some latency values, the max value is lower than the maximum value in the histogram:

File latency = 5.30 ms
max value in Files lat hist = 5.792 ms

The same is valid for the IO latency values:
max IO latency = 196 us
max value in IO lat hist = 216 us

Do I misinterpret the values, or is there something wrong with the computation of the values?

      Files latency    : [ min=405us avg=2.10ms max=5.30ms ]
      Files lat % us   : [ 1%<=430 50%<=862 75%<=4096 99%<=5792 ]
      Files lat hist   : [ 430: 1, 512: 1, 608: 1, 724: 1, 862: 1, 1024: 1, 4096: 2, 4870: 1, 5792: 1 ]
      IO latency       : [ min=6us avg=24us max=196us ]
      IO lat % us      : [ 1%<=6.8 50%<=12 75%<=22 99%<=216 ]
      IO lat hist      : [ 6.8: 9, 8.0: 21, 9.6: 14, 12: 18, 14: 5, 16: 5, 20: 2, 22: 3, 26: 6, 46: 1, 64: 2, 76: 6, 90: 1, 108: 4, 128: 1, 216: 2 ]

Assuming that the history values are really measured values, this would mean that some or all max values are wrong and that design decisions based on the max value would not be correct, since latency based long-distance architectures could be statistically significantly wrong, even if measured with a very high number of measurements. In the IO values (with a very small lot), the measurement error is ~ 10%

@breuner breuner added documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested labels May 18, 2023
@breuner breuner self-assigned this May 18, 2023
@breuner
Copy link
Owner

breuner commented May 18, 2023

Thanks for reporting this, @git4ghw . This makes me aware that the background for this is not explained in the built-in help, so at the very least I need to update the help.

Generally, the latency histogram uses "buckets" (each bucket representing a latency range) to count the number of IOs that were in a certain range. Not knowing upfront how fast or slow the tested system is, the histogram has to cover a fairly wide range. Thus, with a high resolution across the whole range, there would be too many buckets. That's why the implemented approach is based on broader decreasing accuracy with higher latency. The bucket size is calculated as 2^n microseconds, where n starts as 0.25 and is increased by 0.25 for each higher latency bucket.

In your example, the max latency was 5300 microseconds. log2(5300) is 12.37. This means it got into the histogram bucket that holds the range 2^12.25 - 2^12.5, or in other words the bucket that holds the range 4.87ms to 5.79ms.

However, I'm not very happy with this approach, because it's rather inconvenient for humans to calculate the ranges (which could be addressed by explicitly mentioning the ranges in the output, but then the histogram output line would get correspondingly longer) or I could think about a new approach tries to meet your suggestion of not getting beyond 10% inaccuracy.

And of course I'm open to suggestions if you want to make any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants