Adding `grpc_code` label to prometheus `grpc_server_handling_seconds` metric #606

jon-whit · 2023-07-28T19:04:25Z

I'm using the prometheus middleware to measure the RPC endpoint latency for my gRPC service. I've enabled the ServerHandlingTimeHistogram.

I noticed that the grpc_code label is not included in the observed _bucket metrics (see this line of code). Consequently, I cannot effectively measure my p99 request latency for requests served successfully (e.g. where grpc_code="OK"). I'd like to set some SLOs based on percentiles of successful requests, but this middleware doesn't currently allow me to do that as far as I can tell.. Was this design choice intentional, and if so why?

Would you be open to me submitting a PR to add it?

The text was updated successfully, but these errors were encountered:

jon-whit · 2023-07-28T19:59:54Z

Related to grpc-ecosystem/go-grpc-prometheus#75, but reviving since that repository is now deprecated.

johanbrandhorst · 2023-07-28T23:22:55Z

Sounds like the consensus in the other thread is that adding grpc_code would be too much, but that having a success/failure signal would be nice. I don't know enough about the prometheus library to say if this is possible, perhaps @bwplotka can weigh in on here too.

bwplotka · 2023-07-31T09:45:25Z

Yea. I think with native histograms it would be easier decision. Wonder if it makes sense to be early adopter of native histograms here for this reason cc @beorn7

Otherwise we could add opt-in grpc_code, that's also fine.

jon-whit · 2023-08-02T16:18:34Z

Sounds like the consensus in the other thread is that adding grpc_code would be too much, but that having a success/failure signal would be nice. I don't know enough about the prometheus library to say if this is possible, perhaps @bwplotka can weigh in on here too.

@johanbrandhorst I'm not sure I agree with that conclusion. I wouldn't say any consensus was yet achieved. There are still community members requesting it and the "consensus" was merely the response of the maintainers. But, in any case, why is it not possible to allow the client to choose whether to include it or not? For me the increase in cardinality is well worth the value I get from it. I should be able to enable this label if I want and understand that there is a tradeoff. For example, I should be able to

var serverMetrics *grpc_prometheus.ServerMetrics

serverMetrics.EnableHandlingTimeHistogram(
    ...,
    WithStatusCodeLabel(),
)

If the WithStatusCodeLabel option is provided to the handling time histogram, then that label should be emitted and a value provided for it in each observation. The default can be to omit the grpc_code label, which maintains equivalent behavior with what is standard today, but if you want that label and are willing to increase cardinality, then that should be the developer's choice.

This work ☝️ can be in addition to switching to a native histogram so as to further reduce the impact/burden if you want to enable the grpc_code label.

If I submit a PR, will you guys consider merging it? If so, I'd be happy to submit one this week.

beorn7 · 2023-08-08T09:38:57Z

It would be great to offer native histogram support, but I would recommend to also support classic histograms in parallel so that scraper that aren't ready for native histograms yet will still get a classic histogram.

johanbrandhorst · 2024-05-01T02:52:28Z

There still seems to be interest from the community to add this (see #709), I'm not sure there was a conclusion from this discussion, I will defer to you on this @bwplotka.

Tatskaari · 2024-07-16T15:34:03Z

Hey! I'd also find this metrics useful because we're trying to measure our SLO burn rate and would like to set a success criteria based on success/failure and latency. If we use two separate counters, we'd have to double count messages that are both late, and errors.

I think it would be best to leave it up to the end users to decide whether they want to keep this label by dropping the labels when they scrape them.

jon-whit mentioned this issue Jul 28, 2023

chore: update go-grpc-prometheus middleware to use fork with grpc_code openfga/openfga#895

Merged

4 tasks

miparnisari mentioned this issue Apr 29, 2024

feat: add grpc_code label to grpc_server_handling_seconds metric #709

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding `grpc_code` label to prometheus `grpc_server_handling_seconds` metric #606

Adding `grpc_code` label to prometheus `grpc_server_handling_seconds` metric #606

jon-whit commented Jul 28, 2023 •

edited

Loading

jon-whit commented Jul 28, 2023

johanbrandhorst commented Jul 28, 2023

bwplotka commented Jul 31, 2023

jon-whit commented Aug 2, 2023 •

edited

Loading

beorn7 commented Aug 8, 2023

johanbrandhorst commented May 1, 2024

Tatskaari commented Jul 16, 2024 •

edited

Loading

Adding grpc_code label to prometheus grpc_server_handling_seconds metric #606

Adding grpc_code label to prometheus grpc_server_handling_seconds metric #606

Comments

jon-whit commented Jul 28, 2023 • edited Loading

jon-whit commented Jul 28, 2023

johanbrandhorst commented Jul 28, 2023

bwplotka commented Jul 31, 2023

jon-whit commented Aug 2, 2023 • edited Loading

beorn7 commented Aug 8, 2023

johanbrandhorst commented May 1, 2024

Tatskaari commented Jul 16, 2024 • edited Loading

Adding `grpc_code` label to prometheus `grpc_server_handling_seconds` metric #606

Adding `grpc_code` label to prometheus `grpc_server_handling_seconds` metric #606

jon-whit commented Jul 28, 2023 •

edited

Loading

jon-whit commented Aug 2, 2023 •

edited

Loading

Tatskaari commented Jul 16, 2024 •

edited

Loading