-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Prometheus and Add Request Level Metrics #2316
Refactor Prometheus and Add Request Level Metrics #2316
Conversation
…asing values. computing averages on the client is an anti-pattern for prometheus metrics and should be computed on the prom server
… grafana dashboard
…rything a lot simplier so squished everything back to a single file.
Hi @rib-2, thank you for your contribution. This PR is definitely in the right direction, few things to start:
|
@simon-mo Sounds good. Thanks for the feedback.
For the existing logging message --> are you referring to the |
Yes! |
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
I like the idea of doing it in another PR |
…s to be compatible with prior verions (and adds back the gaugues that compute avg tput for backwards compatibility
Only outstanding item I think is the
@simon-mo requesting re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo I think this is good, and should be merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great work here. And thanks @NikolaBorisov for the review.
* Add vllm-online-serving * Add prom metrics * Update monitoring * remove logging * Add labels * Use vllm directly from upstream latest to pick up vllm-project/vllm#2316 * Roll back vllm to 0.3.0 * Get patch files for metrics in vllm-project/vllm#2316 * Update llm_engine.py * Write documents * Add vllm-online-serving/README-ko.md * write README.md
Summary
This PR does three things:
A) Addresses open feature request (#1870) by refactoring and extending initial implementation of metrics (#1890) to:
B) Creates an end-to-end example for how to monitoring vLLM with Prometheus and Grafana
C) Updates the existing metric implementations to follow Prometheus best practices, namely:
vllm:num_requests_running
should bevllm_num_requests_running_total
Gauges
ratherCounters
+ PromQLrate
(Prom Docs) ->vllm:avg_generation_throughput_toks_per_sec
should be aCounter
calledvllm_generation_tokens_total
and use PromQLrate(vllm_generation_tokens_total[5s])
to calc tokens / second during dashboarding.A) Implementation
Created / updated the following classes:
SequenceGroup
: addedlast_token_time
variable andget_last_latency
/get_e2e_latency
) methods, which enables us to capture the request-level latencies if logging is enabled.LLMEngine
: added aPrometheusLogger
and logic to createStats
, making a cleaner interface between theLLMEngine
and logging-related functionality._process_model_outputs
, we callLLMEngine._get_stats
to generateStats
that are passed to thePrometheusLogger.log
.PrometheusLogger
: holds a list ofPrometheusMetrics
and passes theStat
generated by theLLMEngine
to each.PrometheusMetric
: holds a metric (aioprometheus
collectorCounter
,Gauge
,Histogram
) and a function to extract the appropriate data fromStats
Within this framework, created a registry of
PrometheusMetrics
:Currently Supported Include:
counter_prompt_tokens
--> used with rate() to calculate prompt token throughputcounter_generation_tokens
--> used with rate() to calculate generation token throughputgauge_scheduler_running
gauge_scheduler_swapped
gauge_scheduler_waiting
gauge_gpu_cache_usage
gauge_cpu_cache_usage
histogram_time_to_first_token
--> exposes counters needed to calculate avg ttft, P50, P90, P95, P99histogram_inter_token_latency
--> exposes counters needed to calculate avg itl, P50, P90, P95, P99histogram_e2e_request_latency
--> exposes counters needed to calculate e2e request latency, P50, P90, P95, P99See the Example for a dashboard that shows how these exposed metrics should be monitored
B) Example
See examples/production_monitoring for an end-to-end example. I included a Grafana dashboard configuration which shows how these metrics should be monitored.
C) Best Practices
I recognize these changes have breaking impacts on the metrics exposed to users.
Key changes include:
vllm:num_requests_swapped
-->vllm_requests_stopped_total
)vllm:avg_prompt_throughput_toks_per_s
/vllm:avg_generation_throughput_toks_per_s
) to be total tokens processed counters (vllm_prompt_tokens_total
/vllm_generation_tokens_total
)rate(vllm_prompt_tokens_total[30s])
My sense is that this is a very new feature, so Im not sure how much user impact there is. However, I think the changes I am suggesting are justified. I am happy to revert these if requested.
Overhead
I used the benchmarking scripts to test performance with and without the logger on an L4 GPU. There is very minor latency.
benchmark_serving.py
Client:python3 benchmark_serving.py --backend vllm --tokenizer mistralai/Mistral-7B-v0.1 --dataset /home/robertgshaw/vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1.0 --num-prompts 200
Launch with System Logging:
python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-requests
python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-stats --disable-log-requests
Next Steps
Next steps to finalize the PR are:
Questions
Are there any other things I need to do?