Refactor Prometheus and Add Request Level Metrics #2316

robertgshaw2-neuralmagic · 2024-01-01T21:54:31Z

Summary

This PR does three things:

A) Addresses open feature request (#1870) by refactoring and extending initial implementation of metrics (#1890) to:

Handle more complex metrics such as request level latencies (as requested)
Build interfaces to make it easier to add new metrics in the future

B) Creates an end-to-end example for how to monitoring vLLM with Prometheus and Grafana

C) Updates the existing metric implementations to follow Prometheus best practices, namely:

Metric Naming (Prom Docs) -> vllm:num_requests_running should be vllm_num_requests_running_total
Base Units (Prom Docs) -> times should be s rather than ms
Calculating averages as Gauges rather Counters + PromQL rate(Prom Docs) -> vllm:avg_generation_throughput_toks_per_sec should be a Counter called vllm_generation_tokens_total and use PromQL rate(vllm_generation_tokens_total[5s]) to calc tokens / second during dashboarding.

A) Implementation

Created / updated the following classes:

SequenceGroup: added last_token_time variable and get_last_latency/get_e2e_latency) methods, which enables us to capture the request-level latencies if logging is enabled.
LLMEngine: added a PrometheusLogger and logic to create Stats, making a cleaner interface between the LLMEngine and logging-related functionality.
- At the the last step of _process_model_outputs, we call LLMEngine._get_stats to generate Stats that are passed to the PrometheusLogger.log.
PrometheusLogger: holds a list of PrometheusMetrics and passes the Stat generated by the LLMEngine to each.
- PrometheusMetric: holds a metric (aioprometheus collector Counter, Gauge, Histogram) and a function to extract the appropriate data from Stats

Within this framework, created a registry of PrometheusMetrics:

Re-implemented the existing metrics (modulo updating to conform to Prometheus best practices)
Implemented new request-level latency timing metrics (TTFT, Inter-Token, and E2E Latency)

Currently Supported Include:

counter_prompt_tokens --> used with rate() to calculate prompt token throughput
counter_generation_tokens --> used with rate() to calculate generation token throughput
gauge_scheduler_running
gauge_scheduler_swapped
gauge_scheduler_waiting
gauge_gpu_cache_usage
gauge_cpu_cache_usage
histogram_time_to_first_token --> exposes counters needed to calculate avg ttft, P50, P90, P95, P99
histogram_inter_token_latency --> exposes counters needed to calculate avg itl, P50, P90, P95, P99
histogram_e2e_request_latency --> exposes counters needed to calculate e2e request latency, P50, P90, P95, P99

See the Example for a dashboard that shows how these exposed metrics should be monitored

B) Example

See examples/production_monitoring for an end-to-end example. I included a Grafana dashboard configuration which shows how these metrics should be monitored.

C) Best Practices

I recognize these changes have breaking impacts on the metrics exposed to users.

Key changes include:

Updated the names of EACH metric: (e.g. vllm:num_requests_swapped --> vllm_requests_stopped_total)
Updated average token throughput gauges (vllm:avg_prompt_throughput_toks_per_s / vllm:avg_generation_throughput_toks_per_s) to be total tokens processed counters (vllm_prompt_tokens_total / vllm_generation_tokens_total)
- On the dashboard, we can calculate average tokens/sec with rate(vllm_prompt_tokens_total[30s])

My sense is that this is a very new feature, so Im not sure how much user impact there is. However, I think the changes I am suggesting are justified. I am happy to revert these if requested.

Overhead

I used the benchmarking scripts to test performance with and without the logger on an L4 GPU. There is very minor latency.

benchmark_serving.py Client:
python3 benchmark_serving.py --backend vllm --tokenizer mistralai/Mistral-7B-v0.1 --dataset /home/robertgshaw/vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1.0 --num-prompts 200
Launch with System Logging:
python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-requests

Total time: 265.46 s
Throughput: 0.75 requests/s
Average latency: 18.12 s
Average latency per token: 0.04 s
Average latency per output token: 0.08 s

Launch with System Logging Off:
python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-stats --disable-log-requests

Total time: 265.26 s
Throughput: 0.75 requests/s
Average latency: 18.06 s
Average latency per token: 0.04 s
Average latency per output token: 0.08 s

Next Steps

Next steps to finalize the PR are:

Running benchmarks on a more powerful system than an L4

Questions

Are there any other things I need to do?

…asing values. computing averages on the client is an anti-pattern for prometheus metrics and should be computed on the prom server

… grafana dashboard

…rything a lot simplier so squished everything back to a single file.

…chmarking

robertgshaw2-neuralmagic · 2024-01-05T03:24:50Z

@simon-mo this addresses #1870. Any feedback would be appreciated.

Additionally, this is my first time contributing to vLLM. I tried to get up to speed on the whole repo/submission process/QA process, but let me know if I am missing anything. Thanks.

simon-mo · 2024-01-05T13:20:57Z

Hi @rib-2, thank you for your contribution. This PR is definitely in the right direction, few things to start:

We are recommending the OpenAI compatible server as the only server, the internal api_server is only to be used for demo and benchmark purpose. Can you limit the change to OpenAI compatible server?
For the change in abstracting metrics capturing, can you make sure to keep the existing logging message intact? It is heavily used as one of the main diagnostic information.
Regarding metrics definition, I find the fn in Counter, Gauge, and Histogram hard to interpret from a code readability and maintainability point of view. Is there a simpler to way to achieve something like this?

robertgshaw2-neuralmagic · 2024-01-05T14:11:27Z

@simon-mo Sounds good. Thanks for the feedback.

Will remove from api_server. I just added that in so I could test the overhead with the existing benchmarking scripts
I will try to refactor the fn piece. I originally had each metric as a new class that inherits from CounterMetric (etc) and it got a bit verbose, but I will think through it again.

For the existing logging message --> are you referring to the logger to the command line?

simon-mo · 2024-01-05T19:02:54Z

For the existing logging message --> are you referring to the logger to the command line?

Yes!

Co-authored-by: Simon Mo <[email protected]>

robertgshaw2-neuralmagic · 2024-01-24T13:49:28Z

In the Prometheus docs it appears that they do support FastAPI in a very similar way to aioprometheus, so it looks like it should just work.

I read the code and the Prometheus Python Client should also work. We added the metrics to vllm originally, and we just used aioprometheus because we were using it in another code async code-base. Should be ok to switch, but I would do it in another PR.

I like the idea of doing it in another PR

…s to be compatible with prior verions (and adds back the gaugues that compute avg tput for backwards compatibility

robertgshaw2-neuralmagic · 2024-01-26T15:08:59Z

Only outstanding item I think is the time_per_output_token calculation. I left some thoughts @simon-mo, I believe the way we are doing it now works well and calculates properly. LMK if my comments make sense

fixed aesthetic changes
fixed the metric names so that they don't break the existing use (@NikolaBorisov's request)

@simon-mo requesting re-review

NikolaBorisov

@simon-mo I think this is good, and should be merged

simon-mo

Thank you for the great work here. And thanks @NikolaBorisov for the review.

* Add vllm-online-serving * Add prom metrics * Update monitoring * remove logging * Add labels * Use vllm directly from upstream latest to pick up vllm-project/vllm#2316 * Roll back vllm to 0.3.0 * Get patch files for metrics in vllm-project/vllm#2316 * Update llm_engine.py * Write documents * Add vllm-online-serving/README-ko.md * write README.md

robertgshaw2-neuralmagic added 13 commits December 30, 2023 18:44

added first refactor of metrics

56d398b

refactored to use counters rather than gauges for monotonically incre…

2239e73

…asing values. computing averages on the client is an anti-pattern for prometheus metrics and should be computed on the prom server

added dev notebook, started running live

1e6ad74

first example where things did not completely break :)!

10d5353

end to end things seem to be working

5199cdd

logging properly to prom, seeing the metrics come up

f69f639

added full example setting up prom/grafana logging, including default…

0c24fc3

… grafana dashboard

removed local logging

874df77

stashing refactor to stateless loggers

63aecbc

refactored code to support stateless iteration logging; this made eve…

7782baf

…rything a lot simplier so squished everything back to a single file.

missed metrics file :)

92dda00

made seq_group implementation simplier

6e7f715

updated metrics page ordering

67aaed7

robertgshaw2-neuralmagic mentioned this pull request Jan 1, 2024

Add latency metrics #1870

Closed

robertgshaw2-neuralmagic and others added 5 commits January 4, 2024 13:50

updated formatting / type checking

69093b2

updated api server to support /metrics so I could run performance ben…

114a4c9

…chmarking

Update async_llm_engine.py

f68f4f7

quality

ce0534f

Merge branch 'vllm-project:main' into rs/feature/metrics

05b3206

robertgshaw2-neuralmagic marked this pull request as ready for review January 5, 2024 03:18

robertgshaw2-neuralmagic and others added 3 commits January 5, 2024 03:19

cleaned up to use only one Stat type; added other metric

450dfc2

quality

0e65765

Update outputs.py

9fee85f

robertgshaw2-neuralmagic added 3 commits January 5, 2024 22:25

reverted changes to api_server.py

9cdd6c4

removed line to match base

d1dcac6

stash to move to other machine

a42c3ca

robertgshaw2-neuralmagic and others added 2 commits January 24, 2024 08:19

Update vllm/engine/llm_engine.py

dc4eaa5

Co-authored-by: Simon Mo <[email protected]>

Update vllm/sequence.py

d517924

Co-authored-by: Simon Mo <[email protected]>

robertgshaw2-neuralmagic and others added 6 commits January 26, 2024 13:11

fixes simon's concerns and validates working properly. renames metric…

0b726c5

…s to be compatible with prior verions (and adds back the gaugues that compute avg tput for backwards compatibility

Merge branch 'main' into rs/feature/metrics

9b76d60

format

6fed96c

new line

7f1379b

new line

3c18cb5

confirmed everything is working e2e

6b9afa2

robertgshaw2-neuralmagic requested review from simon-mo and NikolaBorisov January 26, 2024 15:06

ronensc mentioned this pull request Jan 29, 2024

Proposal: Adding more Prometheus metrics #2650

Closed

NikolaBorisov approved these changes Jan 31, 2024

View reviewed changes

simon-mo approved these changes Jan 31, 2024

View reviewed changes

simon-mo merged commit 93b38be into vllm-project:main Jan 31, 2024
17 checks passed

hmellor mentioned this pull request Feb 2, 2024

Port metrics from aioprometheus to prometheus_client #2730

Merged

hippothewild added a commit to vessl-ai/examples that referenced this pull request Feb 5, 2024

Use vllm directly from upstream latest to pick up vllm-project/vllm#2316

822fee2

hippothewild added a commit to vessl-ai/examples that referenced this pull request Feb 5, 2024

Get patch files for metrics in vllm-project/vllm#2316

21cb2ca

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Refactor Prometheus and Add Request Level Metrics (vllm-project#2316)

b4407b0

alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

Refactor Prometheus and Add Request Level Metrics (vllm-project#2316)

e8cb38d

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024

Refactor Prometheus and Add Request Level Metrics (vllm-project#2316)

6f93555

horheynm pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 20, 2024

Refactor Prometheus and Add Request Level Metrics (vllm-project#2316)

90de296

horheynm mentioned this pull request Feb 20, 2024

Refactor Prometheus and Add Request Level Metrics (#2316) neuralmagic/nm-vllm#31

Closed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024

Refactor Prometheus and Add Request Level Metrics (vllm-project#2316)

7abdb14

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Refactor Prometheus and Add Request Level Metrics (vllm-project#2316)

8bb1e24

hmellor mentioned this pull request Mar 12, 2024

More metrics #2302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Prometheus and Add Request Level Metrics #2316

Refactor Prometheus and Add Request Level Metrics #2316

robertgshaw2-neuralmagic commented Jan 1, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jan 5, 2024

simon-mo commented Jan 5, 2024

robertgshaw2-neuralmagic commented Jan 5, 2024 •

edited

Loading

simon-mo commented Jan 5, 2024

robertgshaw2-neuralmagic commented Jan 24, 2024

robertgshaw2-neuralmagic commented Jan 26, 2024

NikolaBorisov left a comment

simon-mo left a comment

Refactor Prometheus and Add Request Level Metrics #2316

Refactor Prometheus and Add Request Level Metrics #2316

Conversation

robertgshaw2-neuralmagic commented Jan 1, 2024 • edited Loading

Summary

A) Implementation

B) Example

C) Best Practices

Overhead

Next Steps

Questions

robertgshaw2-neuralmagic commented Jan 5, 2024

simon-mo commented Jan 5, 2024

robertgshaw2-neuralmagic commented Jan 5, 2024 • edited Loading

simon-mo commented Jan 5, 2024

robertgshaw2-neuralmagic commented Jan 24, 2024

robertgshaw2-neuralmagic commented Jan 26, 2024

NikolaBorisov left a comment

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Jan 1, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jan 5, 2024 •

edited

Loading