Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more Prometheus metrics #2764

Merged
merged 50 commits into from
Apr 28, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
319bc37
Add vllm:request_max_tokens
ronensc Jan 31, 2024
9d4ce95
Remove trailing dots from comments that are not sentences
ronensc Jan 31, 2024
ae7eb6e
Add vllm:request_success
ronensc Jan 31, 2024
2188daa
Remove redundant space
ronensc Jan 31, 2024
e41c15f
Add vllm:request_n
ronensc Jan 31, 2024
71ec7c3
Add vllm:prompt_tokens
ronensc Feb 5, 2024
45bd839
Add vllm:generation_tokens
ronensc Feb 5, 2024
f237c50
Add comments
ronensc Feb 5, 2024
f17a966
Rename metrics
ronensc Feb 5, 2024
8e0d8c1
Make type hint compatible with python 3.8
ronensc Feb 7, 2024
9ed04ef
Rename metrics
ronensc Feb 12, 2024
6aebd80
Merge branch 'main' into more-metrics
ronensc Feb 19, 2024
de84dac
Merge branch 'main' into more-metrics
ronensc Feb 21, 2024
35944cc
Merge branch 'main' into more-metrics
ronensc Feb 26, 2024
76cd774
Consider the value of `max_model_len` when building buckets
ronensc Feb 26, 2024
93b0796
Merge branch 'main' into more-metrics
ronensc Mar 4, 2024
3643e0c
Merge branch 'main' into more-metrics
ronensc Mar 13, 2024
60f1049
Fix too long line warning
ronensc Mar 13, 2024
95daee7
Add HTTP metrics from prometheus-fastapi-instrumentator
ronensc Mar 26, 2024
cf4acef
Merge remote-tracking branch 'origin/main' into more-metrics
ronensc Mar 26, 2024
0f8dae9
Make ruff happy
ronensc Mar 26, 2024
bce096c
Remove vllm:request_params_max_tokens
ronensc Mar 28, 2024
e15f653
Move deprecated metrics to legacy section
ronensc Mar 29, 2024
7b05baa
Add metric vllm:request_params_best_of
ronensc Apr 1, 2024
0958259
Revert to exposing /metrics using make_asgi_app()
ronensc Apr 1, 2024
5e2c246
Register 'finished_reason' label name on metric creation
ronensc Apr 1, 2024
5cc7b64
Merge branch 'main' into more-metrics
ronensc Apr 1, 2024
1eeb31d
Fix merge issues
ronensc Apr 1, 2024
4c79cbe
Merge branch 'main' into more-metrics
ronensc Apr 17, 2024
4c41a89
Fix merge issues
ronensc Apr 17, 2024
ac8435b
Add 3 panels to Grafana dashboard
ronensc Apr 17, 2024
f22abf5
Change order of deprecated metrics and add comments
ronensc Apr 19, 2024
9352ce7
Rename LABEL_NAME_FINISHED_REASON and make it a class variable of Met…
ronensc Apr 19, 2024
b2c0445
Set minimum version to prometheus-fastapi-instrumentator
ronensc Apr 19, 2024
e147575
Change finished_reason from counter to list
ronensc Apr 19, 2024
f9bc64e
Compute deprecated metrics using the newer version
ronensc Apr 19, 2024
5ded719
Rename variables. Strip '_lst' suffix.
ronensc Apr 19, 2024
dd84d51
Update naming schema Stats to have the _suffix pattern
ronensc Apr 19, 2024
e127a4c
Fix the incorrect logic for chunked prefill
ronensc Apr 19, 2024
2d36609
Restore num_prompt_tokens_iter and num_generation_tokens_iter
ronensc Apr 19, 2024
e81d95a
Refactor metrics logging methods
ronensc Apr 19, 2024
ece2ec0
Reorder metrics definition to match Stats order
ronensc Apr 19, 2024
5a658c8
Rename metric variables to match suffix convention
ronensc Apr 19, 2024
717b559
Make mypy happy
ronensc Apr 20, 2024
61fad41
Merge branch 'main' into more-metrics
robertgshaw2-neuralmagic Apr 25, 2024
f103ad8
./format
robertgshaw2-neuralmagic Apr 25, 2024
bf1a0c4
Merge branch 'main' into more-metrics
robertgshaw2-neuralmagic Apr 28, 2024
cc0d5eb
fixed chunked prefill logic
robertgshaw2-neuralmagic Apr 28, 2024
d7f493b
make linter happy
robertgshaw2-neuralmagic Apr 28, 2024
54bf260
fixed issues with chunked prefill X metrics
robertgshaw2-neuralmagic Apr 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion vllm/core/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ def abort_seq_group(self, request_id: Union[str, Iterable[str]]) -> None:
for seq_group in state_queue:
if not request_ids:
# Using 'break' here may add two extra iterations,
# but is acceptable to reduce complexity .
# but is acceptable to reduce complexity.
break
if seq_group.request_id in request_ids:
# Appending aborted group into pending list.
Expand Down
45 changes: 42 additions & 3 deletions vllm/engine/llm_engine.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import copy
from collections import defaultdict
from collections import defaultdict, Counter as CollectionsCounter
import os
import time
from typing import (TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Tuple,
Expand Down Expand Up @@ -845,19 +845,44 @@ def _get_stats(self,
# Iteration stats if we have scheduler output.
num_prompt_tokens = 0
num_generation_tokens = 0
num_prompt_tokens_lst = []
num_generation_tokens_lst = []
max_tokens = []
request_n = []
time_to_first_tokens = []
time_per_output_tokens = []
time_e2e_requests = []
finished_reason_counter = CollectionsCounter()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this one to be just after the # Finished Requests comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason this can't just be a list of finished reasons that we process on the stat_logger side?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, and it would make the stats logging code in the engine a bit cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this one to be just after the # Finished Requests comment

If I move finished_reason_counter = CollectionsCounter() after # Finished Requests, it might be undefined when accessed later for the same reason as request_n and request_best_of. Please refer to that comment for details.

Is there any reason this can't just be a list of finished reasons that we process on the stat_logger side?

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah in the diff view I hadn't realised this was nested, it's ok to leave as is then

if scheduler_outputs is not None:
prompt_run = scheduler_outputs.prompt_run

# Number of Tokens.
# Number of Tokens
if prompt_run:
num_prompt_tokens = scheduler_outputs.num_batched_tokens
num_prompt_tokens_lst = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an incorrect calculation.

This stat is used to compute the histogram of prompt lengths for each request. Because of chunked prefill, the prompt processing will occur over multiple steps of LLMEngine. So this will actually give you the distribution of chunks as opposed to the distribution of prefill lengths.

So what I would suggest is that we log the of the generation and the prompt once the seq_group is finished (e.g. in the loop where we loop through the scheduled_sequence_groups

This way we know that we are logging the full prefill length and that we are logging one item per iteration

len(seq_group.prompt_token_ids)
for seq_group in scheduler_outputs.scheduled_seq_groups
]
else:
num_generation_tokens = scheduler_outputs.num_batched_tokens
num_generation_tokens_lst = [
seq.get_output_len()
for seq_group in scheduler_outputs.scheduled_seq_groups
for seq in seq_group.get_finished_seqs()
]

# Latency Timings.
# Sampling Params
if prompt_run:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue related to chunked prefill. If our prefill has N chunked, we will create N histogram entires

max_tokens = [
seq_group.sampling_params.max_tokens
for seq_group in scheduler_outputs.scheduled_seq_groups
]
request_n = [
seq_group.sampling_params.n
for seq_group in scheduler_outputs.scheduled_seq_groups
]

# Latency Timings
time_last_iters = []
for seq_group in scheduler_outputs.scheduled_seq_groups:
# Time since last token. (n.b. updates seq_group.last_token_time)
Expand All @@ -869,15 +894,29 @@ def _get_stats(self,
time_to_first_tokens = time_last_iters if prompt_run else []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are also now incorrect b/c of chunked prefill

time_per_output_tokens = [] if prompt_run else time_last_iters

# Finished Requests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bit hard to follow where we have 5-6 loops that loop through the same list of scheduled_seq_groups

for seq_group in scheduler_outputs.scheduled_seq_groups:
if not seq_group.is_finished():
continue
finished_reason_counter += CollectionsCounter([
SequenceStatus.get_finished_reason(seq.status)
for seq in seq_group.get_finished_seqs()
])

return Stats(
now=now,
num_running=num_running,
num_swapped=num_swapped,
num_waiting=num_waiting,
gpu_cache_usage=gpu_cache_usage,
cpu_cache_usage=cpu_cache_usage,
finished_reason_counter=finished_reason_counter,
num_prompt_tokens=num_prompt_tokens,
num_generation_tokens=num_generation_tokens,
num_prompt_tokens_lst=num_prompt_tokens_lst,
num_generation_tokens_lst=num_generation_tokens_lst,
max_tokens=max_tokens,
request_n=request_n,
time_to_first_tokens=time_to_first_tokens,
time_per_output_tokens=time_per_output_tokens,
time_e2e_requests=time_e2e_requests,
Expand Down
58 changes: 58 additions & 0 deletions vllm/engine/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import numpy as np
from typing import List
from dataclasses import dataclass
from collections import Counter as CollectionsCounter

logger = init_logger(__name__)

Expand All @@ -28,6 +29,8 @@ def add_global_metrics_labels(**kwargs):
"Number of prefill tokens processed.")
counter_generation_tokens = Counter("vllm:generation_tokens_total",
"Number of generation tokens processed.")
counter_request_success = Counter("vllm:request_success",
"Count of successfully processed requests.")

gauge_scheduler_running = Gauge(
"vllm:num_requests_running",
Expand All @@ -44,6 +47,22 @@ def add_global_metrics_labels(**kwargs):
"vllm:cpu_cache_usage_perc",
"CPU KV-cache usage. 1 means 100 percent usage.")

histogram_request_prompt_tokens = Histogram(
"vllm:request_prompt_tokens",
"Number of prefill tokens processed.",
buckets=[
1, 2, 5, 10, 20, 50, 100, 200, 500, 1_000, 2_000, 5_000, 10_000,
20_000, 50_000, 100_000
],
)
histogram_request_generation_tokens = Histogram(
"vllm:request_generation_tokens",
"Number of generation tokens processed.",
buckets=[
1, 2, 5, 10, 20, 50, 100, 200, 500, 1_000, 2_000, 5_000, 10_000,
20_000, 50_000, 100_000
],
)
ronensc marked this conversation as resolved.
Show resolved Hide resolved
histogram_time_to_first_token = Histogram(
"vllm:time_to_first_token_seconds",
"Histogram of time to first token in seconds.",
Expand All @@ -61,6 +80,20 @@ def add_global_metrics_labels(**kwargs):
"vllm:e2e_request_latency_seconds",
"Histogram of end to end request latency in seconds.",
buckets=[1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0])

histogram_max_tokens = Histogram(
"vllm:request_max_tokens",
"Histogram of the max_tokens request parameter.",
buckets=[
1, 2, 5, 10, 20, 50, 100, 200, 500, 1_000, 2_000, 5_000, 10_000,
20_000, 50_000, 100_000
],
)
histogram_request_n = Histogram(
"vllm:request_n",
"Histogram of the n request parameter.",
buckets=[1, 2, 5, 10, 20],
)
# end-metrics-definitions


Expand All @@ -77,8 +110,13 @@ class Stats:
cpu_cache_usage: float

# Raw stats from last model iteration.
finished_reason_counter: CollectionsCounter[str, int]
num_prompt_tokens: int
num_generation_tokens: int
num_prompt_tokens_lst: List[int]
num_generation_tokens_lst: List[int]
max_tokens: List[int]
request_n: List[int]
time_to_first_tokens: List[float]
time_per_output_tokens: List[float]
time_e2e_requests: List[float]
Expand Down Expand Up @@ -115,6 +153,26 @@ def _log_prometheus(self, stats: Stats) -> None:
counter_prompt_tokens.add(labels, stats.num_prompt_tokens)
counter_generation_tokens.add(labels, stats.num_generation_tokens)

# Add to request counters.
for finished_reason, count in stats.finished_reason_counter.items():
counter_request_success.add(
{
**labels,
"finished_reason": finished_reason,
}, count)

# Observe number of tokens in histograms.
for val in stats.num_prompt_tokens_lst:
histogram_request_prompt_tokens.observe(labels, val)
for val in stats.num_generation_tokens_lst:
histogram_request_generation_tokens.observe(labels, val)

# Observe sampling params in histograms.
for val in stats.max_tokens:
histogram_max_tokens.observe(labels, val)
for n in stats.request_n:
histogram_request_n.observe(labels, n)

# Observe request level latencies in histograms.
for ttft in stats.time_to_first_tokens:
histogram_time_to_first_token.observe(labels, ttft)
Expand Down
Loading