Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more Prometheus metrics #2764

Merged
merged 50 commits into from
Apr 28, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
319bc37
Add vllm:request_max_tokens
ronensc Jan 31, 2024
9d4ce95
Remove trailing dots from comments that are not sentences
ronensc Jan 31, 2024
ae7eb6e
Add vllm:request_success
ronensc Jan 31, 2024
2188daa
Remove redundant space
ronensc Jan 31, 2024
e41c15f
Add vllm:request_n
ronensc Jan 31, 2024
71ec7c3
Add vllm:prompt_tokens
ronensc Feb 5, 2024
45bd839
Add vllm:generation_tokens
ronensc Feb 5, 2024
f237c50
Add comments
ronensc Feb 5, 2024
f17a966
Rename metrics
ronensc Feb 5, 2024
8e0d8c1
Make type hint compatible with python 3.8
ronensc Feb 7, 2024
9ed04ef
Rename metrics
ronensc Feb 12, 2024
6aebd80
Merge branch 'main' into more-metrics
ronensc Feb 19, 2024
de84dac
Merge branch 'main' into more-metrics
ronensc Feb 21, 2024
35944cc
Merge branch 'main' into more-metrics
ronensc Feb 26, 2024
76cd774
Consider the value of `max_model_len` when building buckets
ronensc Feb 26, 2024
93b0796
Merge branch 'main' into more-metrics
ronensc Mar 4, 2024
3643e0c
Merge branch 'main' into more-metrics
ronensc Mar 13, 2024
60f1049
Fix too long line warning
ronensc Mar 13, 2024
95daee7
Add HTTP metrics from prometheus-fastapi-instrumentator
ronensc Mar 26, 2024
cf4acef
Merge remote-tracking branch 'origin/main' into more-metrics
ronensc Mar 26, 2024
0f8dae9
Make ruff happy
ronensc Mar 26, 2024
bce096c
Remove vllm:request_params_max_tokens
ronensc Mar 28, 2024
e15f653
Move deprecated metrics to legacy section
ronensc Mar 29, 2024
7b05baa
Add metric vllm:request_params_best_of
ronensc Apr 1, 2024
0958259
Revert to exposing /metrics using make_asgi_app()
ronensc Apr 1, 2024
5e2c246
Register 'finished_reason' label name on metric creation
ronensc Apr 1, 2024
5cc7b64
Merge branch 'main' into more-metrics
ronensc Apr 1, 2024
1eeb31d
Fix merge issues
ronensc Apr 1, 2024
4c79cbe
Merge branch 'main' into more-metrics
ronensc Apr 17, 2024
4c41a89
Fix merge issues
ronensc Apr 17, 2024
ac8435b
Add 3 panels to Grafana dashboard
ronensc Apr 17, 2024
f22abf5
Change order of deprecated metrics and add comments
ronensc Apr 19, 2024
9352ce7
Rename LABEL_NAME_FINISHED_REASON and make it a class variable of Met…
ronensc Apr 19, 2024
b2c0445
Set minimum version to prometheus-fastapi-instrumentator
ronensc Apr 19, 2024
e147575
Change finished_reason from counter to list
ronensc Apr 19, 2024
f9bc64e
Compute deprecated metrics using the newer version
ronensc Apr 19, 2024
5ded719
Rename variables. Strip '_lst' suffix.
ronensc Apr 19, 2024
dd84d51
Update naming schema Stats to have the _suffix pattern
ronensc Apr 19, 2024
e127a4c
Fix the incorrect logic for chunked prefill
ronensc Apr 19, 2024
2d36609
Restore num_prompt_tokens_iter and num_generation_tokens_iter
ronensc Apr 19, 2024
e81d95a
Refactor metrics logging methods
ronensc Apr 19, 2024
ece2ec0
Reorder metrics definition to match Stats order
ronensc Apr 19, 2024
5a658c8
Rename metric variables to match suffix convention
ronensc Apr 19, 2024
717b559
Make mypy happy
ronensc Apr 20, 2024
61fad41
Merge branch 'main' into more-metrics
robertgshaw2-neuralmagic Apr 25, 2024
f103ad8
./format
robertgshaw2-neuralmagic Apr 25, 2024
bf1a0c4
Merge branch 'main' into more-metrics
robertgshaw2-neuralmagic Apr 28, 2024
cc0d5eb
fixed chunked prefill logic
robertgshaw2-neuralmagic Apr 28, 2024
d7f493b
make linter happy
robertgshaw2-neuralmagic Apr 28, 2024
54bf260
fixed issues with chunked prefill X metrics
robertgshaw2-neuralmagic Apr 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ fastapi
uvicorn[standard]
pydantic >= 2.0 # Required for OpenAI server.
prometheus_client >= 0.18.0
prometheus-fastapi-instrumentator
pynvml == 11.5.0
triton >= 2.1.0
outlines == 0.0.34
Expand Down
2 changes: 1 addition & 1 deletion vllm/core/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ def abort_seq_group(self, request_id: Union[str, Iterable[str]]) -> None:
for seq_group in state_queue:
if not request_ids:
# Using 'break' here may add two extra iterations,
# but is acceptable to reduce complexity .
# but is acceptable to reduce complexity.
break
if seq_group.request_id in request_ids:
# Appending aborted group into pending list.
Expand Down
43 changes: 38 additions & 5 deletions vllm/engine/llm_engine.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import time
from collections import Counter as CollectionsCounter
from typing import Iterable, List, Optional, Tuple, Type, Union

from transformers import PreTrainedTokenizer
Expand Down Expand Up @@ -121,7 +122,8 @@ def __init__(
if self.log_stats:
self.stat_logger = StatLogger(
local_interval=_LOCAL_LOGGING_INTERVAL_SEC,
labels=dict(model_name=model_config.model))
labels=dict(model_name=model_config.model),
max_model_len=self.model_config.max_model_len)
self.stat_logger.info("cache_config", self.cache_config)

@classmethod
Expand Down Expand Up @@ -673,24 +675,42 @@ def _get_stats(self,
# Iteration stats if we have scheduler output.
num_prompt_tokens = 0
num_generation_tokens = 0
num_prompt_tokens_lst = []
num_generation_tokens_lst = []
request_n = []
time_to_first_tokens = []
time_per_output_tokens = []
time_e2e_requests = []
finished_reason_counter = CollectionsCounter()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this one to be just after the # Finished Requests comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason this can't just be a list of finished reasons that we process on the stat_logger side?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, and it would make the stats logging code in the engine a bit cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this one to be just after the # Finished Requests comment

If I move finished_reason_counter = CollectionsCounter() after # Finished Requests, it might be undefined when accessed later for the same reason as request_n and request_best_of. Please refer to that comment for details.

Is there any reason this can't just be a list of finished reasons that we process on the stat_logger side?

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah in the diff view I hadn't realised this was nested, it's ok to leave as is then

if scheduler_outputs is not None:
prompt_run = scheduler_outputs.prompt_run

# Number of Tokens.
# Number of Tokens
if prompt_run:
num_prompt_tokens = sum(
num_prompt_tokens_lst = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an incorrect calculation.

This stat is used to compute the histogram of prompt lengths for each request. Because of chunked prefill, the prompt processing will occur over multiple steps of LLMEngine. So this will actually give you the distribution of chunks as opposed to the distribution of prefill lengths.

So what I would suggest is that we log the of the generation and the prompt once the seq_group is finished (e.g. in the loop where we loop through the scheduled_sequence_groups

This way we know that we are logging the full prefill length and that we are logging one item per iteration

len(seq_group.prompt_token_ids)
for seq_group in scheduler_outputs.scheduled_seq_groups)
for seq_group in scheduler_outputs.scheduled_seq_groups
]
num_prompt_tokens = sum(num_prompt_tokens_lst)
num_generation_tokens = sum(
seq_group.num_seqs()
for seq_group in scheduler_outputs.scheduled_seq_groups)
else:
num_generation_tokens = scheduler_outputs.num_batched_tokens
num_generation_tokens_lst = [
seq.get_output_len()
for seq_group in scheduler_outputs.scheduled_seq_groups
for seq in seq_group.get_finished_seqs()
]

# Sampling Params
if prompt_run:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue related to chunked prefill. If our prefill has N chunked, we will create N histogram entires

request_n = [
seq_group.sampling_params.n
for seq_group in scheduler_outputs.scheduled_seq_groups
]

# Latency Timings.
# Latency Timings
time_last_iters = []
for seq_group in scheduler_outputs.scheduled_seq_groups:
# Time since last token.
Expand All @@ -704,15 +724,28 @@ def _get_stats(self,
time_to_first_tokens = time_last_iters if prompt_run else []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are also now incorrect b/c of chunked prefill

time_per_output_tokens = [] if prompt_run else time_last_iters

# Finished Requests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bit hard to follow where we have 5-6 loops that loop through the same list of scheduled_seq_groups

for seq_group in scheduler_outputs.scheduled_seq_groups:
if not seq_group.is_finished():
continue
finished_reason_counter += CollectionsCounter([
SequenceStatus.get_finished_reason(seq.status)
for seq in seq_group.get_finished_seqs()
])

return Stats(
now=now,
num_running=num_running,
num_swapped=num_swapped,
num_waiting=num_waiting,
gpu_cache_usage=gpu_cache_usage,
cpu_cache_usage=cpu_cache_usage,
finished_reason_counter=finished_reason_counter,
num_prompt_tokens=num_prompt_tokens,
num_generation_tokens=num_generation_tokens,
num_prompt_tokens_lst=num_prompt_tokens_lst,
num_generation_tokens_lst=num_generation_tokens_lst,
request_n=request_n,
time_to_first_tokens=time_to_first_tokens,
time_per_output_tokens=time_per_output_tokens,
time_e2e_requests=time_e2e_requests,
Expand Down
77 changes: 74 additions & 3 deletions vllm/engine/metrics.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import time
from dataclasses import dataclass
from typing import Counter as CollectionsCounter
from typing import Dict, List

import numpy as np
Expand All @@ -19,7 +20,7 @@
# begin-metrics-definitions
class Metrics:

def __init__(self, labelnames: List[str]):
def __init__(self, labelnames: List[str], max_model_len: int):
# Unregister any existing vLLM collectors
for collector in list(REGISTRY._collector_to_names):
if hasattr(collector, "_name") and "vllm" in collector._name:
Expand Down Expand Up @@ -61,6 +62,22 @@ def __init__(self, labelnames: List[str]):
name="vllm:generation_tokens_total",
documentation="Number of generation tokens processed.",
labelnames=labelnames)
self.counter_request_success = Counter(
name="vllm:request_success",
documentation="Count of successfully processed requests.",
labelnames=labelnames)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't just counting successful responses, it's counting all finish reasons. If you could find an elegant way to implement the counters we lost when switching from aioprometheus to prometheus_client, that would be great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick option to add http related metrics would be to use prometheus-fastapi-instrumentator.

This involves installing the package:
pip install prometheus-fastapi-instrumentator

Then adding the following 2 lines after the app creation:

app = fastapi.FastAPI(lifespan=lifespan)

from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)

This will add the following metrics:

Metric Name Type Description
http_requests_total counter Total number of requests by method, status, and handler.
http_request_size_bytes summary Content length of incoming requests by handler. Only value of header is respected. Otherwise ignored.
http_response_size_bytes summary Content length of outgoing responses by handler. Only value of header is respected. Otherwise ignored.
http_request_duration_highr_seconds histogram Latency with many buckets but no API specific labels. Made for more accurate percentile calculations.
http_request_duration_seconds histogram Latency with only a few buckets by handler. Made to be only used if aggregation by handler is important.

Should I add it to the PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution, it saves us from reinventing the wheel in vLLM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

Copy link
Collaborator

@hmellor hmellor Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, if we're going to be using prometheus-fastapi-instrumentator then this implementation should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I'm following. Which implementation are you referring to that should be removed?
IIUC, prometheus-fastapi-instrumentator simply adds a middleware into FastAPI to collect the metrics specified in the table above. It uses prometheus_client under the hood and adding other vLLM related metrics should be done with prometheus_client.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code highlighted in the original comment

        self.counter_request_success = Counter(
            name="vllm:request_success",
            documentation="Count of successfully processed requests.",
            labelnames=labelnames)

can be removed if we are getting these metrics from prometheus-fastapi-instrumentator instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying. I believe that vllm:request_success remains valuable. It includes a finished_reason label, which allows for counting requests based on their finished reason — either stop if the sequence ends with an EOS token, or length if the sequence length reaches either scheduler_config.max_model_len or sampling_params.max_tokens. I'm open to adjusting its name and description to make it more indicative.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the idea of renaming this to something like vllm:request_info and including n and best_of as labels too? This way we log a single metric from which the user can construct many different visualisations on Grafana by utilising the label filters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the idea of combining these metrics may seem appealing at first glance, I believe they should be kept separate for the following reasons:

  1. The metrics have different types: vllm:request_success is a Counter, while vllm:request_params_best_of and vllm:request_params_n are Histograms.

  2. Aggregating different labels lacks semantic meaning.

  3. Although merging n and best_of into the same histograms might make sense in this case, as they would share the same buckets, we may encounter scenarios where we need to introduce another metric with different bucket requirements.

  4. This situation differs from the Info metric type, where data is encoded in the label values.

self.histogram_request_prompt_tokens = Histogram(
name="vllm:request_prompt_tokens",
documentation="Number of prefill tokens processed.",
labelnames=labelnames,
buckets=build_1_2_5_buckets(max_model_len),
)
self.histogram_request_generation_tokens = Histogram(
name="vllm:request_generation_tokens",
documentation="Number of generation tokens processed.",
labelnames=labelnames,
buckets=build_1_2_5_buckets(max_model_len),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two could be constructed using vllm:prompt_tokens_total and vllm:generation_tokens_total using a Binary operation transform in Grafana.

It wouldn't be exactly the same, but it would prevent additional overhead in the server. i.e. if you calculate it on grafana (and your scrape interval is 1 minute) then it'd be a histogram of how many tokens get processed/generated per minute rather than how many tokens get processed/generated per request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback!
You are right. But, wouldn't it be beneficial to have in addition histograms depicting the distribution of prompt length and generation length?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this metric doesn't actually introduce any overhead (because the data from vllm:x_tokens_total is reused, these two are probably fine. It would be interesting to know how big the prompts the users were providing are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! I suggest to deprecate the 2 vllm:x_tokens_total metrics as they will be included as part of the Histogram metrics this PR adds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep these metrics, because a developer may not want to have to aggregate histogram data in order to get the same effect of vllm:x_tokens_total

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus histograms have this nice feature where in addition to the bucket counters, they include 2 additional counters
suffixed with _sum and _count.

_count is incremented by 1 on every observe, and _sum is incremented by the value of the observation.

Therfore, vllm:prompt_tokens_total is equivalent to vllm:request_prompt_tokens_sum,
and vllm:generation_tokens_total is equivalent to vllm:request_generation_tokens_sum

Source:
https://www.robustperception.io/how-does-a-prometheus-histogram-work/

Copy link
Collaborator

@hmellor hmellor Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, thanks for explaining. In that case you could move the vllm:x_tokens_total metrics into the # Legacy metrics section.

Although I think there might be some objection to changing metrics that people are already using in dashboards.

cc @simon-mo @Yard1 @robertgshaw2-neuralmagic (not sure who to ping for metrics related things, so please tell me if I should stop)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my point of view, it's fine to duplicate metrics for backward compatibility reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll relocate these metrics to the legacy section. Perhaps in the future, when we're able to make breaking changes, we can consider removing them.

self.histogram_time_to_first_token = Histogram(
name="vllm:time_to_first_token_seconds",
documentation="Histogram of time to first token in seconds.",
Expand All @@ -82,6 +99,12 @@ def __init__(self, labelnames: List[str]):
documentation="Histogram of end to end request latency in seconds.",
labelnames=labelnames,
buckets=[1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0])
self.histogram_request_n = Histogram(
name="vllm:request_params_n",
documentation="Histogram of the n request parameter.",
labelnames=labelnames,
buckets=[1, 2, 5, 10, 20],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep what's currently called vllm:request_success, could this be included as another label and the histogram could be constructed on the Grafana side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I'll incorporate the new metrics into the Grafana dashboard demo.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be resolved now as you've added the Grafana examples


# Legacy metrics
self.gauge_avg_prompt_throughput = Gauge(
Expand All @@ -99,6 +122,28 @@ def __init__(self, labelnames: List[str]):
# end-metrics-definitions


def build_1_2_5_buckets(max_value: int):
"""
Builds a list of buckets with increasing powers of 10 multiplied by
mantissa values (1, 2, 5) until the value exceeds the specified maximum.

Example:
>>> build_1_2_5_buckets(100)
[1, 2, 5, 10, 20, 50, 100]
"""
mantissa_lst = [1, 2, 5]
exponent = 0
buckets = []
while True:
for m in mantissa_lst:
value = m * 10**exponent
if value <= max_value:
buckets.append(value)
else:
return buckets
exponent += 1


@dataclass
class Stats:
"""Created by LLMEngine for use by StatLogger."""
Expand All @@ -112,8 +157,12 @@ class Stats:
cpu_cache_usage: float

# Raw stats from last model iteration.
finished_reason_counter: CollectionsCounter[str]
num_prompt_tokens: int
num_generation_tokens: int
num_prompt_tokens_lst: List[int]
num_generation_tokens_lst: List[int]
request_n: List[int]
time_to_first_tokens: List[float]
time_per_output_tokens: List[float]
time_e2e_requests: List[float]
Expand All @@ -122,7 +171,8 @@ class Stats:
class StatLogger:
"""StatLogger is used LLMEngine to log to Promethus and Stdout."""

def __init__(self, local_interval: float, labels: Dict[str, str]) -> None:
def __init__(self, local_interval: float, labels: Dict[str, str],
max_model_len: int) -> None:
# Metadata for logging locally.
self.last_local_log = time.monotonic()
self.local_interval = local_interval
Expand All @@ -133,7 +183,8 @@ def __init__(self, local_interval: float, labels: Dict[str, str]) -> None:

# Prometheus metrics
self.labels = labels
self.metrics = Metrics(labelnames=list(labels.keys()))
self.metrics = Metrics(labelnames=list(labels.keys()),
max_model_len=max_model_len)

def info(self, type: str, obj: object) -> None:
if type == "cache_config":
Expand Down Expand Up @@ -165,6 +216,26 @@ def _log_prometheus(self, stats: Stats) -> None:
self.metrics.counter_generation_tokens.labels(**self.labels).inc(
stats.num_generation_tokens)

# Add to request counters.
for finished_reason, count in stats.finished_reason_counter.items():
self.metrics.counter_request_success.labels({
**self.labels,
"finished_reason":
finished_reason,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not raise an error because finished_reason was not specified when this metric was constructed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Although it doesn't raise an error, it behaves unexpectedly by appending the finished_reason to the model_name value.
I'll add a fix for this.

}).inc(count)

# Observe number of tokens in histograms.
for val in stats.num_prompt_tokens_lst:
self.metrics.histogram_request_prompt_tokens.labels(
**self.labels).observe(val)
for val in stats.num_generation_tokens_lst:
self.metrics.histogram_request_generation_tokens.labels(
**self.labels).observe(val)

# Observe sampling params in histograms.
for n in stats.request_n:
self.metrics.histogram_request_n.labels(**self.labels).observe(n)

# Observe request level latencies in histograms.
for ttft in stats.time_to_first_tokens:
self.metrics.histogram_time_to_first_token.labels(
Expand Down
9 changes: 3 additions & 6 deletions vllm/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, Response, StreamingResponse
from prometheus_client import make_asgi_app
from prometheus_fastapi_instrumentator import Instrumentator

import vllm
from vllm.engine.arg_utils import AsyncEngineArgs
Expand Down Expand Up @@ -45,18 +45,15 @@ async def _force_log():


app = fastapi.FastAPI(lifespan=lifespan)
# Instrument the app with HTTP metrics and expose it on /metrics
Instrumentator().instrument(app).expose(app, endpoint="/metrics")


def parse_args():
parser = make_arg_parser()
return parser.parse_args()


# Add prometheus asgi middleware to route /metrics requests
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the metrics defined in vllm/engine/metrics.py are exposed. It can't be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced it with the expose() method of prometheus-fastapi-instrumentator which also exposes a /metric endpoint.
https://github.com/vllm-project/vllm/pull/2764/files#diff-38318677b76349044192bf70161371c88fb2818b85279d8fc7f2c041d83a9544R48-R49

I noticed it also solves the /metrics/ redirection issue.
Which of the exposing methods should we use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced it with the expose() method of prometheus-fastapi-instrumentator which also exposes a /metric endpoint.

While this does expose a /metrics endpoint, none of the vLLM metrics will be in it because they come from make_asgi_app(), right?

Have you confirmed that /metrics still contains vLLM metrics with this code removed?

Copy link
Contributor Author

@ronensc ronensc Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've verified that both approaches expose all metrics. The only discrepancy I've noticed is that expose() from prometheus-fastapi-instrumentator exposes metrics on /metrics, whereas make_asgi_app() exposes them on /metrtics/. However, I'll revert to using the make_asgi_app() approach. I find the other method somewhat hacky, as it involves the prometheus-fastapi-instrumentator middleware handling the metrics endpoint. This could look weird if multiple middlewares are in use.

@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
err = openai_serving_chat.create_error_response(message=str(exc))
Expand Down
Loading