Skip to content

Commit

Permalink
Exclude yield/reply time from first token latency metric (#973)
Browse files Browse the repository at this point in the history
While metrics are OK for small number of requests, when megaservice
is handling many (hundreds of) _parallel_ requests, it was reporting
clearly (~10%) larger first token latency, than the client receiving
the tokens from the megaservice.

Getting the time before token is yielded, means that reported first
token latency can be slightly shorter than it actually is. However,
testing with ChatQnA shows latencies to be clearly closer to ones seen
by the client (within couple of percent) and typically smaller (i.e.
logical).

PS. Doing the metrics timing after yielding the token, meant that also
time for sending the reply to the client and waiting that to complete,
was included to the token time.  I suspect that with lot of parallel
requests, processing often had switched to other megaservice request
processing threads, and getting control back to yielding thread for
timing, could be delayed much longer than sending the response to
client took.

Signed-off-by: Eero Tamminen <[email protected]>
  • Loading branch information
eero-t authored Dec 6, 2024
1 parent 3328ea3 commit 5663e16
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions comps/cores/mega/orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,8 +237,8 @@ def generate():
)
token_start = time.time()
else:
yield chunk
token_start = self.metrics.token_update(token_start, is_first)
yield chunk
is_first = False
self.metrics.request_update(req_start)
self.metrics.pending_update(False)
Expand Down Expand Up @@ -306,7 +306,7 @@ def token_generator(self, sentence: str, token_start: float, is_first: bool, is_
suffix = "\n\n"
tokens = re.findall(r"\s?\S+\s?", sentence, re.UNICODE)
for token in tokens:
yield prefix + repr(token.replace("\\n", "\n").encode("utf-8")) + suffix
token_start = self.metrics.token_update(token_start, is_first)
yield prefix + repr(token.replace("\\n", "\n").encode("utf-8")) + suffix
if is_last:
yield "data: [DONE]\n\n"

0 comments on commit 5663e16

Please sign in to comment.