You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is the encoding/decoding overall speed for prompt processing and token generation?
I get speed of prompt processing for both draft and target, but it says "inf tokens per second" for target token generation.
Here's my output from llama-speculative:
encoded 10086 tokens in 169.297 seconds, speed: 59.576 t/s
decoded 920 tokens in 134.800 seconds, speed:6.825 t/s
n_draft = 16
n_predict = 920
n_drafted = 2288
n_accept = 776
accept= 33.916%
draft:
llama_perf_context_print:load time = 500.18 ms
llama_perf_context_print: prompt evaltime = 108017.60 ms / 10371 tokens ( 10.42 ms per token,96.01 tokens per second)
llama_perf_context_print:eval time = 33938.96 ms / 2145 runs ( 15.82 ms per token,63.20 tokens per second)
llama_perf_context_print: total time = 304112.98 ms / 12516 tokens
target:
llama_perf_sampler_print:sampling time = 45.16 ms / 920 runs (0.05 ms per token, 20371.11 tokens per second)
llama_perf_context_print:load time =1693.00 ms
llama_perf_context_print: prompt evaltime = 265149.43 ms / 12517 tokens ( 21.18 ms per token,47.21 tokens per second)
llama_perf_context_print:eval time = 0.00 ms / 1 runs (0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 304613.20 ms / 12518 tokens
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Is the encoding/decoding overall speed for prompt processing and token generation?
I get speed of prompt processing for both draft and target, but it says "inf tokens per second" for target token generation.
Here's my output from llama-speculative:
Beta Was this translation helpful? Give feedback.
All reactions