Could Someone Help me to Interpret Speed for Speculative Decoding? #10716

chigkim · 2024-12-08T06:01:22Z

chigkim
Dec 8, 2024

Is the encoding/decoding overall speed for prompt processing and token generation?
I get speed of prompt processing for both draft and target, but it says "inf tokens per second" for target token generation.
Here's my output from llama-speculative:

encoded 10086 tokens in  169.297 seconds, speed:   59.576 t/s
decoded  920 tokens in  134.800 seconds, speed:6.825 t/s

n_draft   = 16
n_predict = 920
n_drafted = 2288
n_accept  = 776
accept= 33.916%

draft:

llama_perf_context_print:load time = 500.18 ms
llama_perf_context_print: prompt eval time =  108017.60 ms / 10371 tokens (   10.42 ms per token,96.01 tokens per second)
llama_perf_context_print:eval time =   33938.96 ms /  2145 runs   (   15.82 ms per token,63.20 tokens per second)
llama_perf_context_print:   total time =  304112.98 ms / 12516 tokens

target:

llama_perf_sampler_print:sampling time =  45.16 ms /   920 runs   (0.05 ms per token, 20371.11 tokens per second)
llama_perf_context_print:load time =1693.00 ms
llama_perf_context_print: prompt eval time =  265149.43 ms / 12517 tokens (   21.18 ms per token,47.21 tokens per second)
llama_perf_context_print:eval time =   0.00 ms / 1 runs   (0.00 ms per token,  inf tokens per second)
llama_perf_context_print:   total time =  304613.20 ms / 12518 tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could Someone Help me to Interpret Speed for Speculative Decoding? #10716

{{title}}

Replies: 0 comments

Select a reply

Could Someone Help me to Interpret Speed for Speculative Decoding? #10716

chigkim Dec 8, 2024

Replies: 0 comments

chigkim
Dec 8, 2024