Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Latest commit

 

History

History
66 lines (48 loc) · 3.48 KB

README.md

File metadata and controls

66 lines (48 loc) · 3.48 KB

LLM throughput benchmark

The benchmark

You crave for perfect code suggestions, but you don't know whether it fits your needs in terms of latency?

We ran our tests on the following hardware:

The laptop hardware setup includes an Intel(R) Core(TM) i7-12700H for the CPU

with the following LLMs (cf. Ollama hub):

and the following quantization formats: q3_K_M, q4_K_M, q5_K_M.

This benchmark was performed over 5 iterations on 4 different sequences, including on a laptop to better reflect performances that can be expected by common users.

Run it on your hardware

Local setup

Quite simply, start the docker:

docker compose up -d --wait

Pull the model you want

docker compose exec -T ollama ollama pull MODEL

And run the evaluation

docker compose exec -T evaluator python evaluate.py MODEL

Remote instance

Start the evaluator only

docker compose up -d evaluator --wait

And run the evaluation by targeting your remote instance:

docker compose exec -T evaluator python evaluate.py MODEL --endpoint http://HOST:PORT

All script arguments can be checked using python scripts/ollama/evaluate_perf.py --help

Others

Here are the results for other LLMs that have have only been evaluated on the laptop GPU:

Model Ingestion mean (std) Generation mean (std)
tinyllama:1.1b-chat-v1-q4_0 2014.63 tok/s (±12.62) 227.13 tok/s (±2.26)
dolphin-phi:2.7b-v2.6-q4_0 684.07 tok/s (±3.85) 122.25 toks/s (±0.87)
dolphin-mistral:7b-v2.6 291.94 tok/s (±0.4) 60.56 tok/s (±0.15)