Name		Name	Last commit message	Last commit date
parent directory ..
Dockerfile		Dockerfile
README.md		README.md
benchmark_result.csv		benchmark_result.csv
docker-compose.yml		docker-compose.yml
evaluate_perf.py		evaluate_perf.py

README.md

LLM throughput benchmark

The benchmark

You crave for perfect code suggestions, but you don't know whether it fits your needs in terms of latency?

We ran our tests on the following hardware:

NVIDIA GeForce RTX 3060 (mobile)*
NVIDIA GeForce RTX 3070 (Scaleway GPU-3070-S)
NVIDIA A10 (Lambda Cloud gpu_1x_a10)
NVIDIA A10G (AWS g5.xlarge)
NVIDIA L4 (Scaleway L4-1-24G)

The laptop hardware setup includes an Intel(R) Core(TM) i7-12700H for the CPU

with the following LLMs (cf. Ollama hub):

Deepseek Coder 6.7b - instruct (Ollama, Hugging Face)
OpenCodeInterpreter 6.7b (Ollama, Hugging Face, paper)
Dolphin Mistral 7b (Ollama, Hugging Face, paper)
Coming soon: StarChat v2 (Hugging Face, paper)

and the following quantization formats: q3_K_M, q4_K_M, q5_K_M.

This benchmark was performed over 5 iterations on 4 different sequences, including on a laptop to better reflect performances that can be expected by common users.

Run it on your hardware

Local setup

Quite simply, start the docker:

docker compose up -d --wait

Pull the model you want

docker compose exec -T ollama ollama pull MODEL

And run the evaluation

docker compose exec -T evaluator python evaluate.py MODEL

Remote instance

Start the evaluator only

docker compose up -d evaluator --wait

And run the evaluation by targeting your remote instance:

docker compose exec -T evaluator python evaluate.py MODEL --endpoint http://HOST:PORT

All script arguments can be checked using python scripts/ollama/evaluate_perf.py --help

Others

Here are the results for other LLMs that have have only been evaluated on the laptop GPU:

Model	Ingestion mean (std)	Generation mean (std)
tinyllama:1.1b-chat-v1-q4_0	2014.63 tok/s (±12.62)	227.13 tok/s (±2.26)
dolphin-phi:2.7b-v2.6-q4_0	684.07 tok/s (±3.85)	122.25 toks/s (±0.87)
dolphin-mistral:7b-v2.6	291.94 tok/s (±0.4)	60.56 tok/s (±0.15)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ollama

ollama

README.md

LLM throughput benchmark

The benchmark

Run it on your hardware

Local setup

Remote instance

Others

Files

ollama

Directory actions

More options

Directory actions

More options

Latest commit

History

ollama

Folders and files

parent directory

README.md

LLM throughput benchmark

The benchmark

Run it on your hardware

Local setup

Remote instance

Others