You crave for perfect code suggestions, but you don't know whether it fits your needs in terms of latency?
We ran our tests on the following hardware:
- NVIDIA GeForce RTX 3060 (mobile)*
- NVIDIA GeForce RTX 3070 (Scaleway GPU-3070-S)
- NVIDIA A10 (Lambda Cloud gpu_1x_a10)
- NVIDIA A10G (AWS g5.xlarge)
- NVIDIA L4 (Scaleway L4-1-24G)
The laptop hardware setup includes an Intel(R) Core(TM) i7-12700H for the CPU
with the following LLMs (cf. Ollama hub):
- Deepseek Coder 6.7b - instruct (Ollama, Hugging Face)
- OpenCodeInterpreter 6.7b (Ollama, Hugging Face, paper)
- Dolphin Mistral 7b (Ollama, Hugging Face, paper)
- Coming soon: StarChat v2 (Hugging Face, paper)
and the following quantization formats: q3_K_M, q4_K_M, q5_K_M.
This benchmark was performed over 5 iterations on 4 different sequences, including on a laptop to better reflect performances that can be expected by common users.
Quite simply, start the docker:
docker compose up -d --wait
Pull the model you want
docker compose exec -T ollama ollama pull MODEL
And run the evaluation
docker compose exec -T evaluator python evaluate.py MODEL
Start the evaluator only
docker compose up -d evaluator --wait
And run the evaluation by targeting your remote instance:
docker compose exec -T evaluator python evaluate.py MODEL --endpoint http://HOST:PORT
All script arguments can be checked using python scripts/ollama/evaluate_perf.py --help
Here are the results for other LLMs that have have only been evaluated on the laptop GPU:
Model | Ingestion mean (std) | Generation mean (std) |
---|---|---|
tinyllama:1.1b-chat-v1-q4_0 | 2014.63 tok/s (±12.62) | 227.13 tok/s (±2.26) |
dolphin-phi:2.7b-v2.6-q4_0 | 684.07 tok/s (±3.85) | 122.25 toks/s (±0.87) |
dolphin-mistral:7b-v2.6 | 291.94 tok/s (±0.4) | 60.56 tok/s (±0.15) |