Release Faster CPU Prompt Processing (v1.57, CUDA 12) · kalomaze/koboldcpp

I have reverted the upstream llama.cpp change that causes the thread yielding to be conditional, instead, it always does it.
This improves prompt processing performance for me on my CPU which has Intel E-cores, and matches the old faster build I published back when Mixtral was initially released.

The improvement might only apply to this type of Intel CPU that has the hybrid architecture, but I'd recommend trying just in case it has improvements for other CPUs (except for Apple, which apparently is unaffected).

Process:9.33s (22.2ms/T = 45.14T/s), Generate:24.02s (174.0ms/T = 5.75T/s), Total:33.34s (4.14T/s)
Process:8.80s (18.3ms/T = 54.52T/s), Generate:3.18s (158.9ms/T = 6.29T/s)

My prompt processing is about 1.25x faster on Mixtral, and the generation speed is about 1.1x faster on my i5-13400F (I am partially offloading the same amount of layers in both instances.)

This is a global change; it might benefit larger models like 70bs for CPU layers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster CPU Prompt Processing (v1.57, CUDA 12)