Releases: kalomaze/koboldcpp
Faster CPU Prompt Processing (v1.57, CUDA 12)
I have reverted the upstream llama.cpp change that causes the thread yielding to be conditional, instead, it always does it.
This improves prompt processing performance for me on my CPU which has Intel E-cores, and matches the old faster build I published back when Mixtral was initially released.
The improvement might only apply to this type of Intel CPU that has the hybrid architecture, but I'd recommend trying just in case it has improvements for other CPUs (except for Apple, which apparently is unaffected).
Process:9.33s (22.2ms/T = 45.14T/s), Generate:24.02s (174.0ms/T = 5.75T/s), Total:33.34s (4.14T/s)
Process:8.80s (18.3ms/T = 54.52T/s), Generate:3.18s (158.9ms/T = 6.29T/s)
My prompt processing is about 1.25x faster on Mixtral, and the generation speed is about 1.1x faster on my i5-13400F (I am partially offloading the same amount of layers in both instances.)
This is a global change; it might benefit larger models like 70bs for CPU layers.
koboldcpp-1.57 - CUDA 12.3 build
I have merged the (currently unmerged) llama.cpp PR for Mixtral prompt processing to be faster. Should be about a ~1.25x prompt processing speed improvement for all CPU layers.
Quadratic Sampling Test Build (koboldcpp)
Replacement for the last idea (Smooth Sampling) with a different scaling mechanism.
The idea behind it is to simplify sampling as much as possible and remove as many extra variables as is reasonable.
The design I've been testing (on Toppy 7b so far) is "quadratic sampling". The way that it works is:
- We transform each logit based on a quadratic function with a scaling factor & a reference value (h). A higher scaling factor will generally be more deterministic.
- Logits closer to the reference value (which is the maximum logit) will be boosted in score, so that the top tokens become more evenly distributed, in order to avoid repetition and improve vocabulary usage
- Because we are using the top logit as the reference value, the modifications should theoretically scale somewhat well across different models which have different "scales" (e.g. Yi 34b with its 64k vocab)
- We inherently penalize small logits in the process of making the top ones more even, leading to a more coherent distribution overall without having to resort to cutting out tokens completely.
Things that are not implemented yet:
- Target logit multiplier. This is so you can slightly shift the probabilities "backwards" to make it a bit less deterministic in a more natural way compared to the Repetition Penalty (e.g 0.975x multiplier).
- I don't sort the tokens before deriving the maximum logit. I don't think this causes any issues at the moment (somehow?) but I will fix that just in case for the next build.
So far, values between 0.2-0.5 seem optimal.
Smooth Sampling Test Build (koboldcpp)
Dynamic Temperature sampling is a unique concept, but it always peeved me that:
- We basically are forced to use truncation strategies like Min P or Top K, as a dynamically chosen temperature by itself isn't enough to prevent the long tail end of the distribution from being selected.
- These rely on a harsh cutoff and don't allow for "smooth" changes to the distribution.
So, I had an idea: why don't we scale the token scores (the raw ones, before they are turned into probabilities) based on how far they are from the minimum and maximum logit values respectively?
Enter Smooth Sampling:
[5.0 smoothing factor, 0.25 temperature, Toppy 7b q8_0]
What this new option allows for can be summed up as:
- You can increase the variability of good choices without a high Temperature
- This can be used to make lower Temperatures more creative, too
I'm a fan of this approach because it lets you simplify the sampling process into two options: Temperature itself + the Smoothing factor. Technically, a good combination should work well without truncation samplers like Min P. I've only tested it on 7b so far, but it should work nicely across bigger models, especially more confident ones that otherwise get stuck in repetition. looking at you Mixtral Instruct
How to Use
In the ExtStuff.txt file, there's a single option for the smoothing factor.
The higher this value is, the more variable your outputs will be.
I recommend using no truncation samplers (that means Top K = 0, Top P = 1.0, Min P = 0, etc) and finding a good balance between a lowered (or slightly lowered) Temperature (between 0.7-1.0) and the smoothing factor (between 2.5-7.5).
The lower the temperature, the higher you'll be able to turn up the smoothing factor.
Technical Breakdown
- Logits are initially normalized to a range between 0 and 1 based on the initial min and max logits.
- A sigmoid function is applied to each normalized logit. This compresses values near the minimum logit closer to 0 and values near the maximum logit closer to 1 to increase the variance while preventing low probability outliers. The k value decides the steepness of the curve (I call this the "smoothing factor")
- After applying the sigmoid function, the "smoothed" logits are scaled back up to match the original logit range. i.e., they are "denormalized"
Visualization
Koboldcpp 1.54 + Dynamic Temp (Fixed UI Build)
This custom koboldcpp build adds support for Dynamic Temp UI support on SillyTavern (once they merge this pull request):
SillyTavern/SillyTavern#1666
The PR mentioned also adds support for the DynaTemp implementation seen in text-generation-webui
, once the changes from the dev branch get merged upstream.
Koboldcpp 1.54 + Dynamic Temp (UI Build)
I have decided to submit a PR for Dynamic Temp to be added to the mainline koboldcpp rather than as a janky side build.
Once this is merged, it will have some notable changes:
- No more manual override settings or .txt files; it's a properly integrated sampler within kobold itself (thanks for the help @AAbushady!)
- UI is included in kobold lite to control it like how you would any other sampler, will look into SillyTavern integration too
- The older methods (Greedy Dynamic Temp & HHI Dynamic Temp) have been removed
I am considering keeping the exponent value as a configurable setting as well if that is in demand.
The source code of this release is correct this time for the dynatemp-ui changes (the dynatemp-pr-upstream
branch is also identical as of writing).
Do keep in mind this release will not support Dynamic Temp directly in SillyTavern until ST's UI is updated to expect it.
CUDA 12.3 build of koboldcpp-1.53
A release that complies the latest koboldcpp with CUDA 12.3 instead of 11.7 for speed improvements on modern NVIDIA cards [koboldcpp_mainline_cuda12.exe].
There is a Dynamic Temp + Noisy supported version included as well [koboldcpp_dynatemp_cuda12.exe].
Both are up to date with the latest koboldcpp changes and have other small improvements from upstream.
NOTE: Dynamic Temp branch source code has been moved to the dynatemp-fix branch for those building from source.
Faster Mixtral Prompt Processing for Koboldcpp (+ DynaTemp)
JohannesGaessler made two PRs recently for mainline llama.cpp:
- Faster prompt processing for full CUDA offloading (GPU) (this is merged in llama.cpp)
- Faster prompt processing for partial CUDA offloading (CPU+GPU) (also merged now)
I have merged these changes experimentally into my custom koboldcpp build and also removed my debug print statements for regular users. It seems to work just fine, and as expected, Mixtral prompt processing speeds are much better on my hardware:
- 40ms per second (25 tokens/s) compared to 130ms per second (7 t/s), aka over 3x speedup.
My CPU only has 4 cores and I have an RTX 3060, so I am likely bottlenecked to some degree; if your CPU is fast you'll probably get even better speedups. Or if you can offload more layers than 13/33.
You might need to adjust your settings
To get the speed benefits, it's very important that you set BLAS batch size to 512 and that you set your BLAS thread count to your available CPU core count. Ironically, it'll be slower than how it was before if you were turning off batching, but it should be much faster once this is adjusted.
Dynamic Temp overrides and Noisy sampling are both still included for those who use them.
EDIT: Noisy Sampling hotfix was added. Before it was not activating if you set the value to higher than 0. Nothing else was effected and the build seems to be stable.
Dynamic Temp (updated, December 15)
Another maintenance build for my custom koboldcpp build, this one adds support for Mixtral.
Noisy sampling is still included.
Override values have changed for the different Dynamic Temp options.
- 3.9 Temp for Greedy Dynamic Temp
- 2.2 Temp for HHI Dynamic Temp
- 1.84 Temp for Entropy Dynamic Temp (this one remains the same as before)
I still think Entropy Dynamic Temp is the best one. 0.0 minTemp and 2.0 maxTemp seems to scale well as a default, but the maxTemp can go higher especially with a decent minP setting
EDIT: Hotfix for latest koboldcpp build that has improved prompt processing is now included. I recommend BLAS Batch size 128 for Mixtral at the moment, until further improvements are made to prompt processing.
Custom MoE Routing (llama.cpp)
The default amount of experts that are routed per token is 2, for Mixtral's MoE setup.
This is a custom build of llama.cpp which is a modification to the Mixtral PR that lets you customize the amount of experts that are routed per token, from 1 expert (fastest) to 8 experts (slowest).
The experts.txt
file lets you customize this number from the default of 2.