Dynamic Temperature sampling is a unique concept, but it always peeved me that:

We basically are forced to use truncation strategies like Min P or Top K, as a dynamically chosen temperature by itself isn't enough to prevent the long tail end of the distribution from being selected.
These rely on a harsh cutoff and don't allow for "smooth" changes to the distribution.

So, I had an idea: why don't we scale the token scores (the raw ones, before they are turned into probabilities) based on how far they are from the minimum and maximum logit values respectively?

Enter Smooth Sampling:

[5.0 smoothing factor, 0.25 temperature, Toppy 7b q8_0]

What this new option allows for can be summed up as:

You can increase the variability of good choices without a high Temperature
This can be used to make lower Temperatures more creative, too

I'm a fan of this approach because it lets you simplify the sampling process into two options: Temperature itself + the Smoothing factor. Technically, a good combination should work well without truncation samplers like Min P. I've only tested it on 7b so far, but it should work nicely across bigger models, especially more confident ones that otherwise get stuck in repetition. ~~looking at you Mixtral Instruct~~

How to Use

In the ExtStuff.txt file, there's a single option for the smoothing factor.

The higher this value is, the more variable your outputs will be.

I recommend using no truncation samplers (that means Top K = 0, Top P = 1.0, Min P = 0, etc) and finding a good balance between a lowered (or slightly lowered) Temperature (between 0.7-1.0) and the smoothing factor (between 2.5-7.5).

The lower the temperature, the higher you'll be able to turn up the smoothing factor.

Technical Breakdown

Logits are initially normalized to a range between 0 and 1 based on the initial min and max logits.
A sigmoid function is applied to each normalized logit. This compresses values near the minimum logit closer to 0 and values near the maximum logit closer to 1 to increase the variance while preventing low probability outliers. The k value decides the steepness of the curve (I call this the "smoothing factor")
After applying the sigmoid function, the "smoothed" logits are scaled back up to match the original logit range. i.e., they are "denormalized"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smooth Sampling Test Build (koboldcpp)

How to Use

Technical Breakdown

Visualization