Skip to content

Smooth Sampling Test Build (koboldcpp)

Compare
Choose a tag to compare
@kalomaze kalomaze released this 19 Jan 03:30
· 1 commit to smooth-sampling since this release

Dynamic Temperature sampling is a unique concept, but it always peeved me that:

  • We basically are forced to use truncation strategies like Min P or Top K, as a dynamically chosen temperature by itself isn't enough to prevent the long tail end of the distribution from being selected.
  • These rely on a harsh cutoff and don't allow for "smooth" changes to the distribution.

So, I had an idea: why don't we scale the token scores (the raw ones, before they are turned into probabilities) based on how far they are from the minimum and maximum logit values respectively?

Enter Smooth Sampling:

image
[5.0 smoothing factor, 0.25 temperature, Toppy 7b q8_0]

What this new option allows for can be summed up as:

  • You can increase the variability of good choices without a high Temperature
  • This can be used to make lower Temperatures more creative, too

I'm a fan of this approach because it lets you simplify the sampling process into two options: Temperature itself + the Smoothing factor. Technically, a good combination should work well without truncation samplers like Min P. I've only tested it on 7b so far, but it should work nicely across bigger models, especially more confident ones that otherwise get stuck in repetition. looking at you Mixtral Instruct

How to Use

In the ExtStuff.txt file, there's a single option for the smoothing factor.
image

The higher this value is, the more variable your outputs will be.

I recommend using no truncation samplers (that means Top K = 0, Top P = 1.0, Min P = 0, etc) and finding a good balance between a lowered (or slightly lowered) Temperature (between 0.7-1.0) and the smoothing factor (between 2.5-7.5).

The lower the temperature, the higher you'll be able to turn up the smoothing factor.

Technical Breakdown

  • Logits are initially normalized to a range between 0 and 1 based on the initial min and max logits.
  • A sigmoid function is applied to each normalized logit. This compresses values near the minimum logit closer to 0 and values near the maximum logit closer to 1 to increase the variance while preventing low probability outliers. The k value decides the steepness of the curve (I call this the "smoothing factor")
  • After applying the sigmoid function, the "smoothed" logits are scaled back up to match the original logit range. i.e., they are "denormalized"

Visualization

image