llama : remove Tail-Free sampling #10071

ggerganov · 2024-10-28T06:56:59Z

Supersedes #9604

ggml-ci

p-e-w · 2024-10-28T13:52:01Z

Yes please. Despite its reasonable-sounding theoretical foundations, TFS has not stood the test of time and the collective experience of tens of thousands of power users has shown that it is soundly beaten by Min-P in practice, at a fraction of its complexity.

FWIW, I believe that Typical and Mirostat should meet the same fate. Judging from presets posted in forums and model cards, they're very rarely used nowadays, and Mirostat in particular is difficult to build an intuition for. The lesson appears to be that samplers using entropy/perplexity as a control metric are overlooking one or multiple key properties that strongly determine output quality. A combination of Min-P and XTC can do anything those samplers can do and more, and do it in a way that is comprehensible to a human operator thinking about probability distributions.

Heck, even Top-P and Top-K (for values > 1) should probably be deprecated and show a warning when used. Min-P fixes the conceptual flaws of both of them and is flat out superior. AFAIK, those samplers originate from the early OpenAI days and were then incorporated into HF Transformers, after which everyone else cargo culted them into every frontend and backend. Their continued availability (and the fact that they are presented at the very top of most frontends' sampler settings) really hurts the experience of new users, who probably shouldn't be using any sampler other than Temperature, Min-P, XTC, and DRY.

ggerganov · 2024-10-28T14:02:25Z

@p-e-w Thanks for the insights. I will use this opportunity to ask you about what do you think is the best sampling strategy when the LLM is used for fill-in-the-middle (FIM) tasks (e.g. code completion). Do you have any experience and recommendations? My intuition is that one would generally want greedy sampling in this case (since we are not looking for variety and/or creativity), but with some conservative way to terminate early as soon as the LLM is not "confident" about the next token. Do you agree and if yes, what would be a good approach to model this during sampling?

p-e-w · 2024-10-28T14:55:57Z

@ggerganov

Although I too am very interested in FIM, I don't have any practical experience with it yet. That being said, coding appears to be the primary application of FIM, and with coding one would very rarely want anything other than the model's "best guess", i.e. greedy sampling. Note that paradoxically, greedily sampling the most likely token at each position does not guarantee that the most likely sequence of tokens is being sampled, but this cannot be fixed with the way we currently do sampling.

The termination strategy is indeed the crux of the matter. While the FIM models I've seen do generate EOS (or a special end-of-infill token) when the infill is "complete", for coding, you often don't want them to go that far. Intuitively, you may want the model to infill a unit of code, for example an indentation block. As such, I expect that a hybrid heuristic composed of bracket/indentation matching and a maximum output length could work in practice.

Stopping when the confidence is low (e.g. measured by some entropy function on the distribution) might be a double-edged sword. Consider this infill task:

int i = 0;
int j = 0;

for ([FIM]

if (j < i) {
   ...
}

It's likely that there will be one or two loops involving i and j. But which comes first, and what exactly is expected? That's hard to predict with high confidence, especially given that the order of j and i is reversed in the following conditional, and it's also possible that the loop uses a different index variable k, etc. Thus the entropy of the distribution is likely to be quite high for such common patterns. But I don't think that means we should just stop.

Instead, a frontend might use a block-based termination strategy (e.g. stop when encountering a } and all generated { have been closed), combined with a stateful system that samples differently when the user chooses to regenerate, indicating that they aren't satisfied with the output. One approach could be greedily sampling on the first try, and each time the user regenerates that particular FIM output, applying successively stronger XTC configurations (e.g. by lowering xtc_threshold) and greedily sampling from the resulting distribution. This will steer the model away from regenerating the solution the user has already rejected, and IMO is a promising strategy for combining the intelligence of greedy sampling with the ability to get multiple distinct solutions.

JohannesGaessler · 2024-10-28T15:18:30Z

I think better methods for evaluating samplers are sorely needed. The in my opinion biggest difficulty lies in acquiring enough data to make statistically significant statements. I see two ways that could maybe work:

Do blind testing with users where they are given two completions from two different random presets for a task and have to decide which completion is better. This would be a huge amount of work however and it would be difficult to account for diversity in outputs since users would not necessarily notice if the outputs for multiple seeds were the same.
Use a model to generate multiple completions for a task with different seeds. Then ask (possibly another) model to rate the completions both in terms of quality and diversity. Establish which sampling methods incur the smallest quality loss for a given target diversity. Would be less work but dependent on the ability of the model that is doing the rating. Also I am implicitly assuming that there is an inherent tradeoff between quality and diversity of outputs (though increasing temperature will inevitably lead to a uniform token distribution which I would classify as having maximum diversity and minimum quality).

Heck, even Top-P and Top-K (for values > 1) should probably be deprecated and show a warning when used.

Deprecation warnings would be fine I guess but if we advertise our server as OpenAI compatible we definitely should not remove them.

MaggotHATE · 2024-10-28T15:32:31Z

2. Use a model to generate multiple completions for a task with different seeds. Then ask (possibly another) model to rate the completions both in terms of quality and diversity.

This reminds me of the main method described in Chain-of-Thought Reasoning without Prompting paper. Sampling multiple paths with later evaluation might become a useful inference strategy in future, let alone testing existing samplers.

strawberrymelonpanda · 2024-10-29T09:25:34Z

Heck, even Top-P and Top-K (for values > 1) should probably be deprecated

Personally I'd like to see them turned off by default in the llama-cli defaults.

I benchmark models / settings against a personal real-world use-case test (coding, translation, math, RAG), and setting top_k to 0 and top_p to 1.0 improved my top benchmark scores in multiple models.

ngxson · 2024-10-31T10:35:51Z

I'm adding breaking change label because this PR touches llama.h (also added to #9289)

ivanstepanovftw · 2024-11-01T00:19:32Z

I will vote to disable all samplers, or at least set them to default values that will not cut any token.

As the person who first implemented the TFS sampler in llama.cpp, I would say that Min P, XTC, and temperature samplers are all you need.

ref: ggerganov/llama.cpp#10071

* feat: sync llama.cpp * fix: fix submodule update - as part of llama.cpp sync * chore: remove unnecessary comment * chore(example): revert unnecessary changes * feat: sync llama.cpp * fix: remove tfs_z ref: ggerganov/llama.cpp#10071 * fix(cpp): skip gpu device if n_gpu_layers <= 0 ref: ggerganov/llama.cpp#10132 --------- Co-authored-by: Jhen-Jie Hong <[email protected]>

mechanicmuthu · 2024-11-02T13:08:56Z

Heck, even Top-P and Top-K (for values > 1) should probably be deprecated and show a warning when used. Min-P fixes the conceptual flaws of both of them and is flat out superior. AFAIK, those samplers originate from the early OpenAI days and were then incorporated into HF Transformers, after which everyone else cargo culted them into every frontend and backend. Their continued availability (and the fact that they are presented at the very top of most frontends' sampler settings) really hurts the experience of new users, who probably shouldn't be using any sampler other than Temperature, Min-P, XTC, and DRY.

Top-k along with grammar is a poweful combination to get small models to give better results in Yes/no answers. If this is a maintainability issue, only then removing these make sense. Otherwise they are well understood and not everyone just wants to chat with llm. We are looking at programmatic call use.

ggml-ci

llama : remove Tail-Free sampling

ec2be7b

ggml-ci

github-actions bot added script Script related testing Everything test related examples python python script changes server labels Oct 28, 2024

ggerganov merged commit 8d8ff71 into master Oct 29, 2024
62 checks passed

ggerganov deleted the gg/remove-tfs branch October 29, 2024 08:42

ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Oct 31, 2024

ngxson mentioned this pull request Oct 31, 2024

changelog : libllama API #9289

Open

martindevans mentioned this pull request Oct 31, 2024

November Binary Update SciSharp/LLamaSharp#962

Merged

8 tasks

jhen0409 added a commit to a-ghorbani/llama.rn that referenced this pull request Nov 2, 2024

fix: remove tfs_z

1211095

ref: ggerganov/llama.cpp#10071

jhen0409 referenced this pull request in mybigday/llama.rn Nov 4, 2024

fix(android): remove tfs_z param

a61a857

Vali-98 mentioned this pull request Nov 4, 2024

0.8.0a generation crashes the app Vali-98/ChatterUI#97

Closed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama : remove Tail-Free sampling (ggerganov#10071)

44df859

ggml-ci

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama : remove Tail-Free sampling (ggerganov#10071)

10d0baf

ggml-ci

pminev mentioned this pull request Dec 4, 2024

Update submodules alpaca-core/ilib-llama.cpp#1

Closed

1 task

rick-github mentioned this pull request Dec 27, 2024

BUG: Error when creating Modelfile - Error: unknown parameter 'tfs_z' ollama/ollama#8252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : remove Tail-Free sampling #10071

llama : remove Tail-Free sampling #10071

ggerganov commented Oct 28, 2024

p-e-w commented Oct 28, 2024

ggerganov commented Oct 28, 2024

p-e-w commented Oct 28, 2024

JohannesGaessler commented Oct 28, 2024

MaggotHATE commented Oct 28, 2024

strawberrymelonpanda commented Oct 29, 2024 •

edited

Loading

ngxson commented Oct 31, 2024

ivanstepanovftw commented Nov 1, 2024 •

edited

Loading

mechanicmuthu commented Nov 2, 2024

llama : remove Tail-Free sampling #10071

llama : remove Tail-Free sampling #10071

Conversation

ggerganov commented Oct 28, 2024

p-e-w commented Oct 28, 2024

ggerganov commented Oct 28, 2024

p-e-w commented Oct 28, 2024

JohannesGaessler commented Oct 28, 2024

MaggotHATE commented Oct 28, 2024

strawberrymelonpanda commented Oct 29, 2024 • edited Loading

ngxson commented Oct 31, 2024

ivanstepanovftw commented Nov 1, 2024 • edited Loading

mechanicmuthu commented Nov 2, 2024

strawberrymelonpanda commented Oct 29, 2024 •

edited

Loading

ivanstepanovftw commented Nov 1, 2024 •

edited

Loading