Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : remove Tail-Free sampling #10071

Merged
merged 1 commit into from
Oct 29, 2024
Merged

llama : remove Tail-Free sampling #10071

merged 1 commit into from
Oct 29, 2024

Conversation

ggerganov
Copy link
Owner

Supersedes #9604

@github-actions github-actions bot added script Script related testing Everything test related examples python python script changes server labels Oct 28, 2024
@p-e-w
Copy link

p-e-w commented Oct 28, 2024

Yes please. Despite its reasonable-sounding theoretical foundations, TFS has not stood the test of time and the collective experience of tens of thousands of power users has shown that it is soundly beaten by Min-P in practice, at a fraction of its complexity.

FWIW, I believe that Typical and Mirostat should meet the same fate. Judging from presets posted in forums and model cards, they're very rarely used nowadays, and Mirostat in particular is difficult to build an intuition for. The lesson appears to be that samplers using entropy/perplexity as a control metric are overlooking one or multiple key properties that strongly determine output quality. A combination of Min-P and XTC can do anything those samplers can do and more, and do it in a way that is comprehensible to a human operator thinking about probability distributions.

Heck, even Top-P and Top-K (for values > 1) should probably be deprecated and show a warning when used. Min-P fixes the conceptual flaws of both of them and is flat out superior. AFAIK, those samplers originate from the early OpenAI days and were then incorporated into HF Transformers, after which everyone else cargo culted them into every frontend and backend. Their continued availability (and the fact that they are presented at the very top of most frontends' sampler settings) really hurts the experience of new users, who probably shouldn't be using any sampler other than Temperature, Min-P, XTC, and DRY.

@ggerganov
Copy link
Owner Author

@p-e-w Thanks for the insights. I will use this opportunity to ask you about what do you think is the best sampling strategy when the LLM is used for fill-in-the-middle (FIM) tasks (e.g. code completion). Do you have any experience and recommendations? My intuition is that one would generally want greedy sampling in this case (since we are not looking for variety and/or creativity), but with some conservative way to terminate early as soon as the LLM is not "confident" about the next token. Do you agree and if yes, what would be a good approach to model this during sampling?

@p-e-w
Copy link

p-e-w commented Oct 28, 2024

@ggerganov

Although I too am very interested in FIM, I don't have any practical experience with it yet. That being said, coding appears to be the primary application of FIM, and with coding one would very rarely want anything other than the model's "best guess", i.e. greedy sampling. Note that paradoxically, greedily sampling the most likely token at each position does not guarantee that the most likely sequence of tokens is being sampled, but this cannot be fixed with the way we currently do sampling.

The termination strategy is indeed the crux of the matter. While the FIM models I've seen do generate EOS (or a special end-of-infill token) when the infill is "complete", for coding, you often don't want them to go that far. Intuitively, you may want the model to infill a unit of code, for example an indentation block. As such, I expect that a hybrid heuristic composed of bracket/indentation matching and a maximum output length could work in practice.

Stopping when the confidence is low (e.g. measured by some entropy function on the distribution) might be a double-edged sword. Consider this infill task:

int i = 0;
int j = 0;

for ([FIM]

if (j < i) {
   ...
}

It's likely that there will be one or two loops involving i and j. But which comes first, and what exactly is expected? That's hard to predict with high confidence, especially given that the order of j and i is reversed in the following conditional, and it's also possible that the loop uses a different index variable k, etc. Thus the entropy of the distribution is likely to be quite high for such common patterns. But I don't think that means we should just stop.

Instead, a frontend might use a block-based termination strategy (e.g. stop when encountering a } and all generated { have been closed), combined with a stateful system that samples differently when the user chooses to regenerate, indicating that they aren't satisfied with the output. One approach could be greedily sampling on the first try, and each time the user regenerates that particular FIM output, applying successively stronger XTC configurations (e.g. by lowering xtc_threshold) and greedily sampling from the resulting distribution. This will steer the model away from regenerating the solution the user has already rejected, and IMO is a promising strategy for combining the intelligence of greedy sampling with the ability to get multiple distinct solutions.

@JohannesGaessler
Copy link
Collaborator

I think better methods for evaluating samplers are sorely needed. The in my opinion biggest difficulty lies in acquiring enough data to make statistically significant statements. I see two ways that could maybe work:

  1. Do blind testing with users where they are given two completions from two different random presets for a task and have to decide which completion is better. This would be a huge amount of work however and it would be difficult to account for diversity in outputs since users would not necessarily notice if the outputs for multiple seeds were the same.
  2. Use a model to generate multiple completions for a task with different seeds. Then ask (possibly another) model to rate the completions both in terms of quality and diversity. Establish which sampling methods incur the smallest quality loss for a given target diversity. Would be less work but dependent on the ability of the model that is doing the rating. Also I am implicitly assuming that there is an inherent tradeoff between quality and diversity of outputs (though increasing temperature will inevitably lead to a uniform token distribution which I would classify as having maximum diversity and minimum quality).

Heck, even Top-P and Top-K (for values > 1) should probably be deprecated and show a warning when used.

Deprecation warnings would be fine I guess but if we advertise our server as OpenAI compatible we definitely should not remove them.

@MaggotHATE
Copy link
Contributor

2. Use a model to generate multiple completions for a task with different seeds. Then ask (possibly another) model to rate the completions both in terms of quality and diversity.

This reminds me of the main method described in Chain-of-Thought Reasoning without Prompting paper. Sampling multiple paths with later evaluation might become a useful inference strategy in future, let alone testing existing samplers.

@ggerganov ggerganov merged commit 8d8ff71 into master Oct 29, 2024
62 checks passed
@ggerganov ggerganov deleted the gg/remove-tfs branch October 29, 2024 08:42
@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Oct 29, 2024

Heck, even Top-P and Top-K (for values > 1) should probably be deprecated

Personally I'd like to see them turned off by default in the llama-cli defaults.

I benchmark models / settings against a personal real-world use-case test (coding, translation, math, RAG), and setting top_k to 0 and top_p to 1.0 improved my top benchmark scores in multiple models.

@ngxson ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Oct 31, 2024
@ngxson
Copy link
Collaborator

ngxson commented Oct 31, 2024

I'm adding breaking change label because this PR touches llama.h (also added to #9289)

@ivanstepanovftw
Copy link
Collaborator

ivanstepanovftw commented Nov 1, 2024

I will vote to disable all samplers, or at least set them to default values that will not cut any token.

As the person who first implemented the TFS sampler in llama.cpp, I would say that Min P, XTC, and temperature samplers are all you need.

jhen0409 added a commit to a-ghorbani/llama.rn that referenced this pull request Nov 2, 2024
jhen0409 added a commit to mybigday/llama.rn that referenced this pull request Nov 2, 2024
* feat: sync llama.cpp

* fix: fix submodule update - as part of llama.cpp sync

* chore: remove unnecessary comment

* chore(example): revert unnecessary changes

* feat: sync llama.cpp

* fix: remove tfs_z

ref: ggerganov/llama.cpp#10071

* fix(cpp): skip gpu device if n_gpu_layers <= 0

ref: ggerganov/llama.cpp#10132

---------

Co-authored-by: Jhen-Jie Hong <[email protected]>
@mechanicmuthu
Copy link

Heck, even Top-P and Top-K (for values > 1) should probably be deprecated and show a warning when used. Min-P fixes the conceptual flaws of both of them and is flat out superior. AFAIK, those samplers originate from the early OpenAI days and were then incorporated into HF Transformers, after which everyone else cargo culted them into every frontend and backend. Their continued availability (and the fact that they are presented at the very top of most frontends' sampler settings) really hurts the experience of new users, who probably shouldn't be using any sampler other than Temperature, Min-P, XTC, and DRY.

Top-k along with grammar is a poweful combination to get small models to give better results in Yes/no answers. If this is a maintainability issue, only then removing these make sense. Otherwise they are well understood and not everyone just wants to chat with llm. We are looking at programmatic call use.

jhen0409 referenced this pull request in mybigday/llama.rn Nov 4, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples python python script changes script Script related server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants