Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YaRN Support #272

Open
grimulkan opened this issue Sep 2, 2023 · 8 comments
Open

YaRN Support #272

grimulkan opened this issue Sep 2, 2023 · 8 comments

Comments

@grimulkan
Copy link

Any thoughts/plans about YaRN support for the positional embeddings?
https://github.com/jquesnelle/yarn

I don't actually see them beat regular linear scaling w/ fine-tuning in the paper, but presumably it extends beyond the fine-tuned context length without breaking and performs better than regular PI for shorter contexts.

I see GPTQ quants from TheBloke for some models trained with YaRN already. Not sure how those are supposed to be working without changing the way the positions are calculated.

I don't think what the authors call NTK-by-parts is supported by exllama either (YaRN is just a slight modification), so maybe there's something about this that makes it tricky to integrate?

This is all still static RoPE (set scale, base, alpha whatever at load time).

@niceblue88
Copy link

I second this a lot. Being able to scale large context while keeping model vram use low is the most valuable part, almost all use cases scale in better usefulness with increasing context size. However, in terms of exllama, I think the positional embeddings are tightly bound with the great optimisations for speed and minimal vram ... which means its harder to make them dynamic with those optimisations, However, perhaps these can be broken out to allow for this?

@turboderp
Copy link
Owner

turboderp commented Sep 9, 2023

I'm still not sure what "dynamic" positional encodings actually means, and how you would use them with cached keys.

@grimulkan
Copy link
Author

I am not sure we need them to be dynamic. YaRN works both ways? The static version I described above still computes the positional table once at tge start, just like exllama does today, as far as I understand.

@grimulkan
Copy link
Author

By ‘dynamic’, the paper means something that changes the rope scaling depending on the actual context size (only compresses when context exceeds original pre-trained size). This is optional. They have this to say on caching under the dynamic implementation:

Some care has to be taken when using Dynamic Scaling with kv-caching [6], as in some implementations, the RoPE embeddings are cached. The correct implementation should cache the kv-embeddings before applying RoPE, as the RoPE embedding of every token changes when s changes.

My crude understanding (we can use each with or without the dynamic aspect above):

kaiokendev linear: always compress position by some factor. Works well when finetuned.
NTK-alpha or Codellama style base change: change scale by a specific nonlinear equation (modifying either exponent or base). Not as good when finetuned (though maybe using a giant base like Meta did kinda works, but alpha doesn’t work as well as linear with finetuning)
NTK-by-parts (not supported by transformers/exllama): has a different formula based on which hidden state dimension we are computing the position for. Presumably finetunes better than linear and extrapolates?
YaRN: Above + scale attention by a constant factor depending on compression ratio

@turboderp
Copy link
Owner

turboderp commented Sep 9, 2023

The correct implementation should cache the kv-embeddings before applying RoPE, as the RoPE embedding of every token changes when s changes.

This is the part that doesn't make sense to me. I may just be failing to wrap my head around it, but as far as I can tell this would only work for the first layer. The state that exits the first layer is computed as a function of (among other things) those positional embeddings, and the keys and values in turn are produced from that state. So while you can save the keys/values from layer 2 without the embeddings from layer 2, they will still depend on the embeddings from layer 1. And so on throughout the model.

@grimulkan
Copy link
Author

I see. Does that also mess with methods that change the position embeddings by hidden dimension like YaRN?

@turboderp
Copy link
Owner

I'm not sure what those do exactly, especially since the default RoPE implementation already adapts to the hidden dimension of the model. But the hidden dimension of the model is constant regardless of context size or position, so there shouldn't be an issue with the K/V cache.

@grimulkan
Copy link
Author

grimulkan commented Sep 9, 2023

From my limited understanding, the authors claim that trying to use NTK-alpha scaling effectively extrapolates some dimensions, unlike linear scaling which never does. This, they say, Is why it is hard to fine tune it (probably at best those dims become a noop). The modifications basically makes sure no dimension is extrapolated, using a piecewise calculation, which the simple exponential angle scaling equation of NTK doesn't do by default.

Both change by hidden dimension, this just uses a different equation.

That said, I haven't checked, but Meta's method of using a giant fixed base I think effectively avoids the extrapolation also, and the authors don't cover that (edit: ok they apparently do and claim better scaling by experiment). Like YaRN codellama also extrapolates well after fine-tuning, and is supported by exllama already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants