-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YaRN Support #272
Comments
I second this a lot. Being able to scale large context while keeping model vram use low is the most valuable part, almost all use cases scale in better usefulness with increasing context size. However, in terms of exllama, I think the positional embeddings are tightly bound with the great optimisations for speed and minimal vram ... which means its harder to make them dynamic with those optimisations, However, perhaps these can be broken out to allow for this? |
I'm still not sure what "dynamic" positional encodings actually means, and how you would use them with cached keys. |
I am not sure we need them to be dynamic. YaRN works both ways? The static version I described above still computes the positional table once at tge start, just like exllama does today, as far as I understand. |
By ‘dynamic’, the paper means something that changes the rope scaling depending on the actual context size (only compresses when context exceeds original pre-trained size). This is optional. They have this to say on caching under the dynamic implementation:
My crude understanding (we can use each with or without the dynamic aspect above): kaiokendev linear: always compress position by some factor. Works well when finetuned. |
This is the part that doesn't make sense to me. I may just be failing to wrap my head around it, but as far as I can tell this would only work for the first layer. The state that exits the first layer is computed as a function of (among other things) those positional embeddings, and the keys and values in turn are produced from that state. So while you can save the keys/values from layer 2 without the embeddings from layer 2, they will still depend on the embeddings from layer 1. And so on throughout the model. |
I see. Does that also mess with methods that change the position embeddings by hidden dimension like YaRN? |
I'm not sure what those do exactly, especially since the default RoPE implementation already adapts to the hidden dimension of the model. But the hidden dimension of the model is constant regardless of context size or position, so there shouldn't be an issue with the K/V cache. |
From my limited understanding, the authors claim that trying to use NTK-alpha scaling effectively extrapolates some dimensions, unlike linear scaling which never does. This, they say, Is why it is hard to fine tune it (probably at best those dims become a noop). The modifications basically makes sure no dimension is extrapolated, using a piecewise calculation, which the simple exponential angle scaling equation of NTK doesn't do by default. Both change by hidden dimension, this just uses a different equation. That said, I haven't checked, but Meta's method of using a giant fixed base I think effectively avoids the extrapolation also, and the authors don't cover that (edit: ok they apparently do and claim better scaling by experiment). Like YaRN codellama also extrapolates well after fine-tuning, and is supported by exllama already. |
Any thoughts/plans about YaRN support for the positional embeddings?
https://github.com/jquesnelle/yarn
I don't actually see them beat regular linear scaling w/ fine-tuning in the paper, but presumably it extends beyond the fine-tuned context length without breaking and performs better than regular PI for shorter contexts.
I see GPTQ quants from TheBloke for some models trained with YaRN already. Not sure how those are supposed to be working without changing the way the positions are calculated.
I don't think what the authors call NTK-by-parts is supported by exllama either (YaRN is just a slight modification), so maybe there's something about this that makes it tricky to integrate?
This is all still static RoPE (set scale, base, alpha whatever at load time).
The text was updated successfully, but these errors were encountered: