You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reason it does not work is quite complex. Block masks could work with gemma 2 without problem as there is nothing specific preventing them to work in theory. The problem comes from flex attention which I did not manage to implement for gemma 2, so gemma 2 is not using flex attention at the moment.
But the way torchtune automatically checks whether flex attention is available or not and the fact that datasets automatically create block masks specific to flex attention create an incompatibility at the moment (this incompatibility does not exist if you have a version of torch which does not contain flex attention).
Here are a few directions I see to fix this problem:
make flex attention usage a parameter in each recipes and disable it for now for gemma 2: I feel like hiding to the users which attention implementation is going to be run is not ideal.
implement a working flex attention implementation but this would probably need to wait for pytorch to solve the issue
On a similar topic, did you successfully fine tuned gemma 2? I have been running the same (custom) pretraining script with both gemma 2 2B and llama3.2 3B which yields good generations with llama 3.2 and catastrophic ones with gemma 2...
when running gemma2 with packed=True, i got the error below. It should be more comprehensive.
Also, if Gemma is not compatible with packed, we should fix it and/or remove the option from the config.
The text was updated successfully, but these errors were encountered: