You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm still learning about running models locally. Could I ask how you decide which version of each model you will run?
I see different versions like Q5_K_S and Q4_K_M. I understand the main driver is memory when choosing between 7B, 13B, 34B, etc but how do you decide which quantization is right?
I'm on a 32GB M2 Max MacBook Pro.
The text was updated successfully, but these errors were encountered:
If you look at tools like LM Studio they mark some as recommended.
They say anything ending _0 like codellama-13b-instruct.Q4_0.gguf is a legacy quantization method.
For _K_S they don't give an opinion but _K_M do tend to be shown as recommended.
They also mention Q2 and Q3 models having a loss of quality.
So for me the sweetspot seems to be Q4_K_M and Q5_K_M.
I might try writing some kind of evaluation script to compare those based on quality of response and time taken.
I'm still learning about running models locally. Could I ask how you decide which version of each model you will run?
I see different versions like Q5_K_S and Q4_K_M. I understand the main driver is memory when choosing between 7B, 13B, 34B, etc but how do you decide which quantization is right?
I'm on a 32GB M2 Max MacBook Pro.
The text was updated successfully, but these errors were encountered: