You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A while ago, I requested a feature ( GGUF as directory and quantization parameters GUI #7251 ) to allow the partial requant of a model without requantizing a second time the same tensors if they were already in the desired ggml_type.
Now that we have a split feature to convert HF weights in a splitted gguf, and a keep split feature to retain the split during quantization, wouldn't it be beneficial to improve this feature to :
Allow the convert split to create one file by group of tensors (attn_v, q, k, attn_output, ffn_down, up, gate, token_embeddings and output.weight)?
To quantize big models like the 70b and beyond, it would save a lot of time, compute, and disk space, as well as bandwidth to upload the new gguf of the particular tensor) and not be so invasive for Llama.CPP because the split feature already exists.
Moreover, looking at HF, and the amount of different quantized ggufs piling up for a same model, often containing a massive overlap of the same tensors quants compared to the upper and lower quant of a given model, it would also be common sense to preserve the HF service by optimizing the disk-space taken by the GGUF quants instead of piling up massively redundant quantized ggufs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
A while ago, I requested a feature ( GGUF as directory and quantization parameters GUI #7251 ) to allow the partial requant of a model without requantizing a second time the same tensors if they were already in the desired ggml_type.
Now that we have a split feature to convert HF weights in a splitted gguf, and a keep split feature to retain the split during quantization, wouldn't it be beneficial to improve this feature to :
To quantize big models like the 70b and beyond, it would save a lot of time, compute, and disk space, as well as bandwidth to upload the new gguf of the particular tensor) and not be so invasive for Llama.CPP because the split feature already exists.
Moreover, looking at HF, and the amount of different quantized ggufs piling up for a same model, often containing a massive overlap of the same tensors quants compared to the upper and lower quant of a given model, it would also be common sense to preserve the HF service by optimizing the disk-space taken by the GGUF quants instead of piling up massively redundant quantized ggufs.
Beta Was this translation helpful? Give feedback.
All reactions