Using the split feature to convert/quantize more efficiently #11017

Nexesenex · 2024-12-30T09:22:15Z

Nexesenex
Dec 30, 2024

A while ago, I requested a feature ( GGUF as directory and quantization parameters GUI #7251 ) to allow the partial requant of a model without requantizing a second time the same tensors if they were already in the desired ggml_type.

Now that we have a split feature to convert HF weights in a splitted gguf, and a keep split feature to retain the split during quantization, wouldn't it be beneficial to improve this feature to :

Allow the convert split to create one file by group of tensors (attn_v, q, k, attn_output, ffn_down, up, gate, token_embeddings and output.weight)?
Allow, out of that split, to requant ONLY one or several designated tensor ( with the help of something like Quantize: specify each major tensor quant in CLI for common LLMs #8917 )?

To quantize big models like the 70b and beyond, it would save a lot of time, compute, and disk space, as well as bandwidth to upload the new gguf of the particular tensor) and not be so invasive for Llama.CPP because the split feature already exists.

Moreover, looking at HF, and the amount of different quantized ggufs piling up for a same model, often containing a massive overlap of the same tensors quants compared to the upper and lower quant of a given model, it would also be common sense to preserve the HF service by optimizing the disk-space taken by the GGUF quants instead of piling up massively redundant quantized ggufs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the split feature to convert/quantize more efficiently #11017

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Using the split feature to convert/quantize more efficiently #11017

Nexesenex Dec 30, 2024

Replies: 0 comments

Nexesenex
Dec 30, 2024