-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract_lora.py
improvements and fixes
#333
extract_lora.py
improvements and fixes
#333
Conversation
I also removed the |
One thing that might work and be acceptable for archival purposes: if base_type != "to_save" and finetuned_size[0] > base_size[0]:
assert base_size[1] == finetuned_size[1], f"Column dimension mismatch in layer '{base_name}': {base_size} != {finetuned_size}"
base_type = "to_save"
else:
assert base_size == finetuned_size, f"Dimension mismatch in layer '{base_name}': {base_size} != {finetuned_size}"
module_details.append((base_type, base_name)) This would then just save the tensors in full in the LoRA in the same way as the norms are saved. This sidesteps all the hassle with the I also thought of a way this could be improved: # NOTE: Should check only embedding/output types, but no clear way to test for "output" type...
if base_type != "to_save" and finetuned_size[0] > base_size[0]: since it's possible to accurately detect an embedding layer without looking at the name: if isinstance(module, torch.nn.Embedding):
module_details.append(("embedding", name, module.weight.size())) and AFAIK, the output tensors are always the same dimensions. So after the loop in Alternatively this could use the logic you have in |
LoRD author and feature owner here. Thank you @jukofyork this is great! Couple of things I'm considering:
If we can get those changes in I think this should good to merge. |
Hey! Yeah, I did actually have the code all working to save the I still found it a bit odd that these extra (n_new_tokens, vocab_size) dense tensors were also in the sparse embeddings too - I would have expected the sparse embeddings to be the base model's dimensions and then these dense bits extra, rather than having both. There weren't really many examples of these files either so I couldn't really see what was in the sparse embeddings for these positions. Im still non the wiser why the embeddings have a different name and are transposed either - all I can assume is that this is for other types of non-language models that might have different embeddings or possibly something to do with the way they are indexed into instead of multiplied?
Feel free to make or propose any changes you see fit :) |
I forgot to add I think to make this really bulletproof, the best option would be to try creating some real LoRAs for the especially knaely cases like |
Just saw this on reddit: It would be pretty cool if we could get this to the stage of being useful for archival purposes - I have 3x4TB external drives that I keep squirrelling away models on! I did notice when the code was still in to cut off based on the fraction of variance explained, it was very clear which models had been created by applying LoRAs and even which tensors had been targeted by the LoRAs ( If we assume these are going to eventually be quantized, then we can use the same MSE criterion used for quantization: https://en.m.wikipedia.org/wiki/Rate%E2%80%93distortion_theory for: Q = and then use a rank large enough so that the maximum error in Q is around 1/2 the expected maximum quantization error of say https://en.m.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem There are several LLM quantization papers that have concluded that outlier weights/activations are very important, so minimising the maximum MSE error criterion is definitely the one to use. It would take some experimentation to figure out exactly the link between (row-wise or block-wise) quantization noise and (tensor-wise) SVD noise in terms of maximum MSE, but I'm pretty sure even with the threshold set well below the the Nyquist rate for safety; the storage requirements would be an order of magnitude less for full-finetuned models and several orders of magnitude less for LoRA-trained models. |
@jukofyork can you add me to your repo? I have some changes I would like to push before closing this PR |
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just having a look through the changes and noticed this:
if module == pretrained_model.get_input_embeddings():
# if isinstance(module, torch.nn.Embedding):
module_details.append(("embedding", name, module.weight.size()))
elif module == pretrained_model.get_output_embeddings():
# if isinstance(module, torch.nn.Embedding):
module_details.append(("output", name, module.weight.size()))
Might be worth testing this on a model where there is no separate output
, eg:
https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/model.safetensors.index.json
IIRC, it just uses the transpose of model.embed_tokens.weight
matrix instead.
It's been a couple of months since I last looked at this, but this is the code I had originally to calculate and save the "new_embeddings.safetensors" file for when the vocab gets extended: lora_weights = {}
new_embedding_weights = {}
rank_pattern = {}
# Embedding weights
for layer_name in tqdm(embedding_names):
base_weight = base_loader.get_tensor(f"{layer_name}.weight")
finetuned_weight = finetuned_loader.get_tensor(f"{layer_name}.weight")
padding_size = finetuned_weight.shape[0] - base_weight.shape[0]
if padding_size > 0 :
base_weight = pad(base_weight, (0, 0, 0, padding_size), "constant", 0)
#finetuned_weight = finetuned_weight[:base_weight.shape[0]]
extra_weight = finetuned_weight[-padding_size:, :]
new_embedding_weights["input_embeddings"] = extra_weight
print(
f"[P] input_dims=({base_weight.shape}), "
f"output_dims=({finetuned_weight.shape}), "
f"padded_dims=({extra_weight.shape})"
)
#padding_size = new_weight.shape[0] - base_weight.shape[0]
#if padding_size > 0 :
# new_weight = new_weight[:base_weight.shape[0], :base_weight.shape[1]]
# new_weight = new_weight[:base_weight.shape[0]] ???
lora_embedding_A, lora_embedding_B = decompose_delta_weight(
finetuned_weight.T, base_weight.T, desired_rank, min_fraction_variance, device=device
)
lora_weights[f"base_model.model.{layer_name}.lora_embedding_A"] = lora_embedding_A.to(
"cpu"
).contiguous()
lora_weights[f"base_model.model.{layer_name}.lora_embedding_B"] = lora_embedding_B.to(
"cpu"
).contiguous()
# Add the final rank to the rank_pattern dictionary only if it is different from the desired_rank
final_rank = lora_embedding_A.shape[0]
if final_rank != desired_rank:
rank_pattern[layer_name] = final_rank
print(
f"[E] {layer_name}: final_rank={final_rank}, "
f"input_dims=({base_weight.shape}), "
f"output_dims=({lora_embedding_A.shape}, {lora_embedding_B.shape})"
)
.
.
.
with open(os.path.join(out_path, "adapter_config.json"), "w") as f:
json.dump(lora_config, f, indent=2)
save_file(lora_weights, os.path.join(out_path, "adapter_model.safetensors"))
if new_embedding_weights:
save_file(new_embedding_weights, os.path.join(out_path, "new_embeddings.safetensors")) but then when I saw I also never really found out why the extra weights of the vocab seem to get added to both the low-rank weights and the dense weights in the "new_embeddings.safetensors" file (from examination of the few real LoRAs on huggingface that have this file), and as a result wasn't 100% sure how they should be set (my best guess was set the low rank to whatever was found via truncated SVD and then find the deltas of the truncated low rank weights vs the actual original weights and then use these deltas in the dense "new_embeddings.safetensors" file, but never got as far as that IIRC). |
So when finetuning extra vocab into a model and including "embed_tokens" in the LoRA I'm not sure about Regarding vocab order, I think in the case of a model like Mistral v0.3 where tokens are prepended, the SVD would simply produce an approximation of the transformation to get from the original (extended) embeddings to the finetuned ones, disregarding token order and simply operating on weight values. I have a feeling this is fine given a high enough rank. A more thorough solution would be to reorder the rows of the base embeddings matrix before running SVD so the token order between the finetuned model and the base is identical. But Mistral is an edge case in that regard and I'm not sure it's worth the extra complexity as the current implementation works fine for non esoteric cases. |
@jukofyork I'm merging this and closing the PR. If you have more thoughts / code related to archival and compression feel free to open up another issue or PR. Thank you for your contribution! |
Thanks - I think the next thing I was going to look at is trying to set/infer the rank automatically. |
Changes made:
validate_and_combine_details()
, etc.low_rank_decomposition()
now distributessqrt(diag(S))
between bothL
andR
to improve numerical precision.max_rank
parameter and reduces the rank down as necessary per-tensor and then puts any with rank less than the maximum inrank_pattern
andalpha_pattern
as needed.lora_embedding_A
andlora_embedding_B
tensors.--skip-undecomposable
to turn off).transformers.pytorch_utils.Conv1D
as linear type that can be decomposed (if any LLMs use them?).finetuned_size[0] > base_size[0]
indicating extra tokens have been added (this may or may not work depending on the model and desired use for the LoRA). It would be best to limit this to just the embed and output tensors, but I can't see any reliable way to test for output tensors other than by name...--verbose
option to print debugging information about tensor sizes, module types/names, etc.See #328 for more details.
I have successfully used this to take and reapply a full-rank LoRA of
Mistral-7B-Instruct-v0.2
:The
cp ./Mistral-7B-Instruct-v0.2/config.json ./Mistral-7B-Instruct-v0.2/tokenizer_config.json ./mistral-test
step was just to make sure the correct"rope_theta": 1000000.0
setting was used and the correct "[INST]" prompt template for fair test...lora-cache mistral-test.yaml:
and it produced a few 100 identical tokens before very slightly diverging due to either the SVD or the
dfloat16
->float16
conversion (see here).I haven't tested the changes much beyond this, but since
get_model_details()
mainly tests againsttorch.nn.Embedding
,torch.nn.Linear
, etc it should be fairly robust against tensor name-changes in other models.Feel free to make any changes you think are needed and hope it helps :)