Question about SDXL conversion to diffusers in convert_from_ckpt.py #8238

SeungHwa92 · 2024-05-23T08:19:52Z

SeungHwa92
May 23, 2024

Hi, I'm trying to move my kohya code to diffusers. (code for training SDXL Lora)
During this work, I found that text_encoder_2's output is different.
And I found that while converting kohya model to diffusers model, some value is transposed.

https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py#L969

Could you please tell me the reason for this?

asomoza · 2024-05-23T15:56:40Z

asomoza
May 23, 2024
Maintainer

how are you comparing the outputs of text_encoder_2?

Also cc @sayakpaul @DN6

5 replies

SeungHwa92 May 24, 2024
Author

I convert civitai model to diffusers using following commands

python convert_original_stable_diffusion_to_diffusers.py 
--checkpoint_path CIVITAI_CKPT_PATH \
--dump_path CONVERTED_PATH / 
--from_safetensors

I check the output of text_encoder_2 using following code

strings = ['hello world ! How are you ?']

tokenizers = load_tokenizers()

# load kohya models
load_stable_diffusion_format, kohya_text_encoder1, kohya_text_encoder_2, kohya_vae, kohya_unet, logit_scale, ckpt_info = library.sdxl_train_util._load_target_model(CIVITAI_CKPT_PATH, None, 'v1', torch.float)

# load diffusers models
diffusers_pipeline = diffusers.StableDiffusionXLPipeline.from_pretrained(CONVERTED_PATH)
diffusers_text_encoder_2 = diffusers_pipeline.text_encoder_2

# tokenize
tokenizer_2 = tokenizers[1]
token_ids2 = tokenizer_2(strings, padding="max_length", truncation=True, max_length=77, return_tensors="pt").input_ids

# infer kohya, diffusers text_encoder_2
kohya_output_list = kohya_text_encoder_2(token_ids2, output_hidden_states=True, return_dict=False)
diffusers_output = diffusers_text_encoder_2(token_ids2, output_hidden_states=True, return_dict=False)

print(torch.allclose(kohya_output_list[0], diffusers_output[0]))  # False
print(torch.allclose(kohya_output_list[1], diffusers_output[1]))  # True
for i in range(len(kohya_output_list[2])):
    print(torch.allclose(kohya_output_list[2][i], diffusers_output[2][i]))  # all True

kohya_state_dict = kohya_text_encoder_2.state_dict()
diffusers_state_dict = diffusers_text_encoder_2.state_dict()
for k in kohya_state_dict.keys():
    if not torch.allclose(kohya_state_dict[k], diffusers_state_dict[k]):
        print(f'{k} weight is different !')
        if torch.allclose(kohya_state_dict[k], diffusers_state_dict[k].T):
            print(f'{k} weight is transposed.')

# text_projection.weight weight is different !
# text_projection.weight weight is transposed matrix each other

I found text_projection.weight is transposed in https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py#L969

SeungHwa92 May 24, 2024
Author

I would like to ask about another difference between kohya and diffusers.
I was checking whether same output of text_encoder_2(prompt_embeds, pool_prompt_embeds) between kohya and diffusers.
prompt_embeds are same but pool_prompt_embeds have different values.
I found that the way of calculating pool_prompt_embeds are different.

the code for calculating pool_prompt_embeds from kohya and diffusers belows.

Difference start from this.
kohya uses text_encoder_2_output['last_hidden_state'] or text_encoder_2_output[1]
but diffusers uses text_encoder_2_output['text_embeds'] or text_encoder_2_output[0]

# diffusers prompt embedding function https://github.com/huggingface/diffusers/blob/5cd45c24bf616f09c818455184f3d1c3a3cebe00/examples/dreambooth/train_dreambooth_lora_sdxl.py#L934
def encode_prompt(text_encoders, tokenizers, prompt, text_input_ids_list=None):
    prompt_embeds_list = []

    for i, text_encoder in enumerate(text_encoders):
        if tokenizers is not None:
            tokenizer = tokenizers[i]
            text_input_ids = tokenize_prompt(tokenizer, prompt)
        else:
            assert text_input_ids_list is not None
            text_input_ids = text_input_ids_list[i]

        prompt_embeds = text_encoder(
            text_input_ids.to(text_encoder.device), output_hidden_states=True, return_dict=False
        )

        # We are only ALWAYS interested in the pooled output of the final text encoder
        pooled_prompt_embeds = prompt_embeds[0]
        prompt_embeds = prompt_embeds[-1][-2]
        bs_embed, seq_len, _ = prompt_embeds.shape
        prompt_embeds = prompt_embeds.view(bs_embed, seq_len, -1)
        prompt_embeds_list.append(prompt_embeds)

    prompt_embeds = torch.concat(prompt_embeds_list, dim=-1)
    pooled_prompt_embeds = pooled_prompt_embeds.view(bs_embed, -1)
    return prompt_embeds, pooled_prompt_embeds



# kohya prompt embedding function https://github.com/kohya-ss/sd-scripts/blob/bfb352bc433326a77aca3124248331eb60c49e8c/library/train_util.py#L4505
def get_hidden_states_sdxl(
    max_token_length: int,
    input_ids1: torch.Tensor,
    input_ids2: torch.Tensor,
    tokenizer1: CLIPTokenizer,
    tokenizer2: CLIPTokenizer,
    text_encoder1: CLIPTextModel,
    text_encoder2: CLIPTextModelWithProjection,
    weight_dtype: Optional[str] = None,
    accelerator = None,
):
    # input_ids: b,n,77 -> b*n, 77
    b_size = input_ids1.size()[0]
    input_ids1 = input_ids1.reshape((-1, tokenizer1.model_max_length))  # batch_size*n, 77
    input_ids2 = input_ids2.reshape((-1, tokenizer2.model_max_length))  # batch_size*n, 77

    # text_encoder1
    enc_out = text_encoder1(input_ids1, output_hidden_states=True, return_dict=True)
    hidden_states1 = enc_out["hidden_states"][11]

    # text_encoder2
    enc_out = text_encoder2(input_ids2, output_hidden_states=True, return_dict=True)
    hidden_states2 = enc_out["hidden_states"][-2]  # penuultimate layer

    # pool2 = enc_out["text_embeds"]
    unwrapped_text_encoder2 = text_encoder2 if accelerator is None else accelerator.unwrap_model(text_encoder2)
    pool2 = pool_workaround(unwrapped_text_encoder2, enc_out["last_hidden_state"], input_ids2, tokenizer2.eos_token_id)

    # b*n, 77, 768 or 1280 -> b, n*77, 768 or 1280
    n_size = 1 if max_token_length is None else max_token_length // 75
    hidden_states1 = hidden_states1.reshape((b_size, -1, hidden_states1.shape[-1]))
    hidden_states2 = hidden_states2.reshape((b_size, -1, hidden_states2.shape[-1]))

    if max_token_length is not None:
        # bs*3, 77, 768 or 1024
        # encoder1: <BOS>...<EOS> の三連を <BOS>...<EOS> へ戻す
        states_list = [hidden_states1[:, 0].unsqueeze(1)]  # <BOS>
        for i in range(1, max_token_length, tokenizer1.model_max_length):
            states_list.append(hidden_states1[:, i : i + tokenizer1.model_max_length - 2])  # <BOS> の後から <EOS> の前まで
        states_list.append(hidden_states1[:, -1].unsqueeze(1))  # <EOS>
        hidden_states1 = torch.cat(states_list, dim=1)

        # v2: <BOS>...<EOS> <PAD> ... の三連を <BOS>...<EOS> <PAD> ... へ戻す　正直この実装でいいのかわからん
        states_list = [hidden_states2[:, 0].unsqueeze(1)]  # <BOS>
        for i in range(1, max_token_length, tokenizer2.model_max_length):
            chunk = hidden_states2[:, i : i + tokenizer2.model_max_length - 2]  # <BOS> の後から 最後の前まで
            # this causes an error:
            # RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
            # if i > 1:
            #     for j in range(len(chunk)):  # batch_size
            #         if input_ids2[n_index + j * n_size, 1] == tokenizer2.eos_token_id:  # 空、つまり <BOS> <EOS> <PAD> ...のパターン
            #             chunk[j, 0] = chunk[j, 1]  # 次の <PAD> の値をコピーする
            states_list.append(chunk)  # <BOS> の後から <EOS> の前まで
        states_list.append(hidden_states2[:, -1].unsqueeze(1))  # <EOS> か <PAD> のどちらか
        hidden_states2 = torch.cat(states_list, dim=1)

        # pool はnの最初のものを使う
        pool2 = pool2[::n_size]

    if weight_dtype is not None:
        # this is required for additional network training
        hidden_states1 = hidden_states1.to(weight_dtype)
        hidden_states2 = hidden_states2.to(weight_dtype)

    return hidden_states1, hidden_states2, pool2

# kohya prompt embedding function https://github.com/kohya-ss/sd-scripts/blob/bfb352bc433326a77aca3124248331eb60c49e8c/library/train_util.py#L4462C1-L4502C25
def pool_workaround(
    text_encoder: CLIPTextModelWithProjection, last_hidden_state: torch.Tensor, input_ids: torch.Tensor, eos_token_id: int
):
    r"""
    workaround for CLIP's pooling bug: it returns the hidden states for the max token id as the pooled output
    instead of the hidden states for the EOS token
    If we use Textual Inversion, we need to use the hidden states for the EOS token as the pooled output

    Original code from CLIP's pooling function:

    \# text_embeds.shape = [batch_size, sequence_length, transformer.width]
    \# take features from the eot embedding (eot_token is the highest number in each sequence)
    \# casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
    pooled_output = last_hidden_state[
        torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
        input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
    ]
    """

    # input_ids: b*n,77
    # find index for EOS token

    # Following code is not working if one of the input_ids has multiple EOS tokens (very odd case)
    # eos_token_index = torch.where(input_ids == eos_token_id)[1]
    # eos_token_index = eos_token_index.to(device=last_hidden_state.device)

    # Create a mask where the EOS tokens are
    eos_token_mask = (input_ids == eos_token_id).int()

    # Use argmax to find the last index of the EOS token for each element in the batch
    eos_token_index = torch.argmax(eos_token_mask, dim=1)  # this will be 0 if there is no EOS token, it's fine
    eos_token_index = eos_token_index.to(device=last_hidden_state.device)

    # get hidden states for EOS token
    pooled_output = last_hidden_state[torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), eos_token_index]

    # apply projection: projection may be of different dtype than last_hidden_state
    pooled_output = text_encoder.text_projection(pooled_output.to(text_encoder.text_projection.weight.dtype))
    pooled_output = pooled_output.to(last_hidden_state.dtype)

    return pooled_output



strings = ['hello world ! How are you ?']

tokenizers = load_tokenizers()

# load kohya models
load_stable_diffusion_format, kohya_text_encoder, kohya_text_encoder_2, kohya_vae, kohya_unet, logit_scale, ckpt_info = library.sdxl_train_util._load_target_model(CIVITAI_CKPT_PATH, None, 'v1', torch.float)

# load diffusers models
diffusers_pipeline = diffusers.StableDiffusionXLPipeline.from_pretrained(CONVERTED_PATH)
diffusers_text_encoder = diffusers_pipeline.text_encoder
diffusers_text_encoder_2 = diffusers_pipeline.text_encoder_2

# Transpose text_projection.weight for same output
diffusers_text_encoder_2.text_projection.weight.data = diffusers_text_encoder_2.text_projection.weight.data.T.contiguous() 

# tokenize
input_ids1 = tokenizers[0](strings, padding="max_length", truncation=True, max_length=77, return_tensors="pt").input_ids
input_ids2 = tokenizers[1](strings, padding="max_length", truncation=True, max_length=77, return_tensors="pt").input_ids


kohya_prompt_embeds1, kohya_prompt_embeds2, kohya_pool_prompt_embeds = get_hidden_states_sdxl(max_token_length=77,
                                                                                            input_ids1=input_ids1,
                                                                                            input_ids2=input_ids2,
                                                                                            tokenizer1=tokenizers[0],
                                                                                            tokenizer2=tokenizers[1],
                                                                                            text_encoder1=kohya_text_encoder,
                                                                                            text_encoder2=kohya_text_encoder_2,
                                                                                            weight_dtype=torch.float)
kohya_prompt_embeds = torch.cat([kohya_prompt_embeds1, kohya_prompt_embeds2], dim=2)

diffusers_prompt_embeds, diffusers_pooled_prompt_embeds = encode_prompt(text_encoders=[diffusers_text_encoder, diffusers_text_encoder_2],
                                                                        tokenizers=tokenizers,
                                                                        prompt=strings)

print('prompt embeds are same :', torch.allclose(kohya_prompt_embeds, diffusers_prompt_embeds))  # True
print('pool prompt embeds are same :',torch.allclose(kohya_pool_prompt_embeds, diffusers_pooled_prompt_embeds))  # False

                                            
kohya_text_encoder_2_output = kohya_text_encoder_2(input_ids2, output_hidden_states=True, return_dict=True)
diffusers_text_encoder_2_output = diffusers_text_encoder_2(input_ids2, output_hidden_states=True, return_dict=False)

print('check text_encoder_2 outputs are same')
print('text_embeds is index 0 in list:', torch.allclose(kohya_text_encoder_2_output['text_embeds'], diffusers_text_encoder_2_output[0]))  # True
print('last_hidden_state is index 1 in list:', torch.allclose(kohya_text_encoder_2_output['last_hidden_state'], diffusers_text_encoder_2_output[1]))  # True

asomoza May 25, 2024
Maintainer

Thanks for the detailed explanation.

As a first clarification, I want to say that the sd-scripts repo is focused on experimental features and not that much on maintaining compatibility or sticking to the original specs of the models.

That repo is newer than diffusers, so all the changes you mention are made on top of diffusers and not backwards, so probably the best person to answer why or what are those changes for is the author of the repo and not the diffusers team, I see some comments explaining them though.

As for the first question, it probably doesn't matter if it's transposed if it's used accordingly in the inference code, it's just a matter of how it's used. Sadly I don't have the time to read all the kohya repo to search for it but I suggest you look also at the inference code and not just the encoding parts.

As for the second question, it's almost the same response, but also the code you're looking it's a simplified version of the encode_prompt function that's used just as a fast test when training, I suggest you look at the original function. Most of the differences are to deal with textual inversion which the function in the LoRA training doesn't care about.

If you still want explanations about this, I pinged the people that may know that but sadly, this takes time and not everyone has the time for it, what I can say, is, that diffusers always stays true to the spec of the model and the papers, most of the time in collaboration with the authors and/or reviews from them.

SeungHwa92 May 25, 2024
Author

As you mention this difference can be some detailed engineering.

I just wonder I can get some insights if this engineering is known things.

I think it's up to me to find out meaning of this details.

Thanks for your response even if this is very long question.

vitrun Oct 22, 2024

Hey @SeungHwa92 , I noticed the same thing and was wondering about it too. Did you manage to figure it out?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about SDXL conversion to diffusers in convert_from_ckpt.py #8238

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Question about SDXL conversion to diffusers in convert_from_ckpt.py #8238

SeungHwa92 May 23, 2024

Replies: 1 comment · 5 replies

asomoza May 23, 2024 Maintainer

SeungHwa92 May 24, 2024 Author

SeungHwa92 May 24, 2024 Author

asomoza May 25, 2024 Maintainer

SeungHwa92 May 25, 2024 Author

vitrun Oct 22, 2024

SeungHwa92
May 23, 2024

Replies: 1 comment 5 replies

asomoza
May 23, 2024
Maintainer

SeungHwa92 May 24, 2024
Author

SeungHwa92 May 24, 2024
Author

asomoza May 25, 2024
Maintainer

SeungHwa92 May 25, 2024
Author