SDPA not implemented error #9

SpoSer23 · 2024-11-10T14:45:12Z

I'm trying to use LCKV caching technique on Qwen2 LLM and I noticed that you didn't implement sdpa through torch. So, I tried when loading the model to use attn_implementation = "eager" but still it didn't work at all because when i inherit the model it uses its own Config and see that the support for sdpa is equal to True.

Do you have any idea how to solve this problem?
This is the error log from console below:

ValueError Traceback (most recent call last)
Cell In[83], line 34
32 model_class, tokenizer_class = MODEL_CLASSES[model_type]
33 tokenizer = tokenizer_class.from_pretrained(model_name_or_path)
---> 34 model = model_class.from_pretrained(model_name_or_path, attn_implementation="eager")
35 model.eval() # Set model to evaluation mode
37 # Prepare Input

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:3886, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3880 config = cls._autoset_attn_implementation(
3881 config, use_flash_attention_2=use_flash_attention_2, torch_dtype=torch_dtype, device_map=device_map
3882 )
3884 with ContextManagers(init_contexts):
3885 # Let's make sure we don't run the init function of buffer modules
-> 3886 model = cls(config, *model_args, **model_kwargs)
3888 # make sure we use the model's config since the init call might have copied it
3889 config = model.config

Cell In[75], line 4, in LCKVQwen2ForCausalLM.init(self, config)
2 def init(self, config):
3 Qwen2ForCausalLM.init(self, config)
----> 4 self.model = LCKVQwen2Model(config)
6 # Initialize weights and apply final processing
7 self.post_init()

Cell In[74], line 3, in LCKVQwen2Model.init(self, config)
2 def init(self, config: LCKVQwen2Config):
----> 3 Qwen2Model.init(self, config)
4 self.layers = nn.ModuleList([LCKVQwen2DecoderLayer(config, layer_idx=i) for i in range(config.num_hidden_layers)])
5 self.parser = LayerTypeParser(config.layer_types)

File /opt/conda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:864, in Qwen2Model.init(self, config)
863 def init(self, config: Qwen2Config):
--> 864 super().init(config)
865 self.padding_idx = config.pad_token_id
866 self.vocab_size = config.vocab_size

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:1404, in PreTrainedModel.init(self, config, *inputs, **kwargs)
1398 raise ValueError(
1399 f"Parameter config in {self.__class__.__name__}(config) should be an instance of class "
1400 "PretrainedConfig. To create a model from a pretrained model use "
1401 f"model = {self.__class__.__name__}.from_pretrained(PRETRAINED_MODEL_NAME)"
1402 )
1403 # Save config and origin of the pretrained weights if given in model
-> 1404 config = self._autoset_attn_implementation(
1405 config, torch_dtype=torch.get_default_dtype(), check_device_map=False
1406 )
1407 self.config = config
1409 self.name_or_path = config.name_or_path

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:1581, in PreTrainedModel._autoset_attn_implementation(cls, config, use_flash_attention_2, torch_dtype, device_map, check_device_map)
1572 cls._check_and_enable_flash_attn_2(
1573 config,
1574 torch_dtype=torch_dtype,
(...)
1577 check_device_map=check_device_map,
1578 )
1579 elif requested_attn_implementation in [None, "sdpa"] and not is_torch_xla_available():
1580 # use_flash_attention_2 takes priority over SDPA, hence SDPA treated in this elif.
-> 1581 config = cls._check_and_enable_sdpa(
1582 config,
1583 hard_check_only=False if requested_attn_implementation is None else True,
1584 )
1586 if (
1587 torch.version.hip is not None
1588 and config._attn_implementation == "sdpa"
1589 and torch.cuda.device_count() > 1
1590 ):
1591 logger.warning_once(
1592 "Using the SDPA attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends."
1593 )

File /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:1776, in PreTrainedModel._check_and_enable_sdpa(cls, config, hard_check_only)
1774 if hard_check_only:
1775 if not cls._supports_sdpa:
-> 1776 raise ValueError(
1777 f"{cls.name} does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet."
1778 " Please request the support for this architecture: huggingface/transformers#28005. If you believe"
1779 ' this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument attn_implementation="eager" meanwhile. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")'
1780 )
1781 if not is_torch_sdpa_available():
1782 raise ImportError(
1783 "PyTorch SDPA requirements in Transformers are not met. Please install torch>=2.1.1."
1784 )

ValueError: LCKVQwen2Model does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: huggingface/transformers#28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument attn_implementation="eager" meanwhile. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")

The text was updated successfully, but these errors were encountered:

why-in-Shanghaitech · 2024-11-12T01:17:13Z

Hi! Thank you for your question. I feel confused since such a problem does not exist in Llama, which also supports sdpa...

The model loading seems correct, and the first call to _autoset_attn_implementation correctly sets the config. But the internal init function sees sdpa in config._attn_implementation_internal.

All I could come up with are two ad-hoc solutions:
Option 1: Force the config to have attention implementation eager at cell [74]:

def __init__(self, config: LCKVQwen2Config):
+    config._attn_implementation = "eager"
    Qwen2Model.__init__(self, config)
    self.layers = nn.ModuleList([LCKVQwen2DecoderLayer(config, layer_idx=i) for i in range(config.num_hidden_layers)])
    self.parser = LayerTypeParser(config.layer_types)

Option 2: Cheat with an eager implementation but pretend it is sdpa. Modify models/modeling_lckv.py (maybe somewhere in your code):

LCKV_LLAMA_ATTENTION_CLASSES = {
    "eager": LCKVLlamaAttention,
    "flash_attention_2": LCKVLlamaFlashAttention2,
+    "sdpa": LCKVLlamaAttention,
}

I hope it will work. To find the exact cause I may need to see more codes in your notebook.

Anyway, I will put lckv sdpa implementation on the agenda.

SpoSer23 · 2024-11-12T13:44:13Z

I found the exact cause of the problem which is I forgot to delete the model loaded on the kernel. So, I just had to reset the kernel and run again but now, I face another problem which means he reads the custom config but doesn't apply it and uses the parent class of the custom config. Can you tell me what to do in this situation?

P.S: I tried making the custom config inherit from PreTrainedConfig but still same error.

The error looks like this:

AttributeError                            Traceback (most recent call last)
Cell In[15], [line 34](vscode-notebook-cell:?execution_count=15&line=34)
     [32](vscode-notebook-cell:?execution_count=15&line=32) model_class, tokenizer_class = MODEL_CLASSES[model_type]
     [33](vscode-notebook-cell:?execution_count=15&line=33) tokenizer = tokenizer_class.from_pretrained(model_name_or_path)
---> [34](vscode-notebook-cell:?execution_count=15&line=34) model = model_class.from_pretrained(model_name_or_path, attn_implementation="eager")
     [35](vscode-notebook-cell:?execution_count=15&line=35) model.eval()  # Set model to evaluation mode
     [37](vscode-notebook-cell:?execution_count=15&line=37) # Prepare Input

File c:\Users\kareem\anaconda3\envs\ml_env\lib\site-packages\transformers\modeling_utils.py:3832, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   [3826](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/modeling_utils.py:3826) config = cls._autoset_attn_implementation(
   [3827](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/modeling_utils.py:3827)     config, use_flash_attention_2=use_flash_attention_2, torch_dtype=torch_dtype, device_map=device_map
   [3828](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/modeling_utils.py:3828) )
   [3830](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/modeling_utils.py:3830) with ContextManagers(init_contexts):
   [3831](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/modeling_utils.py:3831)     # Let's make sure we don't run the init function of buffer modules
-> [3832](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/modeling_utils.py:3832)     model = cls(config, *model_args, **model_kwargs)
   [3834](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/modeling_utils.py:3834) # make sure we use the model's config since the __init__ call might have copied it
   [3835](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/modeling_utils.py:3835) config = model.config

Cell In[14], [line 4](vscode-notebook-cell:?execution_count=14&line=4)
      [2](vscode-notebook-cell:?execution_count=14&line=2) def __init__(self, config:LCKVQwen2Config):
      [3](vscode-notebook-cell:?execution_count=14&line=3)     Qwen2ForCausalLM.__init__(self, config)
----> [4](vscode-notebook-cell:?execution_count=14&line=4)     self.model = LCKVQwen2Model(config)
      [6](vscode-notebook-cell:?execution_count=14&line=6)     # Initialize weights and apply final processing
      [7](vscode-notebook-cell:?execution_count=14&line=7)     self.post_init()

Cell In[13], [line 4](vscode-notebook-cell:?execution_count=13&line=4)
      [2](vscode-notebook-cell:?execution_count=13&line=2) def __init__(self, config: LCKVQwen2Config):
      [3](vscode-notebook-cell:?execution_count=13&line=3)     Qwen2Model.__init__(self, config)
----> [4](vscode-notebook-cell:?execution_count=13&line=4)     self.layers = nn.ModuleList([LCKVQwen2DecoderLayer(config, layer_idx=i) for i in range(config.num_hidden_layers)])
      [5](vscode-notebook-cell:?execution_count=13&line=5)     self.parser = LayerTypeParser(config.layer_types)
      [7](vscode-notebook-cell:?execution_count=13&line=7)     # Initialize weights and apply final processing

Cell In[13], [line 4](vscode-notebook-cell:?execution_count=13&line=4)
      [2](vscode-notebook-cell:?execution_count=13&line=2) def __init__(self, config: LCKVQwen2Config):
      [3](vscode-notebook-cell:?execution_count=13&line=3)     Qwen2Model.__init__(self, config)
----> [4](vscode-notebook-cell:?execution_count=13&line=4)     self.layers = nn.ModuleList([LCKVQwen2DecoderLayer(config, layer_idx=i) for i in range(config.num_hidden_layers)])
      [5](vscode-notebook-cell:?execution_count=13&line=5)     self.parser = LayerTypeParser(config.layer_types)
      [7](vscode-notebook-cell:?execution_count=13&line=7)     # Initialize weights and apply final processing

Cell In[11], [line 8](vscode-notebook-cell:?execution_count=11&line=8)
      [6](vscode-notebook-cell:?execution_count=11&line=6) def __init__(self, config: LCKVQwen2Config, layer_idx: int):
      [7](vscode-notebook-cell:?execution_count=11&line=7)     super().__init__(config, layer_idx)
----> [8](vscode-notebook-cell:?execution_count=11&line=8)     self.self_attn = LCKV_QWEN2_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)

Cell In[10], [line 7](vscode-notebook-cell:?execution_count=10&line=7)
      [5](vscode-notebook-cell:?execution_count=10&line=5) def __init__(self, config: LCKVQwen2Config, layer_idx: Optional[int] = None):
      [6](vscode-notebook-cell:?execution_count=10&line=6)     super().__init__(config, layer_idx)
----> [7](vscode-notebook-cell:?execution_count=10&line=7)     self.layer_type = LayerTypeParser(config.layer_types)[layer_idx]
      [8](vscode-notebook-cell:?execution_count=10&line=8)     self.sliding_window = config.sliding_window if self.layer_type.use_sliding_window else None
     [10](vscode-notebook-cell:?execution_count=10&line=10)     # Some layers may not need to compute key-value pairs

File c:\Users\kareem\anaconda3\envs\ml_env\lib\site-packages\transformers\configuration_utils.py:264, in PretrainedConfig.__getattribute__(self, key)
    [262](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/configuration_utils.py:262) if key != "attribute_map" and key in super().__getattribute__("attribute_map"):
    [263](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/configuration_utils.py:263)     key = super().__getattribute__("attribute_map")[key]
--> [264](file:///C:/Users/kareem/anaconda3/envs/ml_env/lib/site-packages/transformers/configuration_utils.py:264) return super().__getattribute__(key)

AttributeError: 'Qwen2Config' object has no attribute 'layer_types'

why-in-Shanghaitech · 2024-11-13T02:15:07Z

Thanks for the reply! LCKV did this by adding a config_class attribute to all the LCKV classes:

LCKV/models/modeling_lckv.py

Lines 289 to 290 in cf50b6c

    
           class LCKVLlamaPreTrainedModel(LlamaPreTrainedModel): 
        
               config_class = LCKVLlamaConfig

You may also do this:

model_class, tokenizer_class = MODEL_CLASSES[model_type]
tokenizer = tokenizer_class.from_pretrained(model_name_or_path)

+ config = LCKVQwen2Config.from_pretrained(model_name_or_path)
+ # do some configurations...

- model = model_class.from_pretrained(model_name_or_path, attn_implementation="eager")
+ model = model_class.from_pretrained(model_name_or_path, config=config, attn_implementation="eager")

SpoSer23 · 2024-11-13T10:13:25Z

Thanks for your help, I'm grateful and It helped in the config access problems to args and I was inquiring about the prepare_inputs_for_generation in CausalLM Class as It returns a NoneType Object, so, I added a return statement for the model_inputs but the model responded with random gibberish.

I hope you can help me with this problem.

I will give you access to the notebook if you want to see the whole code and check for yourself as a reply in the email.

AttributeError                            Traceback (most recent call last)
Cell In[19], line 43
     40 input_ids = input_ids.to(model.device)
     42 # Generate Text
---> 43 output_sequences = model.generate(
     44     input_ids=input_ids,
     45     max_length=length + input_ids.shape[-1],
     46     temperature=temperature,
     47     top_k=k,
     48     top_p=p,
     49     max_new_tokens = 256,
     50     repetition_penalty=repetition_penalty,
     51     do_sample=True,
     52     num_return_sequences=num_return_sequences,
     53     use_cache=True
     54 )
     56 # Display Generated Texts
     57 generated_texts = []

File ~\anaconda3\Lib\site-packages\torch\utils\_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~\anaconda3\Lib\site-packages\transformers\generation\utils.py:2047, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   2039     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2040         input_ids=input_ids,
   2041         expand_size=generation_config.num_return_sequences,
   2042         is_encoder_decoder=self.config.is_encoder_decoder,
   2043         **model_kwargs,
   2044     )
   2046     # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2047     result = self._sample(
   2048         input_ids,
   2049         logits_processor=prepared_logits_processor,
   2050         stopping_criteria=prepared_stopping_criteria,
   2051         generation_config=generation_config,
   2052         synced_gpus=synced_gpus,
   2053         streamer=streamer,
   2054         **model_kwargs,
   2055     )
   2057 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2058     # 11. prepare beam search scorer
   2059     beam_scorer = BeamSearchScorer(
   2060         batch_size=batch_size,
   2061         num_beams=generation_config.num_beams,
   (...)
   2066         max_length=generation_config.max_length,
   2067     )

File ~\anaconda3\Lib\site-packages\transformers\generation\utils.py:3003, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   3000 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   3002 # prepare variable output controls (note: some models won't accept all output controls)
-> 3003 model_inputs.update({"output_attentions": output_attentions} if output_attentions else {})
   3004 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   3006 # forward pass to get next token

AttributeError: 'NoneType' object has no attribute 'update'

Mostafa-Emad77 · 2024-11-13T10:59:55Z

Hi! I solved this error of CausalLM by returning the model_inputs and adjusting the layer types config to None for default value and it works completely fine. However, I came across a Runtime Error because of a tensor size missmatch in the addition of the causal mask with the attention weights. So, I was wondering if you could help me solve this problem here.

This is the layer types config i tried:

layer_types: str = "0_1_2_3_4_5_6_7_23_23_23_23_23_23_23_23_16_17_18_19_20_21_22_23"

This is the Runtime Error i receive after I change the config of the layer types:

RuntimeError                              Traceback (most recent call last)
Cell In[18], line 43
     40 input_ids = input_ids.to(model.device)
     42 # Generate Text
---> 43 output_sequences = model.generate(
     44     input_ids=input_ids,
     45     max_length=length + input_ids.shape[-1],
     46     temperature=temperature,
     47     top_k=k,
     48     top_p=p,
     49     max_new_tokens = 2048,
     50     repetition_penalty=repetition_penalty,
     51     do_sample=True,
     52     num_return_sequences=num_return_sequences,
     53     use_cache=True
     54 )
     56 # Display Generated Texts
     57 generated_texts = []

File [~\anaconda3\Lib\site-packages\torch\utils\_contextlib.py:116](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/utils/_contextlib.py#line=115), in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File [~\anaconda3\Lib\site-packages\transformers\generation\utils.py:2047](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/transformers/generation/utils.py#line=2046), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   2039     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2040         input_ids=input_ids,
   2041         expand_size=generation_config.num_return_sequences,
   2042         is_encoder_decoder=self.config.is_encoder_decoder,
   2043         **model_kwargs,
   2044     )
   2046     # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2047     result = self._sample(
   2048         input_ids,
   2049         logits_processor=prepared_logits_processor,
   2050         stopping_criteria=prepared_stopping_criteria,
   2051         generation_config=generation_config,
   2052         synced_gpus=synced_gpus,
   2053         streamer=streamer,
   2054         **model_kwargs,
   2055     )
   2057 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2058     # 11. prepare beam search scorer
   2059     beam_scorer = BeamSearchScorer(
   2060         batch_size=batch_size,
   2061         num_beams=generation_config.num_beams,
   (...)
   2066         max_length=generation_config.max_length,
   2067     )

File [~\anaconda3\Lib\site-packages\transformers\generation\utils.py:3007](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/transformers/generation/utils.py#line=3006), in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   3004 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   3006 # forward pass to get next token
-> 3007 outputs = self(**model_inputs, return_dict=True)
   3009 if synced_gpus and this_peer_finished:
   3010     continue  # don't waste resources running the code we don't need

File [~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1736](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/nn/modules/module.py#line=1735), in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File [~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1747](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/nn/modules/module.py#line=1746), in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File [~\anaconda3\Lib\site-packages\transformers\models\qwen2\modeling_qwen2.py:1167](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/transformers/models/qwen2/modeling_qwen2.py#line=1166), in Qwen2ForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
   1164 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   1166 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1167 outputs = self.model(
   1168     input_ids=input_ids,
   1169     attention_mask=attention_mask,
   1170     position_ids=position_ids,
   1171     past_key_values=past_key_values,
   1172     inputs_embeds=inputs_embeds,
   1173     use_cache=use_cache,
   1174     output_attentions=output_attentions,
   1175     output_hidden_states=output_hidden_states,
   1176     return_dict=return_dict,
   1177     cache_position=cache_position,
   1178 )
   1180 hidden_states = outputs[0]
   1181 if labels is None and not is_torchdynamo_compiling():

File [~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1736](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/nn/modules/module.py#line=1735), in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File [~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1747](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/nn/modules/module.py#line=1746), in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

Cell In[15], line 94, in LCKVQwen2Model.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, cache_position)
     86 use_sequential = (
     87     self.config.use_sequential
     88     or inputs_embeds.shape[1] <= self.config.forward_passes + self.config.backward_passes
     89     and self.parser.attends_top()
     90 )
     92 if use_sequential:
---> 94     iteration_outputs = self._modeling_sequential(
     95         hidden_states,
     96         attention_mask=causal_mask,
     97         position_ids=position_ids,
     98         past_key_values=past_key_values,
     99         output_attentions=output_attentions,
    100         use_cache=use_cache,
    101         cache_position=cache_position,
    102         position_embeddings=position_embeddings,
    103         output_hidden_states=output_hidden_states,
    104     )
    106 else:
    107 
    108     # we need to do forward passes based on a plan if the input is a prompt
    109     plan = self.parser.iteration_plan(self.config.forward_passes, self.config.backward_passes)

Cell In[15], line 311, in LCKVQwen2Model._modeling_sequential(self, hidden_states, attention_mask, position_ids, past_key_values, output_attentions, use_cache, cache_position, position_embeddings, output_hidden_states)
    305 m_cache_position = cache_position[i:i+1] if cache_position is not None else None
    306 m_position_embeddings = (
    307     position_embeddings[0][:, i:i+1],
    308     position_embeddings[1][:, i:i+1]
    309 )
--> 311 outputs = self._iterate_layers(
    312     m_hidden_states,
    313     attention_mask=m_attention_mask,
    314     position_ids=m_position_ids,
    315     past_key_values=past_key_values,
    316     output_attentions=output_attentions,
    317     use_cache=use_cache,
    318     cache_position=m_cache_position,
    319     position_embeddings=m_position_embeddings,
    320     output_hidden_states=output_hidden_states
    321 )
    323 last_hidden_state.append(outputs.last_hidden_state)
    325 if output_hidden_states:

Cell In[15], line 188, in LCKVQwen2Model._iterate_layers(self, hidden_states, attention_mask, position_ids, past_key_values, output_attentions, use_cache, cache_position, position_embeddings, output_hidden_states, layer_slice)
    176     layer_outputs = self._gradient_checkpointing_func(
    177         decoder_layer.__call__,
    178         hidden_states,
   (...)
    185         position_embeddings,
    186     )
    187 else:
--> 188     layer_outputs = decoder_layer(
    189         hidden_states,
    190         attention_mask=attention_mask,
    191         position_ids=position_ids,
    192         past_key_value=past_key_values,
    193         output_attentions=output_attentions,
    194         use_cache=use_cache,
    195         cache_position=cache_position,
    196         position_embeddings=position_embeddings,
    197     )
    199 hidden_states = layer_outputs[0]
    201 if use_cache:

File [~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1736](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/nn/modules/module.py#line=1735), in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File [~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1747](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/nn/modules/module.py#line=1746), in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File [~\anaconda3\Lib\site-packages\transformers\models\qwen2\modeling_qwen2.py:702](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/transformers/models/qwen2/modeling_qwen2.py#line=701), in Qwen2DecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, cache_position, position_embeddings, **kwargs)
    699 hidden_states = self.input_layernorm(hidden_states)
    701 # Self Attention
--> 702 hidden_states, self_attn_weights, present_key_value = self.self_attn(
    703     hidden_states=hidden_states,
    704     attention_mask=attention_mask,
    705     position_ids=position_ids,
    706     past_key_value=past_key_value,
    707     output_attentions=output_attentions,
    708     use_cache=use_cache,
    709     cache_position=cache_position,
    710     position_embeddings=position_embeddings,
    711 )
    712 hidden_states = residual + hidden_states
    714 # Fully Connected

File [~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1736](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/nn/modules/module.py#line=1735), in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File [~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1747](http://localhost:8888/lab/tree/Projects/Orange/~/anaconda3/Lib/site-packages/torch/nn/modules/module.py#line=1746), in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

Cell In[12], line 64, in LCKVQwen2Attention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, cache_position, position_embeddings, **kwargs)
     62 if attention_mask is not None:  # no matter the length, we just slice it
     63     causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
---> 64     attn_weights = attn_weights + causal_mask
     66 # diagonal mask from the right bottom corner
     67 if self.layer_type.attends_top:

RuntimeError: The size of tensor a (8) must match the size of tensor b (7) at non-singleton dimension 3

why-in-Shanghaitech · 2024-11-13T11:44:48Z

Hi! I solved this error of CausalLM by returning the model_inputs and adjusting the layer types config to None for default value and it works completely fine. However, I came across a Runtime Error because of a tensor size missmatch in the addition of the causal mask with the attention weights. So, I was wondering if you could help me solve this problem here.

This is the layer types config i tried:
layer_types: str = "0_1_2_3_4_5_6_7_23_23_23_23_23_23_23_23_16_17_18_19_20_21_22_23"
This is the Runtime Error i receive after I change the config of the layer types:
...

Yes... I can reproduce this bug. I haven't tested generation with the eager implementation yet... I'll look into it.

why-in-Shanghaitech · 2024-11-13T12:08:41Z

I have just pushed a bugfix. Hopefully, it could work.

Mostafa-Emad77 · 2024-11-13T12:35:26Z

Thank you , the bugfix worked and I could change middle layers to the layers ahead

SpoSer23 · 2024-11-13T12:36:44Z

Thanks, I just tried returning the model_inputs and the bugfix helped in using different layer configs and It works just fine.

when initializing the model with no explicit declare which attention implementation to use, the original implementation will throw an error. This is because the llama init function will change the attn implementation to sdpa, which is not implemented in lckv yet. We fix it by passing a copy of the config to the llama init function.

why-in-Shanghaitech self-assigned this Nov 11, 2024

why-in-Shanghaitech added a commit that referenced this issue Nov 13, 2024

fix: bugfix with improper cache initialization (#9)

0dc9a12

SpoSer23 closed this as completed Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDPA not implemented error #9

SDPA not implemented error #9

SpoSer23 commented Nov 10, 2024

why-in-Shanghaitech commented Nov 12, 2024

SpoSer23 commented Nov 12, 2024

why-in-Shanghaitech commented Nov 13, 2024

SpoSer23 commented Nov 13, 2024

Mostafa-Emad77 commented Nov 13, 2024

why-in-Shanghaitech commented Nov 13, 2024

why-in-Shanghaitech commented Nov 13, 2024

Mostafa-Emad77 commented Nov 13, 2024

SpoSer23 commented Nov 13, 2024

SDPA not implemented error #9

SDPA not implemented error #9

Comments

SpoSer23 commented Nov 10, 2024

why-in-Shanghaitech commented Nov 12, 2024

SpoSer23 commented Nov 12, 2024

why-in-Shanghaitech commented Nov 13, 2024

SpoSer23 commented Nov 13, 2024

Mostafa-Emad77 commented Nov 13, 2024

why-in-Shanghaitech commented Nov 13, 2024

why-in-Shanghaitech commented Nov 13, 2024

Mostafa-Emad77 commented Nov 13, 2024

SpoSer23 commented Nov 13, 2024