You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Error running inference on CogVLM2 when distributing it on multiple GPUs: Expected all tensors to be on the same device, but found at least two devices
#31676
Closed
2 of 4 tasks
ghazalsaheb opened this issue
Jun 28, 2024
· 4 comments
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map
MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
device_map='auto'
).eval()
text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
while True:
url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
image = Image.open(requests.get(url, stream=True).raw)
history = []
while True:
query = input("Human:")
if query == "clear":
break
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=[image],
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to("cuda"),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to("cuda"),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to("cuda"),
'images': [[input_by_model['images'][0].to("cuda").to(TORCH_TYPE)]],
}
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0])
response = response.split("<|end_of_text|>")[0]
print("\nCogVLM2:", response)
history.append((query, response))
Human input when running the code: "please describe this image"
Expected behavior
It should be able to distribute the model on multiple GPU cards and run inference when data is only on one card, and generate the caption for each human prompt, but I get the following error:
(I also tried defining my own device map instead of using 'auto' similar to here, but it gives the same error)
> RuntimeError Traceback (most recent call last)
> Cell In[1], line 77
> 72 gen_kwargs = {
> 73 "max_new_tokens": 2048,
> 74 "pad_token_id": 128002,
> 75 }
> 76 with torch.no_grad():
> ---> 77 outputs = model.generate(**inputs, **gen_kwargs)
> 78 outputs = outputs[:, inputs['input_ids'].shape[1]:]
> 79 response = tokenizer.decode(outputs[0])
>
> File [/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:115](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py#line=114), in context_decorator.<locals>.decorate_context(*args, **kwargs)
> 112 @functools.wraps(func)
> 113 def decorate_context(*args, **kwargs):
> 114 with ctx_factory():
> --> 115 return func(*args, **kwargs)
>
> File [/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1622](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py#line=1621), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
> 1614 input_ids, model_kwargs = self._expand_inputs_for_generation(
> 1615 input_ids=input_ids,
> 1616 expand_size=generation_config.num_return_sequences,
> 1617 is_encoder_decoder=self.config.is_encoder_decoder,
> 1618 **model_kwargs,
> 1619 )
> 1621 # 13. run sample
> -> 1622 result = self._sample(
> 1623 input_ids,
> 1624 logits_processor=prepared_logits_processor,
> 1625 logits_warper=logits_warper,
> 1626 stopping_criteria=prepared_stopping_criteria,
> 1627 pad_token_id=generation_config.pad_token_id,
> 1628 output_scores=generation_config.output_scores,
> 1629 output_logits=generation_config.output_logits,
> 1630 return_dict_in_generate=generation_config.return_dict_in_generate,
> 1631 synced_gpus=synced_gpus,
> 1632 streamer=streamer,
> 1633 **model_kwargs,
> 1634 )
> 1636 elif generation_mode == GenerationMode.BEAM_SEARCH:
> 1637 # 11. prepare beam search scorer
> 1638 beam_scorer = BeamSearchScorer(
> 1639 batch_size=batch_size,
> 1640 num_beams=generation_config.num_beams,
> (...)
> 1645 max_length=generation_config.max_length,
> 1646 )
>
> File [/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:2791](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py#line=2790), in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, output_logits, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
> 2788 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
> 2790 # forward pass to get next token
> -> 2791 outputs = self(
> 2792 **model_inputs,
> 2793 return_dict=True,
> 2794 output_attentions=output_attentions,
> 2795 output_hidden_states=output_hidden_states,
> 2796 )
> 2798 if synced_gpus and this_peer_finished:
> 2799 continue # don't waste resources running the code we don't need
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
> 1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
> 1531 else:
> -> 1532 return self._call_impl(*args, **kwargs)
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
> 1536 # If we don't have any hooks, we want to skip the rest of the logic in
> 1537 # this function, and just call forward.
> 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
> 1539 or _global_backward_pre_hooks or _global_backward_hooks
> 1540 or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541 return forward_call(*args, **kwargs)
> 1543 try:
> 1544 result = None
>
> File [/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py:165](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py#line=164), in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
> 163 output = old_forward(*args, **kwargs)
> 164 else:
> --> 165 output = old_forward(*args, **kwargs)
> 166 return module._hf_hook.post_forward(module, output)
>
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py:649](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py#line=648), in CogVLMForCausalLM.forward(self, input_ids, images, token_type_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, labels)
> 646 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
> 648 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
> --> 649 outputs = self.model(
> 650 input_ids=input_ids,
> 651 images=images,
> 652 token_type_ids=token_type_ids,
> 653 attention_mask=attention_mask,
> 654 position_ids=position_ids,
> 655 past_key_values=past_key_values,
> 656 inputs_embeds=inputs_embeds,
> 657 use_cache=use_cache,
> 658 output_attentions=output_attentions,
> 659 output_hidden_states=output_hidden_states,
> 660 return_dict=return_dict,
> 661 )
> 663 hidden_states = outputs[0]
> 664 logits = self.lm_head(hidden_states)
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
> 1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
> 1531 else:
> -> 1532 return self._call_impl(*args, **kwargs)
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
> 1536 # If we don't have any hooks, we want to skip the rest of the logic in
> 1537 # this function, and just call forward.
> 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
> 1539 or _global_backward_pre_hooks or _global_backward_hooks
> 1540 or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541 return forward_call(*args, **kwargs)
> 1543 try:
> 1544 result = None
>
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py:390](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py#line=389), in CogVLMModel.forward(self, input_ids, images, token_type_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
> 388 assert len(input_ids) == len(images), f"{len(input_ids)} {len(images)}"
> 389 inputs_embeds = self.embed_tokens(input_ids)
> --> 390 images_features = self.encode_images(images)
> 391 images_features = rearrange(images_features, 'b n d -> (b n) d')
> 392 images_features = images_features.to(dtype=inputs_embeds.dtype, device=inputs_embeds.device)
>
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py:362](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/modeling_cogvlm.py#line=361), in CogVLMModel.encode_images(self, images)
> 359 images.append(image)
> 361 images = torch.stack(images)
> --> 362 images_features = self.vision(images)
> 363 return images_features
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
> 1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
> 1531 else:
> -> 1532 return self._call_impl(*args, **kwargs)
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
> 1536 # If we don't have any hooks, we want to skip the rest of the logic in
> 1537 # this function, and just call forward.
> 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
> 1539 or _global_backward_pre_hooks or _global_backward_hooks
> 1540 or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541 return forward_call(*args, **kwargs)
> 1543 try:
> 1544 result = None
>
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py:130](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py#line=129), in EVA2CLIPModel.forward(self, images)
> 128 def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)":
> 129 x = self.patch_embedding(images)
> --> 130 x = self.transformer(x)
> 131 x = x[:, 1:]
> 133 b, s, h = x.shape
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
> 1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
> 1531 else:
> -> 1532 return self._call_impl(*args, **kwargs)
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
> 1536 # If we don't have any hooks, we want to skip the rest of the logic in
> 1537 # this function, and just call forward.
> 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
> 1539 or _global_backward_pre_hooks or _global_backward_hooks
> 1540 or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541 return forward_call(*args, **kwargs)
> 1543 try:
> 1544 result = None
>
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py:94](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py#line=93), in Transformer.forward(self, hidden_states)
> 92 def forward(self, hidden_states):
> 93 for layer_module in self.layers:
> ---> 94 hidden_states = layer_module(hidden_states)
> 95 return hidden_states
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1532](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1531), in Module._wrapped_call_impl(self, *args, **kwargs)
> 1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
> 1531 else:
> -> 1532 return self._call_impl(*args, **kwargs)
>
> File [/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1541](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py#line=1540), in Module._call_impl(self, *args, **kwargs)
> 1536 # If we don't have any hooks, we want to skip the rest of the logic in
> 1537 # this function, and just call forward.
> 1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
> 1539 or _global_backward_pre_hooks or _global_backward_hooks
> 1540 or _global_forward_hooks or _global_forward_pre_hooks):
> -> 1541 return forward_call(*args, **kwargs)
> 1543 try:
> 1544 result = None
>
> File [~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py:83](https://zhuylo9veq5jp4u.studio.ca-central-1.sagemaker.aws/jupyterlab/default/lab/tree/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm2-llama3-chat-19B/2bf7de6892877eb50142395af14847519ba95998/visual.py#line=82), in TransformerLayer.forward(self, hidden_states)
> 81 mlp_input = hidden_states
> 82 mlp_output = self.post_attention_layernorm(self.mlp(mlp_input))
> ---> 83 output = mlp_input + mlp_output
> 84 return output
>
> RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3!
The text was updated successfully, but these errors were encountered:
CogVLM uses custom code from the hub when you set trust_remote_code=True and the model is not yet added to transformers. There is an open PR here to port the model to transformers, which is in progress afaik, cc @NielsRogge
For the device mismatch issue, please open an issue in the THUDM/cogvlm2-llama3-chat-19B hub repo.
CogVLM uses custom code from the hub when you set trust_remote_code=True and the model is not yet added to transformers. There is an open PR here to port the model to transformers, which is in progress afaik, cc @NielsRogge
For the device mismatch issue, please open an issue in the THUDM/cogvlm2-llama3-chat-19B hub repo.
@ zucchini-nlp I see, thanks. By THUDM/cogvlm2-llama3-chat-19B you mean here?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Who can help?
@ArthurZucker @amyeroberts @Narsil @muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Human input when running the code: "please describe this image"
Expected behavior
It should be able to distribute the model on multiple GPU cards and run inference when data is only on one card, and generate the caption for each human prompt, but I get the following error:
(I also tried defining my own device map instead of using 'auto' similar to here, but it gives the same error)
The text was updated successfully, but these errors were encountered: