Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the model is on the CPU, the device of the tensor returned by encode is cuda #2694

Open
secsilm opened this issue May 30, 2024 · 5 comments

Comments

@secsilm
Copy link

secsilm commented May 30, 2024

Let's look at the code directly:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2").to('cpu').eval()
model.encode(['test'], convert_to_tensor=True).device  # device(type='cuda', index=0)
To make sure the model is on the CPU
for name, p in model.named_parameters():
    print(name, p.device)

Output:

0.auto_model.embeddings.word_embeddings.weight cpu
0.auto_model.embeddings.position_embeddings.weight cpu
0.auto_model.embeddings.token_type_embeddings.weight cpu
0.auto_model.embeddings.LayerNorm.weight cpu
0.auto_model.embeddings.LayerNorm.bias cpu
0.auto_model.encoder.layer.0.attention.self.query.weight cpu
0.auto_model.encoder.layer.0.attention.self.query.bias cpu
0.auto_model.encoder.layer.0.attention.self.key.weight cpu
0.auto_model.encoder.layer.0.attention.self.key.bias cpu
0.auto_model.encoder.layer.0.attention.self.value.weight cpu
0.auto_model.encoder.layer.0.attention.self.value.bias cpu
0.auto_model.encoder.layer.0.attention.output.dense.weight cpu
0.auto_model.encoder.layer.0.attention.output.dense.bias cpu
0.auto_model.encoder.layer.0.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.0.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.0.intermediate.dense.weight cpu
0.auto_model.encoder.layer.0.intermediate.dense.bias cpu
0.auto_model.encoder.layer.0.output.dense.weight cpu
0.auto_model.encoder.layer.0.output.dense.bias cpu
0.auto_model.encoder.layer.0.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.0.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.1.attention.self.query.weight cpu
0.auto_model.encoder.layer.1.attention.self.query.bias cpu
0.auto_model.encoder.layer.1.attention.self.key.weight cpu
0.auto_model.encoder.layer.1.attention.self.key.bias cpu
0.auto_model.encoder.layer.1.attention.self.value.weight cpu
0.auto_model.encoder.layer.1.attention.self.value.bias cpu
0.auto_model.encoder.layer.1.attention.output.dense.weight cpu
0.auto_model.encoder.layer.1.attention.output.dense.bias cpu
0.auto_model.encoder.layer.1.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.1.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.1.intermediate.dense.weight cpu
0.auto_model.encoder.layer.1.intermediate.dense.bias cpu
0.auto_model.encoder.layer.1.output.dense.weight cpu
0.auto_model.encoder.layer.1.output.dense.bias cpu
0.auto_model.encoder.layer.1.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.1.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.2.attention.self.query.weight cpu
0.auto_model.encoder.layer.2.attention.self.query.bias cpu
0.auto_model.encoder.layer.2.attention.self.key.weight cpu
0.auto_model.encoder.layer.2.attention.self.key.bias cpu
0.auto_model.encoder.layer.2.attention.self.value.weight cpu
0.auto_model.encoder.layer.2.attention.self.value.bias cpu
0.auto_model.encoder.layer.2.attention.output.dense.weight cpu
0.auto_model.encoder.layer.2.attention.output.dense.bias cpu
0.auto_model.encoder.layer.2.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.2.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.2.intermediate.dense.weight cpu
0.auto_model.encoder.layer.2.intermediate.dense.bias cpu
0.auto_model.encoder.layer.2.output.dense.weight cpu
0.auto_model.encoder.layer.2.output.dense.bias cpu
0.auto_model.encoder.layer.2.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.2.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.3.attention.self.query.weight cpu
0.auto_model.encoder.layer.3.attention.self.query.bias cpu
0.auto_model.encoder.layer.3.attention.self.key.weight cpu
0.auto_model.encoder.layer.3.attention.self.key.bias cpu
0.auto_model.encoder.layer.3.attention.self.value.weight cpu
0.auto_model.encoder.layer.3.attention.self.value.bias cpu
0.auto_model.encoder.layer.3.attention.output.dense.weight cpu
0.auto_model.encoder.layer.3.attention.output.dense.bias cpu
0.auto_model.encoder.layer.3.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.3.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.3.intermediate.dense.weight cpu
0.auto_model.encoder.layer.3.intermediate.dense.bias cpu
0.auto_model.encoder.layer.3.output.dense.weight cpu
0.auto_model.encoder.layer.3.output.dense.bias cpu
0.auto_model.encoder.layer.3.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.3.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.4.attention.self.query.weight cpu
0.auto_model.encoder.layer.4.attention.self.query.bias cpu
0.auto_model.encoder.layer.4.attention.self.key.weight cpu
0.auto_model.encoder.layer.4.attention.self.key.bias cpu
0.auto_model.encoder.layer.4.attention.self.value.weight cpu
0.auto_model.encoder.layer.4.attention.self.value.bias cpu
0.auto_model.encoder.layer.4.attention.output.dense.weight cpu
0.auto_model.encoder.layer.4.attention.output.dense.bias cpu
0.auto_model.encoder.layer.4.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.4.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.4.intermediate.dense.weight cpu
0.auto_model.encoder.layer.4.intermediate.dense.bias cpu
0.auto_model.encoder.layer.4.output.dense.weight cpu
0.auto_model.encoder.layer.4.output.dense.bias cpu
0.auto_model.encoder.layer.4.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.4.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.5.attention.self.query.weight cpu
0.auto_model.encoder.layer.5.attention.self.query.bias cpu
0.auto_model.encoder.layer.5.attention.self.key.weight cpu
0.auto_model.encoder.layer.5.attention.self.key.bias cpu
0.auto_model.encoder.layer.5.attention.self.value.weight cpu
0.auto_model.encoder.layer.5.attention.self.value.bias cpu
0.auto_model.encoder.layer.5.attention.output.dense.weight cpu
0.auto_model.encoder.layer.5.attention.output.dense.bias cpu
0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.5.intermediate.dense.weight cpu
0.auto_model.encoder.layer.5.intermediate.dense.bias cpu
0.auto_model.encoder.layer.5.output.dense.weight cpu
0.auto_model.encoder.layer.5.output.dense.bias cpu
0.auto_model.encoder.layer.5.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.5.output.LayerNorm.bias cpu
0.auto_model.pooler.dense.weight cpu
0.auto_model.pooler.dense.bias cpu

sentence-transformers: 2.2.2

@secsilm secsilm changed the title When the model is on the CPU, the device of the tensor returned by encode is cuda When the model is on the CPU, the device of the tensor returned by encode is cuda May 30, 2024
@tomaarsen
Copy link
Collaborator

Hello!

This is a bug with Sentence Transformers < 2.3 that caused to to be ignored: #2351.
In essence, internally it used to track the "target device", i.e. what it used on initialization. If CUDA is available, this "target device" would be set to CUDA, even if the user later moved the model to CPU. Then, when calling encode or fit, the code would move the model to this "target device" forcibly.

You should be able to fix this by updating your sentence-transformers version:

pip install -U sentence-transformers
  • Tom Aarsen

@secsilm
Copy link
Author

secsilm commented May 31, 2024

Hello!

This is a bug with Sentence Transformers < 2.3 that caused to to be ignored: #2351. In essence, internally it used to track the "target device", i.e. what it used on initialization. If CUDA is available, this "target device" would be set to CUDA, even if the user later moved the model to CPU. Then, when calling encode or fit, the code would move the model to this "target device" forcibly.

You should be able to fix this by updating your sentence-transformers version:

pip install -U sentence-transformers
  • Tom Aarsen

I upgrade to 3.0.0, the returned tensor is on the cpu now. But the model still uses the gpu even if I moved it to cpu (2.2.2 is good):

image

@tomaarsen
Copy link
Collaborator

Indeed, the GPU memory is used for 2 reasons:

  1. The model is moved to the GPU before it is moved back to the CPU. This is because the device parameter in SentenceTransformer was not used, which means the automatically used device is the strongest one available: "cuda" (aka the GPU) in your case.
  2. When torch is compiled with CUDA support, then it incurs a GPU memory overhead.

These each come with a recommendation, you can use one or the other:

  1. Set SentenceTransformer("all-MiniLM-L6-v2", device="cpu") when you load the model, rather than moving it after it was loaded.
  2. Reinstall torch, but now without CUDA support. This'll also save you some disk space, etc. The Getting Started widget is quite helpful, just select "CPU" as your compute platform.

If you do recommendation 2, then you don't have to do recommendation 1 anymore, as "cpu" will then be the automatically used device. You can then just load with SentenceTransformer("all-MiniLM-L6-v2") again.

  • Tom Aarsen

@secsilm
Copy link
Author

secsilm commented May 31, 2024

I understand that the model needs to be loaded onto the GPU first and then moved to the CPU, but why isn't the GPU memory released after moving to the CPU?

@tomaarsen
Copy link
Collaborator

The model doesn't need to be loaded on the GPU at all, you can load it on the CPU straight away with SentenceTransformer("all-MiniLM-L6-v2", device="cpu").

And the memory is released, only the torch overhead from initializing CUDA remains. If you load the model in CUDA and then time.sleep(1000), then you'll probably see a higher memory usage than the 280MB, as I think the 280MB is just from torch/CUDA.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants