When the model is on the CPU, the device of the tensor returned by `encode` is cuda #2694

secsilm · 2024-05-30T07:41:43Z

Let's look at the code directly:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2").to('cpu').eval()
model.encode(['test'], convert_to_tensor=True).device  # device(type='cuda', index=0)

To make sure the model is on the CPU

for name, p in model.named_parameters():
    print(name, p.device)

Output:

0.auto_model.embeddings.word_embeddings.weight cpu
0.auto_model.embeddings.position_embeddings.weight cpu
0.auto_model.embeddings.token_type_embeddings.weight cpu
0.auto_model.embeddings.LayerNorm.weight cpu
0.auto_model.embeddings.LayerNorm.bias cpu
0.auto_model.encoder.layer.0.attention.self.query.weight cpu
0.auto_model.encoder.layer.0.attention.self.query.bias cpu
0.auto_model.encoder.layer.0.attention.self.key.weight cpu
0.auto_model.encoder.layer.0.attention.self.key.bias cpu
0.auto_model.encoder.layer.0.attention.self.value.weight cpu
0.auto_model.encoder.layer.0.attention.self.value.bias cpu
0.auto_model.encoder.layer.0.attention.output.dense.weight cpu
0.auto_model.encoder.layer.0.attention.output.dense.bias cpu
0.auto_model.encoder.layer.0.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.0.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.0.intermediate.dense.weight cpu
0.auto_model.encoder.layer.0.intermediate.dense.bias cpu
0.auto_model.encoder.layer.0.output.dense.weight cpu
0.auto_model.encoder.layer.0.output.dense.bias cpu
0.auto_model.encoder.layer.0.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.0.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.1.attention.self.query.weight cpu
0.auto_model.encoder.layer.1.attention.self.query.bias cpu
0.auto_model.encoder.layer.1.attention.self.key.weight cpu
0.auto_model.encoder.layer.1.attention.self.key.bias cpu
0.auto_model.encoder.layer.1.attention.self.value.weight cpu
0.auto_model.encoder.layer.1.attention.self.value.bias cpu
0.auto_model.encoder.layer.1.attention.output.dense.weight cpu
0.auto_model.encoder.layer.1.attention.output.dense.bias cpu
0.auto_model.encoder.layer.1.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.1.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.1.intermediate.dense.weight cpu
0.auto_model.encoder.layer.1.intermediate.dense.bias cpu
0.auto_model.encoder.layer.1.output.dense.weight cpu
0.auto_model.encoder.layer.1.output.dense.bias cpu
0.auto_model.encoder.layer.1.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.1.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.2.attention.self.query.weight cpu
0.auto_model.encoder.layer.2.attention.self.query.bias cpu
0.auto_model.encoder.layer.2.attention.self.key.weight cpu
0.auto_model.encoder.layer.2.attention.self.key.bias cpu
0.auto_model.encoder.layer.2.attention.self.value.weight cpu
0.auto_model.encoder.layer.2.attention.self.value.bias cpu
0.auto_model.encoder.layer.2.attention.output.dense.weight cpu
0.auto_model.encoder.layer.2.attention.output.dense.bias cpu
0.auto_model.encoder.layer.2.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.2.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.2.intermediate.dense.weight cpu
0.auto_model.encoder.layer.2.intermediate.dense.bias cpu
0.auto_model.encoder.layer.2.output.dense.weight cpu
0.auto_model.encoder.layer.2.output.dense.bias cpu
0.auto_model.encoder.layer.2.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.2.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.3.attention.self.query.weight cpu
0.auto_model.encoder.layer.3.attention.self.query.bias cpu
0.auto_model.encoder.layer.3.attention.self.key.weight cpu
0.auto_model.encoder.layer.3.attention.self.key.bias cpu
0.auto_model.encoder.layer.3.attention.self.value.weight cpu
0.auto_model.encoder.layer.3.attention.self.value.bias cpu
0.auto_model.encoder.layer.3.attention.output.dense.weight cpu
0.auto_model.encoder.layer.3.attention.output.dense.bias cpu
0.auto_model.encoder.layer.3.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.3.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.3.intermediate.dense.weight cpu
0.auto_model.encoder.layer.3.intermediate.dense.bias cpu
0.auto_model.encoder.layer.3.output.dense.weight cpu
0.auto_model.encoder.layer.3.output.dense.bias cpu
0.auto_model.encoder.layer.3.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.3.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.4.attention.self.query.weight cpu
0.auto_model.encoder.layer.4.attention.self.query.bias cpu
0.auto_model.encoder.layer.4.attention.self.key.weight cpu
0.auto_model.encoder.layer.4.attention.self.key.bias cpu
0.auto_model.encoder.layer.4.attention.self.value.weight cpu
0.auto_model.encoder.layer.4.attention.self.value.bias cpu
0.auto_model.encoder.layer.4.attention.output.dense.weight cpu
0.auto_model.encoder.layer.4.attention.output.dense.bias cpu
0.auto_model.encoder.layer.4.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.4.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.4.intermediate.dense.weight cpu
0.auto_model.encoder.layer.4.intermediate.dense.bias cpu
0.auto_model.encoder.layer.4.output.dense.weight cpu
0.auto_model.encoder.layer.4.output.dense.bias cpu
0.auto_model.encoder.layer.4.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.4.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.5.attention.self.query.weight cpu
0.auto_model.encoder.layer.5.attention.self.query.bias cpu
0.auto_model.encoder.layer.5.attention.self.key.weight cpu
0.auto_model.encoder.layer.5.attention.self.key.bias cpu
0.auto_model.encoder.layer.5.attention.self.value.weight cpu
0.auto_model.encoder.layer.5.attention.self.value.bias cpu
0.auto_model.encoder.layer.5.attention.output.dense.weight cpu
0.auto_model.encoder.layer.5.attention.output.dense.bias cpu
0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias cpu
0.auto_model.encoder.layer.5.intermediate.dense.weight cpu
0.auto_model.encoder.layer.5.intermediate.dense.bias cpu
0.auto_model.encoder.layer.5.output.dense.weight cpu
0.auto_model.encoder.layer.5.output.dense.bias cpu
0.auto_model.encoder.layer.5.output.LayerNorm.weight cpu
0.auto_model.encoder.layer.5.output.LayerNorm.bias cpu
0.auto_model.pooler.dense.weight cpu
0.auto_model.pooler.dense.bias cpu

sentence-transformers: 2.2.2

The text was updated successfully, but these errors were encountered:

tomaarsen · 2024-05-30T20:21:16Z

Hello!

This is a bug with Sentence Transformers < 2.3 that caused to to be ignored: #2351.
In essence, internally it used to track the "target device", i.e. what it used on initialization. If CUDA is available, this "target device" would be set to CUDA, even if the user later moved the model to CPU. Then, when calling encode or fit, the code would move the model to this "target device" forcibly.

You should be able to fix this by updating your sentence-transformers version:

pip install -U sentence-transformers

Tom Aarsen

secsilm · 2024-05-31T06:58:43Z

Hello!

This is a bug with Sentence Transformers < 2.3 that caused to to be ignored: #2351. In essence, internally it used to track the "target device", i.e. what it used on initialization. If CUDA is available, this "target device" would be set to CUDA, even if the user later moved the model to CPU. Then, when calling encode or fit, the code would move the model to this "target device" forcibly.

You should be able to fix this by updating your sentence-transformers version:
pip install -U sentence-transformers
Tom Aarsen

I upgrade to 3.0.0, the returned tensor is on the cpu now. But the model still uses the gpu even if I moved it to cpu (2.2.2 is good):

tomaarsen · 2024-05-31T07:57:36Z

Indeed, the GPU memory is used for 2 reasons:

The model is moved to the GPU before it is moved back to the CPU. This is because the device parameter in SentenceTransformer was not used, which means the automatically used device is the strongest one available: "cuda" (aka the GPU) in your case.
When torch is compiled with CUDA support, then it incurs a GPU memory overhead.

These each come with a recommendation, you can use one or the other:

Set SentenceTransformer("all-MiniLM-L6-v2", device="cpu") when you load the model, rather than moving it after it was loaded.
Reinstall torch, but now without CUDA support. This'll also save you some disk space, etc. The Getting Started widget is quite helpful, just select "CPU" as your compute platform.

If you do recommendation 2, then you don't have to do recommendation 1 anymore, as "cpu" will then be the automatically used device. You can then just load with SentenceTransformer("all-MiniLM-L6-v2") again.

Tom Aarsen

secsilm · 2024-05-31T08:08:02Z

I understand that the model needs to be loaded onto the GPU first and then moved to the CPU, but why isn't the GPU memory released after moving to the CPU?

tomaarsen · 2024-05-31T08:30:44Z

The model doesn't need to be loaded on the GPU at all, you can load it on the CPU straight away with SentenceTransformer("all-MiniLM-L6-v2", device="cpu").

And the memory is released, only the torch overhead from initializing CUDA remains. If you load the model in CUDA and then time.sleep(1000), then you'll probably see a higher memory usage than the 280MB, as I think the 280MB is just from torch/CUDA.

Tom Aarsen

secsilm changed the title ~~When the model is on the CPU, the device of the tensor returned by encode is cuda~~ When the model is on the CPU, the device of the tensor returned by encode is cuda May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the model is on the CPU, the device of the tensor returned by `encode` is cuda #2694

When the model is on the CPU, the device of the tensor returned by `encode` is cuda #2694

secsilm commented May 30, 2024 •

edited

Loading

tomaarsen commented May 30, 2024

secsilm commented May 31, 2024 •

edited

Loading

tomaarsen commented May 31, 2024

secsilm commented May 31, 2024

tomaarsen commented May 31, 2024

When the model is on the CPU, the device of the tensor returned by encode is cuda #2694

When the model is on the CPU, the device of the tensor returned by encode is cuda #2694

Comments

secsilm commented May 30, 2024 • edited Loading

tomaarsen commented May 30, 2024

secsilm commented May 31, 2024 • edited Loading

tomaarsen commented May 31, 2024

secsilm commented May 31, 2024

tomaarsen commented May 31, 2024

When the model is on the CPU, the device of the tensor returned by `encode` is cuda #2694

When the model is on the CPU, the device of the tensor returned by `encode` is cuda #2694

secsilm commented May 30, 2024 •

edited

Loading

secsilm commented May 31, 2024 •

edited

Loading