Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in SentenceTransformer.encode during the first ~10000 predictions #1795

Open
Dobiasd opened this issue Dec 23, 2022 · 32 comments
Open

Comments

@Dobiasd
Copy link

Dobiasd commented Dec 23, 2022

The following minimal example repeatedly calls SentenceTransformer.encode on random strings of fixed length (12345) and fixed number of strings (200), and it records the memory usage.

For the first ~50 calls (~10000 predictions), the memory usage grows enormously.

# memleak.py

import random
import string

import psutil
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')


def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))


print('iteration,memory_usage_in_MiB', flush=True)
for iteration in range(99999999):
    model.encode([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    print(f'{iteration},{memory_usage_in_MiB}', flush=True)

Output:

iteration,memory_usage_in_MiB
0,1329.22265625
1,1431.140625
2,1509.2265625
3,1641.55859375
4,1699.109375
5,1779.36328125
[...]
10,2250.69921875
[...]
20,3121.921875
[...]
30,4033.1875
[...]
40,4917.00390625
41,5006.48046875
42,5102.65625
43,5186.4453125
44,5276.37890625
45,5378.58203125
46,5486.60546875
47,5546.50390625
48,5648.64453125
49,5731.9296875
50,5749.0390625
51,5765.81640625
52,5776.52734375
53,5752.5390625
54,5752.39453125
55,5765.01953125
56,5783.08203125
57,5758.75
58,5752.390625
59,5794.265625
60,5752.83984375
61,5776.9140625
62,5764.89453125
63,5794.5703125
64,5795.8515625
65,5789.98046875
66,5795.84375
67,5783.55859375
[...]

The larger the input strings, the higher the memory usage. But it always stops at this point.

I'm using no GPU, and the behavior can be reproduced with the following Dockerfile:

FROM python:3.10.9
RUN pip install sentence-transformers==2.2.2 psutil==5.9.4

# download model
RUN python -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')"

# Disable the Docker cache from this stage on, see https://stackoverflow.com/a/58801213/1866775
ADD "https://www.random.org/cgi-bin/randbyte?nbytes=10&format=h" skipcache

ADD ./memleak.py /
RUN python /memleak.py

Is this a memory leak or intended behavior?

@friedhelm739
Copy link

@Dobiasd hi, did you have any solutions? i meet this issue either but i don't notice there is any upper bound, it always occupy as much memory as it can =.=
i also run scripts in docker with cpu, as well as using encode function but with my own model

@Dobiasd
Copy link
Author

Dobiasd commented Jan 10, 2023

No, I did not find a solution.

My workaround is to have a memory limit on the affected Kubernetes pods, regularly have them be OOMKilled because of the memleak, and let the clients retry their requests. 😐

@JoanFM
Copy link

JoanFM commented Mar 15, 2023

Hey @Dobiasd ,

have u tried to see if setting the inner torch models to eval or using inference mode would solve the issue? Maybe the model is storing some information preparing for a backward training pass?

@Dobiasd
Copy link
Author

Dobiasd commented Mar 15, 2023

@JoanFM No, I have not yet tried those things. Can you help me by showing me to do them?

@JoanFM
Copy link

JoanFM commented Mar 15, 2023

Try this:

import random
import string

import psutil
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')


def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

model = model.eval()
import torch
with torch.no_grad():
      print('iteration,memory_usage_in_MiB', flush=True)
      for iteration in range(99999999):
          model.encode([random_string(12345) for _ in range(200)])
          memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
          print(f'{iteration},{memory_usage_in_MiB}', flush=True)

in theory, only one of the changes may be enough

@Dobiasd
Copy link
Author

Dobiasd commented Mar 15, 2023

Thanks a lot! I just tested with your version, but sadly it's still leaking (output). 😐

@JoanFM
Copy link

JoanFM commented Mar 16, 2023

Hey @Dobiasd,

with this change, did you get the OOM in Kubernetes?

@Dobiasd
Copy link
Author

Dobiasd commented Mar 16, 2023

No, so far I only tested with the minimal example (dockerized) as shown in my original post.

@JoanFM
Copy link

JoanFM commented Mar 16, 2023

but u get an OOM?

@Dobiasd
Copy link
Author

Dobiasd commented Mar 16, 2023

The memory consumption grows and grows, as shown in the linked output.

@patelrajnath
Copy link

Having the same issue when training the model with teacher embeddings. After a while it grows so much that the container kills the process.

@Lena810
Copy link

Lena810 commented Jun 29, 2023

Having the same issue. I used psutil to check the memory info, and I found the memory leak may occur in BertModel. Unfortunately, I have no idea how to determine which line leads to memory leak. Hope anyone who can help us!

@rossbg
Copy link

rossbg commented Jul 26, 2023

Looks like this happens when you pass an array of data for encoding.

If you call model.encode many times with one element only (by using an outer loop) there's no memory spike.

At least that's what my tests show.

@fahminlb33
Copy link

fahminlb33 commented Aug 17, 2023

I recently stumbled upon this problem and based on @rossbg advice I partitioned the data before calling encode.

Batch function

def batched(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

Encode process

# create SentenceTransformer and set max_seq_length
embedding_model = SentenceTransformer("indobenchmark/indobert-large-p2")
embedding_model.max_seq_length = 512

# prepare dataset and calculate total iteration for tqdm
dataset = []# Total data: 20336
embedding_chunks = []
max_batch = np.ceil(len(dataset) / 128)

# process each batch
for cb in tqdm.tqdm(batched(dataset, 128), total=max_batch):
  embedding_chunks.append(embedding_model.encode(cb, batch_size=128))

# stack all embeddings into one
all_embeddings = np.vstack(embedding_chunks)

Memory usage on Colab using V100 GPU

image

Based on my experiments, you can fine tune the max_seq_length and batch_size to fit your memory.

Ref:

@secsilm
Copy link

secsilm commented Nov 2, 2023

Looks like this happens when you pass an array of data for encoding.

If you call model.encode many times with one element only (by using an outer loop) there's no memory spike.

At least that's what my tests show.

I test your method, it's still growing.

code:

# memleak.py

import random
import string

import psutil
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')


def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))


print('iteration,memory_usage_in_MiB', flush=True)
for iteration in range(99999999):
    a = [model.encode(random_string(12345)) for _ in range(200)]  # <- CHANGE HERE
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    print(f'{iteration:02d}, {memory_usage_in_MiB:.2f}', flush=True)

output:

iteration,memory_usage_in_MiB
00, 3375.28
01, 3466.03
02, 3555.23
03, 3642.89
04, 3734.67
05, 3828.00
06, 3916.68
07, 4005.63

@rossoft
Copy link

rossoft commented Jan 30, 2024

I have seen the same growing memory issue with same model 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'

@tomaarsen
Copy link
Collaborator

Hello!

Although I can reproduce the results from this issue, the issues disappear if we change the type of input to just a bunch of words using NLTK:

from nltk.corpus import words

import random
import string

import psutil
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

def random_words(length: int) -> str:
    return " ".join(random.sample(words.words(), k=length))

print('iteration,memory_usage_in_MiB', flush=True)
for iteration in range(20):
    model.encode([random_words(2000) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    print(f'{iteration},{memory_usage_in_MiB}', flush=True)
iteration,memory_usage_in_MiB
0,1187.4140625
1,1210.42578125
2,1210.85546875
3,1210.57421875
4,1212.6640625
5,1213.4765625
6,1214.90234375
7,1213.07421875
8,1213.1328125
9,1212.6953125
10,1213.95703125
11,1216.171875
12,1214.828125
13,1214.55078125
14,1214.09765625
15,1214.12109375
16,1216.42578125
17,1215.5078125
18,1214.8828125
19,1215.9765625

In short, I'm struggling to see a real memory leak at this time. I'd love for you to prove me wrong, though - I would love to reduce memory issues for my users.

  • Tom Aarsen

@GingerNg
Copy link

GingerNg commented Feb 6, 2024

When using 'sentence-transformers/all-MiniLM-L6-v2', still using random_string(), the issues disappear too. It's very interesting.

def random_string(length: int) -> str:
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

iteration,memory_usage_in_MiB
0,4851.43359375
1,4851.43359375
2,4851.46875
3,4851.46875
4,4851.46875
5,4851.46875
6,4851.46875
7,4851.46875
8,4851.46875
9,4851.46875
10,4851.46875
11,4851.46875
12,4851.46875
13,4851.46875
14,4851.46875
15,4851.46875

@Dobiasd
Copy link
Author

Dobiasd commented Feb 6, 2024

@tomaarsen Ah, I guess the "leak" disappears when using this fixed set of words (instead of random strings) because the set of possible tokens is limited that way. 👍

To give some context: I ran into this problem while we were processing chat messages from a large online (international) user base. These users tend to produce so many different words (including typos, etc.) that the memory usage does not stop growing at a reasonable amount.

@rossoft
Copy link

rossoft commented Feb 6, 2024

It may be memory grew for me by using random strings with very long words

@tomaarsen
Copy link
Collaborator

Interesting. I would love to avoid the memory issues with these odd edge cases as well. I remember a similar case where someone tried to do sentence segmentation on Wikipedia edits, but it would sometimes stop working - it ended up being caused by someone who edited a sequence of "aaaaaaa..." with a length of 10k, and the segmenter couldn't handle that 😄

@0xtotem
Copy link

0xtotem commented Jun 6, 2024

Any news on this?

I get cuda GPU out of memory when looping encoding text at some point in the loop, even if I do reset torch's cache:

for s in corpus:
    embed = embedder.encode(s)
    torch.cuda.empty_cache()            # no effect
    ...

s can be any size, always smaller than the maximum context length of the model used, at some random point I get an out of memory. If I restart the script I can continue for a while until the next error.
For instance: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU on a Tesla T4 GPU :x

@tomaarsen
Copy link
Collaborator

tomaarsen commented Jun 6, 2024

@0xtotem
Could you share some details on your embedding model, your GPU maximum vram, and the kind of data that you're using? Is it just "normal" texts, or perhaps also code and/or nonsense? I'd like to try and reproduce this for you.

  • Tom Aarsen

@0xtotem
Copy link

0xtotem commented Jun 10, 2024

Hi @tomaarsen, my GPU has 16GB of VRAM and I am processing text that may embed code.
Please note that when I encounter such error, restarting the script is enough to process the piece of code.

For context, I'm encoding pieces in a loop and storing the result on disk (cache) for later usage.
Thanks for your help.

@TheOnlyWayUp
Copy link

Hey @tomaarsen, I'm having this issue too.

I'm using an A100 PCIe with 80gb of VRAM with the Alibaba-NLP/gte-large-en-v1.5. I've been having this issue with other models as well.

My data consists of subtitles, so it could be a tokenization issue.


test.json
test2.json

I haven't checked the data for cleanliness or rating, these are the first 128ish rows. (I split them based on number of chars, each file should be around 600k).

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
import json

with open('./test.json') as handler:
  data_one = json.load(handler)
with open('./test2.json') as handler:
  data_two = json.load(handler)

batches = [data_one, data_two]

for batch in batches:
  embeddings = model.encode([episode["caption"] for episode in batch], batch_size=32, show_progress_bar=True)

Usually, the first iteration works perfectly, it OOMs when it reaches the second file. If we switch the order of the files, same result. It works on the first file, and OOMs on the second one.

I've tried the torch.cuda.empty_cache, gc.collect, running it with no_grad, running each batch in a subprocess, but still no luck.

Especially on the subprocesses, normally, memory frees up when I kill the interpreter, this doesn't happen with subprocesses (loading the model in a subprocess, encoding, and exiting the subprocess) - And I OOM.

I'm ready to try fixes, thanks Tom!

@TheOnlyWayUp
Copy link

TheOnlyWayUp commented Jun 13, 2024

I reduced the batch_size to 16 and reduced the maximum number of characters per iteration to 300k. OOMed on the 3rd batch.

Oh? It works when I skip batch 3, 4, 5. It OOMs on Batch 10 again,
Batch 10's Data: test3.json

Edit: To note, after an OOM, I need to restart the interpreter. Else, it'll continue to OOM even when it otherwise wouldn't (on a freshly started interpreter). Could be a cleanup issue?

@hello2mao
Copy link

+1

@tomaarsen
Copy link
Collaborator

tomaarsen commented Jun 18, 2024

@TheOnlyWayUp I experimented with your code, and it seems like you're experiencing an OOM because of the really large maximum sequence length of 8192 tokens that the gte model allows. In Sentence Transformers, each batch is tokenized to the largest sequence in the batch, until the max_seq_length after which we truncate. In test1.json with a batch size of 32, you end up with these shapes for the inputs:

torch.Size([32, 6482])
torch.Size([32, 2774])

test2.json has these:

torch.Size([32, 4377])
torch.Size([32, 2697])
torch.Size([5, 393])

and test3.json has this one:

torch.Size([28, 8192])

As you can imagine, each of these have wildly different memory requirements, and some of these ([32, 6482], [28, 8192]) are likely to require more VRAM than you have. With other words, this seems like a normal and expected OOM, not a memory leak.

I can't explain why it happens somewhat arbitrarily though.

Either way, if you want to use the model with the massive sequence length, then you can create some extremely long dummy text and encode with that. That way you'll know for sure that the tokenizer is reaching the maximum sequence length. Then you can tune your batch size to the highest it'll go without OOM. Because your batch is as large as it can possibly be, this should be your upper limit of memory usage, e.g.:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)

data = ["a " * 100_000] * 1000

embeddings = model.encode(data, batch_size=32, show_progress_bar=True)
print(embeddings.shape)

For me, this gives torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU, because it tried to calculate attention for a [32, 8192] batch.

My real recommendation is to reduce the maximum sequence length used by the model. Keep in mind that it's currently trying to compress 8192 tokens into an embedding of 1024 values: it'll always lose a lot of context. I'm rather confident that you'll get roughly equivalent performance with a sequence length of 2048 or even 512 (plus, it'll be MUCH faster):

from sentence_transformers import SentenceTransformer
import json

model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
model.max_seq_length = 2048

data = ["a " * 100_000] * 1000

embeddings = model.encode(data, batch_size=32, show_progress_bar=True)
print(embeddings.shape)

Edit: To note, after an OOM, I need to restart the interpreter. Else, it'll continue to OOM even when it otherwise wouldn't (on a freshly started interpreter). Could be a cleanup issue?

This is standard when getting a CUDA OOM, you always have to restart the interpreter then.

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

@0xtotem

s can be any size, always smaller than the maximum context length of the model used, at some random point I get an out of memory.

I just realized that this could be the issue: the number of tokens can suddenly be bigger than what you've seen before. You can verify whether this is the case by feeding the model with a bunch of tokens with your batch size, and see if that results in an OOM. E.g.:

data = ["a " * 100_000] * 1000
embeddings = model.encode(data, batch_size=32, show_progress_bar=True)
  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

tomaarsen commented Jun 18, 2024

@Dobiasd
I just experimented with your first script again, and I've discovered that you get the exact same behaviour in memory increases if we use model.tokenize instead of model.encode. With other words: this seems to be a tokenization issue. The tokenization time also grows from ~0.16s for the first call to 1.5s after ~20 calls.

After some more digging, it looks like https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 uses a BertTokenizerFast that is equivalent to XLMRobertaTokenizer from xlm-roberta-base, but was somehow remapped to the BertTokenizer format. However, the Fast tokenizer versions (based on Rust) keep growing in memory usage, whereas the non-Fast ones don't.
I can't load the https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 tokenizer in a non-Fast way (it needs a vocab.txt file then, it seems), but I can load the (seemingly) equivalent xlm-roberta-base tokenizer as non-fast, and then the memory usage stays consistent.

I think I'll open an issue on https://github.com/huggingface/tokenizers to see if this is expected behaviour or a bug.
Edit: I've reported it here. It's certainly a memory leak, although not in Sentence Transformers.

  • Tom Aarsen

@TheOnlyWayUp
Copy link

TheOnlyWayUp commented Jun 19, 2024

Hey @tomaarsen, thank you for the detailed response! The code snippets are especially helpful, I'll use it to tune my batch-sizes in the future.

The model card had a pytorch example, so I tried it just to check. This is what I stuck to using:

import torch.nn.functional as F
import torch
from transformers import AutoModel, AutoTokenizer

model_path = 'Alibaba-NLP/gte-large-en-v1.5'
device = torch.device('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, unpad_inputs=True, use_memory_efficient_attention=True).to(device)

def embed(texts):
    with torch.inference_mode():
        # Tokenize the input texts
        batch_dict = tokenizer(texts, max_length=8192, padding=True, truncation=True, return_tensors='pt').to(device)

        outputs = model(**batch_dict)
        embeddings = outputs.last_hidden_state[:, 0]
        return embeddings

The snippet took my footprint from 78gb to 27gb. And no OOMs, so I was able to embed my dataset in its entirety (even with the weird shapes).

It might be the options (memory_efficient_attention, I disabled mixed precision) and the use of xformers.

Does this line up with everything mentioned in this thread? The VRAM drop was a pleasant surprise.

Thanks!
- TheOnlyWayUp

@tomaarsen
Copy link
Collaborator

@TheOnlyWayUp Since Sentence Transformers v3.0.0 it's possible to pass kwargs to AutoModel used internally in Sentence Transformers, so I think you can reproduce your performance above with:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Alibaba-NLP/gte-large-en-v1.5", trust_remote_code=True, model_kwargs={"unpad_inputs": True, "use_memory_efficient_attention": True})

(But also, if it's not broke, don't fix it 😄 Just sharing the word about the new model_kwargs feature)

Does this line up with everything mentioned in this thread?
I believe so, I think you should be good :)

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests