Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak for large strings #1539

Closed
noamgai21 opened this issue May 23, 2024 · 28 comments · Fixed by #1675 or #1676
Closed

Memory leak for large strings #1539

noamgai21 opened this issue May 23, 2024 · 28 comments · Fixed by #1675 or #1676

Comments

@noamgai21
Copy link

noamgai21 commented May 23, 2024

This snippet will cause memory usage to rise indefinitely:

from transformers import AutoTokenizer
import gc

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
refresh_every = 100000

for i in range(100000):
  s = f'{i} {i} ' * 10000
  tokenizer.encode(s)
  gc.collect()
  if i % 100 == 0:
    print(i)
  if i % refresh_every == 0:
    tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)

If you set refresh_every to 100000 (like it is in the snippet), the memory usage will keep on rising. This colab notebook crashes after about 15 minutes of executing.

If you set refresh_every to 100, the memory consumption will be stable.

@noamgai21
Copy link
Author

Related to #1495

@tomaarsen
Copy link
Member

tomaarsen commented Jun 18, 2024

Hello!

I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795

Reproduction

import random
import string
import time
import psutil
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')

def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

for iteration in range(99999999):
    start_t = time.time()
    tokenizer.encode_batch([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    delta_t = time.time() - start_t
    print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s")

Outputs

00: 353.12MB, 0.35s
01: 421.64MB, 0.51s
02: 492.77MB, 0.68s
03: 571.88MB, 0.93s
04: 623.66MB, 1.02s
05: 710.28MB, 1.35s
06: 803.41MB, 1.31s
07: 859.77MB, 1.43s
08: 912.55MB, 1.69s
09: 1014.13MB, 1.78s
10: 1081.04MB, 1.95s
11: 1133.04MB, 2.29s
12: 1208.43MB, 2.56s
13: 1413.81MB, 2.65s
14: 1495.07MB, 2.83s
15: 1575.66MB, 3.00s
16: 1646.78MB, 3.19s
17: 1720.24MB, 3.57s
18: 1793.95MB, 3.82s
19: 1862.75MB, 4.02s
20: 1939.91MB, 4.21s
21: 2008.09MB, 4.71s
22: 2084.01MB, 5.04s
23: 2157.63MB, 5.26s
24: 2228.05MB, 5.56s
25: 2304.84MB, 6.13s
26: 2374.40MB, 6.50s
27: 2445.36MB, 6.68s
28: 2517.31MB, 7.38s
29: 2590.93MB, 7.91s
30: 2432.09MB, 8.19s
31: 2645.64MB, 8.56s
32: 2720.85MB, 8.81s
33: 2801.12MB, 9.73s
34: 2874.08MB, 10.14s
35: 2949.19MB, 11.18s
36: 3017.41MB, 11.28s
37: 3094.99MB, 12.76s
38: 3164.58MB, 14.09s
39: 3232.37MB, 13.26s
40: 3309.48MB, 15.10s

This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower.

Notes

The memory usage is much more reasonable if the strings:

  1. are not arbitrary, e.g. repeated "abc"
  2. contain spaces, e.g. by adding + " " to the list of choices.

@n1t0 @Narsil @ArthurZucker

  • Tom Aarsen

@ArthurZucker
Copy link
Collaborator

I will check this might be related to FFI (Foreign Function Interface) and the way string are passed to rust in the background.

@SilasMarvin
Copy link

+1 on facing this issue. Happy to help in any way to get this fixed!

@kczimm
Copy link

kczimm commented Jun 26, 2024

FWIW, it appears to leak even if TOKENIZERS_PARALLELISM=0.

@ArthurZucker
Copy link
Collaborator

ah then it might be the interface between rust and python

@ArthurZucker
Copy link
Collaborator

https://dora-rs.ai/blog/rust-python/ I'll try to follow that

@ArthurZucker
Copy link
Collaborator

if anyone has a fix feel free to open a PR!

@gau-nernst
Copy link

I'm also getting memory leak. Memory keeps growing until the program crashes. I think this should be a high priority bug to be fixed.

@ArthurZucker
Copy link
Collaborator

Will investigate, do you have a reproducer as well? Can help figuring out the extent of the bug

@gau-nernst
Copy link

Using a concrete dataset

from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download


tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

filepath = hf_hub_download("roneneldan/TinyStories", "TinyStoriesV2-GPT4-train.txt", repo_type="dataset")
stories = open(filepath).read().split("\n<|endoftext|>\n")
print(len(stories))  # 2,717,700

outputs = []

chunk_size = 10_000
for i in range(0, len(stories), chunk_size):
    chunk = stories[i : min(i + chunk_size, len(stories))]

    # memory increases 1GB every 2-3s
    outputs.append(tokenizer(chunk, return_attention_mask=False))

    # memory increases at a much slower rate, but might still be abnormal
    # outputs.append(tokenizer(chunk, return_attention_mask=False).input_ids)

The final data is 587,316,317 tokens, which can be fit in memory (using int64, it is ~4GB). In the end I switched to sentencepiece to tokenize data instead.

@CSEEduanyu
Copy link

Using the tokenizer version rust0.19.1 also has the same problem, is there any progress? @ArthurZucker

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Aug 23, 2024

I would recommend you to first test with https://github.com/huggingface/tokenizers/releases (0.20.0) but will investigate.

@gau-nernst
Copy link

I can confirm my snippet above still has memory leak for tokenizers=0.20.0

@ArthurZucker
Copy link
Collaborator

This is actually most probably related to the cache_capacity 😅

00: 641.44MB, 0.82s
01: 677.50MB, 0.82s
02: 744.95MB, 0.82s
03: 806.34MB, 0.81s
04: 875.70MB, 0.82s
05: 938.81MB, 0.81s
06: 1002.64MB, 0.82s
07: 1067.64MB, 0.81s
08: 1137.48MB, 0.81s
09: 1210.08MB, 0.82s
10: 1278.47MB, 0.82s
11: 1347.39MB, 0.82s
12: 1417.44MB, 0.82s
13: 1480.31MB, 0.82s
14: 1548.58MB, 0.82s
15: 1615.86MB, 0.84s
16: 1696.23MB, 0.82s
17: 1760.56MB, 0.81s
18: 1832.53MB, 0.82s
19: 1896.66MB, 0.81s
20: 1960.56MB, 0.82s
21: 1532.59MB, 0.84s
22: 1551.64MB, 0.83s
23: 1572.08MB, 0.82s
24: 1589.92MB, 0.82s
25: 1610.05MB, 0.82s
26: 1629.97MB, 0.81s
27: 1652.41MB, 0.82s
28: 1675.92MB, 0.82s
29: 1701.94MB, 0.82s
30: 1724.75MB, 0.81s
31: 1747.91MB, 0.82s
32: 1774.03MB, 0.82s
33: 1799.89MB, 0.82s
34: 1825.55MB, 0.82s
35: 1851.94MB, 0.83s
36: 1881.30MB, 0.82s
37: 1907.56MB, 0.82s
38: 1937.47MB, 0.82s
39: 1969.55MB, 0.82s
40: 2007.11MB, 0.82s
41: 1582.36MB, 0.84s
42: 1601.08MB, 0.83s
43: 1618.77MB, 0.83s
44: 1636.91MB, 0.82s
45: 1657.00MB, 0.81s
46: 1677.72MB, 0.83s
47: 1699.81MB, 0.82s
48: 1722.53MB, 0.85s
49: 1745.86MB, 0.88s
50: 1768.61MB, 0.84s
51: 1794.33MB, 0.85s
52: 1819.61MB, 0.83s
53: 1845.17MB, 0.85s
54: 1869.97MB, 0.85s
55: 1895.31MB, 0.89s
56: 1920.86MB, 0.85s
57: 1946.27MB, 0.85s
58: 1971.42MB, 0.85s
59: 1999.50MB, 0.91s
60: 2028.67MB, 0.85s
61: 1602.22MB, 0.88s
62: 1622.88MB, 0.84s
63: 1640.98MB, 0.83s
64: 1660.52MB, 0.83s
65: 1679.39MB, 0.84s
66: 1701.47MB, 0.83s
67: 1722.67MB, 0.83s
68: 1745.97MB, 0.83s
69: 1769.48MB, 0.83s
70: 1792.70MB, 0.82s
71: 1817.48MB, 0.83s
72: 1843.30MB, 0.83s
73: 1867.92MB, 0.83s
74: 1892.97MB, 0.83s
75: 1919.30MB, 0.82s
76: 1944.92MB, 0.83s
77: 1969.39MB, 0.83s
78: 1994.12MB, 0.84s
79: 2019.91MB, 0.83s
80: 2053.34MB, 0.83s
81: 1623.53MB, 0.85s
82: 1641.69MB, 0.84s
83: 1661.28MB, 0.83s
84: 1679.92MB, 0.83s
85: 1698.00MB, 0.83s
86: 1718.06MB, 0.85s
87: 1741.38MB, 0.83s
88: 1763.98MB, 0.83s
89: 1787.50MB, 0.83s
90: 1812.05MB, 0.83s
91: 1836.77MB, 0.83s
92: 1861.23MB, 0.84s
93: 1886.31MB, 0.82s
94: 1912.08MB, 0.82s
95: 1937.00MB, 0.82s
96: 1962.55MB, 0.82s
97: 1987.83MB, 0.83s
98: 2013.03MB, 0.82s
99: 2038.19MB, 0.83s
100: 2063.67MB, 0.83s
101: 1637.56MB, 0.86s
102: 1657.77MB, 0.83s
103: 1676.17MB, 0.83s
104: 1694.66MB, 0.83s
105: 1713.62MB, 0.84s
106: 1735.89MB, 0.83s
107: 1757.58MB, 0.83s
108: 1781.53MB, 0.83s
109: 1804.66MB, 0.84s
110: 1828.02MB, 0.82s
111: 1852.81MB, 0.83s
112: 1877.17MB, 0.83s
113: 1902.03MB, 0.84s
114: 1929.05MB, 0.83s
115: 1953.89MB, 0.83s
116: 1980.75MB, 0.84s
117: 2004.72MB, 0.83s
118: 2029.92MB, 0.83s
119: 2054.84MB, 0.82s
120: 2086.91MB, 0.83s
121: 1657.25MB, 0.87s
122: 1676.05MB, 0.83s
123: 1694.41MB, 0.83s
124: 1713.56MB, 0.83s
125: 1732.42MB, 0.84s
126: 1750.59MB, 0.85s
127: 1774.39MB, 0.83s
128: 1797.36MB, 0.84s
129: 1819.55MB, 0.82s
130: 1843.45MB, 0.82s
131: 1868.30MB, 0.85s
132: 1893.30MB, 0.84s
133: 1918.31MB, 0.84s

@ecnuycxie
Copy link

This is actually most probably related to the cache_capacity 😅

00: 641.44MB, 0.82s
01: 677.50MB, 0.82s
02: 744.95MB, 0.82s
03: 806.34MB, 0.81s
04: 875.70MB, 0.82s
05: 938.81MB, 0.81s
06: 1002.64MB, 0.82s
07: 1067.64MB, 0.81s
08: 1137.48MB, 0.81s
09: 1210.08MB, 0.82s
10: 1278.47MB, 0.82s
11: 1347.39MB, 0.82s
12: 1417.44MB, 0.82s
13: 1480.31MB, 0.82s
14: 1548.58MB, 0.82s
15: 1615.86MB, 0.84s
16: 1696.23MB, 0.82s
17: 1760.56MB, 0.81s
18: 1832.53MB, 0.82s
19: 1896.66MB, 0.81s
20: 1960.56MB, 0.82s
21: 1532.59MB, 0.84s
22: 1551.64MB, 0.83s
23: 1572.08MB, 0.82s
24: 1589.92MB, 0.82s
25: 1610.05MB, 0.82s
26: 1629.97MB, 0.81s
27: 1652.41MB, 0.82s
28: 1675.92MB, 0.82s
29: 1701.94MB, 0.82s
30: 1724.75MB, 0.81s
31: 1747.91MB, 0.82s
32: 1774.03MB, 0.82s
33: 1799.89MB, 0.82s
34: 1825.55MB, 0.82s
35: 1851.94MB, 0.83s
36: 1881.30MB, 0.82s
37: 1907.56MB, 0.82s
38: 1937.47MB, 0.82s
39: 1969.55MB, 0.82s
40: 2007.11MB, 0.82s
41: 1582.36MB, 0.84s
42: 1601.08MB, 0.83s
43: 1618.77MB, 0.83s
44: 1636.91MB, 0.82s
45: 1657.00MB, 0.81s
46: 1677.72MB, 0.83s
47: 1699.81MB, 0.82s
48: 1722.53MB, 0.85s
49: 1745.86MB, 0.88s
50: 1768.61MB, 0.84s
51: 1794.33MB, 0.85s
52: 1819.61MB, 0.83s
53: 1845.17MB, 0.85s
54: 1869.97MB, 0.85s
55: 1895.31MB, 0.89s
56: 1920.86MB, 0.85s
57: 1946.27MB, 0.85s
58: 1971.42MB, 0.85s
59: 1999.50MB, 0.91s
60: 2028.67MB, 0.85s
61: 1602.22MB, 0.88s
62: 1622.88MB, 0.84s
63: 1640.98MB, 0.83s
64: 1660.52MB, 0.83s
65: 1679.39MB, 0.84s
66: 1701.47MB, 0.83s
67: 1722.67MB, 0.83s
68: 1745.97MB, 0.83s
69: 1769.48MB, 0.83s
70: 1792.70MB, 0.82s
71: 1817.48MB, 0.83s
72: 1843.30MB, 0.83s
73: 1867.92MB, 0.83s
74: 1892.97MB, 0.83s
75: 1919.30MB, 0.82s
76: 1944.92MB, 0.83s
77: 1969.39MB, 0.83s
78: 1994.12MB, 0.84s
79: 2019.91MB, 0.83s
80: 2053.34MB, 0.83s
81: 1623.53MB, 0.85s
82: 1641.69MB, 0.84s
83: 1661.28MB, 0.83s
84: 1679.92MB, 0.83s
85: 1698.00MB, 0.83s
86: 1718.06MB, 0.85s
87: 1741.38MB, 0.83s
88: 1763.98MB, 0.83s
89: 1787.50MB, 0.83s
90: 1812.05MB, 0.83s
91: 1836.77MB, 0.83s
92: 1861.23MB, 0.84s
93: 1886.31MB, 0.82s
94: 1912.08MB, 0.82s
95: 1937.00MB, 0.82s
96: 1962.55MB, 0.82s
97: 1987.83MB, 0.83s
98: 2013.03MB, 0.82s
99: 2038.19MB, 0.83s
100: 2063.67MB, 0.83s
101: 1637.56MB, 0.86s
102: 1657.77MB, 0.83s
103: 1676.17MB, 0.83s
104: 1694.66MB, 0.83s
105: 1713.62MB, 0.84s
106: 1735.89MB, 0.83s
107: 1757.58MB, 0.83s
108: 1781.53MB, 0.83s
109: 1804.66MB, 0.84s
110: 1828.02MB, 0.82s
111: 1852.81MB, 0.83s
112: 1877.17MB, 0.83s
113: 1902.03MB, 0.84s
114: 1929.05MB, 0.83s
115: 1953.89MB, 0.83s
116: 1980.75MB, 0.84s
117: 2004.72MB, 0.83s
118: 2029.92MB, 0.83s
119: 2054.84MB, 0.82s
120: 2086.91MB, 0.83s
121: 1657.25MB, 0.87s
122: 1676.05MB, 0.83s
123: 1694.41MB, 0.83s
124: 1713.56MB, 0.83s
125: 1732.42MB, 0.84s
126: 1750.59MB, 0.85s
127: 1774.39MB, 0.83s
128: 1797.36MB, 0.84s
129: 1819.55MB, 0.82s
130: 1843.45MB, 0.82s
131: 1868.30MB, 0.85s
132: 1893.30MB, 0.84s
133: 1918.31MB, 0.84s

The increase in memory usage has been bothering me for several days. looking forward to your fix for this issue~ @ArthurZucker

@Narsil
Copy link
Collaborator

Narsil commented Nov 6, 2024

@gau-nernst

The increasing RAM is perfectly normal in your program you keep the entire Encoding object which contains much more than just the token_ids. It contains, offsets, word_ids, special_token_mask, etc...

The commented version grows less fast becase you're only keeping the ids there, again perfectly normal.
It grows faster than just the ids because tokenizer keeps a cache to speed up tokenization.

If you want to have a good program, instead of accumulating everything, keep dumping these into files, this is the only way to reliably avoid OOMing on large datasets.

There's no "leak" per say in all the given programs (it will flatten out given enough time). It's just tokenizer with BPE uses a cache to speed up tokenization in most "normal" (read regular language) situation. Filling up the cache is otherwise normal.
The best we can do is avoid OOMing on these cache allocations (since they are not necessary technically) and give a bit more control for power users (which could want more cache for instance, or clearing it between datasets).

Would you agree @ArthurZucker

@ArthurZucker
Copy link
Collaborator

avoid OOMing on these cache allocations (since they are not necessary technically) and give a bit more control for power users (which could want more cache for instance, or clearing it between datasets).

Aligned! 🤗

@gau-nernst
Copy link

@Narsil I have since moved to sentencepiece since it was more reliable to me.

I know that my computer can hold the full tokenized data in memory. In fact, sentencepiece did it just fine. You might not want to call it "memory leak", but the excessive memory usage is definitely a bug. "it will flatten out given enough time" will not happen if the program crashes before that 🤣

@Narsil
Copy link
Collaborator

Narsil commented Nov 6, 2024

excessive memory usage is definitely a bug.

Excessive according to whom ?
Caching is extremely successful at speeding things up in a variety of tokenizer, so taking more memory to speed things up is perfectly valid.

sentencepiece since it was more reliable to me.

That's perfectly fine.

will not happen if the program crashes before that 🤣

Indeed but it's never allocating more than a few GB (10k items at most, unless you're sending humongous blobs it shouldn't be that big). On modern hardware that should be just fine.
If you're instead sending entire stories (which seems to be the case from a glance) then yes potentially the caching caches way too much. We can definitely add some defense against such use cases (since cache hits should be fairly low anyway).

In any case we're adding more control for fine-grained access to this caching for users to control

@gau-nernst
Copy link

@Narsil I hope I didn't trigger you. I was giving constructive feedback from a user perspective. It was not only me observing this bug. Other people have faced this problem as well, evident in this thread.

Indeed but it's never allocating more than a few GB

Re-running my example above, with latest version of tokenizers 0.20.3, the script consumes all of my 80GB of RAM and I have to terminate the program. And to reiterate, the tokenized data in int64 is only 4GB, so even with extra data like attention masks, it shouldn't consume that much. And if we want to compare "excessive to whom?", sentencepiece didn't have this problem, so it's excessive compared to sentencepiece. Also, this script uses chunking, so it's not sending 1 big text. And even with 1 big text, the library should be able to handle some form of it (e.g. able to tokenize 1 big string worth 1GB of tokens).

We can definitely add some defense against such use cases

I agree. There should be sensible defaults to prevent excessive memory usage in the first place. I'm not too familiar with the internal implementation, but perhaps you could investigate what is the sensible maximum cache size while not hinder performance too much, because I think there is probably diminishing return in maintaining a growing cache?

@tomaarsen
Copy link
Member

I think extra cache options would be quite valuable - tokenization becoming an order of magnitude slower over time is a realistic problem for users who tokenize massive amounts of (sometimes unconvential) texts in the same setting, which is very common for my users for e.g. retrieval systems.

  • Tom Aarsen

@Narsil
Copy link
Collaborator

Narsil commented Nov 6, 2024

tokenization becoming an order of magnitude slower over time

This isn't the case on all systems I've tried on (from your exact use case). Can you confirm which version of tokenizers you're using and what system ?
This would be indeed a massive bug, but either it's already been fixed, or we're following a wrong lead. The cache has 0 impact on this (unless you're hitting swap but unlikely).

I hope I didn't trigger you

I'm not. But without actual numbers of consumption, having opinions about how something should behave is well, just an opinion.
80GB is definitely way too much RAM consumption. As I said, the cache system is counted in number of items, not total memory size. I implemented something which should alleviate that #1676 + more control on the cache: #1675.

@Narsil
Copy link
Collaborator

Narsil commented Nov 6, 2024

After testing, it seems the slowdown only happens with cache ON and Windows (no WSL). @tomaarsen .

I could reproduce on Windows. Both PRs fix it (one is automatic, with the other users need to adjust the cache size or clear it regularly).
Something is wrong with Windows regarding speed. I can imagine big collisions in the hashmap (which would make lookups slower) or something very wrong along those lines.

@Narsil Narsil reopened this Nov 6, 2024
@Narsil
Copy link
Collaborator

Narsil commented Nov 6, 2024

Reopening because I think #1676 should land too before calling it fixed (since it's a zero effort on users fix)

@ArthurZucker
Copy link
Collaborator

Thanks @tomaarsen and @gau-nernst both for the feedbacks! I think we ended up with a good solution that should help everyone thanks to that! It was long due as many people complained about it 🤗

@yucc-leon
Copy link

Has this been fixed in the latest version?

@ArthurZucker
Copy link
Collaborator

Yes! 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants