-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak for large strings #1539
Comments
Related to #1495 |
Hello! I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795 Reproductionimport random
import string
import time
import psutil
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')
def random_string(length: int) -> str:
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
for iteration in range(99999999):
start_t = time.time()
tokenizer.encode_batch([random_string(12345) for _ in range(200)])
memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
delta_t = time.time() - start_t
print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s") Outputs
This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower. NotesThe memory usage is much more reasonable if the strings:
|
I will check this might be related to FFI (Foreign Function Interface) and the way string are passed to rust in the background. |
+1 on facing this issue. Happy to help in any way to get this fixed! |
FWIW, it appears to leak even if |
ah then it might be the interface between rust and python |
https://dora-rs.ai/blog/rust-python/ I'll try to follow that |
if anyone has a fix feel free to open a PR! |
I'm also getting memory leak. Memory keeps growing until the program crashes. I think this should be a high priority bug to be fixed. |
Will investigate, do you have a reproducer as well? Can help figuring out the extent of the bug |
Using a concrete dataset from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
filepath = hf_hub_download("roneneldan/TinyStories", "TinyStoriesV2-GPT4-train.txt", repo_type="dataset")
stories = open(filepath).read().split("\n<|endoftext|>\n")
print(len(stories)) # 2,717,700
outputs = []
chunk_size = 10_000
for i in range(0, len(stories), chunk_size):
chunk = stories[i : min(i + chunk_size, len(stories))]
# memory increases 1GB every 2-3s
outputs.append(tokenizer(chunk, return_attention_mask=False))
# memory increases at a much slower rate, but might still be abnormal
# outputs.append(tokenizer(chunk, return_attention_mask=False).input_ids) The final data is 587,316,317 tokens, which can be fit in memory (using int64, it is ~4GB). In the end I switched to sentencepiece to tokenize data instead. |
Using the tokenizer version rust0.19.1 also has the same problem, is there any progress? @ArthurZucker |
I would recommend you to first test with https://github.com/huggingface/tokenizers/releases (0.20.0) but will investigate. |
I can confirm my snippet above still has memory leak for tokenizers=0.20.0 |
This is actually most probably related to the 00: 641.44MB, 0.82s
01: 677.50MB, 0.82s
02: 744.95MB, 0.82s
03: 806.34MB, 0.81s
04: 875.70MB, 0.82s
05: 938.81MB, 0.81s
06: 1002.64MB, 0.82s
07: 1067.64MB, 0.81s
08: 1137.48MB, 0.81s
09: 1210.08MB, 0.82s
10: 1278.47MB, 0.82s
11: 1347.39MB, 0.82s
12: 1417.44MB, 0.82s
13: 1480.31MB, 0.82s
14: 1548.58MB, 0.82s
15: 1615.86MB, 0.84s
16: 1696.23MB, 0.82s
17: 1760.56MB, 0.81s
18: 1832.53MB, 0.82s
19: 1896.66MB, 0.81s
20: 1960.56MB, 0.82s
21: 1532.59MB, 0.84s
22: 1551.64MB, 0.83s
23: 1572.08MB, 0.82s
24: 1589.92MB, 0.82s
25: 1610.05MB, 0.82s
26: 1629.97MB, 0.81s
27: 1652.41MB, 0.82s
28: 1675.92MB, 0.82s
29: 1701.94MB, 0.82s
30: 1724.75MB, 0.81s
31: 1747.91MB, 0.82s
32: 1774.03MB, 0.82s
33: 1799.89MB, 0.82s
34: 1825.55MB, 0.82s
35: 1851.94MB, 0.83s
36: 1881.30MB, 0.82s
37: 1907.56MB, 0.82s
38: 1937.47MB, 0.82s
39: 1969.55MB, 0.82s
40: 2007.11MB, 0.82s
41: 1582.36MB, 0.84s
42: 1601.08MB, 0.83s
43: 1618.77MB, 0.83s
44: 1636.91MB, 0.82s
45: 1657.00MB, 0.81s
46: 1677.72MB, 0.83s
47: 1699.81MB, 0.82s
48: 1722.53MB, 0.85s
49: 1745.86MB, 0.88s
50: 1768.61MB, 0.84s
51: 1794.33MB, 0.85s
52: 1819.61MB, 0.83s
53: 1845.17MB, 0.85s
54: 1869.97MB, 0.85s
55: 1895.31MB, 0.89s
56: 1920.86MB, 0.85s
57: 1946.27MB, 0.85s
58: 1971.42MB, 0.85s
59: 1999.50MB, 0.91s
60: 2028.67MB, 0.85s
61: 1602.22MB, 0.88s
62: 1622.88MB, 0.84s
63: 1640.98MB, 0.83s
64: 1660.52MB, 0.83s
65: 1679.39MB, 0.84s
66: 1701.47MB, 0.83s
67: 1722.67MB, 0.83s
68: 1745.97MB, 0.83s
69: 1769.48MB, 0.83s
70: 1792.70MB, 0.82s
71: 1817.48MB, 0.83s
72: 1843.30MB, 0.83s
73: 1867.92MB, 0.83s
74: 1892.97MB, 0.83s
75: 1919.30MB, 0.82s
76: 1944.92MB, 0.83s
77: 1969.39MB, 0.83s
78: 1994.12MB, 0.84s
79: 2019.91MB, 0.83s
80: 2053.34MB, 0.83s
81: 1623.53MB, 0.85s
82: 1641.69MB, 0.84s
83: 1661.28MB, 0.83s
84: 1679.92MB, 0.83s
85: 1698.00MB, 0.83s
86: 1718.06MB, 0.85s
87: 1741.38MB, 0.83s
88: 1763.98MB, 0.83s
89: 1787.50MB, 0.83s
90: 1812.05MB, 0.83s
91: 1836.77MB, 0.83s
92: 1861.23MB, 0.84s
93: 1886.31MB, 0.82s
94: 1912.08MB, 0.82s
95: 1937.00MB, 0.82s
96: 1962.55MB, 0.82s
97: 1987.83MB, 0.83s
98: 2013.03MB, 0.82s
99: 2038.19MB, 0.83s
100: 2063.67MB, 0.83s
101: 1637.56MB, 0.86s
102: 1657.77MB, 0.83s
103: 1676.17MB, 0.83s
104: 1694.66MB, 0.83s
105: 1713.62MB, 0.84s
106: 1735.89MB, 0.83s
107: 1757.58MB, 0.83s
108: 1781.53MB, 0.83s
109: 1804.66MB, 0.84s
110: 1828.02MB, 0.82s
111: 1852.81MB, 0.83s
112: 1877.17MB, 0.83s
113: 1902.03MB, 0.84s
114: 1929.05MB, 0.83s
115: 1953.89MB, 0.83s
116: 1980.75MB, 0.84s
117: 2004.72MB, 0.83s
118: 2029.92MB, 0.83s
119: 2054.84MB, 0.82s
120: 2086.91MB, 0.83s
121: 1657.25MB, 0.87s
122: 1676.05MB, 0.83s
123: 1694.41MB, 0.83s
124: 1713.56MB, 0.83s
125: 1732.42MB, 0.84s
126: 1750.59MB, 0.85s
127: 1774.39MB, 0.83s
128: 1797.36MB, 0.84s
129: 1819.55MB, 0.82s
130: 1843.45MB, 0.82s
131: 1868.30MB, 0.85s
132: 1893.30MB, 0.84s
133: 1918.31MB, 0.84s |
The increase in memory usage has been bothering me for several days. looking forward to your fix for this issue~ @ArthurZucker |
The increasing RAM is perfectly normal in your program you keep the entire The commented version grows less fast becase you're only keeping the ids there, again perfectly normal. If you want to have a good program, instead of accumulating everything, keep dumping these into files, this is the only way to reliably avoid OOMing on large datasets. There's no "leak" per say in all the given programs (it will flatten out given enough time). It's just tokenizer with BPE uses a cache to speed up tokenization in most "normal" (read regular language) situation. Filling up the cache is otherwise normal. Would you agree @ArthurZucker |
Aligned! 🤗 |
@Narsil I have since moved to sentencepiece since it was more reliable to me. I know that my computer can hold the full tokenized data in memory. In fact, sentencepiece did it just fine. You might not want to call it "memory leak", but the excessive memory usage is definitely a bug. "it will flatten out given enough time" will not happen if the program crashes before that 🤣 |
Excessive according to whom ?
That's perfectly fine.
Indeed but it's never allocating more than a few GB (10k items at most, unless you're sending humongous blobs it shouldn't be that big). On modern hardware that should be just fine. In any case we're adding more control for fine-grained access to this caching for users to control |
@Narsil I hope I didn't trigger you. I was giving constructive feedback from a user perspective. It was not only me observing this bug. Other people have faced this problem as well, evident in this thread.
Re-running my example above, with latest version of tokenizers 0.20.3, the script consumes all of my 80GB of RAM and I have to terminate the program. And to reiterate, the tokenized data in int64 is only 4GB, so even with extra data like attention masks, it shouldn't consume that much. And if we want to compare "excessive to whom?", sentencepiece didn't have this problem, so it's excessive compared to sentencepiece. Also, this script uses chunking, so it's not sending 1 big text. And even with 1 big text, the library should be able to handle some form of it (e.g. able to tokenize 1 big string worth 1GB of tokens).
I agree. There should be sensible defaults to prevent excessive memory usage in the first place. I'm not too familiar with the internal implementation, but perhaps you could investigate what is the sensible maximum cache size while not hinder performance too much, because I think there is probably diminishing return in maintaining a growing cache? |
I think extra cache options would be quite valuable - tokenization becoming an order of magnitude slower over time is a realistic problem for users who tokenize massive amounts of (sometimes unconvential) texts in the same setting, which is very common for my users for e.g. retrieval systems.
|
This isn't the case on all systems I've tried on (from your exact use case). Can you confirm which version of tokenizers you're using and what system ?
I'm not. But without actual numbers of consumption, having opinions about how something should behave is well, just an opinion. |
After testing, it seems the slowdown only happens with cache ON and Windows (no WSL). @tomaarsen . I could reproduce on Windows. Both PRs fix it (one is automatic, with the other users need to adjust the cache size or clear it regularly). |
Reopening because I think #1676 should land too before calling it fixed (since it's a zero effort on users fix) |
Thanks @tomaarsen and @gau-nernst both for the feedbacks! I think we ended up with a good solution that should help everyone thanks to that! It was long due as many people complained about it 🤗 |
Has this been fixed in the latest version? |
Yes! 😉 |
This snippet will cause memory usage to rise indefinitely:
If you set
refresh_every
to 100000 (like it is in the snippet), the memory usage will keep on rising. This colab notebook crashes after about 15 minutes of executing.If you set
refresh_every
to 100, the memory consumption will be stable.The text was updated successfully, but these errors were encountered: