[Bug]: Local Segment Manager memory leak #3336

tazarov · 2024-12-19T16:03:48Z

What happened?

When a vector segment is accessed (e.g. include=["embeddings"] or queried) its file handles are opened and is therefore loaded into memory. the loaded vector index is then added to _vector_instances_file_handle_cache of the local segment manager - self._vector_instances_file_handle_cache.set(collection_id, instance). The LRU cache has a callback to evict items from the cache when the capacity overflows. The capacity of the LRU cache is bound to the number of allowed open files NOFILE kernel param. The actual capacity is calculated by the number of items (vector segments) multiplied by the number of opened files per segment - 5. Items from the cache are only evicted upon overflow, there is way to manually evict items.

The latter eviction strategy is not a memory leak in of itself, but it allows when NOFILE is too high, on a regular docker container (tested on MacOS) the limit is 209715, divided by 5 gives us 41 943number of items (vector segments) that will persist in cache and keep reference to the segment thus preventing them from being garbage collected. Looking further into the eviction callback - callback=lambda _, v: v.close_persistent_index() and tracing that across the persistent HNSW just closing the index closes the HNSW file handles thus releasing the memory held by HNSW. Unfortunately that is not enough to free all the memory that the vector segment occupies. This leaves us with the subscription to embedding queue still active and keeping a reference to the HNSW index. While the latter might still not fully qualify as memory leak, the fact that a deletion of a collection that leaves a hanging references in the embedding queue subscriptions and also in the LRU cache (especially when it is too large) is a proper memory leak. As an added bonus everything inside the BF index is also leaked to the memory.

The leak is simple to reproduce by continuously creating new collections, adding some vectors (e.g. 199 just to keep things interesting with HNSW file handles and BF) and then deleting the collection. Observing Chroma memory consumption over a moderate period of them e.g. 1h.

As an added bonus the LRU cache is not thread safe, similar to #3334, which is easily demonstrable with the above reproduction scenario run in parallel and a small NOFILE.

The below screenshots demonstrate the effect of the defect when running Chroma (latest main as of 19-Dec-2024). The the instance limited to 4GB, creating and deleting collections of 199 docs (1536 dims embeddings) and 1 sec of sleep between each cycle took 25mins to run out of memory.

Versions

Chroma 0.4.0+, any python version, any OS

Relevant log output

No response

tazarov · 2024-12-19T16:10:48Z

Similar leak also existed for the distributed segment manager, but was resolved with #3243

tazarov added bug Something isn't working Local Chroma An improvement to Local (single node) Chroma by-chroma labels Dec 19, 2024

tazarov mentioned this issue Dec 19, 2024

[BUG]: Local segment manager memory leak fix #3337

Closed

1 task

tazarov mentioned this issue Dec 19, 2024

[BUG]: Local segment manager memory leak fix #3340

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Local Segment Manager memory leak #3336

[Bug]: Local Segment Manager memory leak #3336

tazarov commented Dec 19, 2024

tazarov commented Dec 19, 2024 •

edited

Loading

[Bug]: Local Segment Manager memory leak #3336

[Bug]: Local Segment Manager memory leak #3336

Comments

tazarov commented Dec 19, 2024

What happened?

Versions

Relevant log output

tazarov commented Dec 19, 2024 • edited Loading

tazarov commented Dec 19, 2024 •

edited

Loading