Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Local Segment Manager memory leak #3336

Open
tazarov opened this issue Dec 19, 2024 · 1 comment
Open

[Bug]: Local Segment Manager memory leak #3336

tazarov opened this issue Dec 19, 2024 · 1 comment
Labels
bug Something isn't working by-chroma Local Chroma An improvement to Local (single node) Chroma

Comments

@tazarov
Copy link
Contributor

tazarov commented Dec 19, 2024

What happened?

When a vector segment is accessed (e.g. include=["embeddings"] or queried) its file handles are opened and is therefore loaded into memory. the loaded vector index is then added to _vector_instances_file_handle_cache of the local segment manager - self._vector_instances_file_handle_cache.set(collection_id, instance). The LRU cache has a callback to evict items from the cache when the capacity overflows. The capacity of the LRU cache is bound to the number of allowed open files NOFILE kernel param. The actual capacity is calculated by the number of items (vector segments) multiplied by the number of opened files per segment - 5. Items from the cache are only evicted upon overflow, there is way to manually evict items.

The latter eviction strategy is not a memory leak in of itself, but it allows when NOFILE is too high, on a regular docker container (tested on MacOS) the limit is 209715, divided by 5 gives us 41 943number of items (vector segments) that will persist in cache and keep reference to the segment thus preventing them from being garbage collected. Looking further into the eviction callback - callback=lambda _, v: v.close_persistent_index() and tracing that across the persistent HNSW just closing the index closes the HNSW file handles thus releasing the memory held by HNSW. Unfortunately that is not enough to free all the memory that the vector segment occupies. This leaves us with the subscription to embedding queue still active and keeping a reference to the HNSW index. While the latter might still not fully qualify as memory leak, the fact that a deletion of a collection that leaves a hanging references in the embedding queue subscriptions and also in the LRU cache (especially when it is too large) is a proper memory leak. As an added bonus everything inside the BF index is also leaked to the memory.

The leak is simple to reproduce by continuously creating new collections, adding some vectors (e.g. 199 just to keep things interesting with HNSW file handles and BF) and then deleting the collection. Observing Chroma memory consumption over a moderate period of them e.g. 1h.

As an added bonus the LRU cache is not thread safe, similar to #3334, which is easily demonstrable with the above reproduction scenario run in parallel and a small NOFILE.

The below screenshots demonstrate the effect of the defect when running Chroma (latest main as of 19-Dec-2024). The the instance limited to 4GB, creating and deleting collections of 199 docs (1536 dims embeddings) and 1 sec of sleep between each cycle took 25mins to run out of memory.

image

image

Versions

Chroma 0.4.0+, any python version, any OS

Relevant log output

No response

@tazarov tazarov added bug Something isn't working Local Chroma An improvement to Local (single node) Chroma by-chroma labels Dec 19, 2024
@tazarov
Copy link
Contributor Author

tazarov commented Dec 19, 2024

Similar leak also existed for the distributed segment manager, but was resolved with #3243

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working by-chroma Local Chroma An improvement to Local (single node) Chroma
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant