Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: sqlite3.OperationalError: table embeddings_queue already exists #1209

Closed
pseudotensor opened this issue Oct 6, 2023 · 14 comments
Closed
Labels
bug Something isn't working

Comments

@pseudotensor
Copy link

pseudotensor commented Oct 6, 2023

What happened?

Traceback (most recent call last):
  File "/home/jon/h2ogpt2/src/gpt_langchain.py", line 5001, in update_user_db
    return _update_user_db(file, db1s=db1s,
  File "/home/jon/h2ogpt2/src/gpt_langchain.py", line 5229, in _update_user_db
    db = get_db(sources, use_openai_embedding=use_openai_embedding,
  File "/home/jon/h2ogpt2/src/gpt_langchain.py", line 166, in get_db
    api = chromadb.PersistentClient(path=persist_directory)
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/__init__.py", line 106, in PersistentClient
    return Client(settings)
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/__init__.py", line 145, in Client
    system.start()
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/config.py", line 269, in start
    component.start()
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py", line 93, in start
    self.initialize_migrations()
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/db/migrations.py", line 128, in initialize_migrations
    self.apply_migrations()
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/db/migrations.py", line 156, in apply_migrations
    self.apply_migration(cur, migration)
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py", line 210, in apply_migration
    cur.executescript(migration["sql"])
sqlite3.OperationalError: table embeddings_queue already exists

Versions

chromadb==0.4.10

Relevant log output

See this randomly when multiple databases are being accessed by different users.  A simple stress test with only 3 clients at same time hits it randomly as well.

Is it not allowed to have multiple clients access in different threads the same database?

@pseudotensor pseudotensor added the bug Something isn't working label Oct 6, 2023
@HammadB
Copy link
Collaborator

HammadB commented Oct 6, 2023

It is recommended to use a single client https://docs.trychroma.com/usage-guide as mentioned under "Use a single-client at a time". A single client is thread-safe. Could you amend your design so that the client is shared between threads?

We should add better error messaging here.

@pseudotensor
Copy link
Author

That makes it more difficult to use or design, because then an additional global state has to be maintained for each such database that multiple users would access. It would be better if chroma handled this itself, especially as it fails under this situation. Why make the user of chroma manage the client state when chroma could do it?

@HammadB
Copy link
Collaborator

HammadB commented Oct 6, 2023

We are actively working on the concept of "Database" as a namespace. I am not sure what your application logic is, but if you share more context perhaps I can suggest an interim workaround.

@pseudotensor
Copy link
Author

Hi, the project is https://github.com/h2oai/h2ogpt . Chroma is key and central to the project. I moved 2 weeks ago to latest chroma, and handled migration directly within h2oGPT so old databases use old chroma by default, since migration is slow.

As for h2oGPT, the design is as follows:

  • Gradio (so every user may come and go)
  • Each user can have their own database, but also there are shared databases too (see for context: [Feature Request]: control duckdb threads #869 (comment))
  • All these users may access the shared database. I maintain a database object from langchain, and each user could access the db object at any time independently in different threads.

@HammadB
Copy link
Collaborator

HammadB commented Oct 6, 2023

I see, could collection-level namespacing work for you if we allow'ed something like client.get_collection(where={collection_metadata_you_filter_on: "id"}?

Let me also do some digging if we can unblock this specific way of working. It basically amounts to making that migration idempotent if there aren't any other logical issues.

@tazarov
Copy link
Contributor

tazarov commented Oct 12, 2023

@HammadB, maybe we can serialise access at SegmentAPI level?

@pseudotensor, is it safe to assume that Gradio uses threading as opposed to sub-process for your workload?

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Oct 12, 2023
tazarov added a commit to amikos-tech/chroma-core that referenced this issue Oct 12, 2023
@tazarov
Copy link
Contributor

tazarov commented Oct 12, 2023

@pseudotensor, try pulling and running your tests against this - https://github.com/amikos-tech/chroma-core/tree/feature/local-client-thread-safety

I used the following to test multithreading with PersistentClients

import random
import threading
from time import sleep
import chromadb
import shutil

# Delete the database if it exists
try:
    shutil.rmtree("chroma-test")
except:
    pass

def worker():
    client = chromadb.PersistentClient(path="chroma-test")
    client.heartbeat()
    sleep(random.randint(1,10))
    client.get_or_create_collection("test")
    sleep(random.randint(1, 10))
    try:
        client.delete_collection("test")
    except ValueError:
        pass




# Creating threads
thread1 = threading.Thread(target=worker)
thread2 = threading.Thread(target=worker)
thread3 = threading.Thread(target=worker)
thread4 = threading.Thread(target=worker)
# Starting threads
thread1.start()
thread2.start()
thread3.start()
thread4.start()
# Wait for both threads to complete
thread1.join()
thread2.join()
thread3.join()
thread4.join()

@HammadB
Copy link
Collaborator

HammadB commented Oct 12, 2023

The client is thread safe - that is different than having multiple clients.

@pseudotensor
Copy link
Author

@HammadB, maybe we can serialise access at SegmentAPI level?

@pseudotensor, is it safe to assume that Gradio uses threading as opposed to sub-process for your workload?

Hi, Yes gradio uses asyncio and I launch threads to manage access multiple models connected to distant inference servers. So at any one time the chroma package may be asked to access any number of databases by any number of threads per database.

@pseudotensor
Copy link
Author

@pseudotensor, try pulling and running your tests against this - https://github.com/amikos-tech/chroma-core/tree/feature/local-client-thread-safety

I used the following to test multithreading with PersistentClients

import random
import threading
from time import sleep
import chromadb
import shutil

# Delete the database if it exists
try:
    shutil.rmtree("chroma-test")
except:
    pass

def worker():
    client = chromadb.PersistentClient(path="chroma-test")
    client.heartbeat()
    sleep(random.randint(1,10))
    client.get_or_create_collection("test")
    sleep(random.randint(1, 10))
    try:
        client.delete_collection("test")
    except ValueError:
        pass




# Creating threads
thread1 = threading.Thread(target=worker)
thread2 = threading.Thread(target=worker)
thread3 = threading.Thread(target=worker)
thread4 = threading.Thread(target=worker)
# Starting threads
thread1.start()
thread2.start()
thread3.start()
thread4.start()
# Wait for both threads to complete
thread1.join()
thread2.join()
thread3.join()
thread4.join()

I think you are recommending a client per user perhaps? But even then, a single user might access (via asyncio leading into threads) multiple databases. So it is possible to create a client for every one case, but it feels like something chroma should manage.

That is, if I have a database object from langchain, it seems that object itself should be thread safe, not just some particular client instance that langchain does not directly give users access to.

@tazarov
Copy link
Contributor

tazarov commented Oct 13, 2023

@pseudotensor, we dug a little deeper on this to fully understand it, as the PR I hastily raised was a heavy-handed serialization which is a bit blunt.

The bottom line is if you have a setup like this:

image

You should be fine with the change we plan.


EDIT: Worth asking whether you keep references to the user's Chroma Persistent Client across the user session, which may comprise multiple requests.

HammadB pushed a commit that referenced this issue Oct 24, 2023
…tp clients (#1270)

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
	 - Removed mutable default settings and headers, as this seems

## Test plan
*How are these changes tested?*

- [x] Tests pass locally with `pytest` for python

## Documentation Changes
N/A

This issue partly resolves - #1209 (for multiple local persistent
clients with different persist paths)
@HammadB
Copy link
Collaborator

HammadB commented Oct 25, 2023

@pseudotensor We just shipped #1244 - which removes the restriction on multiple clients and also introduced tenant/database abstractions. I think this should help with your use case.

@pseudotensor
Copy link
Author

@HammadB Thanks! I'll review and see if I understand the changes and how it helps

@HammadB
Copy link
Collaborator

HammadB commented Dec 4, 2023

Closing this as stale - as the underlying issue - multiple persistent clients is solved /supported now.

@HammadB HammadB closed this as completed Dec 4, 2023
@Cirr0e Cirr0e mentioned this issue Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants