[Bug]: sqlite3.OperationalError: table embeddings_queue already exists #1209

pseudotensor · 2023-10-06T09:41:03Z

What happened?

Traceback (most recent call last):
  File "/home/jon/h2ogpt2/src/gpt_langchain.py", line 5001, in update_user_db
    return _update_user_db(file, db1s=db1s,
  File "/home/jon/h2ogpt2/src/gpt_langchain.py", line 5229, in _update_user_db
    db = get_db(sources, use_openai_embedding=use_openai_embedding,
  File "/home/jon/h2ogpt2/src/gpt_langchain.py", line 166, in get_db
    api = chromadb.PersistentClient(path=persist_directory)
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/__init__.py", line 106, in PersistentClient
    return Client(settings)
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/__init__.py", line 145, in Client
    system.start()
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/config.py", line 269, in start
    component.start()
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py", line 93, in start
    self.initialize_migrations()
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/db/migrations.py", line 128, in initialize_migrations
    self.apply_migrations()
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/db/migrations.py", line 156, in apply_migrations
    self.apply_migration(cur, migration)
  File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py", line 210, in apply_migration
    cur.executescript(migration["sql"])
sqlite3.OperationalError: table embeddings_queue already exists

Versions

chromadb==0.4.10

Relevant log output

See this randomly when multiple databases are being accessed by different users.  A simple stress test with only 3 clients at same time hits it randomly as well.

Is it not allowed to have multiple clients access in different threads the same database?

The text was updated successfully, but these errors were encountered:

HammadB · 2023-10-06T14:41:25Z

It is recommended to use a single client https://docs.trychroma.com/usage-guide as mentioned under "Use a single-client at a time". A single client is thread-safe. Could you amend your design so that the client is shared between threads?

We should add better error messaging here.

pseudotensor · 2023-10-06T19:52:31Z

That makes it more difficult to use or design, because then an additional global state has to be maintained for each such database that multiple users would access. It would be better if chroma handled this itself, especially as it fails under this situation. Why make the user of chroma manage the client state when chroma could do it?

HammadB · 2023-10-06T19:58:40Z

We are actively working on the concept of "Database" as a namespace. I am not sure what your application logic is, but if you share more context perhaps I can suggest an interim workaround.

pseudotensor · 2023-10-06T20:11:36Z

Hi, the project is https://github.com/h2oai/h2ogpt . Chroma is key and central to the project. I moved 2 weeks ago to latest chroma, and handled migration directly within h2oGPT so old databases use old chroma by default, since migration is slow.

As for h2oGPT, the design is as follows:

Gradio (so every user may come and go)
Each user can have their own database, but also there are shared databases too (see for context: [Feature Request]: control duckdb threads #869 (comment))
All these users may access the shared database. I maintain a database object from langchain, and each user could access the db object at any time independently in different threads.

HammadB · 2023-10-06T20:23:49Z

I see, could collection-level namespacing work for you if we allow'ed something like client.get_collection(where={collection_metadata_you_filter_on: "id"}?

Let me also do some digging if we can unblock this specific way of working. It basically amounts to making that migration idempotent if there aren't any other logical issues.

tazarov · 2023-10-12T15:40:21Z

@HammadB, maybe we can serialise access at SegmentAPI level?

@pseudotensor, is it safe to assume that Gradio uses threading as opposed to sub-process for your workload?

Refs: chroma-core#1209, chroma-core#1234

tazarov · 2023-10-12T19:27:14Z

@pseudotensor, try pulling and running your tests against this - https://github.com/amikos-tech/chroma-core/tree/feature/local-client-thread-safety

I used the following to test multithreading with PersistentClients

import random
import threading
from time import sleep
import chromadb
import shutil

# Delete the database if it exists
try:
    shutil.rmtree("chroma-test")
except:
    pass

def worker():
    client = chromadb.PersistentClient(path="chroma-test")
    client.heartbeat()
    sleep(random.randint(1,10))
    client.get_or_create_collection("test")
    sleep(random.randint(1, 10))
    try:
        client.delete_collection("test")
    except ValueError:
        pass




# Creating threads
thread1 = threading.Thread(target=worker)
thread2 = threading.Thread(target=worker)
thread3 = threading.Thread(target=worker)
thread4 = threading.Thread(target=worker)
# Starting threads
thread1.start()
thread2.start()
thread3.start()
thread4.start()
# Wait for both threads to complete
thread1.join()
thread2.join()
thread3.join()
thread4.join()

HammadB · 2023-10-12T20:13:06Z

The client is thread safe - that is different than having multiple clients.

pseudotensor · 2023-10-13T05:46:58Z

@HammadB, maybe we can serialise access at SegmentAPI level?

@pseudotensor, is it safe to assume that Gradio uses threading as opposed to sub-process for your workload?

Hi, Yes gradio uses asyncio and I launch threads to manage access multiple models connected to distant inference servers. So at any one time the chroma package may be asked to access any number of databases by any number of threads per database.

pseudotensor · 2023-10-13T05:49:12Z

@pseudotensor, try pulling and running your tests against this - https://github.com/amikos-tech/chroma-core/tree/feature/local-client-thread-safety

I used the following to test multithreading with PersistentClients

import random
import threading
from time import sleep
import chromadb
import shutil

# Delete the database if it exists
try:
    shutil.rmtree("chroma-test")
except:
    pass

def worker():
    client = chromadb.PersistentClient(path="chroma-test")
    client.heartbeat()
    sleep(random.randint(1,10))
    client.get_or_create_collection("test")
    sleep(random.randint(1, 10))
    try:
        client.delete_collection("test")
    except ValueError:
        pass




# Creating threads
thread1 = threading.Thread(target=worker)
thread2 = threading.Thread(target=worker)
thread3 = threading.Thread(target=worker)
thread4 = threading.Thread(target=worker)
# Starting threads
thread1.start()
thread2.start()
thread3.start()
thread4.start()
# Wait for both threads to complete
thread1.join()
thread2.join()
thread3.join()
thread4.join()

I think you are recommending a client per user perhaps? But even then, a single user might access (via asyncio leading into threads) multiple databases. So it is possible to create a client for every one case, but it feels like something chroma should manage.

That is, if I have a database object from langchain, it seems that object itself should be thread safe, not just some particular client instance that langchain does not directly give users access to.

tazarov · 2023-10-13T07:01:41Z

@pseudotensor, we dug a little deeper on this to fully understand it, as the PR I hastily raised was a heavy-handed serialization which is a bit blunt.

The bottom line is if you have a setup like this:

You should be fine with the change we plan.

EDIT: Worth asking whether you keep references to the user's Chroma Persistent Client across the user session, which may comprise multiple requests.

…tp clients (#1270) ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - Removed mutable default settings and headers, as this seems ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python ## Documentation Changes N/A This issue partly resolves - #1209 (for multiple local persistent clients with different persist paths)

HammadB · 2023-10-25T16:40:28Z

@pseudotensor We just shipped #1244 - which removes the restriction on multiple clients and also introduced tenant/database abstractions. I think this should help with your use case.

pseudotensor · 2023-10-25T17:15:56Z

@HammadB Thanks! I'll review and see if I understand the changes and how it helps

HammadB · 2023-12-04T19:52:47Z

Closing this as stale - as the underlying issue - multiple persistent clients is solved /supported now.

pseudotensor added the bug Something isn't working label Oct 6, 2023

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Oct 12, 2023

feat: Thread-safety for persistent and ephemeral clients

4c0417c

Refs: chroma-core#1209, chroma-core#1234

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Oct 12, 2023

feat: Local clients thread safety

759b306

Refs: chroma-core#1209, chroma-core#1234

tazarov mentioned this issue Oct 12, 2023

[ENH]: Local clients thread safety #1240

Closed

1 task

tazarov mentioned this issue Oct 22, 2023

[BUG]: Removed mutable default values in Ephemeral, Persistent and Http clients #1270

Merged

1 task

HammadB closed this as completed Dec 4, 2023

Cirr0e mentioned this issue Dec 17, 2024

[Bug]: #3292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: sqlite3.OperationalError: table embeddings_queue already exists #1209

[Bug]: sqlite3.OperationalError: table embeddings_queue already exists #1209

pseudotensor commented Oct 6, 2023 •

edited

Loading

HammadB commented Oct 6, 2023 •

edited

Loading

pseudotensor commented Oct 6, 2023

HammadB commented Oct 6, 2023 •

edited

Loading

pseudotensor commented Oct 6, 2023

HammadB commented Oct 6, 2023

tazarov commented Oct 12, 2023

tazarov commented Oct 12, 2023

HammadB commented Oct 12, 2023

pseudotensor commented Oct 13, 2023

pseudotensor commented Oct 13, 2023

tazarov commented Oct 13, 2023 •

edited

Loading

HammadB commented Oct 25, 2023

pseudotensor commented Oct 25, 2023

HammadB commented Dec 4, 2023

[Bug]: sqlite3.OperationalError: table embeddings_queue already exists #1209

[Bug]: sqlite3.OperationalError: table embeddings_queue already exists #1209

Comments

pseudotensor commented Oct 6, 2023 • edited Loading

What happened?

Versions

Relevant log output

HammadB commented Oct 6, 2023 • edited Loading

pseudotensor commented Oct 6, 2023

HammadB commented Oct 6, 2023 • edited Loading

pseudotensor commented Oct 6, 2023

HammadB commented Oct 6, 2023

tazarov commented Oct 12, 2023

tazarov commented Oct 12, 2023

HammadB commented Oct 12, 2023

pseudotensor commented Oct 13, 2023

pseudotensor commented Oct 13, 2023

tazarov commented Oct 13, 2023 • edited Loading

HammadB commented Oct 25, 2023

pseudotensor commented Oct 25, 2023

HammadB commented Dec 4, 2023

pseudotensor commented Oct 6, 2023 •

edited

Loading

HammadB commented Oct 6, 2023 •

edited

Loading

HammadB commented Oct 6, 2023 •

edited

Loading

tazarov commented Oct 13, 2023 •

edited

Loading