[Bug]: #3292

4AM-GodVZz · 2024-12-12T10:53:05Z

What happened?

When I deploy the Chroma vector service through an interface, there is too much vector data. I need to delete Chroma.sqlite3 and other files in the persist-directory directory. However, after deleting the files, when calling the interface again, the following error will occur: OperationalError: attempt to write a readonly database``

import json
import requests
import re
from log import LOGGER
from flask import request, Flask
import os
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from urllib.parse import urlparse


DB_PATH = "/opt/model/db"
DEVICE = 'cuda'
BERT_MODEL = "/opt/model/models--infgrad--stella-base-zh-v3-1792d"
EMB_FUNC = HuggingFaceEmbeddings(model_name=BERT_MODEL,
                                 model_kwargs={'device': DEVICE},
                                 encode_kwargs={'normalize_embeddings': True}
                                 )

def add_files_to_db(raw_docs, collection_name):
    documents = [Document(page_content=raw_docs, metadata={'source': collection_name})]
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
    docs = text_splitter.split_documents(documents)
    db = Chroma.from_documents(docs, EMB_FUNC, persist_directory=DB_PATH, collection_name=collection_name)
    db.persist()

def query_db(db: Chroma, query: str = ""):
    query_emb = EMB_FUNC.embed_query(query)
    docs = db.similarity_search_by_vector(query_emb, k = 15)
    docs_dict = [doc.page_content for doc in docs]
    return docs_dict


app = Flask(__name__)
@app.route('/retrieval', methods=['POST'])
def coder_write():
    try:
        param = request.get_json()
    except:
        param = request.form
    query = param['query']
    parse_doc = param['doc']
    flag = param['flag']
    LOGGER.info(f"flag值：{flag}")
    session_id = param['session_id']
    collection_name = session_id
    if flag == '1':
        add_files_to_db(parse_doc, collection_name)
        db = Chroma(persist_directory=DB_PATH, embedding_function=EMB_FUNC, collection_name=collection_name)
        docs = query_db(db, query)
        LOGGER.info(f"相似性度量后的docs：{docs}")
    else:
        db = Chroma(persist_directory=DB_PATH, embedding_function=EMB_FUNC, collection_name=collection_name)
        docs = query_db(db, query)
        LOGGER.info(f"第二轮问答相似性度量后的docs：{docs}")

    search = [(i, j) for i, j in enumerate(docs)]
    r = ''
    for i, j in search:
        r += f"""\n（{i+1}）{j}"""
    result = f"{r}"
    LOGGER.info(f"retrieval最终return的结果：{result}")
    return {"status": 1, "result": result}


if __name__ == "__main__":
    IP = '0.0.0.0'
    app.run(port='35010', host = IP, threaded = False)`

Versions

chroma0.5.23

Relevant log output

File /opt/miniconda/lib/python3.10/site-packages/chromadb/db/mixins/embeddings_queue.py:243, in SqlEmbeddingsQueue.submit_embeddings(self, collection_id, embeddings)
    240 # The returning clause does not guarantee order, so we need to do reorder
    241 # the results. https://www.sqlite.org/lang_returning.html
    242 sql = f"{sql} RETURNING seq_id, id"  # Pypika doesn't support RETURNING
--> 243 results = cur.execute(sql, params).fetchall()
    244 # Reorder the results
    245 seq_ids = [cast(SeqId, None)] * len(
    246     results
    247 )  # Lie to mypy: https://stackoverflow.com/questions/76694215/python-type-casting-when-preallocating-list

OperationalError: attempt to write a readonly database

tazarov · 2024-12-16T13:52:20Z

@4AM-GodVZz, how do you run the above flask app? If you use multiple workers, it is possible that two concurrent calls run in those workers. Chroma is not process safe and the way that workers operate in uvicorn or gunicorn is by spawning a separate process. Therefore concurrent calls will likely end up in the error above.

The root cause is sqlite3 not being process safe and one process holding exclusive lock on the DB.

If on the other hand you are not running with multiple workers, is it possible that another process (outside of the flask app) is accessing Chroma's dir - DB_PATH?

Cirr0e · 2024-12-17T04:33:43Z

Hey! I see what's happening here. The issue occurs because of how Chroma handles database access and file permissions. Let me help you resolve this.

First, there are a few important things to note:

Chroma is thread-safe but not process-safe
After deleting the SQLite files, proper cleanup and initialization is crucial

Here's how to fix this:

First, ensure proper cleanup when deleting the database:

import shutil
import os

def clean_chroma_db():
    if os.path.exists(DB_PATH):
        # Close any existing connections
        try:
            db = Chroma(persist_directory=DB_PATH, embedding_function=EMB_FUNC)
            db.persist()
            del db
        except:
            pass
        # Remove the directory
        shutil.rmtree(DB_PATH)
        # Ensure the base directory exists
        os.makedirs(DB_PATH, exist_ok=True)

Modify your Flask application to use a single client instance:

# At the module level
_db_client = None

def get_db_client():
    global _db_client
    if _db_client is None:
        _db_client = Chroma(persist_directory=DB_PATH, 
                           embedding_function=EMB_FUNC)
    return _db_client

@app.route('/retrieval', methods=['POST'])
def coder_write():
    # ... your existing code ...
    if flag == '1':
        add_files_to_db(parse_doc, collection_name)
        db = get_db_client()  # Use the singleton client
        docs = query_db(db, query)
    else:
        db = get_db_client()  # Use the singleton client
        docs = query_db(db, query)

Important considerations for deployment:

If you're using multiple workers (gunicorn/uvicorn), set workers=1
For scaling, consider using a connection pool or a separate Chroma server instance

The key changes here:

Using a singleton pattern for the Chroma client
Proper cleanup when deleting the database
Ensuring proper file permissions after recreation

Based on similar issues (see chromadb#1441), this approach should resolve the readonly database error.

If you're still seeing issues, could you let me know:

Are you running multiple workers in your Flask deployment?
What are the file permissions on your DB_PATH directory after deletion?

References:

Issue [Bug]: sqlite3.OperationalError: attempt to write a readonly database #1441: Process safety and client management
Issue [Bug]: sqlite3.OperationalError: table embeddings_queue already exists #1209: Multiple client access patterns
Chroma documentation on client usage: https://docs.trychroma.com/usage-guide

Let me know if this helps or if you need any clarification!

4AM-GodVZz added the bug Something isn't working label Dec 12, 2024

itaismith closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: #3292

[Bug]: #3292

4AM-GodVZz commented Dec 12, 2024 •

edited by tazarov

Loading

tazarov commented Dec 16, 2024

Cirr0e commented Dec 17, 2024

[Bug]: #3292

[Bug]: #3292

Comments

4AM-GodVZz commented Dec 12, 2024 • edited by tazarov Loading

What happened?

Versions

Relevant log output

tazarov commented Dec 16, 2024

Cirr0e commented Dec 17, 2024

4AM-GodVZz commented Dec 12, 2024 •

edited by tazarov

Loading