Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: #3292

Closed
4AM-GodVZz opened this issue Dec 12, 2024 · 2 comments
Closed

[Bug]: #3292

4AM-GodVZz opened this issue Dec 12, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@4AM-GodVZz
Copy link

4AM-GodVZz commented Dec 12, 2024

What happened?

When I deploy the Chroma vector service through an interface, there is too much vector data. I need to delete Chroma.sqlite3 and other files in the persist-directory directory. However, after deleting the files, when calling the interface again, the following error will occur: OperationalError: attempt to write a readonly database``

import json
import requests
import re
from log import LOGGER
from flask import request, Flask
import os
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from urllib.parse import urlparse


DB_PATH = "/opt/model/db"
DEVICE = 'cuda'
BERT_MODEL = "/opt/model/models--infgrad--stella-base-zh-v3-1792d"
EMB_FUNC = HuggingFaceEmbeddings(model_name=BERT_MODEL,
                                 model_kwargs={'device': DEVICE},
                                 encode_kwargs={'normalize_embeddings': True}
                                 )

def add_files_to_db(raw_docs, collection_name):
    documents = [Document(page_content=raw_docs, metadata={'source': collection_name})]
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
    docs = text_splitter.split_documents(documents)
    db = Chroma.from_documents(docs, EMB_FUNC, persist_directory=DB_PATH, collection_name=collection_name)
    db.persist()

def query_db(db: Chroma, query: str = ""):
    query_emb = EMB_FUNC.embed_query(query)
    docs = db.similarity_search_by_vector(query_emb, k = 15)
    docs_dict = [doc.page_content for doc in docs]
    return docs_dict


app = Flask(__name__)
@app.route('/retrieval', methods=['POST'])
def coder_write():
    try:
        param = request.get_json()
    except:
        param = request.form
    query = param['query']
    parse_doc = param['doc']
    flag = param['flag']
    LOGGER.info(f"flag值:{flag}")
    session_id = param['session_id']
    collection_name = session_id
    if flag == '1':
        add_files_to_db(parse_doc, collection_name)
        db = Chroma(persist_directory=DB_PATH, embedding_function=EMB_FUNC, collection_name=collection_name)
        docs = query_db(db, query)
        LOGGER.info(f"相似性度量后的docs:{docs}")
    else:
        db = Chroma(persist_directory=DB_PATH, embedding_function=EMB_FUNC, collection_name=collection_name)
        docs = query_db(db, query)
        LOGGER.info(f"第二轮问答相似性度量后的docs:{docs}")

    search = [(i, j) for i, j in enumerate(docs)]
    r = ''
    for i, j in search:
        r += f"""\n{i+1}{j}"""
    result = f"{r}"
    LOGGER.info(f"retrieval最终return的结果:{result}")
    return {"status": 1, "result": result}


if __name__ == "__main__":
    IP = '0.0.0.0'
    app.run(port='35010', host = IP, threaded = False)`

Versions

chroma0.5.23

Relevant log output

File /opt/miniconda/lib/python3.10/site-packages/chromadb/db/mixins/embeddings_queue.py:243, in SqlEmbeddingsQueue.submit_embeddings(self, collection_id, embeddings)
    240 # The returning clause does not guarantee order, so we need to do reorder
    241 # the results. https://www.sqlite.org/lang_returning.html
    242 sql = f"{sql} RETURNING seq_id, id"  # Pypika doesn't support RETURNING
--> 243 results = cur.execute(sql, params).fetchall()
    244 # Reorder the results
    245 seq_ids = [cast(SeqId, None)] * len(
    246     results
    247 )  # Lie to mypy: https://stackoverflow.com/questions/76694215/python-type-casting-when-preallocating-list

OperationalError: attempt to write a readonly database
@4AM-GodVZz 4AM-GodVZz added the bug Something isn't working label Dec 12, 2024
@tazarov
Copy link
Contributor

tazarov commented Dec 16, 2024

@4AM-GodVZz, how do you run the above flask app? If you use multiple workers, it is possible that two concurrent calls run in those workers. Chroma is not process safe and the way that workers operate in uvicorn or gunicorn is by spawning a separate process. Therefore concurrent calls will likely end up in the error above.

The root cause is sqlite3 not being process safe and one process holding exclusive lock on the DB.

If on the other hand you are not running with multiple workers, is it possible that another process (outside of the flask app) is accessing Chroma's dir - DB_PATH?

@Cirr0e
Copy link

Cirr0e commented Dec 17, 2024

Hey! I see what's happening here. The issue occurs because of how Chroma handles database access and file permissions. Let me help you resolve this.

First, there are a few important things to note:

  1. Chroma is thread-safe but not process-safe
  2. After deleting the SQLite files, proper cleanup and initialization is crucial

Here's how to fix this:

  1. First, ensure proper cleanup when deleting the database:
import shutil
import os

def clean_chroma_db():
    if os.path.exists(DB_PATH):
        # Close any existing connections
        try:
            db = Chroma(persist_directory=DB_PATH, embedding_function=EMB_FUNC)
            db.persist()
            del db
        except:
            pass
        # Remove the directory
        shutil.rmtree(DB_PATH)
        # Ensure the base directory exists
        os.makedirs(DB_PATH, exist_ok=True)
  1. Modify your Flask application to use a single client instance:
# At the module level
_db_client = None

def get_db_client():
    global _db_client
    if _db_client is None:
        _db_client = Chroma(persist_directory=DB_PATH, 
                           embedding_function=EMB_FUNC)
    return _db_client

@app.route('/retrieval', methods=['POST'])
def coder_write():
    # ... your existing code ...
    if flag == '1':
        add_files_to_db(parse_doc, collection_name)
        db = get_db_client()  # Use the singleton client
        docs = query_db(db, query)
    else:
        db = get_db_client()  # Use the singleton client
        docs = query_db(db, query)
  1. Important considerations for deployment:
  • If you're using multiple workers (gunicorn/uvicorn), set workers=1
  • For scaling, consider using a connection pool or a separate Chroma server instance

The key changes here:

  • Using a singleton pattern for the Chroma client
  • Proper cleanup when deleting the database
  • Ensuring proper file permissions after recreation

Based on similar issues (see chromadb#1441), this approach should resolve the readonly database error.

If you're still seeing issues, could you let me know:

  1. Are you running multiple workers in your Flask deployment?
  2. What are the file permissions on your DB_PATH directory after deletion?

References:

Let me know if this helps or if you need any clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants