You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked and my calculated embeddings are behaving deterministically. Therefore, when querying the collection with the same embedding I would expect the nearest result to be the same id that was just added (As the distance should be zero or very near zero). I am querying like this:
After running this test operation for each of my embeddings I have found that 56 / apprx 50,000 embeddings results in the incorrect file path being returned.
Further, this doesn't seem to be random, for example, it doesn't matter how many times I run the same query with an embedding that is returning an incorrect result - it still returns the incorrect result each time.
I am now unsure on why this could be happening and whether this is a potential bug.
Here is my full test code:
# Parametersfile_path='ProgramData/5/63.txt'model_name="llama3.2"COLLECTION_NAME="POJ_DATASET_ollama_embedding"client=chromadb.HttpClient(host='localhost', port=8000)
collection=client.get_collection(name=COLLECTION_NAME)
# Read the file contentwithopen(file_path, "r") asfile:
text_content=file.read()
# Generate embedding for file textollama_client=Client(host='http://localhost:11434')
embedding=ollama_client.embed(model=model_name, input=text_content)['embeddings']
# Add embedding to collectiontry:
collection.add(
embeddings=embedding,
ids=[file_path]
)
print("Added to collection")
doc_count=collection.count()
print(f"Total documents in collection after add: {doc_count}")
exceptExceptionase:
print(f"Error adding to collection: {e}")
try:
results=collection.query(
query_embeddings=embedding,
n_results=5,
)
exceptExceptionase:
print(f"Error querying collection: {e}")
result_paths=results["ids"][0]
result_distances=results["distances"][0]
print(f"Query path: {file_path}")
print(f"Result paths: {result_paths}")
print(f"Result distances: {result_distances}")
Total documents in collection after add: 51752
Query path: ProgramData/5/63.txt
Result paths: ['ProgramData/31/2034.txt', 'ProgramData/51/396.txt', 'ProgramData/22/544.txt', 'ProgramData/71/68.txt', 'ProgramData/48/666.txt']
Result distances: [0.3231008052825928, 0.3234265446662903, 0.3266177773475647, 0.3289269804954529, 0.3314428925514221]
The text was updated successfully, but these errors were encountered:
I think I'm having the same issue.
I loaded my vdb with 60000+ docs and their embeddings using a custom embedding function.
When inspecting the DB embedding looks normal and .query return accurate value with correct distance.
After compressing the folder(I'm using persistent client ) and transferring to local all my embeddings are missing.
I have updated my Chroma version to 0.5.18 but this still results in the same output as before:
Total documents in collection after add: 51752
Query path: ProgramData/5/63.txt
Result paths: ['ProgramData/31/2034.txt', 'ProgramData/51/396.txt', 'ProgramData/22/544.txt', 'ProgramData/71/68.txt', 'ProgramData/48/666.txt']
Result distances: [0.3231008052825928, 0.3234265446662903, 0.3266177773475647, 0.3289269804954529, 0.3314428925514221]
As for your other suggestion I already have ef_search set to a value of 100. My collection is setup as:
What happened?
I have populated a chroma collection with approximately 50,000 embeddings which are being pre-calculated then added using llama3.2 as such:
I have checked and my calculated embeddings are behaving deterministically. Therefore, when querying the collection with the same embedding I would expect the nearest result to be the same id that was just added (As the distance should be zero or very near zero). I am querying like this:
After running this test operation for each of my embeddings I have found that 56 / apprx 50,000 embeddings results in the incorrect file path being returned.
Further, this doesn't seem to be random, for example, it doesn't matter how many times I run the same query with an embedding that is returning an incorrect result - it still returns the incorrect result each time.
I am now unsure on why this could be happening and whether this is a potential bug.
Here is my full test code:
Versions
Python 3.11.5
chromadb 0.5.11
llama3.2
MacOS Sonoma 14.1.1
Relevant log output
The text was updated successfully, but these errors were encountered: