From fc4c8b547444efafa8ddf75fbd53a9b8e1a7eabe Mon Sep 17 00:00:00 2001 From: Anton Troynikov Date: Wed, 4 Oct 2023 09:01:33 -0700 Subject: [PATCH] [BUG]: Chat with your documents example exhibits flaky retrieval (#1203) ## Description of changes *Summarize the changes made by this PR.* In https://github.com/chroma-core/chroma/issues/1115 @BChip noticed flaky retrieval performance. The issue was difficult to replicate because of nondeterminism inherent in the HNSW graph construction on loading, but I was able to track it down through repeated testing. The issue is caused by ingesting all the empty lines in the document, which make up 50% of all the lines in each file, which outputs the same embedding for all of them, causing the HNSW graph to sometimes be degenerate. The fix is to skip the empty lines. We should consider how we can mitigate this in the future since this is not easy to detect after the fact, and is likely to be something users run into. ## Test plan Failures no longer occur after manual invocation. ## Documentation Changes N/A --- examples/chat_with_your_documents/load_data.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/examples/chat_with_your_documents/load_data.py b/examples/chat_with_your_documents/load_data.py index 574a2127d10..b9ffdbb116a 100644 --- a/examples/chat_with_your_documents/load_data.py +++ b/examples/chat_with_your_documents/load_data.py @@ -22,6 +22,9 @@ def main( ): # Strip whitespace and append the line to the documents list line = line.strip() + # Skip empty lines + if len(line) == 0: + continue documents.append(line) metadatas.append({"filename": filename, "line_number": line_number})