From f842fbea4974fdd340b7c05fb0f70a8628a53e31 Mon Sep 17 00:00:00 2001 From: Simon Willison Date: Sun, 3 Sep 2023 19:10:42 -0700 Subject: [PATCH] Mention brute-force approach, link to vector indexing issue Refs #216. Closes #214 --- docs/embeddings/cli.md | 2 ++ docs/embeddings/python-api.md | 4 +++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/embeddings/cli.md b/docs/embeddings/cli.md index 4bda9f6b..81f5d46f 100644 --- a/docs/embeddings/cli.md +++ b/docs/embeddings/cli.md @@ -285,6 +285,8 @@ llm-docs/plugins/index.md The `llm similar` command searches a collection of embeddings for the items that are most similar to a given or item ID. +This currently uses a slow brute-force approach which does not scale well to large collections. See [issue 216](https://github.com/simonw/llm/issues/216) for plans to add a more scalable approach via vector indexes provided by plugins. + To search the `quotations` collection for items that are semantically similar to `'computer science'`: ```bash diff --git a/docs/embeddings/python-api.md b/docs/embeddings/python-api.md index cfe338b7..adc91223 100644 --- a/docs/embeddings/python-api.md +++ b/docs/embeddings/python-api.md @@ -116,7 +116,9 @@ if Collection.exists(db, "entries"): (embeddings-python-similar)= ## Retrieving similar items -Once you have populated a collection of embeddings you can retrieve the entries that are most similar to a given string using the `similar()` method: +Once you have populated a collection of embeddings you can retrieve the entries that are most similar to a given string using the `similar()` method. + +This method uses a brute force approach, calculating distance scores against every document. This is fine for small collections, but will not scale to large collections. See [issue 216](https://github.com/simonw/llm/issues/216) for plans to add a more scalable approach via vector indexes provided by plugins. ```python for entry in collection.similar("hound"):