Skip to content

Commit

Permalink
Release 0.9
Browse files Browse the repository at this point in the history
Refs #192, #209, #211, #213, #215, #217, #218, #219, #222

Closes #205
  • Loading branch information
simonw committed Sep 4, 2023
1 parent e6e1da3 commit 5efb300
Show file tree
Hide file tree
Showing 5 changed files with 81 additions and 7 deletions.
30 changes: 30 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
# Changelog

(v0_9)=
## 0.9 (2023-09-03)

The big new feature in this release is support for **embeddings**.

{ref}`Embedding models <embeddings>` take a piece of text - a word, sentence, paragraph or even a whole article, and convert that into an array of floating point numbers. [#185](https://github.com/simonw/llm/issues/185)

This embedding vector can be thought of as representing a position in many-dimensional-space, where the distance between two vectors represents how semantically similar they are to each other within the content of a language model.

Embeddings can be used to find **related documents**, and also to implement **semantic search** - where a user can search for a phrase and get back results that are semantically similar to that phrase even if they do not share any exact keywords.

LLM now provides both CLI and Python APIs for working with embeddings. Embedding models are defined by plugins, so you can install additional models using the {ref}`plugins mechanism <installing-plugins>`.

The first two embedding models supported by LLM are:

- OpenAI's [ada-002](https://platform.openai.com/docs/guides/embeddings) embedding model, available via an inexpensive API if you set an OpenAI key using `llm keys set openai`.
- The [sentence-transformers](https://www.sbert.net/) family of models, available via the new [llm-sentence-transformers](https://github.com/simonw/llm-sentence-transformers) plugin.

See {ref}`embeddings-cli` for detailed instructions on working with embeddings using LLM.

The new commands for working with embeddings are:

- **{ref}`llm embed <embeddings-cli-embed>`** - calculate embeddings for content and return them to the console or store them in a SQLite database.
- **{ref}`llm embed-multi <embeddings-cli-embed-multi>`** - run bulk embeddings for multiple strings, using input from a CSV, TSV or JSON file, data from a SQLite database or data found by scanning the filesystem. [#215](https://github.com/simonw/llm/issues/215)
- **{ref}`llm similar <embeddings-cli-similar>`** - run similarity searches against your stored embeddings - starting with a search phrase or finding content related to a previously stored vector. [#190](https://github.com/simonw/llm/issues/190)
- **{ref}`llm embed-models <embeddings-cli-embed-models>`** - list available embedding models.
- **{ref}`llm embed-db <help-embed-db>`** - commands for inspecting and working with the default embeddings SQLite database.

There's also a new {ref}`llm.Collection <embeddings-python-collections>` class for creating and searching collections of embedding from Python code, and a {ref}`llm.get_embedding_model() <embeddings-python-api>` interface for embedding strings directly. [#191](https://github.com/simonw/llm/issues/191)

(v0_8_1)=
## 0.8.1 (2023-08-31)

Expand Down
10 changes: 5 additions & 5 deletions docs/embeddings/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

LLM provides command-line utilities for calculating and storing embeddings for pieces of content.

(embeddings-llm-embed)=
(embeddings-cli-embed)=
## llm embed

The `llm embed` command can be used to calculate embedding vectors for a string of content. These can be returned directly to the terminal, stored in a SQLite database, or both.
Expand Down Expand Up @@ -110,7 +110,7 @@ llm similar phrases -c 'hound'
{"id": "hound", "score": 0.8484683588631485, "content": "my happy hound", "metadata": {"name": "Hound"}}
```

(embeddings-llm-embed-multi)=
(embeddings-cli-embed-multi)=
## llm embed-multi

The `llm embed` command embeds a single string at a time.
Expand All @@ -130,7 +130,7 @@ All three mechanisms support these options:
- `--store` to store the original content in the embeddings table in addition to the embedding vector
- `--prefix` to prepend a prefix to the stored ID of each item

(embeddings-llm-embed-multi-csv-etc)=
(embeddings-cli-embed-multi-csv-etc)=
### Embedding data from a CSV, TSV or JSON file

You can embed data from a CSV, TSV or JSON file using the `-i/--input` option.
Expand Down Expand Up @@ -188,7 +188,7 @@ llm embed-multi items \
--store
```

(embeddings-llm-embed-multi-sqlite)=
(embeddings-cli-embed-multi-sqlite)=
### Embedding data from a SQLite database

You can embed data from a SQLite database using `--sql`, optionally combined with `--attach` to attach an additional database.
Expand All @@ -213,7 +213,7 @@ llm embed-multi docs \
-m ada-002
```

(embeddings-llm-embed-multi-directories)=
(embeddings-cli-embed-multi-directories)=
### Embedding data from files in directories

LLM can embed the content of every text file in a specified directory, using the file's path and name as the ID.
Expand Down
2 changes: 1 addition & 1 deletion docs/embeddings/writing-plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ class SentenceTransformerModel(llm.EmbeddingModel):
results = self._model.encode(texts)
return (list(map(float, result)) for result in results)
```
Once installed, the model provided by this plugin can be used with the {ref}`llm embed <embeddings-llm-embed>` command like this:
Once installed, the model provided by this plugin can be used with the {ref}`llm embed <embeddings-cli-embed>` command like this:

```bash
cat file.txt | llm embed -m sentence-transformers/all-MiniLM-L6-v2
Expand Down
Loading

0 comments on commit 5efb300

Please sign in to comment.