clozify-llm

A tiny tool to help generate sentences for Clozemaster language learning using large language models (LLM).

Wraps around the OpenAI API.

Why?

I like using Clozemaster. I like the language resources on DW's Learn German site. I like adding words from the word lists in the DW Learn German lessons but the words are often too specialized to have existing Clozes.

I can trawl the web to find examples. (Thank you, search bars on German public radio websites!) And sometimes the searching for a good example is a helpful part of the learning itself. But sometimes I also just want to add sentences to Clozemaster.

The idea here is prompting a LLM for each word in a list and requesting a response that's already in the CSV format Clozemaster accepts for batch uploads.

I've experimented with two approaches:

A one-shot approach that uses a STARTING_MESSAGE to prompt ChatGPT with an example prompt and response
A fine-tuning approach that uses a completion model (e.g. OpenAI "curie") fine-tuned on a number of existing example cloze creations. This requires an appropriate training data set.

How?

You're gonna need an OpenAI account and API key (assumed to be available at standard OPENAI_API_KEY env var). It's gonna cost ya once you run through the free credits.

Install

$ poetry install

This installs the dependencies in a poetry environment as well as a simple CLI named clozify.

Verify installation:

$ poetry shell
$ clozify --help
Usage: clozify [OPTIONS] COMMAND [ARGS]...

  Use LLMs to generate cloze sentences.

Options:
  --help  Show this message and exit.

Commands:
  chat      Generate clozes using a chat model
  complete  Generate clozes using a completion model
  finetune  Start completion model fine-tuning from training data
  prep      Prepare training data for model fine-tuning.

CLI usage

`clozify chat`

clozify chat uses a list of vocabulary words and a chat completion request. Example:

$ echo "regional" > vocab.txt
$ echo "Waschbär" >> vocab.txt
$ clozify chat -f vocab.txt -o out_clozes.csv
response for regional received, total usage 104
response for Waschbär received, total usage 107
wrote 2 responses to out_clozes.csv
$ cat clozes.csv
"Die regionale Küche ist sehr vielfältig und lecker.","The regional cuisine is very diverse and delicious.","regional"
"Der Waschbär ist ein nachtaktives Tier.","The raccoon is a nocturnal animal.","Waschbär"

`clozify complete`

clozify complete uses a list of vocabulary words and definitions, and a fine-tuned completion model request. Example:

$ cat my_inputs.csv
word,defn
"Ausrede, -n (f.)","eine Erklärung, die aber nicht ganz wahr ist und die eine Entschuldigung für etwas sein soll"
"Bank (f.)","Unternehmen, das Geld- und Kreditgeschäfte betreibt und den Zahlungsverkehr vermittelt"
"Bank (f.)","Sitzgelegenheit aus Holz, Stein o. Ä., die mehreren Personen nebeneinander Platz bietet"
$ clozify complete -f my_inputs.csv -m 'curie:ft-personal-2023-01-01-01-01-01' -o my_clozes.csv
response for Ausrede, -n (f.) received, total usage 80
response for Bank (f.) received, total usage 77
response for Bank (f.) received, total usage 83
$ cat my_clozes.csv
"Ich muss diese Ausrede nicht erfinden.","I don't have to come up with an excuse.","Ausrede"
"Sie müssen nur ein Konto bei einer deutschen Bank haben.","You only need a bank account in Germany.","Bank"
"Ein Bank ist eine Möbelstück, die zur Sitzgelegenheit dient.","A bench is a piece of furniture that serves as a seating device.","Bank"

`clozify finetune`

Start a model fine-tuning job assuming a training data set is available (see "Data prep", below).

$ clozify finetune --help
Usage: clozify finetune [OPTIONS] CSV_FILE TRAINING_DATA_OUTPUT

  Start completion model fine-tuning from training data

  Start model fine-tuning using data in CSV_FILE written to
  TRAINING_DATA_OUTPUT in the format that is uploaded for fine-tuning. Details
  of the FineTune job is printed.

Options:
  --help  Show this message and exit.

Data prep

Helper functions are included that help extract clozes and vocabulary lists. Running these require installing the optional "prep" group of dependencies into the poetry environment.

The steps include:

Wrangle vocab and cloze data into a standard format
Generate embeddings for the vocab and cloze words
Join the vocab to the cloze sentences based on embedding similarities
Manually review the join and fix any errors

These steps are available through the CLI tool via the prep subcommmand.

$ clozify prep --help
Usage: clozify prep [OPTIONS] COMMAND [ARGS]...

  Prepare training data for model fine-tuning.

Options:
  --help  Show this message and exit.

Commands:
  embed  Get embeddings for the word or cloze in the input
  fetch  Get vocabulary from a course
  fix    Update candidate training data based on manual review
  match  Join cloze and vocab data based on embedding similarities
  parse  Extract clozes from scraped json data

Limitations

This is relying on machine translation so all limitations there apply. The output might have subtle issues with grammar, idiomatic usage, etc. The assumption is the output will receive manual human review for these issues before being added to a flashcard set.

The output appears to struggle with identifying when the input word was inflected in the output. This has an impact on how Clozemaster ingests the list of sentences and identifies the cloze. This could probably be cleaned up with some postprocessing rules, but again, manual human review can also catch.

Cleaning of the output might be necessary if the CSV format gets mangled in some responses.

There is no content filtering applied for potentially objectionable material.

Certain parameters are hard-coded (model, temperature, etc.).

Development

Environment managed with poetry v1.4.

Autoformatting checks managed with pre-commit.

References

This was inspired by Matt Webb's description of his Braggoscope project (about page, blog writeup), which uses ChatGPT to extract data from In Our Time shownotes and perform some other tasks (embeddings).

Robin Sloan's gloss also provided a nudge:

Where the GPT-alikes are concerned, a question that’s emerging for me is:

What could I do with a universal function — a tool for turning just about any X into just about any Y with plain language instructions?

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
docs		docs
src/clozify_llm		src/clozify_llm
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clozify-llm

Why?

How?

Install

CLI usage

`clozify chat`

`clozify complete`

`clozify finetune`

Data prep

Limitations

Development

References

About

Releases

Packages

Languages

awbirdsall/clozify-llm

Folders and files

Latest commit

History

Repository files navigation

clozify-llm

Why?

How?

Install

CLI usage

clozify chat

clozify complete

clozify finetune

Data prep

Limitations

Development

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`clozify chat`

`clozify complete`

`clozify finetune`

Packages