Merge pull request #120 from ceruleandeep/captioning

Image Captioning docs
SillyTavern · Nov 30, 2024 · a77bfca · a77bfca
2 parents c956056 + c61631b
commit a77bfca
Show file tree

Hide file tree

Showing 3 changed files with 247 additions and 5 deletions.
diff --git a/Usage/User_Settings/uicustomization.md b/Usage/User_Settings/uicustomization.md
@@ -18,9 +18,17 @@ Theme files allow you to save, share, and reuse your UI customizations. You can
 All the settings in this section are saved to the current theme. If you switch themes, the settings will be replaced by the settings of the new theme.
 
 ### Display Settings
-These fundamental display options affect how characters and messages are presented in the chat interface. Choose the style that best fits your preferences and provides the most comfortable reading experience.
-* **Avatar Style**: Choose between Circle, Square, or Rectangle
-* **Chat Style**: Select from Flat, Bubbles, or Document layouts
+These display options affect how characters and messages are presented in the chat interface.
+
+#### Avatar Style
+Choose between Circle, Square, or Rectangle.
+
+#### Chat Style
+| Style        | Description                                                                                                                                                    | [Slash command](/For_Contributors/st-script.md#ui-styling) |
+|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
+| **Flat**     | Clean and continuous "chat log" style, a flat canvas for your AI interactions to come to life.                                                                 | `/flat`<br>`/default`                                      |
+| **Bubbles**  | "Instant messenger" style with distinct bubbles for each message, delightful rounded corners, and a subtle 3D effect.                                          | `/bubble`<br>`/bubbles`                                    |
+| **Document** | Compact, document-like appearance with click-to-edit mode and a text-focused layout. Hides avatars, timestamps, and message control buttons for past messages. | `/single`<br>`/story`                                      |
 
 ### Theme Colors
 Customize the color scheme of every UI element to create your perfect theme. Colors can be selected using a color picker, and include transparency options where applicable.

diff --git a/extensions/captioning.md b/extensions/captioning.md
@@ -0,0 +1,210 @@
+---
+templating: false
+---
+
+# Image Captioning
+
+Image Captioning allows SillyTavern to automatically generate text descriptions for images used in chats. 
+
+Use Image Captioning when you want your AI character to "see" and respond to visual content in your conversations.
+
+- Create captions for images you upload or paste into messages
+- Add context to existing images in the chat history
+- Use various sources for generation, including local models, cloud APIs, and crowdsourced networks
+
+There are options that require no setup, no money, and no GPU. There are also options that require some or all of those things. Choose the one that fits your needs and resources.
+
+The image captioning extension is built-in to SillyTavern and does not need to be installed separately.
+
+## Quick start
+
+1. Set up:
+    - Open the **Image Captioning** panel in the **<i class="fa-solid fa-cubes"></i> Extensions** panel
+    - Choose a captioning source (most likely "Local" or "Multimodal")
+    - For "Multimodal" ensure you've set up the connection in the **<i class="fa-solid fa-plug"></i> API Connections** tab
+2. Generate a caption:
+    - Choose "**Generate Caption**" from the **<i class="fa-solid fa-magic-wand-sparkles"></i> Extensions** popup menu
+    - Select an image file when prompted
+    - Wait for the caption to be generated
+3. Review and send:
+    - The captioned image will be inserted into your message
+    - See the caption using the image tooltip
+    - Click **<i class="fa-solid fa-paper-plane"></i> Send** to see what your character thinks of the image!
+
+## Panel controls
+
+### Source Selection
+
+Choose the source for image captioning. Supported options:
+
+| Source                           | Description                                                                                                                                                                                                  |
+|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Multimodal](#multimodal-source) | **Cloud**: OpenAI, Anthropic, Google, MistralAI, and others. <br>**Local**: Ollama, llama.cpp, KoboldCpp, Text Generation WebUI, and vLLM. <br>Supports custom prompts so you can ask your images questions. |
+| [Local](#local-source)           | Uses [transformers.js](https://huggingface.co/docs/transformers.js/en/index) running locally inside your SillyTavern server. Zero setup!                                                                     |
+| Horde                            | Uses the [AI Horde](https://aihorde.net/) network, a crowdsourced distributed network of image generation models. Nothing to download, configure, or pay for. Variable response times.                       |
+| Extras                           | The Extras project was discontinued in April 2024 and is not maintained or supported.                                                                                                                        |
+
+### Caption Configuration
+- **Caption Prompt**: Enter a custom prompt for captioning. The default prompt is "What's in this image?"
+- **Ask every time**: Toggle to request a custom prompt for each image caption
+
+### Message Template
+- **Message Template**: Customize the caption message template. Use `{{caption}}` macro to insert the generated caption. The default template is `[{{user}} sends {{char}} a picture that contains: {{caption}}]`
+
+### Auto-captioning
+- **Automatically caption images**: Toggle to enable automatic captioning of images pasted or attached to messages
+- **Edit captions before saving**: Toggle to allow editing captions before they are saved
+
+## Captioning images
+
+All the ways to caption images in SillyTavern:
+
+* Choose "**Generate Caption**" from the **<i class="fa-solid fa-magic-wand-sparkles"></i> Extensions** popup menu and select an image file when prompted
+* Click the <i class="fa-solid fa-envelope-open-text"></i> **Caption** icon at the top of an image already in a message
+* Paste an image directly into the chat input with [auto-captioning](#auto-captioning) enabled
+* Attach an image file to a message using the <i class="fa-solid fa-paperclip"></i> **Embed File or Image** button in the actions of a message.
+* Send a message with an embedded image
+* Use the `/caption` [slash command](#slash-command-caption)
+
+## Auto-Captioning
+The auto-captioning feature allows you to automatically generate captions for images as they are added to the chat, without manually triggering the captioning process each time.
+
+To enable, select the "Automatically caption images" checkbox in the Image Captioning panel. You can also choose to edit captions before they are saved by checking the "Edit captions before saving" box.
+
+Once enabled, auto-captioning will trigger in the following scenarios:
+
+- When an image is pasted directly into the chat input.
+- When an image file is attached to a message.
+- When a message with an embedded image is sent.
+
+The system will use your selected captioning source (Local, Extras, Horde, or Multimodal) and the configured settings to generate a caption for the image.
+
+### Editing captions before saving (Refine Mode)
+
+If you've enabled the "Edit captions before saving" option:
+1. After an image is added, a popup will appear with the generated caption.
+2. You can review and edit the caption as needed.
+3. Click "OK" to apply the caption, or "Cancel" to discard the caption without saving.
+
+### Caption sending
+The generated (and optionally edited) caption will be automatically inserted into the prompt using the Message Template you've configured. By default, it will be sent in this format:
+
+```
+[BaronVonUser sends Seraphina a picture that contains: ...]
+```
+
+
+## Slash Command: /caption
+The extension provides a `/caption` slash command to use in the chatbox or in scripts. 
+
+### Usage
+
+```
+/caption [quiet=true|false]? [mesId=number]? [prompt]
+```
+
+- `prompt` (optional): A custom prompt for the captioning model. Only supported by multimodal sources.
+- `quiet=true|false`: If set to true, suppresses sending a captioned message to the chat. Default is false.
+- `mesId=number`: Specifies a message ID to caption an image from an existing message instead of uploading a new one.
+
+If no `mesId` is provided, the command will prompt you to upload an image. When `quiet` is false (default), a new message with the captioned image will be sent to the chat. The generated caption can be used as input for other commands.
+
+### Examples
+Caption a new image with the default settings:
+
+```
+/caption
+```
+
+Caption a new image with a custom prompt:
+
+```
+/caption Describe the main colours and shapes in this image
+```
+
+Caption an image from message #5 without sending a new message:
+
+```
+/caption mesId=5 quiet=true
+```
+
+Caption an image from message #10 with a custom prompt then [generate a new image](/extensions/Stable-Diffusion.md) based on the caption:
+
+```
+/caption mesId=10 Describe this image using comma-separated keywords | /imagine 
+```
+
+## Local source
+
+You can change the model in [config.yaml](/Administration/config-yaml.md#extensions-configuration). The key is called `extras.captioningModel` because reasons. Enter the Hugging Face model ID you want to use. The default is `Xenova/vit-gpt2-image-captioning`. 
+
+You can use any model that supports image captioning (`VisionEncoderDecoderModel` or "image-to-text" pipeline). The model needs be to compatible with the transformers.js library. That is, it needs ONNX weights. Look for models with the `ONNX` and `image-to-text` tags, or that have a folder called `onnx` full of `.onnx` files. 
+
+## Multimodal source
+
+### General configuration
+
+- **Model**: Choose the model for image captioning. Options vary based on the selected API.
+- **Allow reverse proxy**: Toggle to allow using a reverse proxy if defined and valid (OpenAI, Anthropic, Google, Mistral)
+
+API keys and endpoint URLs for captioning sources are managed in the [API Connections](/Usage/API_Connections/index.md) panel. Set the connection up in API Connections first, then select it as your captions source in Captioning.
+
+!!! warning Set it up in the API Connections panel first
+One last time: configure the API key/address/port in **<i class="fa-solid fa-plug"></i> API Connections** and use the connection in Captioning.
+
+You can still use Claude for chats and Google AI Studio for image captioning, or whatever. Just set them *both* up in the 'API Connections' tab first. Then flip your Chat Completion source to Claude and your Captioning source to Google AI Studio.
+!!!
+
+For most local backends, you will need to set some options in the model backend rather than in SillyTavern. If your backend can only run one model at a time and doesn't support automatic switching, you are unfortunately going to have a hard time using the same backend for chat and captioning with different models. 
+
+Even if you run two instances of the backend on different ports, API Connections only allows one active configuration per backend type. But what if I told you... that you can probably connect to your backend in both Text Completion and Chat Completion modes? Now you can have two connections to the same backend type.
+
+### Sources
+
+To use one of these caption sources, select Multimodal in the Source dropdown.
+
+* "I want the best captioning possible, and I don't mind paying for it": Anthropic
+* "I don't want to pay anything or run anything": Google AI Studio free tier
+* "I want to caption images locally and have it just work": Ollama
+* "I want to keep the dream of local AI alive": [KoboldCpp](#koboldcpp)
+* "I want to complain when it doesn't work": ~~Extras~~
+
+| API Provider                      | Description                                                                                                                                                                   |
+|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| 01.AI (Yi)                        | Cloud, paid, yi-vision                                                                                                                                                        |
+| Anthropic                         | Cloud, paid, Claude models with vision capabilities: claude-3-5-sonnet/haiku, claude-3-opus/sonnet                                                                            |
+| Custom (OpenAI-compatible)        | For custom OpenAI-compatible APIs, uses currently configured model in API Connections tab                                                                                     |
+| Google AI Studio                  | Cloud, free tier then paid, Gemini Flash/Pro                                                                                                                                  |
+| Groq                              | Cloud, llama-3.2-vision in 11B/90B, LLaVA                                                                                                                                     |
+| KoboldCpp                         | Local, must configure model in KoboldCpp                                                                                                                                      |
+| llama.cpp                         | Local, must configure model in llama.cpp                                                                                                                                      |
+| MistralAI                         | Cloud, paid, pixtral-large, pixtral-12B                                                                                                                                       |
+| Ollama                            | Local, can switch between available models and download [additional vision models](https://ollama.com/search?c=vision) within Captioning after configuring in API Connections |
+| OpenAI                            | Cloud, paid, GPT-4 Vision, 4-turbo, 4o, 4o-mini                                                                                                                               |
+| OpenRouter                        | Cloud, paid (maybe free options), many models, pick from what's available within Captioning after configuring in API connections                                              |
+| Text Generation WebUI (oobabooga) | Local, must configure model in ooba                                                                                                                                           |
+| vLLM                              | Local                                                                                                                                                                         |
+
+### KoboldCpp
+
+For general information on installing and using [KoboldCpp](https://github.com/LostRuins/koboldcpp), see the [KoboldCpp documentation](https://github.com/LostRuins/koboldcpp/wiki).
+
+To use KoboldCpp for multimodal captioning:
+
+* get a multimodal-capable model, trained to process text and image prompts at the same time.
+* also get the multimodal projections for the model. These weights allow the model to understand how the text and image parts of the input relate to each other.
+* load the model and projections in the KoboldCpp launch GUI or command line interface.
+
+The original and classic local multimodal model is LLaVA. GGUF-format files for the model and projections are available from [Mozilla/llava-v1.5-7b-llamafile](https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile). To load them from the command line, set the model and projections with the `--model` and `--mmproj` flags. For example:
+
+```shell
+./koboldcpp \
+--model="models/llava-v1.5-7b-Q4_K.gguf" \
+--mmproj="models/ llava-v1.5-7b-mmproj-Q4_0.gguf" \
+... other flags ...
+```
+
+Some LLaVA finetunes you can try: [xtuner/llava-llama-3-8b-v1_1-gguf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf), [xtuner/llava-phi-3-mini-gguf](https://huggingface.co/xtuner/llava-phi-3-mini-gguf).
+
+You can use multimodal projections for the base model that your particular finetune was built from. Projections for some common base models are available from [koboldcpp/mmproj](https://huggingface.co/koboldcpp/mmproj/tree/main).
+
diff --git a/extensions/index.md b/extensions/index.md
@@ -8,10 +8,34 @@ order: 35
 # Extensions
 
 SillyTavern comes with many extensions that can be enabled or disabled in the Extensions panel. Extensions can add new
-features, change the behaviour of existing features, or provide additional content for your AI to use.
+features, change the behaviour of existing features, or provide additional content for your AI to use. More extensions
+can be installed from the "Download Extensions & Assets" menu in the Extensions panel.
+
+## Extensions panel
+
+To open or close the Extensions panel, choose **<i class="fa-solid fa-cubes fa-fw"></i> Extensions** in the top bar.
+
+- **<i class="fa-solid fa-cubes"></i> Manage extensions**: Activate, deactivate, and update extensions
+- **Download Extensions & Assets**: Install [more extensions](#installable-extensions), characters, sounds, and backgrounds from the SillyTavern repository
+- **Notify on extension updates**: Check to be notified when there are updates available for installed extensions
+- **<i class="fa-solid fa-cloud-arrow-down"></i> Install extension**: Import an extension from a Git repository URL
+
+!!! warning
+Using third-party extensions can have unintended side effects and may pose security risks. 
+Always make sure you trust the source before importing an extension via **<i class="fa-solid fa-cloud-arrow-down"></i> Install extension**. 
+We are not responsible for any damage caused by third-party extensions.
+!!!
 
 ## Built-in extensions
 
+These extensions are built into SillyTavern and do not need to be installed. They can be enabled or disabled in the Extensions panel.
+
+:::callout
+**[Image Captioning](captioning.md)**
+
+Generates text from images so your AI can "see" and respond to visual content in your conversations
+:::
+
 :::callout
 **[Image Generation](Stable-Diffusion.md)**
 
@@ -43,7 +67,7 @@ Voice narration for your chat messages via ElevenLabs, Silero, your system TTS,
 :::
 
 :::callout
-**[Quick Reply](/For_Contributors/st-script.md)**
+**[Quick Reply](/For_Contributors/st-script.md#quick-replies-script-library-and-auto-execution)**
 
 Reply to chat messages with a single click, run commands and STscripts, and more
 :::