fish-speech.rs

A simple, fast text-to-speech inference server for Fish Speech 1.5 and below, written in pure Rust.

Features:

Simple inference: OpenAI-compatible server with streaming audio and WAV
Backwards compatibility: now the only project still supporting Fish Speech 1.4 and 1.2 SFT
Reliable installation: Compiles to single ~15MB static binary, no Python environment or torch.compile cache failures

Prerequisites

Hardware: Supports Nvidia, Apple Silicon, and CPU.
OS: Linux or macOS are highly recommended. Windows and WSL work but are not yet officially supported: installation is much more difficult and performance on the same machine is slower.
System: working Rust installation to compile from source (see official docs). I'll try to remove this requirement ASAP.

Compiling from source

Note

Yes, compiling from source isn't fun. An official Docker image, Homebrew, Linux packaging, and Python interop layer are on the roadmap.

If you want your platform supported, please feel free to raise an issue and I'll get on it!

First, clone this repo to the folder you want:

git clone https://github.com/EndlessReform/fish-speech.rs.git
cd fish-speech.rs

Save the Fish Speech checkpoints to ./checkpoints. I recommend using huggingface-cli:

# If it's not already on system
brew install huggingface-cli
# For Windows or Linux (assumes working Python):
# pip install -U "huggingface_hub[cli]"
# or just skip this and download the folder manually

mkdir -p checkpoints/fish-speech-1.5
# NOTE: the official weights are not compatible
huggingface-cli download jkeisling/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5

If you're using an older Fish version:

For Fish 1.4, please use jkeisling/fish-speech-1.4: the official weights are not compatible.
For Fish 1.2 SFT, feel free to use the official weights.

Now compile from source. For Apple Silicon GPU support:

cargo build --release --bin server --features metal

For Nvidia:

cargo build --release --bin server --features cuda

For extra performance on Nvidia, you can enable flash attention. If you're compiling for the first time or after a major update though, this may take QUITE a while: up to 15 minutes and 16GB of RAM on a decent CPU. You have been warned!

mkdir ~/.candle
CANDLE_FLASH_ATTN_BUILD_DIR=$HOME/.candle cargo build --release --features flash-attn --bin server

This will compile your binary to ./target/release/server.

Running the server

Just start the binary! If you're getting started for the first time, run:

# Default voice to get you started
./target/release/server --voice-dir voices-template

Options:

--port: Defaults to 3000
--checkpoint: Directory for checkpoint folder. Defaults to checkpoints/fish-speech-1.5.
--voice-dir: Directory for speaker prompts. (more on this below)
--fish-version: 1.5, 1.4, or 1.2. Defaults to 1.5
--temp: Temperature for language model backbone. Default: 0.7
--top_p: Top-p sampling for language model backbone. Default 0.8, to turn off set it to 1.

This server supports OGG audio (streaming) and WAV audio output.

You can use any OpenAI-compatible client. Here's an example Python request:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1"
)
audio = client.audio.speech.create(
    input="Hello world!",
    voice="default",
    response_format="wav",
    model="tts-1",
)

temp_file = "temp.wav"
audio.stream_to_file(temp_file)

Temporary voice cloning

To clone a voice, you'll need a WAV file and a transcription. Suppose you want to add speaker alice, who says "Hello world" in file fake.wav.

Make a POST request to the /v1/audio/encoding endpoint, with:

fake.wav as file body
id and prompt as URL-encoded query parameters

Example with curl:

curl -X POST "http://localhost:3000/v1/audio/encoding?id=alice&prompt=Hello%20world" \
  -F "[email protected]" \
  --output alice.npy

You can check that your voice was added by hitting the /v1/voices debug endpoint:

curl http://localhost:3000/v1/voices
# Returns ['default', 'alice']

This will return a .npy file with your input audio as encoded tokens. After this, the alice voice will be usable as long as the server is running, and will be deleted on shutdown.

If you want to save this voice, you can use the .npy file returned to add it to the voices on startup: see below.

Persisting cloned voices

Note

Yes, this sucks. Persisting encoded voices to disk is a top priority.

Open up the voices-template/index.json file. It should look something like:

{
  "speakers": {
    "default": "When I heard the release demo, I was shocked, angered, and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference."
  }
}

The voices directory contains:

This index, mapping speaker names to the text in their speaker conditioning audio
Speaker conditioning files, where the filename is the name of the speaker and the audio is the speaker conditioning value (ex. default.npy)

Take the .npy file you got from runtime encoding and rename it to your speaker ID: ex. if the ID is "alice", then rename it to alice.npy. Then move alice.npy to your voices folder. In index.json, add the key:

{
  "speakers": {
    "default": "When I heard the release demo, I was shocked, angered, and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference."
    "alice": "I–I hardly know, sir, just at present–at least I know who I WAS when I got up this morning, but I think I must have been changed several times since then."
  }
}

Restart the server and your new voice should be good to go.

CLI scripts

For now, we're keeping compatibility with the official Fish Speech inference CLI scripts. (Inference server and Python bindings coming soon!)

Generate speaker conditioning tokens

# saves to fake.npy by default
cargo run --release --features metal --bin encoder -- -i ./tests/resources/sky.wav

For earlier versions, you'll need to specify version and checkpoints manually:

cargo run --release --bin encoder -- --input ./tests/resources/sky.wav --output-path fake.npy --fish-version 1.2 --checkpoint ./checkpoints/fish-speech-1.2-sft

Generate semantic codebook tokens

For Fish 1.5 (default):

# Switch to --features cuda for Nvidia GPUs
cargo run --release --features metal --bin llama_generate -- \
  --text "That is not dead which can eternal lie, and with strange aeons even death may die." \
  --prompt-text "When I heard the release demo, I was shocked, angered, and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference." \
  --prompt-tokens fake.npy

For earlier versions, you'll have to specify version and checkpoint explicitly. For example, for Fish 1.2:

cargo run --release --features metal --bin llama_generate -- --text "That is not dead which can eternal lie, and with strange aeons even death may die." --fish-version 1.2 --checkpoint ./checkpoints/fish-speech-1.2-sft

For additional speed, compile with Flash Attention support.

Warning

The candle-flash-attention dependency can take more than 10 minutes to compile even on a good CPU, and can require more than 16 GB of memory! You have been warned.

Also, of October 2024 the bottleneck is actually elsewhere (in inefficient memory copies and kernel dispatch), so on already fast hardware (like an RTX 4090) this currently has less of an impact.

# Cache the Flash Attention build
# Leave your computer, have a cup of tea, go touch grass, etc.
mkdir ~/.candle
CANDLE_FLASH_ATTN_BUILD_DIR=$HOME/.candle cargo build --release --features flash-attn --bin llama_generate

# Then run with flash-attn flag
cargo run --release --features flash-attn --bin llama_generate -- \
  --text "That is not dead which can eternal lie, and with strange aeons even death may die." \
  --prompt-text "When I heard the release demo, I was shocked, angered, and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference." \
  --prompt-tokens fake.npy

Decode tokens to WAV

For Fish 1.5 (default):

# Switch to --features cuda for Nvidia GPUs
cargo run --release --features metal --bin vocoder -- -i out.npy -o fake.wav

For earlier models, please specify the version. 1.2 example:

cargo run --release --bin vocoder -- --fish-version 1.2 --checkpoint ./checkpoints/fish-speech-1.2-sft

License

Warning

This codebase is licensed under the original Apache 2.0 license. Feel free to use however you want. However, the Fish Speech weights are CC-BY-NC-SA-4.0 and for non-commercial use only!

Please support the original authors by using the official API for production.

This model is permissively licensed under the BY-CC-NC-SA-4.0 license. The source code is released under Apache 2.0 license.

Massive thanks also go to:

All candle_examples maintainers for highly useful code snippets across the codebase
WaveyAI's mel spec for the STFT implementation

Original README below

Fish Speech V1.5

Fish Speech V1.5 is a leading text-to-speech (TTS) model trained on more than 1 million hours of audio data in multiple languages.

Supported languages:

English (en) >300k hours
Chinese (zh) >300k hours
Japanese (ja) >100k hours
German (de) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours
Russian (ru) ~20k hours
Dutch (nl) <10k hours
Italian (it) <10k hours
Polish (pl) <10k hours
Portuguese (pt) <10k hours

Please refer to Fish Speech Github for more info. Demo available at Fish Audio.

Citation

If you found this repository useful, please consider citing this work:

@misc{fish-speech-v1.4,
      title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
      author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
      year={2024},
      eprint={2411.01156},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.01156},
}

License

This model is permissively licensed under the BY-CC-NC-SA-4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.vscode		.vscode
configs		configs
docs		docs
fish_speech_core		fish_speech_core
fish_speech_python		fish_speech_python
server		server
tests		tests
voices-template		voices-template
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fish-speech.rs

Prerequisites

Compiling from source

Running the server

Temporary voice cloning

Persisting cloned voices

CLI scripts

Generate speaker conditioning tokens

Generate semantic codebook tokens

Decode tokens to WAV

License

Original README below

Fish Speech V1.5

Citation

License

About

Releases

Packages

Languages

License

EndlessReform/fish-speech.rs

Folders and files

Latest commit

History

Repository files navigation

fish-speech.rs

Prerequisites

Compiling from source

Running the server

Temporary voice cloning

Persisting cloned voices

CLI scripts

Generate speaker conditioning tokens

Generate semantic codebook tokens

Decode tokens to WAV

License

Original README below

Fish Speech V1.5

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages