Skip to content

personthe/PersonBarkGUI

ย 
ย 

Repository files navigation

๐Ÿš€ BARK INFINITY ๐ŸŽถ ๐ŸŒˆโœจ๐Ÿš€

โšก Low GPU memory? No problem. CPU offloading. โšก

Open In Colab Basic Colab Notebook

๐ŸŒ  The Past: ๐ŸŒ 

Bark Infinity started as a humble ๐Ÿ’ป command line wrapper, a CLI ๐Ÿ’ฌ. Built from simple keyword commands, it was a proof of concept ๐Ÿงช, a glimmer of potential ๐Ÿ’ก.

๐ŸŒŸ The Present: ๐ŸŒŸ

Bark Infinity evolved ๐Ÿงฌ, expanding across dimensions ๐ŸŒ. Infinite Length ๐ŸŽต๐Ÿ”„, Infinite Voices ๐Ÿ”Š๐ŸŒˆ, and a true high point in human history: ๐ŸŒ Infinite Awkwardness ๐Ÿ•บ. But for some people, the time-tested command line interface was not a good fit. Many couldn't even try Bark ๐Ÿ˜ž, struggling with CUDA gods ๐ŸŒฉ and being left with cryptic error messages ๐Ÿง and a chaotic computer ๐Ÿ’พ. Many people felt veryโ€ฆ UN INFINITE.

๐Ÿ”œ๐Ÿš€ The Future: ๐Ÿš€

๐Ÿš€ Bark Infinity ๐Ÿพ was born in the command line, and Bark Infinity grew within the command line. We live in the era where old fashioned command line applications are wrapped in โœจfancy Gradio Uis๐ŸŒˆ and ๐Ÿ–ฑ๏ธOne Click Installers. We all must adapt to a changing world, right? Or do we?

bark_test_webui

pip

!git clone https://github.com/JonathanFly/bark.git
%cd bark
!pip install -r requirements-pip.txt
!pip install encodec rich-argparse

๐ŸŽ‰ Mamba/Conda Install ๐ŸŽ‰

(I created a requirements-pip.txt file as well, but haven't tested a full pip route. However you should be able to install with that too.)

  1. Go here: https://github.com/conda-forge/miniforge#mambaforge
  2. Download this: https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Windows-x86_64.exe a. Install the Mambaforge for your OS, not specifically Windows. OSX for OSX etc. b. Don't install Mambaforge-pypy3. (It might work but not what I tested.) Install the one above that, just plain Mambaforge.
  3. Install. Then start the miniforge 'Miniforge Prompt' Terminal which is a new program it installed. You will always use this program for Bark.
  4. You should see a terminal that says "(base)". Do not move forward until you see that.
  5. Type this:
mamba update mamba
mamba install git

Your terminal still says (base). 6. This step is most of the installation time, the "mamba env create -f environment-cuda.yml" line. TType:

git clone https://github.com/JonathanFly/bark.git
cd bark
mamba env create -f environment-cuda.yml 

Okay stop here and see if something went wrong. When it's done it should say somewhere: "To activate this environment, use conda activate bark-infinity-oneclick" Then type.

mamba activate bark-infinity-oneclick

Note I typed "mamba" not "conda", even though the message said the word conda.

  1. Okay now instead of (base) you should see (bark-infinity-oneclick). Do not move on if you still see (base) on your screen.
  2. Type:
pip install encodec
pip install rich-argparse

Now if type 'dir' should see 'bark_webui.py' in the list tof files. If you don't, something might have gone wrong bin step 6 where you type 'cd bark' 9. Start Bark like this. (Always making sure you start 'Miniforge Prompt') not (base) TO START (Always making sure you start 'Miniforge Prompt') and make sure you are the /bark directory that has thet bark_webgui.py file

Are you done? Maybe not. You can try skipping this step but something in the libararies are bugged, so you porbably need a step 10.

  1. (you can try skipping this if you want)
mamba uninstall pysoundfile
pip install soundfile

Okay you are done. Just type:

python bark_perform.py

or

python bark_webui.py

To restart later, start Miniforge Prompt. Then activate bark-infinity-oneclick (you can set it up to actiate automatically as well), and then:

Option 1: Using commands

mamba activate bark-infinity-oneclick
cd bark
python bark_webui.py

Option 2: Run bark-webui.bat from Windows Explorer as normal, non-administrator, user.

(If you do not have an NVIDIA GPU use environment-cpu.yml instead of environment-cuda.yml)

I dipped my toes back into a bit twitter.com/jonathanfly

๐ŸŒŸ (OLD NOT UPDATED) Main Features ๐ŸŒŸ __

1. INFINITY VOICES ๐Ÿ”Š๐ŸŒˆ

Discover cool new voices and reuse them. Performers, musicians, sound effects, two party dialog scenes. Save and share them. Every audio clip saves a speaker.npz file with the voice. To reuse a voice, move the generated speaker.npz file (named the same as the .wav file) to the "prompts" directory inside "bark" where all the other .npz files are.

๐Ÿ”Š With random celebrity appearances!

(I accidently left a bunch of voices in the repo, some of them are pretty good. Use --history_prompt 'en_fiery' for the same voice as the audio sample right after this sentence.)

whoami.mp4

2. INFINITY LENGTH ๐ŸŽต๐Ÿ”„

Any length prompt and audio clips. Sometimes the final result is seamless, sometimes it's stable (but usually not both!).

๐ŸŽต Now with Slowly Morphing Rick Rolls! Can you even spot the seams in the most earnest Rick Rolls you've ever heard in your life?

but_are_we_strangers_to_love_really.mp4

๐Ÿ•บ Confused Travolta Mode ๐Ÿ•บ

Confused Travolta GIF confused_travolta

Can your text-to-speech model stammer and stall like a student answering a question about a book they didn't read? Bark can. That's the human touch. The semantic touch. You can almost feel the awkward silence through the screen.

๐Ÿ’ก But Wait, There's More: Travolta Mode Isn't Just A Joke ๐Ÿ’ก

Are you tired of telling your TTS model what to say? Why not take a break and let your TTS model do the work for you. With enough patience and Confused Travolta Mode, Bark can finish your jokes for you.

almost_a_real_joke.mp4

Truly we live in the future. It might take 50 tries to get a joke and it's probabably an accident, but all 49 failures are also very amusing so it's a win/win. (That's right, I set a single function flag to False in a Bark and raved about the amazing new feature. Everything here is small potatoes really.)

reaching_for_the_words.mp4

BARK INFINITY is possible because Bark is such an amazingly simple and powerful model that even I could poke around easily.

For music, I recommend using the --split_by_lines and making sure you use a multiline string as input. You'll generally get better results if you manually split your text, which I neglected to provide an easy way to do because I stayed too late listening to 100 different Bark versions of a scene an Andor and failed Why was 6 afraid of 7 jokes.

๐Ÿ“ Command Line Options ๐Ÿ“ (Some of these parameters are not implemented.)

Type --help or use the GUI

Usage: bark_perform.py [-h] [--text_prompt TEXT_PROMPT] [--list_speakers LIST_SPEAKERS] [--dry_run DRY_RUN]
                       [--history_prompt HISTORY_PROMPT] [--prompt_file PROMPT_FILE]
                       [--split_input_into_separate_prompts_by {word,line,sentence,string,random,rhyme,pos,regex}]
                       [--split_input_into_separate_prompts_by_value SPLIT_INPUT_INTO_SEPARATE_PROMPTS_BY_VALUE]
                       [--always_save_speaker ALWAYS_SAVE_SPEAKER] [--output_iterations OUTPUT_ITERATIONS]
                       [--output_filename OUTPUT_FILENAME] [--output_dir OUTPUT_DIR] [--hoarder_mode HOARDER_MODE]
                       [--extra_stats EXTRA_STATS] [--text_use_gpu TEXT_USE_GPU] [--text_use_small TEXT_USE_SMALL]
                       [--coarse_use_gpu COARSE_USE_GPU] [--coarse_use_small COARSE_USE_SMALL]
                       [--fine_use_gpu FINE_USE_GPU] [--fine_use_small FINE_USE_SMALL]
                       [--codec_use_gpu CODEC_USE_GPU] [--force_reload FORCE_RELOAD] [--text_temp TEXT_TEMP]
                       [--waveform_temp WAVEFORM_TEMP] [--confused_travolta_mode CONFUSED_TRAVOLTA_MODE]
                       [--silent SILENT] [--seed SEED] [--stable_mode_interval STABLE_MODE_INTERVAL]
                       [--single_starting_seed SINGLE_STARTING_SEED]
                       [--split_character_goal_length SPLIT_CHARACTER_GOAL_LENGTH]
                       [--split_character_max_length SPLIT_CHARACTER_MAX_LENGTH]
                       [--add_silence_between_segments ADD_SILENCE_BETWEEN_SEGMENTS]
                       [--split_each_text_prompt_by {word,line,sentence,string,random,rhyme,pos,regex}]
                       [--split_each_text_prompt_by_value SPLIT_EACH_TEXT_PROMPT_BY_VALUE]
                       [--extra_confused_travolta_mode EXTRA_CONFUSED_TRAVOLTA_MODE]
                       [--semantic_history_starting_weight SEMANTIC_HISTORY_STARTING_WEIGHT]
                       [--semantic_history_future_weight SEMANTIC_HISTORY_FUTURE_WEIGHT]
                       [--semantic_prev_segment_weight SEMANTIC_PREV_SEGMENT_WEIGHT]
                       [--coarse_history_starting_weight COARSE_HISTORY_STARTING_WEIGHT]
                       [--coarse_history_future_weight COARSE_HISTORY_FUTURE_WEIGHT]
                       [--coarse_prev_segment_weight COARSE_PREV_SEGMENT_WEIGHT]
                       [--fine_history_starting_weight FINE_HISTORY_STARTING_WEIGHT]
                       [--fine_history_future_weight FINE_HISTORY_FUTURE_WEIGHT]
                       [--fine_prev_segment_weight FINE_PREV_SEGMENT_WEIGHT]
                       [--custom_audio_processing_function CUSTOM_AUDIO_PROCESSING_FUNCTION]
                       [--use_smaller_models USE_SMALLER_MODELS] [--semantic_temp SEMANTIC_TEMP]
                       [--semantic_top_k SEMANTIC_TOP_K] [--semantic_top_p SEMANTIC_TOP_P]
                       [--semantic_min_eos_p SEMANTIC_MIN_EOS_P]
                       [--semantic_max_gen_duration_s SEMANTIC_MAX_GEN_DURATION_S]
                       [--semantic_allow_early_stop SEMANTIC_ALLOW_EARLY_STOP]
                       [--semantic_use_kv_caching SEMANTIC_USE_KV_CACHING] [--semantic_seed SEMANTIC_SEED]
                       [--semantic_history_oversize_limit SEMANTIC_HISTORY_OVERSIZE_LIMIT]
                       [--coarse_temp COARSE_TEMP] [--coarse_top_k COARSE_TOP_K] [--coarse_top_p COARSE_TOP_P]
                       [--coarse_max_coarse_history COARSE_MAX_COARSE_HISTORY]
                       [--coarse_sliding_window_len COARSE_SLIDING_WINDOW_LEN]
                       [--coarse_kv_caching COARSE_KV_CACHING] [--coarse_seed COARSE_SEED]
                       [--coarse_history_time_alignment_hack COARSE_HISTORY_TIME_ALIGNMENT_HACK]
                       [--fine_temp FINE_TEMP] [--fine_seed FINE_SEED] [--render_npz_samples RENDER_NPZ_SAMPLES]
                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

๐Ÿถ Bark

Twitter

Examples โ€ข Suno Studio Waitlist โ€ข Updates โ€ข How to Use โ€ข Installation โ€ข FAQ



Bark is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints, which are ready for inference and available for commercial use.

โš  Disclaimer

Bark was developed for research purposes. It is not a conventional text-to-speech model but instead a fully generative text-to-audio model, which can deviate in unexpected ways from provided prompts. Suno does not take responsibility for any output generated. Use at your own risk, and please act responsibly.

๐ŸŽง Demos

Open in Spaces Open on Replicate Open In Colab

๐Ÿš€ Updates

2023.05.01

  • ยฉ๏ธ Bark is now licensed under the MIT License, meaning it's now available for commercial use!

  • โšก 2x speed-up on GPU. 10x speed-up on CPU. We also added an option for a smaller version of Bark, which offers additional speed-up with the trade-off of slightly lower quality.

  • ๐Ÿ“• Long-form generation, voice consistency enhancements and other examples are now documented in a new notebooks section.

  • ๐Ÿ‘ฅ We created a voice prompt library. We hope this resource helps you find useful prompts for your use cases! You can also join us on Discord, where the community actively shares useful prompts in the #audio-prompts channel.

  • ๐Ÿ’ฌ Growing community support and access to new features here:

  • ๐Ÿ’พ You can now use Bark with GPUs that have low VRAM (<4GB).

2023.04.20

  • ๐Ÿถ Bark release!

๐Ÿ Usage in Python

๐Ÿช‘ Basics

from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
     Hello, my name is Suno. And, uh โ€” and I like pizza. [laughs] 
     But I also have other interests such as playing tic tac toe.
"""
audio_array = generate_audio(text_prompt)

# save audio to disk
write_wav("bark_generation.wav", SAMPLE_RATE, audio_array)
  
# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)
pizza.webm

๐ŸŒŽ Foreign Language


Bark supports various languages out-of-the-box and automatically determines language from input text. When prompted with code-switched text, Bark will attempt to employ the native accent for the respective languages. English quality is best for the time being, and we expect other languages to further improve with scaling.

text_prompt = """
    ์ถ”์„์€ ๋‚ด๊ฐ€ ๊ฐ€์žฅ ์ข‹์•„ํ•˜๋Š” ๋ช…์ ˆ์ด๋‹ค. ๋‚˜๋Š” ๋ฉฐ์น  ๋™์•ˆ ํœด์‹์„ ์ทจํ•˜๊ณ  ์นœ๊ตฌ ๋ฐ ๊ฐ€์กฑ๊ณผ ์‹œ๊ฐ„์„ ๋ณด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
"""
audio_array = generate_audio(text_prompt)
suno_korean.webm

Note: since Bark recognizes languages automatically from input text, it is possible to use for example a german history prompt with english text. This usually leads to english audio with a german accent.

๐ŸŽถ Music

Bark can generate all types of audio, and, in principle, doesn't see a difference between speech and music. Sometimes Bark chooses to generate text as music, but you can help it out by adding music notes around your lyrics.

text_prompt = """
    โ™ช In the jungle, the mighty jungle, the lion barks tonight โ™ช
"""
audio_array = generate_audio(text_prompt)
lion.webm

๐ŸŽค Voice Presets

Bark supports 100+ speaker presets across supported languages. You can browse the library of speaker presets here, or in the code. The community also often shares presets in Discord.

Bark tries to match the tone, pitch, emotion and prosody of a given preset, but does not currently support custom voice cloning. The model also attempts to preserve music, ambient noise, etc.

text_prompt = """
    I have a silky smooth voice, and today I will tell you about 
    the exercise regimen of the common sloth.
"""
audio_array = generate_audio(text_prompt, history_prompt="v2/en_speaker_1")
sloth.webm

Generating Longer Audio

By default, generate_audio works well with around 13 seconds of spoken text. For an example of how to do long-form generation, see this example notebook.

Click to toggle example long-form generations (from the example notebook)
dialog.webm
longform_advanced.webm
longform_basic.webm

๐Ÿ’ป Installation

pip install git+https://github.com/suno-ai/bark.git

or

git clone https://github.com/suno-ai/bark
cd bark && pip install . 

Note: Do NOT use 'pip install bark'. It installs a different package, which is not managed by Suno.

๐Ÿ› ๏ธ Hardware and Inference Speed

Bark has been tested and works on both CPU and GPU (pytorch 2.0+, CUDA 11.7 and CUDA 12.0).

On enterprise GPUs and PyTorch nightly, Bark can generate audio in roughly real-time. On older GPUs, default colab, or CPU, inference time might be significantly slower. For older GPUs or CPU you might want to consider using smaller models. Details can be found in out tutorial sections here.

The full version of Bark requires around 12GB of VRAM to hold everything on GPU at the same time. To use a smaller version of the models, which should fit into 8GB VRAM, set the environment flag SUNO_USE_SMALL_MODELS=True.

If you don't have hardware available or if you want to play with bigger versions of our models, you can also sign up for early access to our model playground here.

โš™๏ธ Details

Bark is fully generative tex-to-audio model devolved for research and demo purposes. It follows a GPT style architecture similar to AudioLM and Vall-E and a quantized Audio representation from EnCodec. It is not a conventional TTS model, but instead a fully generative text-to-audio model capable of deviating in unexpected ways from any given script. Different to previous approaches, the input text prompt is converted directly to audio without the intermediate use of phonemes. It can therefore generalize to arbitrary instructions beyond speech such as music lyrics, sound effects or other non-speech sounds.

Below is a list of some known non-speech sounds, but we are finding more every day. Please let us know if you find patterns that work particularly well on Discord!

  • [laughter]
  • [laughs]
  • [sighs]
  • [music]
  • [gasps]
  • [clears throat]
  • โ€” or ... for hesitations
  • โ™ช for song lyrics
  • CAPITALIZATION for emphasis of a word
  • [MAN] and [WOMAN] to bias Bark toward male and female speakers, respectively

Supported Languages

Language Status
English (en) โœ…
German (de) โœ…
Spanish (es) โœ…
French (fr) โœ…
Hindi (hi) โœ…
Italian (it) โœ…
Japanese (ja) โœ…
Korean (ko) โœ…
Polish (pl) โœ…
Portuguese (pt) โœ…
Russian (ru) โœ…
Turkish (tr) โœ…
Chinese, simplified (zh) โœ…

Requests for future language support here or in the #forums channel on Discord.

๐Ÿ™ Appreciation

  • nanoGPT for a dead-simple and blazing fast implementation of GPT-style models
  • EnCodec for a state-of-the-art implementation of a fantastic audio codec
  • AudioLM for related training and inference code
  • Vall-E, AudioLM and many other ground-breaking papers that enabled the development of Bark

ยฉ License

Bark is licensed under the MIT License.

Please contact us at [email protected] to request access to a larger version of the model.

๐Ÿ“ฑย Community

๐ŸŽงย Suno Studio (Early Access)

Weโ€™re developing a playground for our models, including Bark.

If you are interested, you can sign up for early access here.

โ“ FAQ

How do I specify where models are downloaded and cached?

  • Bark uses Hugging Face to download and store models. You can see find more info here.

Bark's generations sometimes differ from my prompts. What's happening?

  • Bark is a GPT-style model. As such, it may take some creative liberties in its generations, resulting in higher-variance model outputs than traditional text-to-speech approaches.

What voices are supported by Bark?

  • Bark supports 100+ speaker presets across supported languages. You can browse the library of speaker presets here. The community also shares presets in Discord. Bark also supports generating unique random voices that fit the input text. Bark does not currently support custom voice cloning.

Why is the output limited to ~13-14 seconds?

  • Bark is a GPT-style model, and its architecture/context window is optimized to output generations with roughly this length.

How much VRAM do I need?

  • The full version of Bark requires around 12Gb of memory to hold everything on GPU at the same time. However, even smaller cards down to ~2Gb work with some additional settings. Simply add the following code snippet before your generation:
import os
os.environ["SUNO_OFFLOAD_CPU"] = True
os.environ["SUNO_USE_SMALL_MODELS"] = True

My generated audio sounds like a 1980s phone call. What's happening?

  • Bark generates audio from scratch. It is not meant to create only high-fidelity, studio-quality speech. Rather, outputs could be anything from perfect speech to multiple people arguing at a baseball game recorded with bad microphones.

About

BARK INFINITY + GUI

Topics

Resources

License

Stars

Watchers

Forks

Languages

  • Jupyter Notebook 64.2%
  • Python 35.8%