Skip to content

Sound generation

HugoFara edited this page Sep 23, 2024 · 1 revision

Ambient sound generation

This part refers to the generation of ambient sounds, such as birds chriping or the crashing of waves. The project does not features voices or music, as it is not it's purpose.

Text-to-text-sound

This is the easiest approach. First, we use the initial user prompt that describe the landscape. Then, we pass this pompt to a text-to-text model, such as llama3, with a direction similar to "Describe the sounds that could appear in this landscape, and format the answer in a Python list". Then, we pass each prompt from the list to a text-to-sound system. So far, I have tried llama3 with AudioGen, with successful results.

This approach had limited success in various ways. First, the text-to-sound model often struggled to generated natural sounds. Second, some sounds would be unnatural, for instance earing "the rumbling of thunder" with the same volume as "leaves rustling" is quite strange.

Image-to-sound

The text-to-image model can get a bit creative, and using this creativity would be a great improvement. The idea is to use the initial prompt, the generated image, to generate a new description of the landscape. Then, it would be passed to the language model, to create the list of sound descriptions.

Clone this wiki locally