Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions and documentation #7

Open
BradKML opened this issue Apr 15, 2021 · 3 comments
Open

Questions and documentation #7

BradKML opened this issue Apr 15, 2021 · 3 comments

Comments

@BradKML
Copy link

BradKML commented Apr 15, 2021

  1. Which model does Yukarin uses for its training?
  2. Are there any target voice training document specifications?
  3. Would public voice datasets help with training?
  4. Does this project work with English datasets?
  5. Why is the example page's voice so "robotic"/"compressed"?
@BradKML
Copy link
Author

BradKML commented Apr 15, 2021

And in regards to Speech Upsampling or Speech Superresolution:

@SinisterSpatula
Copy link

SinisterSpatula commented Jul 1, 2021

  1. Which model does Yukarin uses for its training?

I'm not sure but I'm guessing GAN. It generates, discriminates, and has an adversary. I'm new to this stuff though, and just playing with it as a hobby and learning.

  1. Are there any target voice training document specifications?

It's working for me with 24000 hz 16 bit wav's made in audacity. The audio pairs should be around 15 seconds or less each (seems okay to go slightly over that, as long as your system has enough ram.)

  1. Would public voice datasets help with training?

You could use those if you like. I tried out the JSV I think it's called, it worked well. I just removed any very short clips. Finally I switched to using audio books and used audacity to label the sounds with minimum 6 seconds. (Short clips can cause the process to crash). You just need to build a parallel dataset of audio, of your own voice and target.

  1. Does this project work with English datasets?

Yes, I can confirm it does. If you want to hear a sample, I'll be sharing my english results in the yukarin discord. I had decent results with 212 audio pairs (some phonemes were silent or missing and the audio was more wobbly), and very good/better results with 512. I might try 1,000 in the future.

  1. Why is the example page's voice so "robotic"/"compressed"?

It might have been because it was only showing the stage 1 training, I'm unsure. To me, the second stage of training (using the pix2pix I think it is (where it's generating a higher quality sound by turning the audio into a picture) seem to really bring the quality and naturalness back to it again. I learned not to judge it too much on the stage 1 quality, wait for second stage to truly appreciate what it can do. It's very impressive IMO. I have not tried the real-time conversion yet, I'm going to soon. It could be that the real-time conversion has lower quality to speed up processing. I'm hoping I can achieve the quality I've seen in my test output wav's without too much delay, but I'll be finding out soon.

Those repositories you linked are all very cool and interesting, however this was the only series of projects that seemed to offer real time conversion. Does anyone know if it's possible to adapt any of those other projects to become real time? Or did I miss one of them that actually does offer real time conversion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants