Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the speech rate of generated voice #2

Open
Charlottecuc opened this issue Mar 30, 2022 · 7 comments
Open

About the speech rate of generated voice #2

Charlottecuc opened this issue Mar 30, 2022 · 7 comments

Comments

@Charlottecuc
Copy link

Charlottecuc commented Mar 30, 2022

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input.
However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)?
Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

@insunhwang89
Copy link
Owner

insunhwang89 commented Mar 30, 2022

Thank you for your interest in our research. You asked about two things.

  1. Is the speech adjustable according to the speed of the given target speech?
  • Our model cannot control the rhythm. Rhythm is a property related to the speed of speech. Therefore, the speech generated by our model will follow the speed of speech in the source speech. In target speech, only speech features are extracted and used.

  • Additionally, we are doing further research using rhythm. (the rhythm is not adjusted is an issue that has not been resolved in voice conversion)

  1. When a speech with noise is used, noise is generated in the generated speech.
  • We did not conduct a separate experiment for noise. But let me tell you our experience. Since speech data for noise was not used when learning Vocoder, it is inevitably vulnerable to noise. Therefore, in order to solve this problem, finetuning of the vocoder model should be performed on the noise data. Or, I have to find a model that is strong against noise.

@Charlottecuc
Copy link
Author

Charlottecuc commented Mar 31, 2022

Thank you~

@skol101
Copy link

skol101 commented Jun 21, 2022

@Charlottecuc @intory89 then does it make sense to introduce audiomentations during vocoder training ?

Here there was a suggestion yl4579/StarGANv2-VC#21 that it's the VC model that should be supplied with corrupted inputs, not the vocoder.

@skol101
Copy link

skol101 commented Nov 24, 2022

@Charlottecuc is right here -- the model doesn't follow the speed of the source speech for UNSEEN speakers.

@insunhwang89
Copy link
Owner

Our model did not consider rhythm among the characteristics of Speaker. Please refer to SpeechSplit for related research.

@Superman-Valencia
Copy link

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

Hi, I also tried the mandarin dataset to train the model. Therefore, there are questions for me.
Do you revise some part of the model?
Does the dataset have to the corresponding transcript?

1 similar comment
@Superman-Valencia
Copy link

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

Hi, I also tried the mandarin dataset to train the model. Therefore, there are questions for me.
Do you revise some part of the model?
Does the dataset have to the corresponding transcript?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants