About the speech rate of generated voice #2

Charlottecuc · 2022-03-30T08:09:21Z

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input.
However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)?
Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

insunhwang89 · 2022-03-30T09:22:29Z

Thank you for your interest in our research. You asked about two things.

Is the speech adjustable according to the speed of the given target speech?

Our model cannot control the rhythm. Rhythm is a property related to the speed of speech. Therefore, the speech generated by our model will follow the speed of speech in the source speech. In target speech, only speech features are extracted and used.
Additionally, we are doing further research using rhythm. (the rhythm is not adjusted is an issue that has not been resolved in voice conversion)

When a speech with noise is used, noise is generated in the generated speech.

We did not conduct a separate experiment for noise. But let me tell you our experience. Since speech data for noise was not used when learning Vocoder, it is inevitably vulnerable to noise. Therefore, in order to solve this problem, finetuning of the vocoder model should be performed on the noise data. Or, I have to find a model that is strong against noise.

Charlottecuc · 2022-03-31T02:00:17Z

Thank you~

skol101 · 2022-06-21T16:24:47Z

@Charlottecuc @intory89 then does it make sense to introduce audiomentations during vocoder training ?

Here there was a suggestion yl4579/StarGANv2-VC#21 that it's the VC model that should be supplied with corrupted inputs, not the vocoder.

skol101 · 2022-11-24T10:39:36Z

@Charlottecuc is right here -- the model doesn't follow the speed of the source speech for UNSEEN speakers.

insunhwang89 · 2022-11-25T06:14:22Z

Our model did not consider rhythm among the characteristics of Speaker. Please refer to SpeechSplit for related research.

Superman-Valencia · 2023-03-27T12:14:15Z

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

Hi, I also tried the mandarin dataset to train the model. Therefore, there are questions for me.
Do you revise some part of the model?
Does the dataset have to the corresponding transcript?

Superman-Valencia · 2023-03-27T12:14:27Z

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

Hi, I also tried the mandarin dataset to train the model. Therefore, there are questions for me.
Do you revise some part of the model?
Does the dataset have to the corresponding transcript?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the speech rate of generated voice #2

About the speech rate of generated voice #2

Charlottecuc commented Mar 30, 2022 •

edited

Loading

insunhwang89 commented Mar 30, 2022 •

edited

Loading

Charlottecuc commented Mar 31, 2022 •

edited

Loading

skol101 commented Jun 21, 2022 •

edited

Loading

skol101 commented Nov 24, 2022

insunhwang89 commented Nov 25, 2022

Superman-Valencia commented Mar 27, 2023

Superman-Valencia commented Mar 27, 2023

About the speech rate of generated voice #2

About the speech rate of generated voice #2

Comments

Charlottecuc commented Mar 30, 2022 • edited Loading

insunhwang89 commented Mar 30, 2022 • edited Loading

Charlottecuc commented Mar 31, 2022 • edited Loading

skol101 commented Jun 21, 2022 • edited Loading

skol101 commented Nov 24, 2022

insunhwang89 commented Nov 25, 2022

Superman-Valencia commented Mar 27, 2023

Superman-Valencia commented Mar 27, 2023

Charlottecuc commented Mar 30, 2022 •

edited

Loading

insunhwang89 commented Mar 30, 2022 •

edited

Loading

Charlottecuc commented Mar 31, 2022 •

edited

Loading

skol101 commented Jun 21, 2022 •

edited

Loading