You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the great work! I've been working with the model as a backbone for Mandarin TTS.
Recently I noted a couple of problems with the pretrained text aligner, and wonder if these could pose a disturbance in generation of the gt durations and training of the duration predictor (there've been defects with phrase breaking in my synthesized speech):
It seems the training of the AuxiliaryASR has taken to a different dictionary of symbol representation, involving the "sos" "eos" "unk" tokens. Normally this would not incur a problem, but since the aligner is finetuned in the first stage using TMA, I reckon there could be confusion with ASRS2S decoding? https://github.com/yl4579/StyleTTS/blob/main/Utils/ASR/models.py#L128
Here the text input is randomly masked filled with "unk" tokens, whose index is set as 3, yet for the TTS model a 3 in text input would point to the comma (","). I guess this won't be that much of a problem on the whole, but wonder if this could suggest some potential mismatch in text processing between the pretraining (as AuxiliaryASR) and finetuning (in the TTS model) stages.
The gt durations derived from the text aligner seem problematic when it comes to phrase breaks (with a corresponding blank in the text), tending to assign a long duration to the last phoneme before the break, rather than to the break (blank in text) itself, for example in utterance of "abc de" if there is a 10-frame pause between "abc" and "de", the derived gt duration would probably have a 10 on "c" and only 2 on " " or something, whereas we shall expect the 10 frames of pause to be assigned to " ".
I figure this to be an inherent problem with the ASR-based alignment approach, since an ASR model is not meant to identify blanks and the CTC loss would deal with blanks in a particular manner. But it feels crucial to have correct gt duration for phrase breaks in order to correctly expand phonemes into frames.
I hope I'm not getting anything wrong here. Will there be a way to fix the potential problems?
The text was updated successfully, but these errors were encountered:
Hi,
Thanks for the great work! I've been working with the model as a backbone for Mandarin TTS.
Recently I noted a couple of problems with the pretrained text aligner, and wonder if these could pose a disturbance in generation of the gt durations and training of the duration predictor (there've been defects with phrase breaking in my synthesized speech):
It seems the training of the AuxiliaryASR has taken to a different dictionary of symbol representation, involving the "sos" "eos" "unk" tokens. Normally this would not incur a problem, but since the aligner is finetuned in the first stage using TMA, I reckon there could be confusion with ASRS2S decoding?
https://github.com/yl4579/StyleTTS/blob/main/Utils/ASR/models.py#L128
Here the text input is randomly masked filled with "unk" tokens, whose index is set as 3, yet for the TTS model a 3 in text input would point to the comma (","). I guess this won't be that much of a problem on the whole, but wonder if this could suggest some potential mismatch in text processing between the pretraining (as AuxiliaryASR) and finetuning (in the TTS model) stages.
The gt durations derived from the text aligner seem problematic when it comes to phrase breaks (with a corresponding blank in the text), tending to assign a long duration to the last phoneme before the break, rather than to the break (blank in text) itself, for example in utterance of "abc de" if there is a 10-frame pause between "abc" and "de", the derived gt duration would probably have a 10 on "c" and only 2 on " " or something, whereas we shall expect the 10 frames of pause to be assigned to " ".
I figure this to be an inherent problem with the ASR-based alignment approach, since an ASR model is not meant to identify blanks and the CTC loss would deal with blanks in a particular manner. But it feels crucial to have correct gt duration for phrase breaks in order to correctly expand phonemes into frames.
I hope I'm not getting anything wrong here. Will there be a way to fix the potential problems?
The text was updated successfully, but these errors were encountered: