Questions regarding the pretrained text aligner #80

gnitoah · 2025-01-02T09:47:52Z

Hi,

Thanks for the great work! I've been working with the model as a backbone for Mandarin TTS.

Recently I noted a couple of problems with the pretrained text aligner, and wonder if these could pose a disturbance in generation of the gt durations and training of the duration predictor (there've been defects with phrase breaking in my synthesized speech):

It seems the training of the AuxiliaryASR has taken to a different dictionary of symbol representation, involving the "sos" "eos" "unk" tokens. Normally this would not incur a problem, but since the aligner is finetuned in the first stage using TMA, I reckon there could be confusion with ASRS2S decoding?
https://github.com/yl4579/StyleTTS/blob/main/Utils/ASR/models.py#L128
Here the text input is randomly masked filled with "unk" tokens, whose index is set as 3, yet for the TTS model a 3 in text input would point to the comma (","). I guess this won't be that much of a problem on the whole, but wonder if this could suggest some potential mismatch in text processing between the pretraining (as AuxiliaryASR) and finetuning (in the TTS model) stages.
The gt durations derived from the text aligner seem problematic when it comes to phrase breaks (with a corresponding blank in the text), tending to assign a long duration to the last phoneme before the break, rather than to the break (blank in text) itself, for example in utterance of "abc de" if there is a 10-frame pause between "abc" and "de", the derived gt duration would probably have a 10 on "c" and only 2 on " " or something, whereas we shall expect the 10 frames of pause to be assigned to " ".
I figure this to be an inherent problem with the ASR-based alignment approach, since an ASR model is not meant to identify blanks and the CTC loss would deal with blanks in a particular manner. But it feels crucial to have correct gt duration for phrase breaks in order to correctly expand phonemes into frames.

I hope I'm not getting anything wrong here. Will there be a way to fix the potential problems?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding the pretrained text aligner #80

Questions regarding the pretrained text aligner #80

gnitoah commented Jan 2, 2025 •

edited

Loading

Questions regarding the pretrained text aligner #80

Questions regarding the pretrained text aligner #80

Comments

gnitoah commented Jan 2, 2025 • edited Loading

gnitoah commented Jan 2, 2025 •

edited

Loading