Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding the pretrained text aligner #80

Open
gnitoah opened this issue Jan 2, 2025 · 0 comments
Open

Questions regarding the pretrained text aligner #80

gnitoah opened this issue Jan 2, 2025 · 0 comments

Comments

@gnitoah
Copy link

gnitoah commented Jan 2, 2025

Hi,

Thanks for the great work! I've been working with the model as a backbone for Mandarin TTS.

Recently I noted a couple of problems with the pretrained text aligner, and wonder if these could pose a disturbance in generation of the gt durations and training of the duration predictor (there've been defects with phrase breaking in my synthesized speech):

  1. It seems the training of the AuxiliaryASR has taken to a different dictionary of symbol representation, involving the "sos" "eos" "unk" tokens. Normally this would not incur a problem, but since the aligner is finetuned in the first stage using TMA, I reckon there could be confusion with ASRS2S decoding?
    https://github.com/yl4579/StyleTTS/blob/main/Utils/ASR/models.py#L128
    Here the text input is randomly masked filled with "unk" tokens, whose index is set as 3, yet for the TTS model a 3 in text input would point to the comma (","). I guess this won't be that much of a problem on the whole, but wonder if this could suggest some potential mismatch in text processing between the pretraining (as AuxiliaryASR) and finetuning (in the TTS model) stages.

  2. The gt durations derived from the text aligner seem problematic when it comes to phrase breaks (with a corresponding blank in the text), tending to assign a long duration to the last phoneme before the break, rather than to the break (blank in text) itself, for example in utterance of "abc de" if there is a 10-frame pause between "abc" and "de", the derived gt duration would probably have a 10 on "c" and only 2 on " " or something, whereas we shall expect the 10 frames of pause to be assigned to " ".
    I figure this to be an inherent problem with the ASR-based alignment approach, since an ASR model is not meant to identify blanks and the CTC loss would deal with blanks in a particular manner. But it feels crucial to have correct gt duration for phrase breaks in order to correctly expand phonemes into frames.

I hope I'm not getting anything wrong here. Will there be a way to fix the potential problems?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant