You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunatley I get really bad results, I think the reason is because of bad alignment.
I train the models on a german dataset, containing 900 samples, each between 5 and 30 seconds. The sampling rate is 22050 and they are 16 bit (mono). I ran your preprocessing step.
My tensorboard looks like this (as you can see there is no alignment).
What's the reason for this and how can I solve it?
I really appreciate every help!
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Hi, could you show the attention score? The generated attention does not matter, what's used for duration extraction is the ground truth aligned one. 900 samples is quite few for generating attention with tacotron - what language are the samples in and are you using phonemes? For a small dataset like this one could try to pretrain a tacotron model on a different dataset until attention is built up and then continue training on the smaller dataset. Also, it could make sense to set the trim_long_silences=True and vad_max_silence_length=6 or so for shorter silent parts in the audios, which helps attention to build up.
Hi @cschaefer26,
thanks for your great repository.
Unfortunatley I get really bad results, I think the reason is because of bad alignment.
I train the models on a german dataset, containing 900 samples, each between 5 and 30 seconds. The sampling rate is 22050 and they are 16 bit (mono). I ran your preprocessing step.
My tensorboard looks like this (as you can see there is no alignment).
What's the reason for this and how can I solve it?
I really appreciate every help!
Thanks in advance!
The text was updated successfully, but these errors were encountered: