-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_len doesnt crop samples properly #290
Comments
Hello @yl4579. Could you exlain it please? |
I'm afraid @yl4579 left the community around the time this repository was last updated. He, and most of the initial contributors, no longer respond to any questions. You might be able to find some answers if you double-post this into Discussions as well, however the community all seem to have moved on to their own versions of StyleTTS2, including some commercial forks without contributing back to the community - which is a shame really. But that's the state of things right now. |
I think there's nothing wrong with the code itself and it's working as intended. the purpose of that line is probably not to take the biggest sample in the batch but rather to ensure no sample in your batch goes beyond that threshold. the Author's previous works also work in a similar way. I've tried doing the other way by padding / trimming all the samples to ensure they're always at max_len if they were not, this will drastically increase the memory consumption as one would expect if you use a max len close to 10 seconds of audio. unless i'm confused about what you're trying to say, it's not a good idea to do that. |
Hi. It seems that max_len doesnt work properly.
mel_len should be
mel_input_length_all.max()
, notmel_input_length_all.min()
It leads that we select the maximum length as minimum length in batch. With this formula we will select max_len only when the minimum length in batch will be greater than max_len
For example if max_len==400, maximum length of mels in batch was 600 and minimum is 92 with whis formula we assign mel_len=min(92, 400)= 92
Thus, all samples in clipped batch will be with maximum length of 92 because we do
It means that we always train on samples with minimum lenght in tha batch. Here some shapes for example
27600/300=92 (300 is hop len)
Also random_start leads to cropping the begging of samples that less than max_len and using padding instead
More over we skip many of samples
To fix it we should crop only samples which length is greater than max_len
Did I noticed the bug or I dont understand something?
The text was updated successfully, but these errors were encountered: