You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @cschaefer26,
You have done nice job. I'm using your repo. But while aligning larger audio (> 1 minute) with its character (phone) sequence at inference period, the number of predicted values in duration file (. npy file) does not match with the number of characters (phones) that I input with the audio file. What is the problem here? I want to use pretrained model (trained on bangla dataset [audio, phoneme sequence] ) for phoneme duration prediction.So accuracy is a major concern for me.
Note that: While training, I have used 10-15 second larger audio files and corresponding transcriptions (phoneme sequences). And I customized your code (preprocess.py and extract_durations.py) to fit the inference for single audio and its transcription.
The text was updated successfully, but these errors were encountered:
Hi, did you ensure that all the audio files were preprocessed before training? Because the preprocessing builds up a phoneme sett from the training data. I'd suspect that you apply the model to new files with unknown phonemes that get filtered out (that's just a guess).
Hi @cschaefer26 , your guess is correct. I applied the model with new files containing unknown phonemes. Thanks for your reply. However, when I want to align an audio (with intermediate silences which are actually inherent) and its phoneme sequence, the accuracy of predicted durations for phones is quite low. As intermediate silence parts are merged with phones' duration. Any suggestion please......
Hi @cschaefer26,
You have done nice job. I'm using your repo. But while aligning larger audio (> 1 minute) with its character (phone) sequence at inference period, the number of predicted values in duration file (. npy file) does not match with the number of characters (phones) that I input with the audio file. What is the problem here? I want to use pretrained model (trained on bangla dataset [audio, phoneme sequence] ) for phoneme duration prediction.So accuracy is a major concern for me.
Note that: While training, I have used 10-15 second larger audio files and corresponding transcriptions (phoneme sequences). And I customized your code (preprocess.py and extract_durations.py) to fit the inference for single audio and its transcription.
The text was updated successfully, but these errors were encountered: