-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different results output during training compared to test.py #15
Comments
So I think I have tracked it down to the helper function (decoder_helper) in the decoder part of tacotron, but I'm still a bit of a novice so I don't really understand how to fix it. I've isolated it here by running test with the training data and selectively turning off specific areas that utilize the |
I opened an issue regarding the same problem, and I was told that In evaluation, unlike training, the output of the decoder fed back into the decoder input of next time step, so the result can be different. Closed the issue after that. |
My concern is that the InferenceHelper might not be doing what we think... Maybe the author can help out to explain why the EmbeddingHelper (either greedy or sampling) wasn't used? Essentially I'm struggling to understand how it can go from sounding very good during training to unintelligible during inference... Edit: Of course not using embedding layer... it's a float 😝 |
@onyedikilo Ok, now I see your closed issue and how you ran into the same problem. This level of drop in audio quality just feels wrong. If you (or anyone else) checking into this finds anything, I would be super interested to know. |
Hi, I met the same problem. Do you all have any suggestions? Thanks a lot. |
Hi @jpdz, I have stepped away from working on this for the moment, but will return in a few weeks. I don't yet understand fully what the CustomHelper function that is used in the decoder actually does. In the paper, it says it passes the information at timestep t to timestep t+1, but to me it looks like at timestep t+1, it can see the entire input... But I'm probably missing something. One idea I had is to use the standard decoder which has an embedding layer. One could multiply the mel spectrogram coefficients to a large number and then cast to an int. This would allow the use of an embedding layer, which is implemented in tensorflow and has some documentation. I will report back after trying this, but it will likely take me a month to get to it. |
Hey, very sorry I'm only just getting back to you all on this. Although I agree it's annoying, it does make sense that there's a big drop off in quality even on the same prompt when running train.py vs test.py. As in the original paper, the repo does not use scheduled sampling (although it is a configurable parameter in tacotron.py). This means that in training, the decoder is given the ground truth input at every time step. When we test we don't have access to the ground truth so the next input at each time step is the output of the previous time step. This will be much noisier since we are unlikely to have perfectly synthesized the previous time step. I highly encourage you to try changing the scheduled sampling parameter and see if it improves performance, particularly on smaller datasets. I ran a few experiments with it but didn't have the compute to properly explore it. With scheduled sampling probability 1, the output is fed to the input every time step in training and testing so the above problem should go away. The down side is that training will be more difficult and so you may want to reduce the dropout value concurrently. I wrote InferenceHelper since the TensorFlow seq2seq api is geared towards NLP and so only provided inference helpers which sample from an embedding -- they pick the most likely next discrete word. Here the decoder outputs a continuous function (the mel filters) and so we directly pass the previous output into the next input. That's all that the InferenceHelper class does in next_inputs_fn. Each time step does have access to the whole sequence, but that logic is handled in the attention mechanism (AttentionWrapper). I'll put this in the readme this week but together with the above, the best way to tell if your model is generalizing is by looking for monotonicity in the attention plots. You can see these in Tensorboard under the images tab. |
@barronalex Hi, I tried to change the scheduled sampling parameter to 0.5 and have trained for three days on one gpu, however, the results are not good. BTW, I am a little confused about the sampling parameter, what's the difference between 0.5 and 1. Thanks a lot. |
Which dataset are you training on? I just uploaded some weights trained on Nancy with r=2, scheduled sampling 0.5 which might be a good starting point. With scheduled sampling 0.5, we use the ground truth at the next decoder input half the time, and the previous output half the time. With scheduled sampling 1, we always use the previous output and never the ground truth. This means you should get the same results for training and testing on the same input with scheduled sampling 1, but it will be harder to train the model. |
@barronalex I also use the Nancy dataset based on your previous code with scheduled sampling 0.5. It has been trained for two weeks, still doesn't converge. Did you get some nice results? Thanks a lot! |
So on the training set it still sounds poor and there's no alignment? I ended up getting better results with r=2 rather than r=5 and so maybe try that or just pull the repo, restore my weights and continue training? The alignment with the weights I posted is quite good but it could use more training to remove some of the noise. The samples have been updated too so you can get a sense of their quality from that. |
@barronalex Thank you so much! I will have a look at it! |
Hi @barronalex, the audio clips do indeed sound much better. These are from inference or during treaining? I look forward to getting back to this in a few more weeks after wrapping up some other projects. Thanks again for the extra work and for uploading your examples. |
No worries at all! Sorry it's been a while. Those clips are from inference on unseen examples (mostly taken from Arctic and the paper examples). It sounds much better during training. |
@barronalex I listened to your updated results. it's good I think. I am now using your model and began to continue training it. |
Which version of TensorFlow are you running? |
I have solved this problem since when I run the test.py, I merge the two commands into one command. InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [400] rhs shape= [160] Secondly, I can successfully retrain the model from global step 0, however, when it comes to save samples during training, a problem occurs: Did I made some mistakes? Thanks a lot. |
It seems like you might have the saved spectrogram with r=5. It should work if you rerun 'preprocess.py nancy' with r=2 (which is now the default in audio.py) and then trying the training again. It's not the best design currently that you have to rerun it so I'll try and fix that soon. |
@barronalex That's the problem, thanks a lot! |
I'm trying to reproduce some of the results I obtained during training by using the test.py script. Continuing to dig into this, but wondering if anyone else has come across the same issue?
The text was updated successfully, but these errors were encountered: