Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results output during training compared to test.py #15

Open
mrelich opened this issue Jul 11, 2017 · 19 comments
Open

different results output during training compared to test.py #15

mrelich opened this issue Jul 11, 2017 · 19 comments

Comments

@mrelich
Copy link

mrelich commented Jul 11, 2017

I'm trying to reproduce some of the results I obtained during training by using the test.py script. Continuing to dig into this, but wondering if anyone else has come across the same issue?

@mrelich
Copy link
Author

mrelich commented Jul 11, 2017

So I think I have tracked it down to the helper function (decoder_helper) in the decoder part of tacotron, but I'm still a bit of a novice so I don't really understand how to fix it. I've isolated it here by running test with the training data and selectively turning off specific areas that utilize the train bool. Going to keep digging, but I think this will be necessary to fix for everyone that wants to actually use the models they train 😄

@onyedikilo
Copy link

I opened an issue regarding the same problem, and I was told that In evaluation, unlike training, the output of the decoder fed back into the decoder input of next time step, so the result can be different. Closed the issue after that.

@mrelich
Copy link
Author

mrelich commented Jul 11, 2017

My concern is that the InferenceHelper might not be doing what we think... Maybe the author can help out to explain why the EmbeddingHelper (either greedy or sampling) wasn't used?

Essentially I'm struggling to understand how it can go from sounding very good during training to unintelligible during inference...

Edit: Of course not using embedding layer... it's a float 😝

@mrelich
Copy link
Author

mrelich commented Jul 11, 2017

@onyedikilo Ok, now I see your closed issue and how you ran into the same problem. This level of drop in audio quality just feels wrong. If you (or anyone else) checking into this finds anything, I would be super interested to know.

@jpdz
Copy link

jpdz commented Jul 20, 2017

Hi, I met the same problem. Do you all have any suggestions? Thanks a lot.

@mrelich
Copy link
Author

mrelich commented Jul 20, 2017

Hi @jpdz, I have stepped away from working on this for the moment, but will return in a few weeks. I don't yet understand fully what the CustomHelper function that is used in the decoder actually does. In the paper, it says it passes the information at timestep t to timestep t+1, but to me it looks like at timestep t+1, it can see the entire input... But I'm probably missing something.

One idea I had is to use the standard decoder which has an embedding layer. One could multiply the mel spectrogram coefficients to a large number and then cast to an int. This would allow the use of an embedding layer, which is implemented in tensorflow and has some documentation. I will report back after trying this, but it will likely take me a month to get to it.

@barronalex
Copy link
Owner

Hey, very sorry I'm only just getting back to you all on this.

Although I agree it's annoying, it does make sense that there's a big drop off in quality even on the same prompt when running train.py vs test.py.

As in the original paper, the repo does not use scheduled sampling (although it is a configurable parameter in tacotron.py).

This means that in training, the decoder is given the ground truth input at every time step. When we test we don't have access to the ground truth so the next input at each time step is the output of the previous time step. This will be much noisier since we are unlikely to have perfectly synthesized the previous time step.

I highly encourage you to try changing the scheduled sampling parameter and see if it improves performance, particularly on smaller datasets. I ran a few experiments with it but didn't have the compute to properly explore it. With scheduled sampling probability 1, the output is fed to the input every time step in training and testing so the above problem should go away. The down side is that training will be more difficult and so you may want to reduce the dropout value concurrently.

I wrote InferenceHelper since the TensorFlow seq2seq api is geared towards NLP and so only provided inference helpers which sample from an embedding -- they pick the most likely next discrete word. Here the decoder outputs a continuous function (the mel filters) and so we directly pass the previous output into the next input. That's all that the InferenceHelper class does in next_inputs_fn.

Each time step does have access to the whole sequence, but that logic is handled in the attention mechanism (AttentionWrapper).

I'll put this in the readme this week but together with the above, the best way to tell if your model is generalizing is by looking for monotonicity in the attention plots. You can see these in Tensorboard under the images tab.

@jpdz
Copy link

jpdz commented Jul 28, 2017

@barronalex Hi, I tried to change the scheduled sampling parameter to 0.5 and have trained for three days on one gpu, however, the results are not good. BTW, I am a little confused about the sampling parameter, what's the difference between 0.5 and 1. Thanks a lot.

@barronalex barronalex reopened this Aug 3, 2017
@barronalex
Copy link
Owner

Which dataset are you training on? I just uploaded some weights trained on Nancy with r=2, scheduled sampling 0.5 which might be a good starting point.

With scheduled sampling 0.5, we use the ground truth at the next decoder input half the time, and the previous output half the time. With scheduled sampling 1, we always use the previous output and never the ground truth. This means you should get the same results for training and testing on the same input with scheduled sampling 1, but it will be harder to train the model.

@jpdz
Copy link

jpdz commented Aug 3, 2017

@barronalex I also use the Nancy dataset based on your previous code with scheduled sampling 0.5. It has been trained for two weeks, still doesn't converge. Did you get some nice results? Thanks a lot!

@barronalex
Copy link
Owner

So on the training set it still sounds poor and there's no alignment?

I ended up getting better results with r=2 rather than r=5 and so maybe try that or just pull the repo, restore my weights and continue training?

The alignment with the weights I posted is quite good but it could use more training to remove some of the noise.

The samples have been updated too so you can get a sense of their quality from that.

@jpdz
Copy link

jpdz commented Aug 3, 2017

@barronalex Thank you so much! I will have a look at it!

@mrelich
Copy link
Author

mrelich commented Aug 3, 2017

Hi @barronalex, the audio clips do indeed sound much better. These are from inference or during treaining?

I look forward to getting back to this in a few more weeks after wrapping up some other projects. Thanks again for the extra work and for uploading your examples.

@barronalex
Copy link
Owner

No worries at all! Sorry it's been a while.

Those clips are from inference on unseen examples (mostly taken from Arctic and the paper examples). It sounds much better during training.

@jpdz
Copy link

jpdz commented Aug 4, 2017

@barronalex I listened to your updated results. it's good I think. I am now using your model and began to continue training it.
However, when I run the test.py based on your weights, it shows this problem:
Traceback (most recent call last):
File "test.py", line 89, in
test(model, config, prompts)
File "test.py", line 31, in test
model = model(config, batch_inputs, train=False)
File "/disk5/tacotron2/models/tacotron.py", line 190, in init
self.seq2seq_output, self.output = self.inference(inputs, train)
File "/disk5/tacotron2/models/tacotron.py", line 131, in inference
encoded = ops.CBHG(pre_out, speaker_embed, K=16, c=[128,128,128], gru_units=128)
File "/disk5/tacotron2/models/ops.py", line 60, in CBHG
) for k in range(1, K+1)]
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/layers/convolutional.py", line 376, in conv1d
return layer.apply(inputs)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 492, in apply
return self.call(inputs, *args, **kwargs)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 428, in call
self._assert_input_compatibility(inputs)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 540, in _assert_input_compatibility
str(x.get_shape().as_list()))
ValueError: Input 0 of layer conv1d_1 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 128]
Any suggestions? Thanks a lot~

@barronalex
Copy link
Owner

Which version of TensorFlow are you running?

@jpdz
Copy link

jpdz commented Aug 4, 2017

I have solved this problem since when I run the test.py, I merge the two commands into one command.
However, I still have a problem when I am going to continue training the model based on what you have trained. It seems that fails to load your retrained model and continue training it.
Caused by op u'save/Assign_13', defined at:
File "train.py", line 126, in
train(model, config)
File "train.py", line 47, in train
saver = tf.train.Saver(max_to_keep=3, keep_checkpoint_every_n_hours=3)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in init
self.build()
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
restore_sequentially=self._restore_sequentially)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 419, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 155, in restore
self.op.get_shape().is_fully_defined())
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 271, in assign
validate_shape=validate_shape)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign
use_locking=use_locking, name=name)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [400] rhs shape= [160]
[[Node: save/Assign_13 = Assign[T=DT_FLOAT, _class=["loc:@decoder/decoder/attention_wrapper/output_projection_wrapper/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/decoder/attention_wrapper/output_projection_wrapper/bias/Adam_1, save/RestoreV2_13/_233)]]

Secondly, I can successfully retrain the model from global step 0, however, when it comes to save samples during training, a problem occurs:
saving weights
saving sample
Traceback (most recent call last):
File "train.py", line 126, in
train(model, config)
File "train.py", line 94, in train
ideal = audio.invert_spectrogram(inputs['stft'][0]*stft_std + stft_mean)
File "/disk5/tacotron2/audio.py", line 68, in invert_spectrogram
spec = reshape_frames(spec, forward=False)
File "/disk5/tacotron2/audio.py", line 30, in reshape_frames
signal = np.reshape(signal, (-1, int(signal.shape[1]/r)))
File "/disk5/vir/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 232, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "/disk5/vir/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 348500 into shape (2562)

Did I made some mistakes? Thanks a lot.

@barronalex
Copy link
Owner

It seems like you might have the saved spectrogram with r=5. It should work if you rerun 'preprocess.py nancy' with r=2 (which is now the default in audio.py) and then trying the training again.

It's not the best design currently that you have to rerun it so I'll try and fix that soon.

@jpdz
Copy link

jpdz commented Aug 4, 2017

@barronalex That's the problem, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants