Training Loss explodes when training on NTU-Dataset 120 #4

shubhMaheshwari · 2022-01-14T14:51:08Z

Great work. Thanks for sharing your code!

While training on the NTU-120 the generator loss is exploding. We only changed the batchsize from 32 to 380

[Epoch 297/1200] [Batch 162/165] [D loss: 0.198222] [G loss: 4918.882324]
[Epoch 297/1200] [Batch 163/165] [D loss: 0.205873] [G loss: 4918.882324]
[Epoch 297/1200] [Batch 164/165] [D loss: 0.223392] [G loss: 4918.882324]

Do you know why this could be happening ?

The text was updated successfully, but these errors were encountered:

DegardinBruno · 2022-01-14T16:00:52Z

Hello @shubhMaheshwari! Thanks!

GANs are very sensitive with hyperparameters even with batch sizes!
Since we are using a Wasserstein loss with gradient penalty large batch sizes can affect the training!

Can you try with a smaller batch size like 32, 64 or 128? And then come back to us with your results?
P.S. Also for better gradients remember using always a base 2 number as the batch size (2,4,8,16,32,64...), which had been already proven with GAN architectures.

shubhMaheshwari · 2022-01-17T09:00:18Z

Hey @DegardinBruno,
We tried again with 32 batch size but the loss for the generator is exploding.

[Epoch 648/1200] [Batch 1959/1969] [D loss: -0.036133] [G loss: -240803.156250]
[Epoch 648/1200] [Batch 1960/1969] [D loss: 0.047026] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1961/1969] [D loss: 0.007517] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1962/1969] [D loss: -0.058064] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1963/1969] [D loss: 0.203749] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1964/1969] [D loss: -0.175149] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1965/1969] [D loss: 0.327590] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1966/1969] [D loss: -0.348924] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1967/1969] [D loss: -0.372265] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1968/1969] [D loss: -0.245794] [G loss: -526808.625000]

We made only made the following changes to the code.

-parser.add_argument("--n_classes", type=int, default=60, help="number of classes for datas
+parser.add_argument("--n_classes", type=int, default=120, help="number of classes for data
-parser.add_argument("--checkpoint_interval", type=int, default=10000, help="interval betwe
+parser.add_argument("--checkpoint_interval", type=int, default=500, help="interval between
-parser.add_argument("--data_path", type=str, default="/media/degar/Data/PhD/Kinetic-GAN/Br
-parser.add_argument("--label_path", type=str, default="/media/degar/Data/PhD/Kinetic-GAN/B
+parser.add_argument("--data_path", type=str, default="/ssd_scratch/cvit/sai.shashank/data/
+parser.add_argument("--label_path", type=str, default="/ssd_scratch/cvit/sai.shashank/data

DegardinBruno · 2022-01-17T11:28:27Z

@shubhMaheshwari I will repeat the experiments with the information you provided, I will come back to you with an answer.
Just some questions:

Which benchmark are you using, cross-setup or cross-subject?
Which mapping network's depth did you define in the generator? If you are using the default (4) increase it at least to 8, the default one is for NTU-60 and NTU-120 has much more different subjects present in the training data (check paper for details).

Btw, Kinetic-GAN's loss on NTU-120 for the cross-setup benchmark should have similar behaviour as follows but values may vary due to random initializations.

shubhMaheshwari · 2022-01-17T12:07:33Z

We are using cross-subject
We are using mapping network's depth = 4

Can you provide a single command to train on NTU-120? Similar to one provided in the readme

python kinetic-gan.py  --data_path path_train_data.npy  --label_path path_train_labels.pkl  --dataset ntu_or_h36m  # check kinetic-gan.py file

Thanks
Shubh

DegardinBruno · 2022-01-17T14:33:41Z

Just one small thing, can you show me your loss evolution?
Just run this, it will save a pdf plot on the respective exp folder:

python visualization/plot_loss.py --batches 1970 --runs kinetic-gan --exp -1

Can you provide a single command to train on NTU-120? Similar to one provided in the readme

Here is the entire command that I am running:

python kinetic-gan.py --b1 0.5 --b2 0.999 --batch_size 32 --channels 3 --checkpoint_interval 10000 --data_path /home/degardin/DATASETS/st-gcn/NTU-120/xsub/train_data.npy  --dataset ntu --label_path /home/degardin/DATASETS/st-gcn/NTU-120/xsub/train_label.pkl --lambda_gp 10 --latent_dim 512 --lr 0.0002 --mlp_dim 8 --n_classes 120 --n_cpu 8 --n_critic 5 --n_epochs 1200 --sample_interval 5000 --t_size 64 --v_size 25

DegardinBruno · 2022-01-18T13:20:47Z

@shubhMaheshwari This is my loss at this moment, which is normal to be high at the beginning and rapidly start to learn to generate the human structure before learning to synthesise human motion:

shubhMaheshwari · 2022-01-26T14:06:49Z

@DegardinBruno This is the loss curve we are getting. We didn't make any changes to the code

DegardinBruno · 2022-01-28T03:22:44Z

@shubhMaheshwari did you downloaded the data from our server?
I repeated the experiments a second time and nothing seems different from normal.
What about torch versions also?

Try with ntu-60 xsub (default settings as the code base) to see if the same is happening, or even with fewer classes like 5 or 10 (feeder is ready for it also).

Feel free to reach me out.

DegardinBruno added help wanted Extra attention is needed question Further information is requested and removed help wanted Extra attention is needed question Further information is requested labels Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Loss explodes when training on NTU-Dataset 120 #4

Training Loss explodes when training on NTU-Dataset 120 #4

shubhMaheshwari commented Jan 14, 2022 •

edited

Loading

DegardinBruno commented Jan 14, 2022

shubhMaheshwari commented Jan 17, 2022

DegardinBruno commented Jan 17, 2022 •

edited

Loading

shubhMaheshwari commented Jan 17, 2022 •

edited

Loading

DegardinBruno commented Jan 17, 2022

DegardinBruno commented Jan 18, 2022

shubhMaheshwari commented Jan 26, 2022 •

edited

Loading

DegardinBruno commented Jan 28, 2022

Training Loss explodes when training on NTU-Dataset 120 #4

Training Loss explodes when training on NTU-Dataset 120 #4

Comments

shubhMaheshwari commented Jan 14, 2022 • edited Loading

DegardinBruno commented Jan 14, 2022

shubhMaheshwari commented Jan 17, 2022

DegardinBruno commented Jan 17, 2022 • edited Loading

shubhMaheshwari commented Jan 17, 2022 • edited Loading

DegardinBruno commented Jan 17, 2022

DegardinBruno commented Jan 18, 2022

shubhMaheshwari commented Jan 26, 2022 • edited Loading

DegardinBruno commented Jan 28, 2022

shubhMaheshwari commented Jan 14, 2022 •

edited

Loading

DegardinBruno commented Jan 17, 2022 •

edited

Loading

shubhMaheshwari commented Jan 17, 2022 •

edited

Loading

shubhMaheshwari commented Jan 26, 2022 •

edited

Loading