Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Loss explodes when training on NTU-Dataset 120 #4

Open
shubhMaheshwari opened this issue Jan 14, 2022 · 8 comments
Open

Training Loss explodes when training on NTU-Dataset 120 #4

shubhMaheshwari opened this issue Jan 14, 2022 · 8 comments

Comments

@shubhMaheshwari
Copy link

shubhMaheshwari commented Jan 14, 2022

Hey @DegardinBruno

Great work. Thanks for sharing your code!

While training on the NTU-120 the generator loss is exploding. We only changed the batchsize from 32 to 380

[Epoch 297/1200] [Batch 162/165] [D loss: 0.198222] [G loss: 4918.882324]
[Epoch 297/1200] [Batch 163/165] [D loss: 0.205873] [G loss: 4918.882324]
[Epoch 297/1200] [Batch 164/165] [D loss: 0.223392] [G loss: 4918.882324]

Do you know why this could be happening ?

@DegardinBruno
Copy link
Owner

Hello @shubhMaheshwari! Thanks!

GANs are very sensitive with hyperparameters even with batch sizes!
Since we are using a Wasserstein loss with gradient penalty large batch sizes can affect the training!

Can you try with a smaller batch size like 32, 64 or 128? And then come back to us with your results?
P.S. Also for better gradients remember using always a base 2 number as the batch size (2,4,8,16,32,64...), which had been already proven with GAN architectures.

@shubhMaheshwari
Copy link
Author

Hey @DegardinBruno,
We tried again with 32 batch size but the loss for the generator is exploding.

[Epoch 648/1200] [Batch 1959/1969] [D loss: -0.036133] [G loss: -240803.156250]
[Epoch 648/1200] [Batch 1960/1969] [D loss: 0.047026] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1961/1969] [D loss: 0.007517] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1962/1969] [D loss: -0.058064] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1963/1969] [D loss: 0.203749] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1964/1969] [D loss: -0.175149] [G loss: -439268.718750]
[Epoch 648/1200] [Batch 1965/1969] [D loss: 0.327590] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1966/1969] [D loss: -0.348924] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1967/1969] [D loss: -0.372265] [G loss: -526808.625000]
[Epoch 648/1200] [Batch 1968/1969] [D loss: -0.245794] [G loss: -526808.625000]

We made only made the following changes to the code.

-parser.add_argument("--n_classes", type=int, default=60, help="number of classes for datas
+parser.add_argument("--n_classes", type=int, default=120, help="number of classes for data
-parser.add_argument("--checkpoint_interval", type=int, default=10000, help="interval betwe
+parser.add_argument("--checkpoint_interval", type=int, default=500, help="interval between
-parser.add_argument("--data_path", type=str, default="/media/degar/Data/PhD/Kinetic-GAN/Br
-parser.add_argument("--label_path", type=str, default="/media/degar/Data/PhD/Kinetic-GAN/B
+parser.add_argument("--data_path", type=str, default="/ssd_scratch/cvit/sai.shashank/data/
+parser.add_argument("--label_path", type=str, default="/ssd_scratch/cvit/sai.shashank/data

@DegardinBruno
Copy link
Owner

DegardinBruno commented Jan 17, 2022

@shubhMaheshwari I will repeat the experiments with the information you provided, I will come back to you with an answer.
Just some questions:

  • Which benchmark are you using, cross-setup or cross-subject?
  • Which mapping network's depth did you define in the generator? If you are using the default (4) increase it at least to 8, the default one is for NTU-60 and NTU-120 has much more different subjects present in the training data (check paper for details).

Btw, Kinetic-GAN's loss on NTU-120 for the cross-setup benchmark should have similar behaviour as follows but values may vary due to random initializations.
loss

@shubhMaheshwari
Copy link
Author

shubhMaheshwari commented Jan 17, 2022

  1. We are using cross-subject
  2. We are using mapping network's depth = 4

Can you provide a single command to train on NTU-120? Similar to one provided in the readme

python kinetic-gan.py  --data_path path_train_data.npy  --label_path path_train_labels.pkl  --dataset ntu_or_h36m  # check kinetic-gan.py file

Thanks
Shubh

@DegardinBruno
Copy link
Owner

Just one small thing, can you show me your loss evolution?
Just run this, it will save a pdf plot on the respective exp folder:

python visualization/plot_loss.py --batches 1970 --runs kinetic-gan --exp -1

Can you provide a single command to train on NTU-120? Similar to one provided in the readme

Here is the entire command that I am running:

python kinetic-gan.py --b1 0.5 --b2 0.999 --batch_size 32 --channels 3 --checkpoint_interval 10000 --data_path /home/degardin/DATASETS/st-gcn/NTU-120/xsub/train_data.npy  --dataset ntu --label_path /home/degardin/DATASETS/st-gcn/NTU-120/xsub/train_label.pkl --lambda_gp 10 --latent_dim 512 --lr 0.0002 --mlp_dim 8 --n_classes 120 --n_cpu 8 --n_critic 5 --n_epochs 1200 --sample_interval 5000 --t_size 64 --v_size 25

@DegardinBruno
Copy link
Owner

@shubhMaheshwari This is my loss at this moment, which is normal to be high at the beginning and rapidly start to learn to generate the human structure before learning to synthesise human motion:

loss

@DegardinBruno DegardinBruno added help wanted Extra attention is needed question Further information is requested and removed help wanted Extra attention is needed question Further information is requested labels Jan 18, 2022
@shubhMaheshwari
Copy link
Author

shubhMaheshwari commented Jan 26, 2022

loss
@DegardinBruno This is the loss curve we are getting. We didn't make any changes to the code

@DegardinBruno
Copy link
Owner

@shubhMaheshwari did you downloaded the data from our server?
I repeated the experiments a second time and nothing seems different from normal.
What about torch versions also?

Try with ntu-60 xsub (default settings as the code base) to see if the same is happening, or even with fewer classes like 5 or 10 (feeder is ready for it also).

Feel free to reach me out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants