Pytorch Implementation of MoCoGAN

Usage

We are using this dataset which you need to extact and place all the files in a file named data.

$ python3 main.py --epochs 40000

NOTE: on Colab Notebook use following command:

!git clone link-to-repo
%run main.py --epochs 40000

usage: main.py [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS]
               [--pre-train PRE_TRAIN] [--img_size IMG_SIZE] [--data DATA] 
               [--channel CHANNEL] [--hidden HIDDEN] [--dc DC] [--de DE]
               [--lr LR] [--beta BETA] [--trained_path TRAINED_PATH] [--T T]

Start trainning MoCoGAN.....

optional arguments:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        set batch_size
  --epochs EPOCHS       set num of iterations
  --pre-train PRE_TRAIN
                        set 1 when you use pre-trained models
  --img_size IMG_SIZE   set the input image size of frame
  --data DATA           set the path for the direcotry containing dataset
  --channel CHANNEL     set the no. of channel of the frame
  --hidden HIDDEN       set the hidden layer size for gru
  --dc DC               set the size of motion vector
  --de DE               set the size of randomly generated epsilon
  --lr LR               set the learning rate
  --beta BETA           set the beta for the optimizer
  --trained_path TRAINED_PATH
                        set the path were to trained models are saved
  --T T                 set the no. of frames to be selected

Contributed by:

Ayush Gupta

References

Title: MoCoGAN: Decomposing Motion and Content for Video Generation
Authors: Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, Jan Kautz
Link: https://arxiv.org/pdf/1707.04993.pdf
Year: 2017

Summary

Introduction

Visual signals in a video can be divided into content and motions. Content specifies which object is in the video, motion describes their dynamics. Based on this MoCoGAN framework was proposed. This proposed framework generates a video by mapping a sequence of randomly generated vectors to a sequence of video frames. Each randomly generated vector consists of a motion part, and a content part.

To learn motion and content in an unsupervised manner we introduce an adverserial learning scheme utilizing both image and video discriminator.

GANs

Generative adversarial nets were recently introduced as a novel way to train a generative model. They consists of two ‘adversarial’ models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. Both G and D could be a non-linear mapping function, such as a multi-layer perceptron.

Motion And Content Decomposition GAN

In MoCoGAN, we assume a latent space of images Z_I≡R^d where each point z ∈ Z_I represents an image, and a video of K frames is represented by a path of length K in the latent space, [z⁽¹⁾, ..., z^(K)]. By adopting this formulation, videos of different lengths can be generated by paths of different lengths. We further assume that Z_I is decomposed into the content Z_C and motion Z_M subspace. The content subspace models motion-independent appearance in videos, while the motion subspace models motion-dependent appearance in videos.

Framework

For a video, the content vector, zC, is sampled once and fixed. Then, a series of random variables[e⁽¹⁾, ..., e^(K)] is sampled and mapped to a series of motioncodes [z⁽¹⁾_M,...,z^(K)_M] via the recurrent neural network R_M. We implement RM using a one-layer GRU network. A generator GI produces a frame, x˜^(k), using the content and the motion vectors {z_C, z^(K)_M }. The discriminators, D_I and D_V, are trained on real and fake images and videos, respectively, sampled from the training set v and the generated set v˜. The function S₁ samples a single frame from a video, S_T samples T consequtive frames.

We train MoCoGAN using the alternating gradient update algorithm as in. In one step, we update D_I and D_V while fixing G_I and R_M. In the alternating step, we update G_I and R_M while fixing D_I and D_V using a min-max game with value function F_V(D_I,D_V,G_I,R_M)

In this objective function the first and second terms helps to train the Image Discriminator so that it can generate 1 for images samples from real videos and zero for those from fake videos. Similarly the third and fourth term help us to train the Video Discriminator.

Implementation and Model Architecture

We implement this model on Weizmann database.

We train our model for 40000 epoch
We use BCE loss(Binary Crossentropy loss) with a learning rate of 0.0002
We test the model by generating videos from a randomly generated set of epsilon and Z_C

Generator

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
   ConvTranspose2d-1            [-1, 512, 6, 6]       1,105,920
       BatchNorm2d-2            [-1, 512, 6, 6]           1,024
              ReLU-3            [-1, 512, 6, 6]               0
   ConvTranspose2d-4          [-1, 256, 12, 12]       2,097,152
       BatchNorm2d-5          [-1, 256, 12, 12]             512
              ReLU-6          [-1, 256, 12, 12]               0
   ConvTranspose2d-7          [-1, 128, 24, 24]         524,288
       BatchNorm2d-8          [-1, 128, 24, 24]             256
              ReLU-9          [-1, 128, 24, 24]               0
  ConvTranspose2d-10           [-1, 64, 48, 48]         131,072
      BatchNorm2d-11           [-1, 64, 48, 48]             128
             ReLU-12           [-1, 64, 48, 48]               0
  ConvTranspose2d-13            [-1, 3, 96, 96]           3,072
             Tanh-14            [-1, 3, 96, 96]               0
================================================================
Total params: 3,863,424
Trainable params: 3,863,424
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 6.75
Params size (MB): 14.74
Estimated Total Size (MB): 21.49
----------------------------------------------------------------

Image Discriminator

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 64, 48, 48]           3,072
         LeakyReLU-2           [-1, 64, 48, 48]               0
            Conv2d-3          [-1, 128, 24, 24]         131,072
         LeakyReLU-4          [-1, 128, 24, 24]               0
            Conv2d-5          [-1, 256, 12, 12]         524,288
         LeakyReLU-6          [-1, 256, 12, 12]               0
            Conv2d-7            [-1, 512, 6, 6]       2,097,152
         LeakyReLU-8            [-1, 512, 6, 6]               0
            Conv2d-9              [-1, 1, 1, 1]          18,432
          Sigmoid-10              [-1, 1, 1, 1]               0
================================================================
Total params: 2,774,016
Trainable params: 2,774,016
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.11
Forward/backward pass size (MB): 4.22
Params size (MB): 10.58
Estimated Total Size (MB): 14.91
----------------------------------------------------------------

Video Discriminator

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv3d-1        [-1, 64, 8, 48, 48]          12,288
         LeakyReLU-2        [-1, 64, 8, 48, 48]               0
            Conv3d-3       [-1, 128, 4, 24, 24]         524,288
       BatchNorm3d-4       [-1, 128, 4, 24, 24]             256
         LeakyReLU-5       [-1, 128, 4, 24, 24]               0
            Conv3d-6       [-1, 256, 2, 12, 12]       2,097,152
       BatchNorm3d-7       [-1, 256, 2, 12, 12]             512
         LeakyReLU-8       [-1, 256, 2, 12, 12]               0
            Conv3d-9         [-1, 512, 1, 6, 6]       8,388,608
      BatchNorm3d-10         [-1, 512, 1, 6, 6]           1,024
        LeakyReLU-11         [-1, 512, 1, 6, 6]               0
           Linear-12                    [-1, 1]          18,433
          Sigmoid-13                    [-1, 1]               0
================================================================
Total params: 11,042,561
Trainable params: 11,042,561
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 1.69
Forward/backward pass size (MB): 26.86
Params size (MB): 42.12
Estimated Total Size (MB): 70.67
----------------------------------------------------------------

Results

Some samples of the generated videos are as follows:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pytorch Implementation of MoCoGAN

Usage

Contributed by:

References

Summary

Introduction

GANs

Motion And Content Decomposition GAN

Framework

Implementation and Model Architecture

Generator

Image Discriminator

Video Discriminator

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pytorch Implementation of MoCoGAN

Usage

Contributed by:

References

Summary

Introduction

GANs

Motion And Content Decomposition GAN

Framework

Implementation and Model Architecture

Generator

Image Discriminator

Video Discriminator

Results