Skip to content

Latest commit

 

History

History
190 lines (158 loc) · 9.91 KB

File metadata and controls

190 lines (158 loc) · 9.91 KB

Pytorch Implementation of MoCoGAN

Usage

We are using this dataset which you need to extact and place all the files in a file named data.

$ python3 main.py --epochs 40000

NOTE: on Colab Notebook use following command:

!git clone link-to-repo
%run main.py --epochs 40000
usage: main.py [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS]
               [--pre-train PRE_TRAIN] [--img_size IMG_SIZE] [--data DATA] 
               [--channel CHANNEL] [--hidden HIDDEN] [--dc DC] [--de DE]
               [--lr LR] [--beta BETA] [--trained_path TRAINED_PATH] [--T T]

Start trainning MoCoGAN.....

optional arguments:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        set batch_size
  --epochs EPOCHS       set num of iterations
  --pre-train PRE_TRAIN
                        set 1 when you use pre-trained models
  --img_size IMG_SIZE   set the input image size of frame
  --data DATA           set the path for the direcotry containing dataset
  --channel CHANNEL     set the no. of channel of the frame
  --hidden HIDDEN       set the hidden layer size for gru
  --dc DC               set the size of motion vector
  --de DE               set the size of randomly generated epsilon
  --lr LR               set the learning rate
  --beta BETA           set the beta for the optimizer
  --trained_path TRAINED_PATH
                        set the path were to trained models are saved
  --T T                 set the no. of frames to be selected

Contributed by:

References

  • Title: MoCoGAN: Decomposing Motion and Content for Video Generation
  • Authors: Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, Jan Kautz
  • Link: https://arxiv.org/pdf/1707.04993.pdf
  • Year: 2017

Summary

Introduction

Visual signals in a video can be divided into content and motions. Content specifies which object is in the video, motion describes their dynamics. Based on this MoCoGAN framework was proposed. This proposed framework generates a video by mapping a sequence of randomly generated vectors to a sequence of video frames. Each randomly generated vector consists of a motion part, and a content part.

To learn motion and content in an unsupervised manner we introduce an adverserial learning scheme utilizing both image and video discriminator.

GANs

Generative adversarial nets were recently introduced as a novel way to train a generative model. They consists of two ‘adversarial’ models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. Both G and D could be a non-linear mapping function, such as a multi-layer perceptron.

Motion And Content Decomposition GAN

In MoCoGAN, we assume a latent space of images ZI≡Rd where each point z ∈ ZI represents an image, and a video of K frames is represented by a path of length K in the latent space, [z(1), ..., z(K)]. By adopting this formulation, videos of different lengths can be generated by paths of different lengths. We further assume that ZI is decomposed into the content ZC and motion ZM subspace. The content subspace models motion-independent appearance in videos, while the motion subspace models motion-dependent appearance in videos.

Framework

For a video, the content vector, zC, is sampled once and fixed. Then, a series of random variables[e(1), ..., e(K)] is sampled and mapped to a series of motioncodes [z(1)M,...,z(K)M] via the recurrent neural network RM. We implement RM using a one-layer GRU network. A generator GI produces a frame, x˜(k), using the content and the motion vectors {zC, z(K)M }. The discriminators, DI and DV, are trained on real and fake images and videos, respectively, sampled from the training set v and the generated set v˜. The function S1 samples a single frame from a video, ST samples T consequtive frames.

We train MoCoGAN using the alternating gradient update algorithm as in. In one step, we update DI and DV while fixing GI and RM. In the alternating step, we update GI and RM while fixing DI and DV using a min-max game with value function FV(DI,DV,GI,RM)

In this objective function the first and second terms helps to train the Image Discriminator so that it can generate 1 for images samples from real videos and zero for those from fake videos. Similarly the third and fourth term help us to train the Video Discriminator.

Implementation and Model Architecture

We implement this model on Weizmann database.

  • We train our model for 40000 epoch
  • We use BCE loss(Binary Crossentropy loss) with a learning rate of 0.0002
  • We test the model by generating videos from a randomly generated set of epsilon and ZC

Generator

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
   ConvTranspose2d-1            [-1, 512, 6, 6]       1,105,920
       BatchNorm2d-2            [-1, 512, 6, 6]           1,024
              ReLU-3            [-1, 512, 6, 6]               0
   ConvTranspose2d-4          [-1, 256, 12, 12]       2,097,152
       BatchNorm2d-5          [-1, 256, 12, 12]             512
              ReLU-6          [-1, 256, 12, 12]               0
   ConvTranspose2d-7          [-1, 128, 24, 24]         524,288
       BatchNorm2d-8          [-1, 128, 24, 24]             256
              ReLU-9          [-1, 128, 24, 24]               0
  ConvTranspose2d-10           [-1, 64, 48, 48]         131,072
      BatchNorm2d-11           [-1, 64, 48, 48]             128
             ReLU-12           [-1, 64, 48, 48]               0
  ConvTranspose2d-13            [-1, 3, 96, 96]           3,072
             Tanh-14            [-1, 3, 96, 96]               0
================================================================
Total params: 3,863,424
Trainable params: 3,863,424
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 6.75
Params size (MB): 14.74
Estimated Total Size (MB): 21.49
----------------------------------------------------------------

Image Discriminator

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 64, 48, 48]           3,072
         LeakyReLU-2           [-1, 64, 48, 48]               0
            Conv2d-3          [-1, 128, 24, 24]         131,072
         LeakyReLU-4          [-1, 128, 24, 24]               0
            Conv2d-5          [-1, 256, 12, 12]         524,288
         LeakyReLU-6          [-1, 256, 12, 12]               0
            Conv2d-7            [-1, 512, 6, 6]       2,097,152
         LeakyReLU-8            [-1, 512, 6, 6]               0
            Conv2d-9              [-1, 1, 1, 1]          18,432
          Sigmoid-10              [-1, 1, 1, 1]               0
================================================================
Total params: 2,774,016
Trainable params: 2,774,016
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.11
Forward/backward pass size (MB): 4.22
Params size (MB): 10.58
Estimated Total Size (MB): 14.91
----------------------------------------------------------------

Video Discriminator

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv3d-1        [-1, 64, 8, 48, 48]          12,288
         LeakyReLU-2        [-1, 64, 8, 48, 48]               0
            Conv3d-3       [-1, 128, 4, 24, 24]         524,288
       BatchNorm3d-4       [-1, 128, 4, 24, 24]             256
         LeakyReLU-5       [-1, 128, 4, 24, 24]               0
            Conv3d-6       [-1, 256, 2, 12, 12]       2,097,152
       BatchNorm3d-7       [-1, 256, 2, 12, 12]             512
         LeakyReLU-8       [-1, 256, 2, 12, 12]               0
            Conv3d-9         [-1, 512, 1, 6, 6]       8,388,608
      BatchNorm3d-10         [-1, 512, 1, 6, 6]           1,024
        LeakyReLU-11         [-1, 512, 1, 6, 6]               0
           Linear-12                    [-1, 1]          18,433
          Sigmoid-13                    [-1, 1]               0
================================================================
Total params: 11,042,561
Trainable params: 11,042,561
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 1.69
Forward/backward pass size (MB): 26.86
Params size (MB): 42.12
Estimated Total Size (MB): 70.67
----------------------------------------------------------------

Results

Some samples of the generated videos are as follows:

gif gif gif