Accumulated Gradient in Tensorflow

Typical deep learning frameworks take a training batch as input, passes it forward, computes the loss, gradient and finally back-propagates. The input batch has to fit the GPU memory in such procedure. With a limited memory GPU, such constraint has multiple drawbacks. First, small batches gradients are not stable; accordingly a small learning rate is required. This reduces the network convergence rate. Second, small batches introduce noisy contradicting gradients near the optimal point. This limits the network efficiency in terms of accuracy.

The accumulated gradient is a neural network framework feature. It is a workaround to enable big batches on limited memory GPUs. Instead of back-propagating for every batch feed-forward; gradients across multiple batches are accumulated. After multiple feed forwards, the accumulated gradient is back-propagated through the network layer. This gives the illusion of using big batches on limited memory GPUs.

Accumulated gradient is supported in caffe using iter_size parameter. For example, if batch_size = 10 and iter_size = 5, this simulates an input batch of size 50, which might not fit the GPU memory in a single pass.

Accumulated gradient is supported in pytorch by updating the network weights once after computing the gradient multiple consecutive times. The following snippet illustartes the main idea

while Training: # Main Training Loop

	for i in range(config.caffe_iter_size):  
		input, target  = next(train_loader_iter) ## Small batch
		output = model(input) ## Feed-forward
		loss = criterion(output, target) / config.caffe_iter_size ## compute loss
		loss.backward() ## Accumulate the average gradient
			
	# After caffe_iter_size iterations, update network weights **once**
	optimizer.step() 
	# Zero the gradient to re-accumulate again
	optimizer.zero_grad()

It is important to normalize the gradients across individual batches. The following snippet from Caffe source code shows the normalization step

template <typename Dtype>
void SGDSolver<Dtype>::ApplyUpdate() {
  Dtype rate = GetLearningRate();
  if (this->param_.display() && this->iter_ % this->param_.display() == 0) {
    LOG_IF(INFO, Caffe::root_solver()) << "Iteration " << this->iter_
        << ", lr = " << rate;
  }
  ClipGradients();
  for (int param_id = 0; param_id < this->net_->learnable_params().size();
       ++param_id) {
    Normalize(param_id);  /***** Normalize across iter_size *****/
    Regularize(param_id);
    ComputeUpdateValue(param_id, rate);
  }
  this->net_->Update();

  // Increment the internal iter_ counter -- its value should always indicate
  // the number of times the weights have been updated.
  ++this->iter_;
}

template <typename Dtype>
void SGDSolver<Dtype>::Normalize(int param_id) {
  if (this->param_.iter_size() == 1) { return; }
  // Scale gradient to counterbalance accumulation.
  const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();
  
  /***** <<===== Notice: Scale by 1/iter_size *****/
  const Dtype accum_normalization = Dtype(1.) / this->param_.iter_size();
  switch (Caffe::mode()) {
  case Caffe::CPU: {
    caffe_scal(net_params[param_id]->count(), accum_normalization,
        net_params[param_id]->mutable_cpu_diff());
    break;
  }
  case Caffe::GPU: {
#ifndef CPU_ONLY
    caffe_gpu_scal(net_params[param_id]->count(), accum_normalization,
        net_params[param_id]->mutable_gpu_diff());
#else
    NO_GPU;
#endif
    break;
  }
  default:
    LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode();
  }
}

Without normalization, the gradient direction is correct, but the huge magnitude will lead to big jumps -- unstable.

This nice accumulated gradient feature boosts performance with a couple of percentages. These few percentages are essential to replicate state-of-the-art results. Unfortunately, I am not aware of any official documentation to use such feature in Tensorflow. This stackoverflow post shares the main idea to manually implement accumulated gradient in Tensorflow. The main idea is to create shadow variables for trainable variables to maintain their gradients. Gradients are accumulated in this shadow variables, and they are used when updating the network weights.

In this repos, I utilize such idea but normalize the gradient across batches. I think a small learning rate might eliminate the need for the normalization step. Yet, I prefer to normalize and use big learning rate. In this repos, learning_rate=0.1 despite the fact that I am using a pretrained network. The following snippet re-illustrates accumulated gradient in Tensorflow with normalization.

## Creation of shadow variables with the same shape as the trainable ones
# initialized with 0s
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in trainable_vars]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
	global_step = tf.Variable(0, name='global_step', trainable=False)
	learning_rate = tf_utils.poly_lr(global_step)
	optimizer = tf.train.MomentumOptimizer(learning_rate)
	grads = optimizer.compute_gradients(model.train_loss, trainable_vars)
	# Adds to each element from the list you initialized earlier with zeros its gradient 
	##(works because accum_vars and gvs are in the same order)
        accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(grads)]
        iter_size = config.caffe_iter_size
        # Define the training step (part with variable value update)
        train_op = optimizer.apply_gradients([(accum_vars[i] / iter_size, gv[1]) for i, gv in enumerate(grads)],
                                                     global_step=global_step)
   
While Training: ## Main Loop
	for mini_batch in range(config.caffe_iter_size-1):
		words,lbl =  train_iter.next()
    	feed_dict = {.......}
	    sess.run(accum_ops, feed_dict)         
	words,lbl =  train_iter.next()
	feed_dict = {.......}
	sess.run(train_op, feed_dict)
	sess.run(zero_ops)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accumulated Gradient in Tensorflow

Clone this wiki locally