-
Notifications
You must be signed in to change notification settings - Fork 7
Accumulated Gradient in Tensorflow
Typical deep learning frameworks take a training batch as input, passes it forward, computes the loss, gradient and finally back-propagates. The input batch has to fit the GPU memory in such procedure. With a limited memory GPU, such constraint has multiple drawbacks. First, small batches gradients are not stable; accordingly a small learning rate is required. This reduces the network convergence rate. Second, small batches introduce noisy contradicting gradients near the optimal point. This limits the network efficiency in terms of accuracy.
The accumulated gradient is a neural network framework feature. It is a workaround to enable big batches on limited memory GPUs. Instead of back-propagating for every batch feed-forward; gradients across multiple batches are accumulated. After multiple feed forwards, the accumulated gradient is back-propagated through the network layer. This gives the illusion of using big batches on limited memory GPUs.
Accumulated gradient is supported in caffe using iter_size
parameter. For example, if batch_size = 10
and iter_size = 5
, this simulates an input batch of size 50, which might not fit the GPU memory in a single pass.
Accumulated gradient is supported in pytorch by updating the network weights once after computing the gradient multiple consecutive times. The following snippet illustartes the main idea
while Training: # Main Training Loop
for i in range(config.caffe_iter_size):
input, target = next(train_loader_iter) ## Small batch
output = model(input) ## Feed-forward
loss = criterion(output, target) / config.caffe_iter_size ## compute loss
loss.backward() ## Accumulate the average gradient
# After caffe_iter_size iterations, update network weights **once**
optimizer.step()
# Zero the gradient to re-accumulate again
optimizer.zero_grad()
It is important to normalize the gradients across individual batches. The following snippet from Caffe source code shows the normalization step
template <typename Dtype>
void SGDSolver<Dtype>::ApplyUpdate() {
Dtype rate = GetLearningRate();
if (this->param_.display() && this->iter_ % this->param_.display() == 0) {
LOG_IF(INFO, Caffe::root_solver()) << "Iteration " << this->iter_
<< ", lr = " << rate;
}
ClipGradients();
for (int param_id = 0; param_id < this->net_->learnable_params().size();
++param_id) {
Normalize(param_id); /***** Normalize across iter_size *****/
Regularize(param_id);
ComputeUpdateValue(param_id, rate);
}
this->net_->Update();
// Increment the internal iter_ counter -- its value should always indicate
// the number of times the weights have been updated.
++this->iter_;
}
template <typename Dtype>
void SGDSolver<Dtype>::Normalize(int param_id) {
if (this->param_.iter_size() == 1) { return; }
// Scale gradient to counterbalance accumulation.
const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();
/***** <<===== Notice: Scale by 1/iter_size *****/
const Dtype accum_normalization = Dtype(1.) / this->param_.iter_size();
switch (Caffe::mode()) {
case Caffe::CPU: {
caffe_scal(net_params[param_id]->count(), accum_normalization,
net_params[param_id]->mutable_cpu_diff());
break;
}
case Caffe::GPU: {
#ifndef CPU_ONLY
caffe_gpu_scal(net_params[param_id]->count(), accum_normalization,
net_params[param_id]->mutable_gpu_diff());
#else
NO_GPU;
#endif
break;
}
default:
LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode();
}
}
Without normalization, the gradient direction is correct, but the huge magnitude will lead to big jumps -- unstable.
This nice accumulated gradient feature boosts performance with a couple of percentages. These few percentages are essential to replicate state-of-the-art results. Unfortunately, I am not aware of any official documentation to use such feature in Tensorflow. This stackoverflow post shares the main idea to manually implement accumulated gradient in Tensorflow. The main idea is to create shadow variables for trainable variables to maintain their gradients. Gradients are accumulated in this shadow variables, and they are used when updating the network weights.
In this repos, I utilize such idea but normalize the gradient across batches. I think a small learning rate might eliminate the need for the normalization step. Yet, I prefer to normalize and use big learning rate. In this repos, learning_rate=0.1
despite the fact that I am using a pretrained network. The following snippet re-illustrates accumulated gradient in Tensorflow with normalization.
## Creation of shadow variables with the same shape as the trainable ones
# initialized with 0s
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in trainable_vars]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
global_step = tf.Variable(0, name='global_step', trainable=False)
learning_rate = tf_utils.poly_lr(global_step)
optimizer = tf.train.MomentumOptimizer(learning_rate)
grads = optimizer.compute_gradients(model.train_loss, trainable_vars)
# Adds to each element from the list you initialized earlier with zeros its gradient
##(works because accum_vars and gvs are in the same order)
accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(grads)]
iter_size = config.caffe_iter_size
# Define the training step (part with variable value update)
train_op = optimizer.apply_gradients([(accum_vars[i] / iter_size, gv[1]) for i, gv in enumerate(grads)],
global_step=global_step)
While Training: ## Main Loop
for mini_batch in range(config.caffe_iter_size-1):
words,lbl = train_iter.next()
feed_dict = {.......}
sess.run(accum_ops, feed_dict)
words,lbl = train_iter.next()
feed_dict = {.......}
sess.run(train_op, feed_dict)
sess.run(zero_ops)