Custom Optimizer in TensorFlow

How to customize the optimizers to speed-up and improve the process of finding a (local) minimum of the loss function using TensorFlow.
c
comments

By Benoit Descamps, BigData Republic.

Introduction

Neural Networks play a very important role when modeling unstructured

# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

tf.train.GradientDescentOptimizer is an object of the class GradientDescentOptimizer and as the name says, it implements the gradient descent algorithm.

The method minimize() is being called with a “cost” as parameter and consists of the two methods  compute_gradients() and then apply_gradients().

For this post, and the implementation of AddSign and PowerSign, we must have a closer look at this last step apply_gradients().

This method relies on the (new) Optimizer (class), which we will create, to implement the following methods:  _create_slots()_prepare()_apply_dense(), and _apply_sparse().

_create_slots() and  _prepare() create and initialise additional variables, such as momentum.

_apply_dense(), and _apply_sparse() implement the actual Ops, which update the variables. Ops are generally written in C++ . Without having to change the C++ header yourself, you can still return a python wrapper of some Ops through these methods.

This is done as follows:

def _create_slots(self, var_list):
     # Create slots for allocation and later management of additional
     # variables associated with the variables to train.
     # for example: the first and second moments.
     '''
     for v in var_list:
     self._zeros_slot(v, "m", self._name)
     self._zeros_slot(v, "v", self._name)
     '''
 def _apply_dense(self, grad, var):
     #define your favourite variable update
     # for example:
     '''
     # Here we apply gradient descents by substracting the variables
     # with the gradient times the learning_rate (defined in __init__)
     var_update = state_ops.assign_sub(var, self.learning_rate * grad)
     '''
     #The trick is now to pass the Ops in the control_flow_ops and
     # eventually groups any particular computation of the slots your
     # wish to keep track of:
     # for example:
     '''
     m_t = ...m... #do something with m and grad
     v_t = ...v... # do something with v and grad
     '''
     return control_flow_ops.group(*[var_update, m_t, v_t])

Let us now put everything together and show the implementation of PowerSign and AddSign.

First, you need the following modules for adding Ops,

# This class defines the API to add Ops to train a model.
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import state_ops
from tensorflow.python.framework import ops
from tensorflow.python.training import optimizer
import tensorflow as tf

Let us now implement AddSign and PowerSign. Both optimizers are actually very similar and make use of the sign of the momentum m-hat and gradient g-hat for the update.

PowerSign

For PowerSign the update of the variables w_(n+1) at the (n+1)-th epoch, i.e.,

The decay-rate f_n in the following code is set to 1. I will not discuss this here, and I refer to the paper [1] for more details.

class PowerSign(optimizer.Optimizer):
    """Implementation of PowerSign.
    See [Bello et. al., 2017](https://arxiv.org/abs/1709.07417)
    @@__init__
    """
    def __init__(self, learning_rate=0.001,alpha=0.01,beta=0.5, use_locking=False, name="PowerSign"):
        super(PowerSign, self).__init__(use_locking, name)
        self._lr = learning_rate
        self._alpha = alpha
        self._beta = beta
        
        # Tensor versions of the constructor arguments, created in _prepare().
        self._lr_t = None
        self._alpha_t = None
        self._beta_t = None

    def _prepare(self):
        self._lr_t = ops.convert_to_tensor(self._lr, name="learning_rate")
        self._alpha_t = ops.convert_to_tensor(self._beta, name="alpha_t")
        self._beta_t = ops.convert_to_tensor(self._beta, name="beta_t")

    def _create_slots(self, var_list):
        # Create slots for the first and second moments.
        for v in var_list:
            self._zeros_slot(v, "m", self._name)

    def _apply_dense(self, grad, var):
        lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
        alpha_t = math_ops.cast(self._alpha_t, var.dtype.base_dtype)
        beta_t = math_ops.cast(self._beta_t, var.dtype.base_dtype)

        eps = 1e-7 #cap for moving average
        
        m = self.get_slot(var, "m")
        m_t = m.assign(tf.maximum(beta_t * m + eps, tf.abs(grad)))

        var_update = state_ops.assign_sub(var, lr_t*grad*tf.exp( tf.log(alpha_t)*tf.sign(grad)*tf.sign(m_t))) #Update 'ref' by subtracting 'value
        #Create an op that groups multiple operations.
        #When this op finishes, all ops in input have finished
        return control_flow_ops.group(*[var_update, m_t])

     def _apply_sparse(self, grad, var):
        raise NotImplementedError("Sparse gradient updates are not supported.")

AddSign

AddSign is very similar to PowerSign as seen below,

class AddSign(optimizer.Optimizer):
    """Implementation of AddSign.
    See [Bello et. al., 2017](https://arxiv.org/abs/1709.07417)
    @@__init__
    """

    def __init__(self, learning_rate=1.001,alpha=0.01,beta=0.5, use_locking=False, name="AddSign"):
        super(AddSign, self).__init__(use_locking, name)
        self._lr = learning_rate
        self._alpha = alpha
        self._beta = beta
        
        # Tensor versions of the constructor arguments, created in _prepare().
        self._lr_t = None
        self._alpha_t = None
        self._beta_t = None
      
    def _prepare(self):
        self._lr_t = ops.convert_to_tensor(self._lr, name="learning_rate")
        self._alpha_t = ops.convert_to_tensor(self._beta, name="beta_t")
        self._beta_t = ops.convert_to_tensor(self._beta, name="beta_t")

    def _create_slots(self, var_list):
        # Create slots for the first and second moments.
        for v in var_list:
            self._zeros_slot(v, "m", self._name)

    def _apply_dense(self, grad, var):
        lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype)
        beta_t = math_ops.cast(self._beta_t, var.dtype.base_dtype)
        alpha_t = math_ops.cast(self._alpha_t, var.dtype.base_dtype)
    
        eps = 1e-7 #cap for moving average
        
        m = self.get_slot(var, "m")
        m_t = m.assign(tf.maximum(beta_t * m + eps, tf.abs(grad)))
        
        var_update = state_ops.assign_sub(var, lr_t*grad*(1.0+alpha_t*tf.sign(grad)*tf.sign(m_t) ) )
        #Create an op that groups multiple operations
        #When this op finishes, all ops in input have finished
        return control_flow_ops.group(*[var_update, m_t])

    def _apply_sparse(self, grad, var):
        raise NotImplementedError("Sparse gradient updates are not supported.")

Performance testing the Optimizers

The Rosenbrock function is a famous performance test for optimization algorithms. The function is non-convex, and defined as,

The resulting shape is plotted in figure (1) below. As we seen, it has a minimum at x = 1 and y = 1.

The following script generates the Euclidian distance of the true minimum w.r.t the approximated minimum by a given optimizer at each epoch.

def RosenbrockOpt(optimizer,MAX_EPOCHS = 4000, MAX_STEP = 100):
   '''
   returns distance of each step*MAX_STEP w.r.t minimum (1,1)
   '''
   x1_data = tf.Variable(initial_value=tf.random_uniform([1], minval=-3, maxval=3,seed=0),name='x1')
   x2_data = tf.Variable(initial_value=tf.random_uniform([1], minval=-3, maxval=3,seed=1), name='x2')
   y = tf.add(tf.pow(tf.subtract(1.0, x1_data), 2.0),tf.multiply(100.0, tf.pow(tf.subtract(x2_data, tf.pow(x1_data, 2.0)), 2.0)), 'y')
global_step_tensor = tf.Variable(0, trainable=False, name='global_step')

   train = optimizer.minimize(y,global_step=global_step_tensor)

   sess = tf.Session()

   init = tf.global_variables_initializer()#tf.initialize_all_variables()
   sess.run(init)

   minx = 1.0
   miny = 1.0

   distance = []
   xx_ = sess.run(x1_data)
   yy_ = sess.run(x2_data)
   print(0,xx_,yy_,np.sqrt((minx-xx_)**2+(miny-yy_)**2))
   for step in range(MAX_EPOCHS):
      _, xx_, yy_, zz_ = sess.run([train,x1_data,x2_data,y])
      if step % MAX_STEP == 0:
         print(step+1, xx_,yy_, zz_)
         distance += [ np.sqrt((minx-xx_)**2+(miny-yy_)**2)]
   sess.close()
   return distance

A performance comparison of each optimizer is plotted below for a run of 4000 epochs.

While the performance heavily vary from the choice of hyperparameters, the extremely fast convergence of PowerSign needs to noticed.

Below, the coordinates of the approximations have been plotted for several epochs.

Epoch Rmsprop (x,y,z) AddSign  (x,y,z) PowerSign  (x,y,z)
(-2.39, -1.57, 4.26)  (-2.39, -1.57, 4.26)  (-2.39, -1.57, 4.26)
501 (0.66, 0.43, 0.13) (0.41, 0.17, 0.34) (0.97, 0.95, 0.0)
1001 (0.83, 0.67, 0.05) (0.55, 030, 0.21) (0.98, 0.96, 0.00)
2001 (0.93, 0.85, 0.03) (0.69, 0.48, 0.09) (0.98, 0.96, 0.00)
3001 (0.96, 0.92, 0.02) (0.78, 0.60, 0.05) (0.98, 0.97, 0.00)

Final Discussion:

Tensorflow allows us to create our own customizers. Recent progress in research have delivered two new promising optimizers,i.e. PowerSign and AddSign.

The fast early convergence of PowerSign makes it an interesting optimizer to combine with others such as Adam.

References:

  1. Additional information on PowerSign and AddSign is available on arxiv paper “Neural Optimizer Search with Reinforcement Learning” , Bello et. al., https://arxiv.org/abs/1709.07417.
  2. Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.
  3. unpublished
  4. I have found a lot of useful information through this stackerflow post, which I have attempted to bundle into this post.

Original. Reposted with permission.

Related

  • Deep Learning Made Easy with Deep Cognition
  • Understanding Objective Functions in Neural Networks
  • TensorFlow: What Parameters to Optimize?