By Stathis Vafeias, AimBrain.
For most machine learning practitioners designing a neural network is an artform. Usually, it begins with a common architecture and then parameters are tweaked until a good combination of layers, activation functions, regularisers, and optimisation parameters are found. Guided by popular architectures — like VGG, Inception, ResNets, DenseNets and others — one will iterate through variations of the network until it achieves the desired balance of speed and accuracy. But as the available processing power increases, it makes sense to begin automating this network optimisation process.
In shallow models like Random Forests and SVMs we are already able to automate the process of tweaking through hyper-parameter optimisation. Popular toolkits like sk-learn provide methods for searching the hyper-parameter space. At its simplest form the hyper-parameter search is performed through a grid search over all possible parameters or random sampling from a parameter distribution (read this post). These approaches face two problems: a) waste of resources while searching on bad parameter region, b) inefficient at handling a dynamic set of parameters, hence it’s hard to alternate the structure of the solver (i.e. the architecture of a DNN). More efficient methods like Bayesian optimisation deal with (a) but still suffer from (b). It is also hard to explore models in parallel in the Bayesian optimisation setting.
While the idea of automatically identifying the best models is not new, the recent large increase in processing power make it more achievable than ever before. Especially if the type of models we want to optimise are computationally hungry (e.g. DNNs).
The time has come! And it’s so important that even Google decided to include it in its annual Google I/O (~16:00 min) conference, and I’m sure many others in the industry are doing too. It is already a spotlighted project in our team @ AimBrain.
Get in touch twitter.com/techabilly | offbit.github.io
Check out what we do in AimBrain @ http://aimbrain.com
A way to think of hyper parameter optimisation, is to pose it as a meta-learning problem.
Can we make an algorithm that will explore which networks perform better at a given problem?
Note: Maybe meta-learning is a bit of term abuse, let’s not confuse it with approaches like learning to learn (e.g. this really interesting https://arxiv.org/abs/1606.04474 “Learning to learn by gradient descent by gradient descent” from Deepmind). So allow me to use the term meta-learning loosely.
Our goal is to define how many hidden layers (green) the network will have and the parameter of each one.
The goal is to explore both the architecture and parameter space of the model in order to optimise its performance in a given dataset. This problem by nature is complicated and the rewards are sparse. When we say that the reward is sparse we mean that we need to train a network to a sufficient point and then evaluate it; we only have some score when the train-evaluate cycle is done. This basically tells us how the whole system has performed. This type of reward is not a differentiable function! Remind you of something? Yes, it is a typical Reinforcement Learningscenario!
Quoting wikipedia on RL:
Reinforcement learning (RL) is an area of machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximise some notion of cumulative reward.
Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).
The agent in our scenario is a model and the environment is the dataset we use to train and evaluate. The interpreter is the process that analyses the reward of each episode and sets the state of the agent (or the policy, and in our setting it sets the parameters of the network).
Typically, reinforcement learning problems are modelled as a Markov decision process. The goal is to optimise the total reward of an agent and at each step you are called to take decision to either a) optimise the outcome given the model that you have created or b) explore a new action. The agent is forming a policy which improves the more it interacts with the environment.
Note: this topic is out of the scope of this post, the best introduction book on the topic is probably “Reinforcement Learning: An Introduction” by R.Sutton and A. Barto.
An alternative approach to solving the reinforcement learning scenario is through evolutionary algorithms. Inspired by biological evolution, an evolutionary algorithm searches the solution space by creating a population of solutions. It then evaluates each solution and evolves the population according to the fitness(score) of each solution. Evolution involves selection and mutation of the most fit members of the population. As a result the population will evolve to increase its overall fitness and produce viable solutions to the problem.
The cycle of an evolution algorithm
The illustration on the left demonstrates the cycle of evolution. The two parts that are needed to design an evolutionary algorithm are a)the selection process and b)the crossover/mutation strategy that needs to follow.
Selection: the way parents are picked; common practice is to pick the k-best and some random individuals for diversity. More advanced selection techniques involve creating different subgroups of the population (usually referred to as species), then select the top-k among each of the species which helps to preserve a diversity in the solution space. Another popular method is the tournament selection where randomly selected individuals participate in a tournament play to define the winner (individuals selected for passing on their genes).
Crossover: The way two parents(or more) are mixed to produce an offspring. This is highly dependant on the way the problem is structured. A common approach is to describe each parent with a list of elements (usually numerical values) and then select random parts from each parent to compose a new list (genome). Read more
Mutation: the process of alternating the genome randomly. It’s a major exploration factor and helps maintain the diversity of the population. Read more