By Ahmed Gad, Menoufia University
This article will help you understand why we need the learning rate and whether it is useful or not for training an artificial neural network. Using a very simple Python code for a single layer perceptron, the learning rate value will get changed to catch its idea.
An obstacle for newbies in artificial neural networks is the learning rate. I was asked many times about the effect of the learning rate in the training of the artificial neural networks (ANNs). Why we use learning rate? What is the best value for the learning rate? In this article, I will try to make things simpler by providing an example that shows how learning rate is useful in order to train an ANN. I will start by explaining our example with Python code before working with the learning rate.
A very very simple example is used to get us out of complexity and allow us to just focus on the learning rate. A single numerical input will get applied to a single layer perceptron. If the input is 250 or smaller, its value will get returned as the output of the network. If the input is larger than 250, then it will be clipped to just 250. The following table shows the 6 samples used for training.
The architecture of the ANN used is shown in the next figure. There are just input and output layers. The input layer has just a single neuron for our single input. The output layer has just a single neuron for generating the output. The output layer neuron is responsible for mapping the input to the correct output. There is also a bias applied to the output layer neuron with weight b and value +1. There is also a weight W for the input.
The equation and the graph of the activation function used in this example are as shown in the next figure. When the input is below or equal to 250, the output will be the same as the input. Otherwise, it will be clipped to 250.
Implementation using Python
The Python code implementing the entire network is shown below. We will discuss all of it until making it easy as much as possible then focus on changing the learning rate to find out how it affects the network training.
The equation uses the weights of the current step (n) to generate the weights of the next step (n+1). This equation is what we will use for knowing how the learning rate affects the learning process.
Finally, we need to concatenate all of these together to make the network learn. This is done using the training_loop(inpt, weights) method defined from line 20 to 31. It goes into a training loop. The loop is used to map the inputs to their outputs with the least possible prediction error.
The loop does three operations:
1. Output Prediction.
2. Error Calculation.
3. Updating Weights.
After getting the idea of the example and its Python code, let us start showing how the learning rate is useful in order to get the best results.
In the previously discussed example, line 13 has the weights update equation in which the learning rate is used. At first, let us assume that we have not used the learning rate completely. The equation will as follows:
Let us see the effect of removing the learning rate. In the iteration of the training loop, the network has the following inputs (b=0.05 and W=0.1, Input = 60, and desired output=60).
The expected output which is the result of the activation function as in line 25 will be activation_function(0.05(+1) + 0.1(60)). The predicted output will be 6.05.
In line 26, the prediction error will be calculated by getting the difference between the desired and the predicted output. The error will be abs(60-6.05)=53.95.
Then in line 27 the weights will get updated according to the above equation. The new weights will be [0.05, 0.1] + (53.95)*60 = [0.05, 0.1] + 3237 = [3237.05, 3237.1].
It seems that the new weights are too different from the previous weights. Each weight got increased by 3,237 which is too large. But let us continue making the next prediction.
In the next iteration, the network will have these inputs applied: (b=3237.05 and W=3237.1, Input = 40, and desired output=40). The expected output will be activation_function((3237.05 + 3237.1(40)) = 250. The prediction error will be abs(40 - 250) = 210. The error is very large. It is larger than the previous error. Thus we have to update the weights again. According to the above equation, the new weights will be [3237.05, 3237.1] + (-210)*40 = [3237.05, 3237.1] + -8400 = [-5162.95, -5162.9].
The next table summarizes the results of the first three iterations:
As we go into more iterations, the results get worse. The magnitude of the weights is changing rapidly and sometimes with changing its signs. They are moving from very large positive value to very large negative value. How can we stop this large and abrupt changes in the weights? How to scale down the value by which the weights are updated?
If we looked at the value by which the weights are changing by from the previous table, it seems that the value is very large. This means that the network changes its weights with large speed. It is like someone that makes large moves within small times. At one time, the person is in the far east and after a very short time, that person will be in the far west. We just need to make it slower.
If we are able to scale down this value to get smaller then everything will be alright. But how?
Getting back to the part of the code that generates this value, it looks that the update equation is what generates it. Specifically this part:
We can scale this part by multiplying it by a small value such as 0.1. So, rather than generating 3237.0 as the updated value in the first iteration, it will be reduced to just 323.7. We can even scale this value to a smaller value by decreasing the scale value to say 0.001. Using 0.001, the value will be just 3.327.
We can catch it now. This scaling value is the learning rate. Choosing small values for the learning rate makes the rate of weights update smaller and avoids abrupt changes. As the value gets larger as the changes are faster and as a result bad results.
But what is the best value for the learning rate?
There is no value we can say it is the best value for the learning rate. The learning rate is a hyperparameter. A hyperparameter has its value determined by experiments. We try different values and use the value that gives best results. There are some ways that just helps you select values of hyperparameters.
For our problem, I deduced that a value of .00001 works fine. After training the network with that learning rate, we can make a test. The following table shows the results of prediction of 4 new testing samples. It seems that results are now much better after using the learning rate.
Bio: Ahmed Gad received his B.Sc. degree with excellent with honors in information technology from the Faculty of Computers and Information (FCI), Menoufia University, Egypt, in July 2015. For being ranked first in his faculty, he was recommended to work as a teaching assistant in one of the Egyptian institutes in 2015 and then in 2016 to work as a teaching assistant and a researcher in his faculty. His current research interests include deep learning, machine learning, artificial intelligence, digital signal processing, and computer vision.
Original. Reposted with permission.
- TensorFlow: Building Feed-Forward Neural Networks Step-by-Step
- An Overview of 3 Popular Courses on Deep Learning
- 5 Free Resources for Furthering Your Understanding of Deep Learning