The database

How to train a neural network to code by itself ?

Let’s admit it would be quite crazy. A developer causing a neural network to replace it to code in its place ! Ok, let’s do that.

Prerequisites

  • Tensorflow + Basic Deep learning skills.
  • The github repository of the project.
  • In this article, I will make a very quick reminder of recurrent neural network . However, if you don’t know much about the subject, these two resources seem good to me to begin to understand the subject: a video and an article. :)

I will not explain in this article all the parts of the project. On the other hand, I will go into detail about the essential points that will enable you to understand it. Take the time to execute each of the small pieces of code given in order to understand the logic. It‘s important; it‘s by doing that one learns.

This time it’s the right one, we’ll go !

Like any supervised workout, we’re going to need a dataset for our network. We are going to base this on code c (if it’s a language too easy, it’s not funny). To do this, our training data will be c scripts from the github Linux repository. I have already extracted the .c that interests us on the project.

First problem : How to represent our data ?

A neural network treats only numbers. Everything else is unknown to him. Thus, each character of our dataset should be represented in this form (a number / characters).

Here, for example, the character “=” is assigned to the number 7. We will later represent each number in one hot encoding in order to better converge during the backpropagation.

The three important variables to remember here are vocab_to_int, int_to_vocab and encoded. The first two allow us to easily switch between a character and an int and vice versa. The last is the representation of all our dataset in an encoder format. (Only int instead of characters).

Let’s create a simple batch consisting of two sequences where each sequence will consist of 10 numbers. This batch will serve as an example for the rest of this article.

Batch Inputs : 
[20 6 58 27 6 27 97 86 56 49]
[ 36 32 32 37 27 12 94 60 89 101]
Batch Targets :
[ 6 58 27 6 27 97 86 56 49 57]
[ 32 32 37 27 12 94 60 89 101 77]

That’s what the batch looks like. You can also display this translation:

['/', '*', '\n', ' ', '*', ' ', 'C', 'o', 'p', 'y']
['2', '0', '0', '4', ' ', 'E', 'v', 'g', 'e', 'n']

Now, we have some first values to work with. We want our network to be able to predict the next character to be typed knowing the n preceding characters. And not just the previous character. Indeed if I say to my network that the last letter type is an “e” the possibilities of evolution are huge. However, if I tell him that the last type is “w” “h” “i” “l” and “e” then it ‘s much more obvious that the next character to type will be a “(“.

We must therefore create a neural network taking into account the temporal space of the characters type. To do this, we need to use a reccurent neural network.

In order to illustrate the last example, a classic classifier (on the left of the diagram) takes the preceding letter; it’s passed by the hidden layer represented in blue in order to deduce an output. A recurring neural network is architecturally different. Each cell (represented in red) is not only connected to the inputs, but also to the cell of the instant t-1. In order to solve our problem, we will use LSTM (long short time memory) cells.

Feel free to spend time understanding the principle of recurrent neural networks in order to fully understand the code that will follow.

We will go into this article in detail of the 5 main parts. Placeholder serving as an entry to our model. The initialization of our LSTM cells used to create the RNN. 
The output layer connected to each cell. The operation used to measure the model error. Finally, we will define the training operation.

I) Graph inputs

The batch consists of two inputs of size 10, the shape expected for our input is therefore of size [2, 10], each entry of the batch being associated with a single output, we can define the same shape for our target. Finally we define a placeholder for the value of the probability used for the future dropout.

II) LSTM

Let’s study each part of this code :

  • create_cell() is used to create an LSTM cell composed of 4 hidden neurons. This function also adds a dropout to the cell output before returning it to us.
  • tf.contrib.rnn.MultiRNNCell is used to easily instantiate our rnn. We give as a parameter an array of create_cell() because we want an RNN consisting of several layers. Three in this example.
  • initial_state: Knowing that each cell of an RNN depends on the previous state, we must instantiate an initial state filled with zero that will serve as input to the first entries of our batch.
  • x_one_hot transforms our batch into one hot encoding
  • cell_outputs gives us the output of each cell of our RNN. Here, each output will consist of 4 values (number of hidden neurons).
  • final_state returns the state of our last cell which can be used during training as a new initial state for a next batch. (Assuming that the next batch is the logical sequel to the previous batch).

III) Graph outputs

The values at the output of our cells are stored in a three-dimensional table [number of sequences, sequence size, number of neurons] or [2, 10, 4]. We no longer need to separate the outputs by sequences. We then resize the output to get an array of dimension [20, 4] stored in the seq_out_reshape variable.

Finally, we apply a simple linear operation: tf.matmul (..) + b. The whole followed by a softmax in order to represent our outputs in the form of probability.

IV) Loss

In order to apply our error operation, the targets of our batch must be represented in the same way and in the same dimension as the output values of the model. We use tf.one_hot to represent our outputs under the same encoding as our inputs. Then we resize the array (tf.reshape ()) to the same dimensions of the linear output: tf.matmul (..) + b. We can now use this function to calculate the error of the model.

V) Train

We simply apply an AdamOptimize to minimize our errors.

I think it‘s finally one of the most rewarding parts: the results of the training. I have for this one used the following parameters:

• Size of a sequence: 100 
• Size of a batch: 200 
• Number of neurons per cell: 512 
• Depth of RNN: 2 
• Learning rate: 0.0005 
• Dropout: 0.5

The results presented below were obtained after about two hours of training on my GPU. (GeForce GTX 1060).

Let’s start with the evolution of the error:

Finally, let’s look at what type of code our model is capable of generating :

It’s pretty cool to see that this model has clearly understood the general structure of a program. A function, parameters, initialization of variables, conditions … etc….

We will notice that there is absolutely no function named “super_fold” in our dataset. So I have a hard time understanding the usefulness of this function, it must be believed that this model is perhaps more intelligent than me….