Entity Embeddings

Structured Deep Learning

  • Fast
  • No Domain Knowledge Requiring
  • High Performing

This blog will mainly focus on a not very widely known application area of deep learning, structured data.

In machine learning/deep learning or any kind of predictive modeling task data comes before the algorithm/methodology. This is the main reason why machine learning requires a lot of feature engineering before certain tasks such as image classification, NLP, and many other “unusual” data that can’t be directly fed into a logistic regression or a random forest model. In contrary, these type of tasks are done significantly well by deep learning without any requirement of nasty and time consuming feature engineering. Most of the time these features require domain knowledge, creativity and a lot of trial and error. Of course domain expertise and clever feature engineering are still very valuable but the techniques I will be mentioning throughout this post will be enough for you to aim for top 3 in a Kaggle competition (http://blog.kaggle.com/2016/01/22/rossmann-store-sales-winners-interview-3rd-place-cheng-gui/) without any prior knowledge on the subject.

Due to complex nature and ability of feature generation (e.g. convolutional layers of CNNs), deep learning is abundantly applied to various kinds of image, textual and audio data problems. These are without a doubt very important problems for advancement of AI, and there must be a very good reason why every year top researchers in the field compete to classify cats, dogs, and ships better than the previous year. But these are rarely the cases we see in industry. Companies work with databases involving structured datasets, and these are the domains that shape every day lives.

Let’s define structured data to be more clear for the rest of this post. Here you can think of rows as collected individual data points or observations and columns as fields that represents a single attribute of each observation. For example data from an online retail store might have rows as sales made by customers and columns as item bought, quantity, price, time stamp, and so on ….

Below we have online seller data, rows as each unique sales and columns describing that particular sales.

Let’s talk about how we can leverage neural networks for structured data task. Actually at a theoretical level it’s very trivial to create a fully connected network with any desired architecture, then using “rows” as inputs. Given the loss function after a couple of dot products and backpropagation we would end up having a trained network which then can be used to make predictions.

Even though as it seems very straightforward there are major reasons why people prefer tree based methods over neural networks when it comes to structured data. This can be understand from the algorithm’s point of view, by exploring how actually algorithms see and treat our data.

The main separation in terms of having structured and unstructured data is that with unstructured data even though it’s “unusual”, we often deal with single entities with single units like pixels, voxels, audio frequencies, radar back-scatters, sensor measurements… so on so forth. In contrary, with structured data we often need to deal with many different type of data under two main groups; numerical and categorical. Categorical data requires processing prior to training, because most of the algorithms as well as neural networks can’t handle them directly yet.

There are various options available for encoding variables such as label/numerical encoding and one-hot encoding. But these techniques are problematic in terms of memory and real representation of categorical levels. The former property is probably more obvious and can be illustrated with an example.

Let’s say we have day of week information as a column. If we one-hot encode or arbitrarily label encode this variable we would assume equal and arbitrary distance/difference among levels respectively.

But both of these two approaches assume that the difference between each pair of days are equal but in reality we easily know that this is not true, so should our algorithm!

“The continuous nature of neural networks limits their applicability
to categorical variables. Therefore, naively applying
neural networks on structured data with integer
representation for category variables does not work well” [1]

Tree algorithms do not require any assumptions on continuity of categorical variables since they can find states by splitting as needed, but with neural networks this is not the case. Here comes the entity embedding to help. Entity embedding is used to map discrete values into multi-dimensional space where values with similar function outputs are closer to each other. You may think of if we were embedding states in a country for a sales problem, similar states in terms of sales would be closer to each in this projected space.

Since we don’t want to make arbitrary assumptions between the levels of our categorical variables, we will learn a better representation of each in Euclidean space. This representation will be equal to nothing but the dot product of one-hot encoded data and learn-able weights.

Embeddings are very widely used in NLP, as each word is represented as a vector. Two famous embedding examples are Glove and word2vec. We can see how powerful embeddings are from the figure 4 [2]. These vectors are readily available for you to download them and use them as it fits your goal, which is actually very cool thinking of the information they hold.

Even though embeddings can be applied to different context either with supervised or unsupervised manner, our main objective is to understand how to do these projections for categorical variables.

Even though entity embeddings have a different name, they are not much different then the use case we have seen in word embeddings. After all, the only thing we care is to have higher dimensional vector representation of our for our grouped data; this may be words, day of weeks, countries and many others you can imagine. This transaction from word embeddings to meta(categorical in our case) data embeddings enabled Yoshua Bengio et al. to win a Kaggle competition(https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i) back in 2015 with a single simple automated approach, which is not the usual case for winning solutions.

“To deal with the discrete meta-data, consisting of client ID, taxi ID, date and time information, we learn embeddings jointly with the model for each of these information. This is inspired by neural language modeling approaches [2], in which each word is mapped to a vector space of fixed size (the vector is called the embedding of the word).”[3]

Step by step we will explore how to learn these features within a neural network. Define a fully connected neural network, separate numerical and categorical variables.

For each categorical variable:

  1. Initialize a random embedding matrix as m x D.

m: number of unique levels of categorical variable (Monday, Tuesday, …)

D: desired dimension for representation, a hyperparameter which can be between 1 and m-1 (if 1 then it will be label encoding, if m it will be one-hot encoding)

2. Then for each forward pass through neural network we do a lookup for the given level (e.g Monday for “dow”) from the embedding matrix, which will give us a vector as 1 x D.

3. Append this 1 x D vector to our input vector (numerical vector). Think this process as augmenting a matrix, where we add an embedding vector for each category that been embedded by doing lookup for each particular row.

4. While doing backpropagation we are also updating these embedding vectors in the way of gradient which minimizes our loss function.

Usually inputs are not updated but for embedding matrices we have this special case where we allow our gradient to flow all the way back to these mapped features and hence optimize them.

We can think this as a process, which allows our categorical embeddings to be better represented at every iteration.

Note: Rule of thumb is to keep categories that don’t have very high cardinality. As in if a variable has unique levels for 90% of the observations then it wouldn’t be a very predictive variable and we may very well get rid of it.

We can very well implement above mentioned architecture in our favorite framework (preferably a dynamic one) by doing lookup and allowing requires_grad=True in our embedding vectors and learning them. But all these steps and more is already done in Fast.ai. Beyond making structural deep learning easy this library also provides many state-of-art features like differential learning rates, SGDR, cyclical learning rate, learning rate finder and so on. These are all the things we would like to take advantage of. You can read further about these topics from these very cool other blog posts:

https://medium.com/@bushaev/improving-the-way-we-work-with-learning-rate-5e99554f163b

https://medium.com/@surmenok/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0

https://medium.com/@markkhoffmann/exploring-stochastic-gradient-descent-with-restarts-sgdr-fa206c38a74e

Teleported.in
Learning rate (LR) is one of the most important hyperparameters to be tuned and holds key to faster and effective…teleported.in

In this part we will take a look at how we can bypass all these steps mentioned and build a neural network that is more effective with structured data.

For this purpose we will take a look at an active Kaggle competition https://www.kaggle.com/c/mercari-price-suggestion-challenge/. In this challenge we are trying to predict the price of an item sold by an online seller. This is a very suitable example to entity embeddings since data is mostly categorical with relatively high cardinality (not too much) and there isn’t much else.

Data:

~1.4 M rows

  • item_condition_id: condition of the item (cardinality: 5)
  • category_name: category name (cardinality: 1287)
  • brand_name: name of the brand (cardinality: 4809)
  • shipping: whether shipping is included in price (cardinality 2)

Important Note: I won’t be having a validation set in this example since I’ve already find my best model parameters but you should always do hyperparameter tuning with a validation set.

Step 1:

Fill missing values as a level, since missing-ness itself is an important information.

train.category_name = train.category_name.fillna('missing').astype('category')
train.brand_name = train.brand_name.fillna('missing').astype('category')
train.item_condition_id = train.item_condition_id.astype('category')
test.category_name = test.category_name.fillna('missing').astype('category')
test.brand_name = test.brand_name.fillna('missing').astype('category')
test.item_condition_id = test.item_condition_id.astype('category')

Step 2:

Preprocess data, scale for numerical columns since neural networks like normalized data or in other words equally scaled data. If you don’t scale your data one feature may be over emphasized by the network since it’s all about dot products and gradients. It would be better if we were scaling both training and test by training statistics, but this shouldn’t matter much. Think of dividing each pixel by 255, same logic.

I’ve combined training and test data since we want same levels to have same encodings.

combined_x, combined_y, nas, _ = proc_df(combined, 'price', do_scale=True)

Step 3:

Create model data object. Path is where Fast.ai stores models and activations.

path = '../data/'
md = ColumnarModelData.from_data_frame(path, test_idx, combined_x, combined_y, cat_flds=cats, bs= 128

Step 4:

Decide D (dimension of embedding). cat_sz is a list of tuples (col_name, cardinality + 1) for each categorical columns.

# We said that D (dimension of embedding) is an hyperparameter
# But here is Jeremy Howard's rule of thumb
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
# [(6, 3), (1312, 50), (5291, 50), (3, 2)]

Step 5:

Create a learner, this is the core object of Fast.ai library.

# params: embedding sizes, number of numerical cols, embedding dropout, output, layer sizes, layer dropouts
m = md.get_learner(emb_szs, len(combined_x.columns)-len(cats),
0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)

Step 6:

This part is explained in much more detail in other posts that I’ve mentioned before.

Take advantage of full Fast.ai joy.

We are picking our learning rate to be from a point before where loss starts increasing…

# find best lr
m.lr_find()
# find best lr
m.sched.plot()

Fit

we can see that with just 3 epochs we have
lr = 0.0001
m.fit(lr, 3, metrics=[lrmse])

Fit more

m.fit(lr, 3, metrics=[lrmse], cycle_len=1)

And some more…

m.fit(lr, 2, metrics=[lrmse], cycle_len=1)

So, these brief but effective steps can take you to top ~10% without any further need within minutes. If you really want to aim high, I would suggest you to exploit item_description column and make it multiple categorical variables. Then leave the job to entity embeddings, of course don’t forget to stack and ensemble :)

This was my very first blog post I hope you enjoyed it! I must admit this thing is kind of addictive, so I might come back very soon…

Appreciate all the claps you have for me :)

Bio: I am a current graduate student in USF, Master’s of Analytics program. I’ve been applying machine learning for 3 years now and currently practicing deep learning with Fast.ai.

Linkedin: https://www.linkedin.com/in/kerem-turgutlu-12906b65/en

References:

[1] Cheng Guo, Felix Berkhahn (2016, April, 22) Entity Embeddings of Categorical Variables. Retrieved from https://arxiv.org/abs/1604.06737.

[2] TensorFlow Tutorials: https://www.tensorflow.org/tutorials/word2vec

[3] Yoshua Bengio, et al. Artificial Neural Networks Applied to Taxi Destination Prediction. Retrieved from https://arxiv.org/pdf/1508.00021.pdf.