By Dataiku.com Sponsored Post.
Getting Up Close and Personal with Algorithms
We hear the term “machine learning” a lot these days, usually in the context of predictive analysis and artificial intelligence. Machine learning is, more or less, a way for computers to learn things without being specifically programmed. But how does that actually happen?
The answer is, in one word, algorithms. Algorithms are sets of rules that a computer is able to follow. Think about how you learned to do long division -- maybe you learned to take the denominator and divide it into the first digits of the numerator, then subtracting the subtotal and continuing with the next digits until you were left with a remainder. Well, that’s an algorithm, and it’s the sort of thing we can program into a computer, which can perform these sorts of calculations much, much faster than we can.
We've put together a brief summary of the top algorithms used in predictive analysis, which you can see just below. Read on for more detail on these algorithms.
What Does Machine Learning Look Like?
In machine learning, our goal is either prediction or clustering. Today, we’re going to focus on prediction (we’ll cover clustering in a future article). Prediction is a process where, from a set of input variables, we estimate the value of an output variable. For example, using a set of characteristics of a house, we can predict its sale price.Prediction problems are divided into two main categories:
- Regression problems, where the variable to predict is numerical (e.g., the price of a house)
- Classification problems, where the variable to predict is a "Yes/No" answer (for example, predict whether a certain piece of equipment will experience a mechanical failure)
With this in mind, let’s check out the most prominent and common algorithms used in machine learning historically and today.
Our algorithms come in three groups: linear models, tree-based models, and neural networks.
Linear Model Approach
A linear model uses simple formulas to find the “best fit” line through a set of __data on which it has been trained at the expense of the ability to generalize to previously unseen data. For this reason, linear regression (along with logistic regression, which we’ll get to in a second) in machine learning is often “regularized,” which means the model has certain penalties to prevent overfit.
Another drawback of linear models is that, since they’re so simple, they tend to have trouble predicting more complex behaviors when the input variables are not independent.
Logistic regression is simply the adaptation of linear regression to classification problems (once again, discussed above). The drawbacks of logistic regression are the same as those of linear regression.
Because of its shape, the logistic function is very good for classification problems, as it introduces a threshold effect.
Tree-Based Model Approach
When you hear tree-based, think decision trees, i.e., a sequence of branching operations.
A decision tree is a graph that uses a branching method to show each possible outcome of a decision. Like if you’re ordering a salad, you first decide the type of lettuce, then the toppings, then the dressing. We can represent all possible outcomes in a decision tree.
To train a decision tree, we take the train data set (that is, the data set that we use to train the model) and find which attribute best “splits” the train set with regards to the target. For example, in a fraud detection case, we could find that the attribute which best predicts the risk of fraud is the country. After this first split, we have two subsets which are the best at predicting if we only know that first attribute. Then we can iterate on the second-best attribute for each subset and resplit each subset, continuing until we have used enough of the attributes to satisfy our needs.
A random forest is the average of many decision trees, each of which is trained with a random sample of the data. Each single tree in the forest is weaker than a full decision tree, but by putting them all together, we get better overall performance thanks to diversity.
Random forest is a very popular algorithm in machine learning today. It is very easy to train, and it tends to perform quite well. Its downside is that it can be slow to output predictions relative to other algorithms, so you might not use it when you need lightning-fast predictions.
Gradient boosting, like random forest, is also made from “weak” decision trees. The big difference is that in gradient boosting, the trees are trained one after another. Each subsequent tree is trained primarily with data that had been wrongly identified by previous trees. This allows gradient boost to focus less on the easy-to-predict cases and more on difficult cases.
Gradient boosting is also pretty fast to train and performs very well. However, small changes in the training data set can create radical changes in the model, so it may not produce the most explainable results.
Neural networks refer to a biological phenomenon comprised of interconnected neurons that exchange messages with each other. This idea has now been adapted to the world of machine learning and is called ANN (Artificial Neural Networks). Deep learning, which you’ve heard a lot about, is just several layers of neural networks put one after the other.
ANNs are a family of models that are taught to adopt cognitive skills to function like the human brain. No other algorithms can handle extremely complex tasks, such as image recognition, as well as neural networks. However, just like the human brain, it takes a very long time to train the model, and it requires a lot of power (just think about how much we eat to keep our brains working!).
We hope this has been helpful in shedding some light on machine learning. If you're interested in learning more, you can read about the machine learning features within Dataiku DSS, or even check out a very cool application of machine learning to predict crime rates in London.