By Rose Martin, Dataquest.io
Predictive models are extremely useful for forecasting future outcomes and estimating metrics that are impractical to measure. For example, __data science workflow might look like. A lot of the time, we’ll start with a question we want to answer, and do something like the following:
- Collect some data relevant to the problem (more is almost always better).
- Clean, augment, and preprocess the data into a convenient form, if needed.
- Conduct an exploratory analysis of the data to get a better sense of it.
- Using what you find as a guide, construct a model of some aspect of the data.
- Use the model to answer the question you started with, and validate your results.
Linear regression is one of the simplest and most common supervised machine learning algorithms that data scientists use for predictive modeling. In this post, we’ll use linear regression to build a model that predicts cherry tree volume from metrics that are much easier for folks who study trees to measure.
We'll use R in this blog post to explore this data set and learn the basics of linear regression. If you're unfamiliar with R, we recommend our R Fundamentals and R Programming: Intermediate courses from our R data Analyst path. It will also help to have some very basic statistics knowledge, but if you know what a mean and standard deviation are, you'll be able to follow along. If you want to practice building the models and visualizations yourself, we’ll be using the following R packages:
This package contains a wide variety of practice data sets. We'll be using one of them, "trees", to learn about building linear regression models.
We'll use this popular data visualization package to build plots of our models.
This package extends the functionality of
ggplot2. We'll be using it to create a plot matrix as part of our initial exploratory data visualization.
We'll use this package for visualizing more complex linear regression models with multiple predictors.
How do they measure tree volume, anyway?
The trees data set is included in base R’s
datasets package, and it’s going to help us answer this question. Since we’re working with an existing (clean) data set, steps 1 and 2 above are already done, so we can skip right to some preliminary exploratory analysis in step 3. What does this data set look like?
This data set consists of 31 observations of 3 numeric variables describing black cherry trees:
- The trunk girth (in)
- height (ft)
- volume (ft3)
These metrics are useful information for foresters and scientists who study the ecology of trees. It's fairly simple to measure tree heigh and girth using basic forestry tools, but measuring tree volume is a lot harder. If you don’t want to actually cut down and dismantle the tree, you have to resort to some technically challenging and time-consuming activities like climbing the tree and making precise measurements. It would be useful to be able to accurately predict tree volume from height and/or girth.
To decide whether we can make a predictive model, the first step is to see if there appears to be a relationship between our predictor and response variables (in this case girth, height, and volume). Let’s do some exploratory data visualization. We’ll use the
ggpairs()function from the
GGallypackage to create a plot matrix to see how the variables relate to one another.
ggpairs() function gives us scatter plots for each variable combination, as well as density plots for each variable and the strength of correlations between variables. If you’ve used
ggplot2 before, this notation may look familiar:
GGally is an extension of
ggplot2that provides a simple interface for creating some otherwise complicated figures like this one.
As we look at the plots, we can start getting a sense of the data and asking questions. The correlation coefficients provide information about how close the variables are to having a relationship; the closer the correlation coefficient is to 1, the stronger the relationship is. The scatter plots let us visualize the relationships between pairs of variables. Scatter plots where points have a clear visual pattern (as opposed to looking like a shapeless cloud) indicate a stronger relationship.
Which predictor variables seem related to the response variable?
From looking at the
ggpairs() output, girth definitely seems to be related to volume: the correlation coefficient is close to 1, and the points seem to have a linear pattern. There may be a relationship between height and volume, but it appears to be a weaker one: the correlation coefficient is smaller, and the points in the scatter plot are more dispersed.
What is the shape of the relationship between the variables?
The relationship appears to be linear; from the scatter plot, we can see that the tree volume increases consistently as the tree girth increases.
Is the relationship strong, or is noise in the data swamping the signal?
The relationship between height and volume isn’t as clear, but the relationship between girth and volume seems strong.
Now that we have a decent overall grasp of the data, we can move on to step 4 and do some predictive modeling.
Forming a hypothesis
A hypothesis is an educated guess about what we think is going on with our data. In this case, let’s hypothesize that cherry tree girth and volume are related. Every hypothesis we form has an opposite: the “null hypothesis” (H). Here, our null hypothesis is that girth and volume aren’t related.
In statistics, the null hypothesis is the one we use our data to support or reject; we can’t ever say that we “prove” a hypothesis. We call the hypothesis that girth and volume are related our “alternative” hypothesis (Ha).
H : There is no relationship between girth and volume
Ha: There is some relationship between girth and volume
Our linear regression model is what we will use to test our hypothesis. If we find strong enough evidence to reject H, we can then use the model to predict cherry tree volume from girth.
Building blocks of a linear regression model
Linear regression describes the relationship between a response variable (or dependent variable) of interest and one or more predictor (or independent) variables. It helps us to separate the signal (what we can learn about the response variable from the predictor variable) from the noise (what we can’t learn about the response variable from the predictor variable). We’ll dig deeper into how the model does this as we move along.
Let’s dive right in and build a linear model relating tree volume to girth. R makes this straightforward with the base function
lm() function fits a line to our data that is as close as possible to all 31 of our observations. More specifically, it fits the line in such a way that the sum of the squared difference between the points and the line is minimized; this method is known as “minimizing least squares.”
Even when a linear regression model fits data very well, the fit isn’t perfect. The distances between our observations and their model-predicted value are called residuals.
Mathematically, can we write the equation for linear regression as:
Y ≈ β0 + β1X + ε
- The Y and X variables are the response and predictor variables from our data that we are relating to eachother
- β0 is the model coefficient that represents the model intercept, or where it crosses the y axis
- β1 is the model coefficient that represents the model slope, the number that gives information about the steepness of the line and its direction (positive or negative)
- ε is the error term that encompasses variability we cannot capture in the model (what X cannot tell us about Y)
In the case of our example:
Tree Volume ≈ Intercept + Slope(Tree Girth) + Error
lm() function estimates the intercept and slope coefficients for the linear model that it has fit to our data. With a model in hand, we can move on to step 5, bearing in mind that we still have some work to do to validate the idea that this model is actually an appropriate fit for the data.
Can we use this model to make predictions?
Whether we can use our model to make predictions will depend on:
- Whether we can reject the null hypothesis that there is no relationship between our variables.
- Whether the model is a good fit for our data.
Let’s call the output of our model using
summary(). The model output will provide us with the information we need to test our hypothesis and assess how well the model fits our data.
Let’s walk through the output to answer each of these questions.
Is the hypothesis supported?
Coefficients: Estimate and Std. Error:
- The intercept in our example is the expected tree volume if the value of girth was zero. Of course we cannot have a tree with negative volume, but more on that later.
- The slope in our example is the effect of tree girth on tree volume. We see that for each additional inch of girth, the tree volume increases by 5.0659 ft3.
- The coefficient standard errors tell us the average variation of the estimated coefficients from the actual average of our response variable.
- This is a test statistic that measures how many standard deviations the estimated coefficient is from zero.
- This number is the p-value, defined as the probability of observing any value equal or larger than t if H is true. The larger the t statistic, the smaller the p-value. Generally, we use 0.05 as the cutoff for significance; when p-values are smaller than 0.05, we reject H.
We can reject the null hypothesis in favor of believing there to be a relationship between tree width and volume.