**By Perceptive Analytics**

Imagine that you were to build a model to predict the mileage of a car from different attributes of the car. How will you do it?

The simplest approach will be to take one parameter or attribute (which one attribute to choose can be a question for endless debate) that affects the mileage most and build out a regression model to predict the mileage. But do you think this is the right approach? No, because the mileage of a car depends on multiple factors and not just a single factor. So, let’s go a step further and expand our model to make it more robust and include other attributes of car.

In the second approach, we will identify various attributes of a car such as horsepower, capacity, engine type, engine variant, cylinders, etc. All these will form predictor variables (also known as independent variables) for our model and the mileage will be response variable (also known as dependent variable).

What’s the difference between the first and second model?

In the second model, we have multiple factors or variables contributing to the final output variable. Intuitively, the accuracy of this model should be higher. Right?

The first model is called a simple linear regression; while, the second model is called a multiple linear regression model. In this case, assumption is that there are multiple independent variables which all affect the output variable. But, what if one of the independent variables is a dependent variable on other independent variables? For example, mileage is a dependent variable on horsepower, capacity, engine type, engine variant and cylinders; but what if, horsepower is a dependent variable on capacity, engine type and cylinders?

In such a scenario, the model becomes complex and path analysis comes handy in such situations. Path analysis is an extension of multiple regression. It allows for the analysis of more complicated models. It is helpful in examining situations where there are multiple intermediate dependent variables and in situations where Z is dependent on variable Y, which in turn is dependent on variable X. It can compare different models to determine which one best fits the data.

Path analysis was earlier also known as ‘causal modeling’; however, after strong criticism people refrain from using the term because it’s not possible to establish causal relationships using statistical techniques. Causal relationships can only be established through experimental designs. Path analysis can be used to disprove a model that suggests a causal relationship among variables; however, it cannot be used to prove that a causal relation exist among variables.

Let’s understand the terminology used in the path analysis. We don’t variables as independent or dependent here; rather, we call them exogenous or endogenous variables. Exogenous variables (independent variables in the world of regression) are variables which have arrows starting from them but none pointing towards them. Endogenous variables have at least one variable pointing towards them. The reason for such a nomenclature is that the factors that cause or influence exogenous variables exist outside the system while, the factors that cause endogenous variables exist within the system. In the above image, X is an exogenous variable; while, Y and Z are endogenous variables. A typical path diagram is as shown below.

In the above figure, A, B, C, D and E are exogenous variables; while, I and O are endogenous variables. ‘d’ is a disturbance term which is analogous to residuals in regression.

Now, let’s go through the assumptions that we need to consider before we use path analysis. Since, path analysis is an extension of multiple regression, most of assumptions of multiple regression hold true for path analysis as well.

- All the variables should have linear relations among each other.
- Endogenous variable should be continuous. In case of ordinal data, minimum number of categories should be five.
- There should be no interaction among variables. In case of any interaction, a separate term or variable can be added that reflects the interaction between the two variables.
- Disturbance terms are uncorrelated or covariance among the disturbance terms is zero.

Now, let’s move a step ahead and understand the implementation of path analysis in R. We will first try out with a toy example and then take a standard dataset available in R.

install.packages("lavaan") install.packages("OpenMx") install.packages("semPlot") install.packages("GGally") install.packages("corrplot") library(lavaan) library(semPlot) library(OpenMx) library(GGally) library(corrplot)

Now, let’s create our own dataset and try out path analysis. Please note that the rationale for doing this exercise is to develop intuition to understand path analysis.

# Let's create our own dataset and play around that first set.seed(11) a = 0.5 b = 5 c = 7 d = 2.5 x1 = rnorm(20, mean = 0, sd = 1) x2 = rnorm(20, mean = 0, sd = 1) x3 = runif(20, min = 2, max = 5) Y = a*x1 + b*x2 Z = c*x3 + d*Y data1 = cbind(x1, x2, x3, Y, Z) head(data1, n = 10)

> head(data1, n = 10) x1 x2 x3 Y Z [1,] -0.59103110 -0.68251762 2.152597 -3.70810366 5.797922 [2,] 0.02659437 -0.01585819 3.488896 -0.06599378 24.257289 [3,] -1.51655310 -0.44260479 3.524391 -2.97130048 17.242488 [4,] -1.36265335 0.35255750 2.707776 1.08146082 21.658085 [5,] 1.17848916 0.07317058 4.441204 0.95509749 33.476170 [6,] -0.93415132 0.00715880 3.257310 -0.43128166 21.722969 [7,] 1.32360565 -0.18760011 2.574199 -0.27619773 17.328901 [8,] 0.62491779 -0.76570065 3.946699 -3.51604433 18.836781 [9,] -0.04572296 -0.22105682 4.439842 -1.12814558 28.258531 [10,] -1.00412058 -0.98358859 2.676505 -5.42000323 5.185524

Now, we have created this dataset. Let’s see the correlation matrix for these variables. This will tell us how strongly and which all variables are correlated to each other.

> cor1 = cor(data1) > corrplot(cor1, method = 'square')

The above chart shows us that Y is very strongly correlate with X2; while, Z is strongly correlated with X2 and Y. The impact of X1 on Y is not as strong as that of X2.

> semPaths(fit1, 'std', layout = 'circle')

The above plot shows us that Z is strongly dependent on Y and weakly dependent on X3 and X1. Y is strongly dependent on X2 and weakly dependent on X1. This is the same intuition that we have built earlier in this article. This is the beauty of path analysis and this is how analysis can be used.

The values between the lines are path coefficients. Path coefficients are standardized regression coefficients, similar to beta coefficients of multiple regression. These path coefficients should be statistically significant, which can be checked from the summary output (we will see this in the next example).

Let’s move to our second example. In this example, we will use standard dataset ‘mtcars’ available in R.

# Let's take second example where we take standard dataset 'mtcars' available in R data2 = mtcars head(data2, n = 10)

> head(data2, n = 10) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

model2 = 'mpg ~ hp + gear + cyl + disp + carb + am + wt hp ~ cyl + disp + carb' fit2 = cfa(model2,> summary(fit2) lavaan (0.6-1) converged normally after 62 iterations Number of observations 32 Estimator ML Model Fit Test Statistic 7.901 Degrees of freedom 3 P-value (Chi-square) 0.048 Parameter Estimates: Information Expected Information saturated (h1) model Structured Standard Errors Standard Regressions: Estimate Std.Err z-value P(>|z|) mpg ~ hp -0.022 0.016 -1.388 0.165 gear 0.586 1.247 0.470 0.638 cyl -0.848 0.710 -1.194 0.232 disp 0.006 0.012 0.512 0.609 carb -0.472 0.620 -0.761 0.446 am 1.624 1.542 1.053 0.292 wt -2.671 1.267 -2.109 0.035 hp ~ cyl 7.717 6.554 1.177 0.239 disp 0.233 0.087 2.666 0.008 carb 20.273 3.405 5.954 0.000 Variances: Estimate Std.Err z-value P(>|z|) .mpg 5.011 1.253 4.000 0.000 .hp 644.737 161.184 4.000 0.000

In the above summary output, we can see that wt is a significant variable for mpg at 5 percent level; while, dsp and crb are significant variables for hp. ‘Hp’ itself is not a significant variable for mpg. We will examine this model using a path diagram using semPlot package.

> semPaths(fit2, 'std', 'est', curveAdjacent = TRUE, style = "lisrel")

The above plot shows that mpg is strongly dependent on wt; while, hp is strongly dependent on dsp and crb. There is a weak relation between hp and mpg. Same inference was derived from the above output.

semPaths function can be used to create above chart in multiple ways. You can go through the documentation for semPaths and explore different options.

There are few considerations that you should keep in mind while doing path analysis. Path analysis is very sensitive to omission or addition of variables in the model. Any omission of relevant variable or addition of extra variable in the model may significantly impact the results. Also, path analysis is a technique for testing out models and not building them. If you were to use path analysis in building models, then you may end with endless combination of different models and choosing the right model may not be possible. So, path analysis can be used to test a specific model or compare multiple models to choose the best possible.

There are numerous other ways you can use path analysis. We would love to hear your experiences of using path analysis in different contexts. Please share your examples and experiences in the comments section below.

This article was contributed by **Perceptive Analytics**. Prudhvi Potluri, Chaitanya Sagar and Saneesh Veetil contributed to this article.

**Perceptive Analytics** provides data Analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Its client roster includes Fortune 500 and NYSE listed companies in the USA and India.

**Related:**

- Optimization Using R
- Next Generation data Manipulation with R and dplyr
- Choropleth Maps in R