By Steve Poulin, CompassRed
Survival analysis refers to a class of statistical techniques that measure the effect of predictors on the time until an event, rather than the probability of an event occurring. As the name indicates, this technique has roots in the field of medical research for evaluating the effect of drugs or medical procedures on time until death. However, there are many less morbid applications of this technique, such as the following business analytics examples that I’ve observed during my 20 years as a data scientist:
- Time until product failure
- Time until a warranty claim
- Time until a process reaches a critical level
- Time from initial sales contact to a sale
- Time from employee hire to either termination or quit
- Time from a salesperson hire to their first sale
The most commonly used survival analysis techniques are Kaplan-Meier and Cox Regression. The Kaplan-Meier test is already widely used within the pharmaceutical industry for clinical drug trials, comparing the effects of drugs and their placebos on either time to recovery or to death. In an article for The New Yorker, Malcolm Gladwell includes an interesting description of the critical role of Kaplan-Meier tests in the search for effective cancer treatments. This type of test determines if there is a statistically significant difference between the survival time of two or more groups. For clinical drug trials, a successful test typically means that the group taking the new drug has a shorter time to recovery or death than the group taking a placebo (and determines whether the trial can move to the next stage).
The other newer and very popular survival analysis technique was first described by Sir David Cox in an article published in 1972. It is formally known as the Cox Model, but often referred to as Cox Regression. A measure of its popularity among academic researchers can be found in the Web of Science, a citation indexing service, which documents that over 31,000 published articles cited the Cox Regression technique since 1980.
The advantage of Cox Regression over Kaplan-Meier is that it can accommodate any number of predictors, rather than group membership only. As is the case for all regression techniques, there are two potential benefits of analysis using Cox Regression: predictor ranking, with each predictor’s effect measured above and beyond the other predictors’ effects, and the ability to make predictions with the regression results. Predictor rankings enable the analyst to identify the factors that have the most influence on time to an event, and the regression results can be used to estimate the amount of until an event for a specific profile of any subject.
The interpretation of Cox Regression results depends on whether the event of interest is positive (e.g. a sale) or negative (e.g. product failure), and whether a predictor tends to lengthen the time until an event or shorten it. For instance, in an analysis on the amount of time until a sale, a positive effect means that a one-unit increase in the predictor makes a quick sale more likely, and a negative effect typically lengthens the amount of time until a sale (and increases the cost of the sales effort). For an analysis of the amount of time until product failure, a positive effect means a one-unit increase in the predictor usually shortens the time until failure (most likely a bad thing, unless you want to sell more of the product!), and a negative effect means that a one-unit increase in the predictor, on the average, results in longer product life.
There are other analytical techniques that can predict time until an event, but survival analysis techniques have the unique advantage of including cases that have experienced the event and cases that have not. Although these latter cases do not have a date for the target event, they are an integral part of the analysis. In the language of survival analysis, they are known as censored cases.
Censored cases are an integral part of survival analysis because the target field is the amount of time a case is at risk of experiencing the event, whether the target event occurred or not. For censored cases, the date of the analysis is used instead of the date the target event has occurred. This date represents how long a case was known to be at risk of experiencing the target event, even if they didn’t. For some analyses, the beginning date of the period of risk is difficult to determine, such as cancer risk. However, in most business analyses, this date is usually easy to determine, such as the date someone becomes a sales prospect, or the date a product has been manufactured.
Most major proprietary statistical software packages (SPSS, SAS, etc.) and R, the open-source statistical software, include these survival analysis techniques. The use of the Cox Regression technique includes the usual regression caveats, such as the use of scale predictors and avoiding multicollinearity. Both the Cox Regression and the Kaplan-Meier tests represent predictive analytics techniques that can deliver business value in terms of increasing revenue and reducing costs.
Bio: Dr. Steve Poulin is the Principal Data Scientist & Manager of Predictive Analytics at CompassRed, a data analytics firm in Wilmington, DE. He has provided data analytics training and consultation for over 250 organizations since 1997
- Business intuition in data science
- The Challenges of Building a Predictive Churn Model
- Automated Feature Engineering for Time Series Data