By Michael Schmidt, PhD, Chief Scientist at DataRobot.
Most machine learning algorithms today are not time-aware and are not easily applied to time series and forecasting problems. Leveraging advanced algorithms like XGBoost, or even linear models, typically requires substantial data preparation and feature engineering – for example, creating lagged features, detrending the target, and detecting periodicity. The preprocessing required becomes more difficult in the common case where the problem requires predicting a window of multiple future time points. As a result, most practitioners fall back on classical methods, such as ARIMA or trend analysis, which are time-aware but less expressive. This article covers the best practices for solving this challenge, by introducing a general framework for developing time series models, generating features and preprocessing the data, and exploring the potential to automate this process in order to apply advanced machine learning algorithms to almost any time series problem.
Time series forecasting is one of the hardest problems in data science. It literally involves predicting future events and extrapolating how a potentially complex system evolves. In traditional machine learning problems, we often assume prediction data will resemble the training data. However, time series problems are intrinsically dynamic and moving – particularly for non-stationary signals we will discuss later. This amplifies the sensitivity to overfitting and can also make it challenging for some models to find predictive signals to begin with.
Most advanced machine learning algorithms that solve these challenges today (e.g., XGBoost) are not time-aware. They typically look at one row at a time when forming predictions. In order to use these methods for forecasting, we need to derive informative features, based on past and present data in time. This article introduces a framework for this feature engineering process and details how to build more powerful models for time series prediction.
It’s worth mentioning that there are several time series specialized algorithms like ARIMA, exponential smoothing, and various decomposition and trend methods. The framework and feature engineering we discuss below can benefit these algorithms, as well, by making it more practical to incorporate multi-variate data and richer features that also vary over time.
Time Series Framework
When building a time series model, we need to define how features should be created and how the model will be used. Below, we introduce a general time series framework to encode this information, which will also enable us to automate this process later on.
The Forecast Point defines an arbitrary point in time that a prediction is being made. The Feature Derivation Window (FDW) defines a rolling window, relative to the Forecast Point, which can be used to derive descriptive features. Finally, the Forecast Window (FW) defines the range of future values we wish to predict, called Forecast Distances (FDs).
The time series framework captures the business logic of how the model will be used. It encodes how much recent history is required in order to make new predictions (e.g., at least 28 days ago), how recent of data is available (e.g., up to 7 days ago), and which forecast distances are needed (2 to 7 days). The forecast window also gives us an objective way to measure the total accuracy of the model for training, where total error can be measured by averaging across all potential forecast points in the data and the accuracy for each forecast distance in the window.
Time Series Features
Based on the time series framework defined above, we can generate a number of different time series features that can be useful to predict different forecast distances.
- Various lags inside the FDW
- Rolling mean, min, max, etc. statistics
- Bollinger bands and statistics
- Rolling entropy, or rolling majority, for categorical features
- Rolling text statistics for text features
These features can be derived from the target variable or any of its covariates.
In order to transform the dataset into a form that an ordinary machine learning model can use, we need to generate examples for each forecast distance, with corresponding features derived from the relative feature derivation window.
For each forecast distance (FD):
1. Calculate a rolling statistic for each row based on the relative FDW rows
2. Generate examples for each row to predict the FD’s future value based on these features
3. Repeat for each candidate rolling feature
This produces a potentially larger dataset with examples for predicting each forecast distance and various new features. However, now we can train models which are not time-aware on individuals rows to produce accurate forecasts. The forecast distance itself can be used as a feature, or we could instead build a separate model for each distance.
Time Series Treatments
In addition to creating new features, we can also consider ways to transform the target variable in order to maximize the predictive accuracy and stability of predictions. For example, we can perform tests to check if the target is stationary or non-stationary, if it is periodic or not, and if it is an exponential trend based on basic statistical tests. Based on these tests, we might choose to transform the target using:
- Log-transformation (for exponential trends and multiplicative models)
- Periodic differencing (to make an integrated model with stationary targets)
- Simple naive difference (using the most recent FDW row)
At prediction time, these treatments can be reversed, resulting in more accurate predictions in the original scale.
The time series framework, treatments, and candidate features provide a way to systematically transform an original dataset into a dataset that we can use to train arbitrary machine learning models for forecasting.
However, there are still some challenges remaining:
- How to select the best algorithm on these features?
- How to properly partition and validate time series models during training?
- How to scale to large datasets?
- How to handle multi-series (e.g., cross-sectional time series) data?
- How to select the best features among a large number of potential features?
Fully automating this process is difficult because of the wide range of datasets and business constraints and the technical detail required to for individual steps, features, and decisions. However, companies like DataRobot today are focusing on automating these best practices for businesses and individual users.
In this article, we covered the challenges involved in time series forecasting and how to apply advanced machine learning algorithms to these problems. We introduced a general time series framework for defining how the model is used, how to generate predictive history features and basic time-series treatments, and finally, the potential to automate this process. Time series modeling is a rich subject, with many other considerations that could be incorporated. The practices here will help get the maximum accuracy to predicting future events.
Bio: Michael Schmidt, PhD, is the Chief Scientist at DataRobot and has been featured in the Forbes list of the world’s top 7 data scientists. He has won awards for research in AI, with publications ranking in the 99th percentile of all tracked research. In 2012, Michael founded Nutonian and led the development of Eureqa, a machine learning application and service used by over 80,000 users globally (later acquired by DataRobot). In 2015, he was selected by MIT for the “35 Innovators Under 35” award. Michael has also appeared in several media outlets such as the New York Times, NPR’s RadioLab, the Science Channel, and Communications of the ACM. Most recently, his work focuses on automated machine learning, feature engineering, and advanced time series prediction.
- Basic Concepts of Feature Selection
- The Practical Importance of Feature Selection
- Time Series Analysis: A Primer