In the words of Hannah Montana, everybody makes mistakes. And, respectfully, Hannah Montana never even tried to use machine learning (ML) methods for time-series forecasting. If she did, she would have made many more mistakes.
Machine learning forecasting models are flexible, easy to implement, and generally provide better results than simple statistical models. However, because they are easy to implement, they are also easy to implement poorly. Here are three mistakes to avoid when using ML models for time-series forecasting.
1. Ignoring Data Quality
The adage “garbage in, garbage out,” may be cliche, but as David Foster Wallace wrote: Cliches earned their status as cliches because they’re so obviously true. In the same way, it is easy for a data scientist to overlook the most obvious input to their forecasting algorithm – the data. If there are issues with the data – whether in the presence and severity of unexplainable outliers, missing data points, or in the method of data capture – and if these issues remain uncorrected, then any machine learning algorithm is going to produce poor results.
A pertinent example of this can be found in retail transactional data from the coronavirus period. The combination of labor and supply shortages, warehouse and store closures, and unusual demand patterns caused strange, spiky point-of-sale and shipping data in supply chains during the covid period. Many machine learning and statistical algorithms would try to discover a trend where one does not exist, and incorrectly predict that demand for toilet paper will go to the moon and stay there, or that nobody will ever purchase dress pants ever again, even though the change in demand was prompted by an anomalous event. Even simple corrections or smoothing in the coronavirus period can provide a more stable series for the algorithm to learn from, leading to better predictions in the future window.
Thoughtlessly Engineering Features
The popular metaphor for Machine learning forecasting algorithms is that of the “black box,” where data is funneled into an unknowable, unobservable algorithm, the black box contemplates, the black box learns, and the black box predicts. This is, of course, deeply unsatisfactory.
A more appropriate metaphor for an ML algorithm is a patron at a restaurant. While the chef could throw all the raw ingredients at the patron and say, “Here is your deconstructed entree; now, give me my Michelin star,” it is much more effective if the chef actually cooks the raw ingredients, meticulously assembles the cooked ingredients, and plates a spectacularly balanced entree. This allows the patron to pick out the superstar ingredients in the main dish, while also appreciating the complementary flavors and textures provided by the sides.
In ML for time-series prediction, making a Michelin-caliber meal means engineering features, from factors both exogenous (e.g., promotional events, holidays) and endogenous (e.g., lags, seasonal components, etc.). It is not enough to simply hand an ML model historical sales; the forecaster needs to define variables for the model – this was the quantity of Item A’s sales 12 months ago, there was a promotion on Item A in Month 3, there was a promotion on an item similar to Item A in Month 5, etc.
Machine learning forecasting models are a bit like humans – when they have access to better information, they tend to make better decisions. Instead of simply generating a binary flag for when an item has a promotion, one could do further analysis to determine whether the item’s sales have historically been promotion-sensitive, and then bin items into multiple categories – Promotion Sensitive, Not Promotion Sensitive, or Somewhat Promotion Sensitive. Similarly, if the promotion is a deep discount on a normally expensive item, one would expect its sales to have a significantly greater boost than if an equivalent percentage discount was applied to an inexpensive item. Instead of expecting an ML model to learn these relationships on its own, representing this information for them gives the model greater clarity when determining what information is important. If the forecaster generates many interesting and novel features regarding promotions, the ML model can determine which of these features are most relevant to predict sales. However, if the ML model only has access to an “Item Promoted” flag and past sales data, then it will try to apply a uniform promotional lift across all items and promotional events, resulting in forecasts that make little sense.
Training with Class Imbalances
Consider a world in which the only automobiles are red sedans. Assume these are self-driving cars, and that the traffic infrastructure – roads, lights, sensors, etc. – is designed to function for red sedans, as this is the only automobile that exists. Suddenly, a blue truck appears, as if from the ether. The infrastructure does not know how to handle the blue truck, and chaos ensues.
Machine learning forecasting models are like this traffic infrastructure. If they only see red sedans, they only know how to learn patterns that exist within red sedans. And even if they were able to see a few blue trucks, they would still mostly learn what it’s like to be a red sedan. This is an example of one of the most common mistakes forecasters make with machine learning – training with class imbalances.
In order to learn the effect of a certain event (e.g., a promotion), the ML model needs to have a sufficient number of observations that have a promotion. Not only that, there needs to be a roughly equivalent number of observations that do not have a promotion. This is like a control group, allowing the ML model to differentiate between sales of promoted items, versus sales of non-promoted items. If this condition is not met, then the ML model will determine that the most important feature for predicting an item’s sale is not whether it was on promotion, but it will instead predict using the features of the time series itself – last year’s value, the moving average of the last six months of sales, etc. Class imbalances during training often produce forecasts which violate a simple test of forecast value-add: could a fifth grader produce a better forecast?
Consider an item which reacts strongly and consistently to promotion, selling around 500 units every month it’s promoted, and about 80 units every month it’s not promoted. If these promotions are run inconsistently (e.g., during Months 3 & 4 one year; and Months 5 & 7 the next year), and every data point is used during training, the ML model will not recognize the exact spike in sales one expects during a promotion. It will instead predict a modest increase in promotion weeks and become very good at learning about non-promotional observations since there are 49 weeks of non-promotional data for every 3 weeks of promotional data.
How the forecaster accomplishes this practically is application-specific, but it is often as simple as having two models: one which is used only to predict non-promotional observations (and is trained with every non-promotional observation available), and another model which is used only to predict observations which have a promotion (which is trained on every promotional observation available, along with a roughly equal number of non-promotional observations).
One may notice that these three common mistakes are made before any machine learning forecasting models are actually run. This is because it’s impossible to make mistakes when optimizing, selecting, and deploying an ML model. (I’m an SME in wishful thinking!)
Unfortunately, one can make many more mistakes in the machine learning forecasting pipeline. Future entries will concern mistakes made in this critical second stage of machine learning.