There are two dominant camps as to how the quality of a forecast should be judged. The traditional camp—originating from and still favored by academia—looks only at the intrinsic qualities of a forecast, such as accuracy and bias. The second camp—rising more recently from industry—objects because academic metrics do not completely reflect the impact on the business. This camp promotes metrics such as Forecast Value Add (FVA) and more exotic correlations to impacts on downstream processes. We forecast not for the purpose of forecasting but to improve the business. They aim to measure that improvement, such as the effect of forecast on inventory levels or expediting.
Here we make the case that while the causes of both camps may be necessary, they are not sufficient in practice. This blog highlights two additional aspects of a forecast that are often ignored but are critical to a business’ forecast, and these vital factors are stability and desirability.
None of the metrics used to judge the quality of a forecast of either school consider change over time. If we generate a forecast this month, it may be accurate. If we generate another forecast next month, it may also be accurate. But how much did it change? Maybe both had an error of around 20%, but one was positively biased and the other was negatively biased. The change in quantities forecasted could be as significant as 40% in this case. Figure 1 shows an example of two such forecasts with equal error but opposite bias.
Forecast instability occurs when consecutive forecasts for the same period have dramatically different values. It is not uncommon for forecasts to change upwards of 10% when measured across entire years and entire portfolios from one forecast cycle to the next. Obviously, for individual time series, the changes are even greater. This is a significant contributing factor to the bullwhip effect that businesses experience.
While stability can be measured objectively, a subjective aspect of forecast quality is its desirability. When measured at agreed-upon lags, a generated forecast may have all the right metric values but not satisfy the customer’s expectations. Figure 2 below shows a few examples.
The examples, in this case, are clearly undesirable given the smooth historical pattern even for a short time range. Most automated forecasts will not pick up on seasonality with such brief history and could produce any of the output examples. While understandable—even justifiable—to an expert, the customer is unlikely to agree. Automated forecasts are an area where the risk of overfitting (finding patterns where there are none) and the risk of underfitting (failing to find existing patterns) clash.
The examples included are anecdotal. In practice, the situation is typically not so clear. But the examples highlight that the quality of a forecast should not just be taken at face value. In business, each forecast is but one iteration in an endless cycle of iterations. What came before and what comes after matters, and ignoring that turns most quality assessments into academic exercises, even those that measure impact on the business.
Similarly, not all aspects of quality can be easily measured numerically. Often a visual inspection will expose issues hidden by the numbers. That is not to say it cannot be mitigated by better algorithms. There will always be cases where the customer expects more. Often, they will be right. We, humans, perceive patterns even when they are pure coincidence. Forecast algorithms tend to err on the safe side. For desirability, maybe they should err on the daring side?