📖 Forecasting. Principles and Practice

Tags:: 📚Books, ✒️SummarizedBooks, Time series, Forecasting
Author:: Rob J Hyndman and George Athanasopoulos
Liked:: 8
Link:: Forecasting: Principles and Practice (3rd ed) (otexts.com)
Source date:: 2021-05-01
Finished date:: 2021-12-01

Cover::

La biblia del forecaster, referenciada en todo StackExchange, y disponible for free, que esto sí que es poca broma.

Tendría como tres partes: la primera mitad (capítulos 1 al 6) es model-agnóstic sobre las series, la segunda es un conjunto de métodos (7 al 10) y una última (11 al 13) de “problemxas avanzados”.

Poco de ML, es métodos clásicos. De ML, tenemos esto: Statistical and Machine Learning forecasting methods: Concerns and ways forward (plos.org) y pendiente el Business Forecasting: The Emerging Role of Artificial Intelligence and Machine Learning | Wiley… pero pinta regularillo para ML.

He echado muchísimo de menos una comparativa de modelos… pero es que por lo visto es un tema complicado. Me lo he buscado fuera: 📜 ‘Horses for Courses’ in demand forecasting. La sensación es que tienes que ir un poco a tirarle de todo a un forecast y “see what sticks”. Aquí sí que tiene valor el auto ML que te dan librerías como PyCaret. A ver si el notas éste servidor de usted se decide a implementar multiple seasonality.

2. Time series graphics

Seasonal plots

2.3 Time series patterns

Many people confuse cyclic behaviour with seasonal behaviour, but they are really quite different. If the fluctuations are not of a fixed frequency then they are cyclic; if the frequency is unchanging and associated with some aspect of the calendar, then the pattern is seasonal.

3. Time series decomposition

3.1 Transformations and adjustments

The purpose of these adjustments and transformations is to simplify the patterns in the historical data by removing known sources of variation, or by making the pattern more consistent across the whole data set. Simpler patterns are usually easier to model and lead to more accurate forecasts.

3.2 Time series components

An alternative to using a multiplicative decomposition is to first transform the data until the variation in the series appears to be stable over time, then use an additive decomposition. When a log transformation has been used, this is equivalent to using a multiplicative decomposition because: $y_{t} = S_{t} \times T_{t} \times R_{t} is equivalent to lo g y_{t} = lo g S_{t} + lo g T_{t} + lo g R_{t} .$

Ahí S_t es la component estacional, T_t la tendencia, y R_t el residuo.

3.3. Moving averages

Centred moving averages: moving averages of moving averages.

The most common use of centred moving averages is for estimating the trend-cycle from seasonal data. Consider the 2×4-MA: $\hat{T}_{t} = \frac{1}{8} y_{t - 2} + \frac{1}{4} y_{t - 1} + \frac{1}{4} y_{t} + \frac{1}{4} y_{t + 1} + \frac{1}{8} y_{t + 2} .$ When applied to quarterly data, each quarter of the year is given equal weight as the first and last terms apply to the same quarter in consecutive years. Consequently, the seasonal variation will be averaged out and the resulting values of $\hat{T}_{t}$ will have little or no seasonal variation remaining. A similar effect would be obtained using a 2×8-MA or a 2×12-MA to quarterly data.

(That is, the smoothing effect we apply to obtain the trend a|nd cycles)

3.4 Classical decomposition

While classical decomposition is still widely used, it is not recommended, as there are now several much better methods.

3.6. STL decomposition

The good one. PROs:

Handles any seasonality period.
The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
The smoothness of the trend-cycle can also be controlled by the user.
It can be robust to outliers. CONs:
Only handles additive components. For multiplicative, you need to take logs.

4. Time series features

4.2 ACF features

Particularly PACF plots are useful, because they remove previous autocorrelation effects so you see the new information added by each lag.

Also remember that these area linear correlations, so lag plots are more useful. -

4.3 STL features

From 3.6 STL decomposition:

STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships.

Strength of trend (and similar for seasonality) $F_{T} = max (0, 1 - \frac{Var ( R _{t} )}{Var ( T _{t} + R _{t} )})$

4.4 Other features

coef_hurst will calculate the Hurst coefficient of a time series which is a measure of “long memory.” A series with long memory will have significant autocorrelations for many lags.

feat_spectral will compute the (Shannon) spectral entropy of a time series, which is a measure of how easy the series is to forecast. A series which has strong trend and seasonality (and so is easy to forecast) will have entropy close to 0. A series that is very noisy (and so is difficult to forecast) will have entropy close to 1.

shift_kl_max finds the largest distributional shift (based on the Kulback-Leibler divergence) between two consecutive sliding windows of the time series. This is useful for finding sudden changes in the distribution of a time series.

You may discover stuff by plotting time series features in a pairwise fashion

5. The forecaster’s toolbox

5.3 Fitted values and residuals.

Innovation residuals: residuals on the transformed scale (if there is no transformation, innovation residuals = residuals).

5.4. Residual diagnostics.

A good forecasting model has:
- Uncorrelated innovation residuals.
- Zero mean (otherwise they are biased).
Prediction intervals will be easier if the following:
- Constant variance: homoscedasticity
- Normally distributed.

5.5. Distributional forecasts and prediction interval.

If residuals are not normal, we can bootstrap them (only assumes uncorrelated with constant variance): simulate N forecasts by sampling from residuals: $y_{T + 1} = \overset{y}{^}_{T + 1∣ T} + e_{T + 1}$ where $e_{T + 1}$ can be replaced by a sample of a past residual.

5.6. Forecasting using transformations.

Bias adjustments

One issue with using mathematical transformations such as Box-Cox transformations is that the back-transformed point forecast will not be the mean of the forecast distribution. In fact, it will usually be the median of the forecast distribution (assuming that the distribution on the transformed space is symmetric). For many purposes, this is acceptable, although the mean is usually preferable. For example, you may wish to add up sales forecasts from various regions to form a forecast for the whole country. But medians do not add up, whereas means do.

5.8 Evaluating point forecast accuracy

A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean. Consequently, the RMSE is also widely used.

From 8.2:

sometimes different accuracy measures will suggest different forecasting methods, and then a decision is required as to which forecasting method we prefer to use

Percentage errors, like the MAPE or sMAPE make no sense for units with no meaningful zero (e.g., imagine a 100% error on a Celsius scale… what does it really mean). The Symmetric MAPE (sMAPE) appeared to avoid having a higher penalty for negative errors (forecast higher than reality) as it happened with the MAPE: MAPE has no upper limit for negative errors but cannot exceed 100% for positive errors (i.e., forecasting 0). The authors do not recommend their use.

But they do recommend scaled errors: MASE and RMSSE, which are simply the errors scaled by naive forecasts.

6. Judgmental forecasts

Dealing with the same info that 📖 Noise. A Flaw in Human Judgement
Only make large adjustments:

Judgmental adjustments are most effective when there is significant additional information at hand or strong evidence of the need for an adjustment. We should only adjust when we have important extra information which is not incorporated in the statistical model. Hence, adjustments seem to be most accurate when they are large in size. Small adjustments (especially in the positive direction promoting the illusion of optimism) have been found to hinder accuracy, and should be avoided.

And document them:

In particular, having to document and justify adjustments will make it more challenging to override the statistical forecasts, and will guard against unnecessary adjustments.

7. Time series regression models

7.1 The linear model.

Asumptions we made (apart from the linear relationship):
- Errors have mean zero, they are not autocorrelated and they are unrelated to predictor variables.
- If the errors are normally distributed with constant variance, we can easily produce prediction intervals.

7.2. Least squares estimation.

$R^{2}$ is the proportion of variation in the forecast variable explained by the regression model. Don’t use it to select predictors! Can lead to overfitting (since it is measured on training data).

7.3. Evaluating the regression model.

Residual plots (ACF, against predictors…) to verify the assumptions and thus, that there is no more info we can exploit.
Careful with non-stationary time series (those without constant mean, variance, e.g., a trended series) as regressors: there may be no real relationship (time acts as a confounding factor).

7.4 Some useful predictors.

Time, to capture trends
Fourier series for seasonality.
Dummy variables

Many beginners will try to add a seventh dummy variable for the seventh category. This is known as the “dummy variable trap”, because it will cause the regression to fail. There will be one too many parameters to estimate when an intercept is also included. The general rule is to use one fewer dummy variables than categories. So for quarterly data, use three dummy variables; for monthly data, use 11 dummy variables; and for daily data, use six dummy variables, and so on.

Note that this is Multicollinearity: correlation between predictors (or linear combinations of predictors). This is better explained in 📖 Introductory Statistics and Analytics. A Resampling Perspective. -

7.5 Selecting predictors.

Apart from CV (ala Prophet), there are different scores, such as an adjusted version of $R^{2}$ or the well known Akaike’s Information Criterion (AIC) variations for small training values or for preventing choosing many predictors. Note that with large enough $AIC = T lo g (\frac{SSE}{T}) + 2 (k + 2)$ Not clear if there is anything one can do with small data. CV becomes hard, AIC theoretical guarantees may not hold: https://stats.stackexchange.com/questions/139175/aic-versus-cross-validation-in-time-series-the-small-sample-case

7.7 Forecasting with a nonlinear trend

… not recommended that quadratic or higher order trends be used in forecasting. When they are extrapolated, the resulting forecasts are often unrealistic. (…) A better approach is to use the piecewise specification introduced above and fit a piecewise linear trend which bends at some point in time.

7.8 Correlation, causation, and forecasting

Multicollinearity: correlation between predictors (or linear combinations of predictors).
Multicollinearity is not a problem in general (with the exception of the “dummy variable trap”:

if the future values of your predictor variables are within their historical ranges, there is nothing to worry about — multicollinearity is not a problem except when there is perfect correlation.

8. Exponential smoothing

8.1. Simple exponential smoothing (SES).

Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older.

8.2 Methods with trend.

Holt’s linear trend method and damped trend methods
- Two smoothing equations: one for the level and one for the trend.

The forecasts generated by Holt’s linear method display a constant trend (increasing or decreasing) indefinitely into the future. Empirical evidence indicates that these methods tend to over-forecast, especially for longer forecast horizons. Motivated by this observation, Gardner & McKenzie (1985) introduced a parameter that “dampens” the trend to a flat line some time in the future. Methods that include a damped trend have proven to be very successful, and are arguably the most popular individual methods when forecasts are required automatically for many series

8.3 Methods with seasonality. Known as the Holt-Winters method.

Exponential smoothing can accommodate trends and seasonalities, which are weighted averages of the previous seasons. Seasonality can be additive or multiplicative here too.

(From 8.2)The trend equation shows that $b_{t}$ is a weighted average of the estimated trend at time $t$ based on $ℓ_{t - 1} + b_{t - 1}$ and $b_{t - 1}$ , the previous estimate of the trend.

The seasonal equation shows a weighted average between the current seasonal index, $(y_{t} - ℓ_{t - 1} - b_{t - 1})$ , and the seasonal index of the same season last year (i.e., mm time periods ago).

8.5. Innovations state space models for exponential smoothing

The previous methods can be, in reality statistical models to generate forecast intervals, by introducing an error distribution. For example, for Exponential smoothing (ETS)(A, N, N): $Forecast equation Smoothing equation \overset{y}{^}_{t + 1∣ t} ℓ_{t} = ℓ_{t} = α y_{t} + (1 - α) ℓ_{t - 1} .$ Can be rewritten as: $ℓ_{t} = ℓ_{t - 1} + α (y_{t} - ℓ_{t - 1}) = ℓ_{t - 1} + α e_{t},$ where $e_{t} = y_{t} - ℓ_{t - 1} = y_{t} - \overset{y}{^}_{t ∣ t - 1}$ Since $y_{t} = ℓ_{t - 1} + e_{t}$ , then: $y_{t} ℓ_{t} = ℓ_{t - 1} + ε_{t} = ℓ_{t - 1} + α ε_{t} .$

which are the fully specified statistical model. The first one is the measurement equation, and the second the state equation.

ETS(.,.,.) for (Error, Trend, Seasonal), where

Error = {A, M}
Trend = {N, A, $A_{d}$ }
Seasonal = {N, A, M} being (A: additive, M: multiplicative, A_d: damped trend).
The naming:

Each model consists of a measurement equation that describes the observed data, and some state equations that describe how the unobserved components or states (level, trend, seasonal) change over time. Hence, these are referred to as state space models.

9. ARIMA models

9.1 Stationarity and differencing

A series with aperiodic cycles is still stationary.
A way to remove trend and seasonality is use the series compose by the difference between values. This can be done multiple times, but only as few as necessary.
They should be interpretable and applying more differences than needed could introduce false dynamics or autocorrelations that don’t really exists.
Differences at lag 1 are called “first differences” to distinguish them from seasonality differences.

9.5. Non-seasonal ARIMA models

ARIMA: AutoRegressive Integrated Moving Average.
ARIMA(p,d,q): $y_{t}^{'} = c + ϕ_{1} y_{t - 1}^{'} + \dots + ϕ_{p} y_{t - p}^{'} + θ_{1} ε_{t - 1} + \dots + θ_{q} ε_{t - q} + ε_{t}$ where d is the degree of differencing. E.g., a degree two of differencing means: $y_{t}^{''} = y_{t}^{'} - y_{t - 1}^{'} = (y_{t} - y_{t - 1}) - (y_{t - 1} - y_{t - 2}) = y_{t} - 2 y_{t - 1} + y_{t - 2} .$ AR and MA modelling are not really redundant: AR tries to predict the signal if such signal has autocorrelation. MA (the averaging of error terms) tries to minimize prediction error attending to the noise term of the series: [Question] What is the essence of Combining AR and MA models into ARMA or ARIMA ? : statistics (reddit.com)

9.9. Seasonal ARIMA models

Also known as SARIMA in literature. It includes additional terms for a single seasonal component: $(P, D, Q)_{m}$

Note that it allows the seasonality pattern to change over time.

10. Dynamic regression models

Regression models we allow to have ARIMA errors. So we forecast the regression part of the model, the ARIMA part of the model, and combine the results.

10.5 Dynamic harmonic regression

Similar to Prophet, it is the addition of Fourier terms in the regression, to allow for different seasonalities.

The ARIMA() function will allow a seasonal period up to m = 350, but in practice will usually run out of memory whenever the seasonal period is more than about 200.

The only real disadvantage (compared to a seasonal ARIMA SARIMA model) is that the seasonality is assumed to be fixed — the seasonal pattern is not allowed to change over time. But in practice, seasonality is usually remarkably constant so this is not a big disadvantage except for long time series.

11. Forecasting hierarchical and grouped time series.

11.2 Single level approaches

Bottom-up approach

An advantage of this approach is that we are forecasting at the bottom level of a structure, and therefore no information is lost due to aggregation. On the other hand, bottom-level data can be quite noisy and more challenging to model and forecast.

Top-down approach: we can learn proportions to disaggregate from the data

proportions based on forecasts rather than historical data can be used (G. Athanasopoulos et al., 2009).vConsider a one level hierarchy. We first generate h-step-ahead forecasts for all of the series. We don’t use these forecasts directly, and they are not coherent (they don’t add up correctly). Let’s call these “initial” forecasts. We calculate the proportion of each h-step-ahead initial forecast at the bottom level, to the aggregate of all the h-step-ahead initial forecasts at this level. We refer to these as the forecast proportions, and we use them to disaggregate the top-level h-step-ahead initial forecast in order to generate coherent forecasts for the whole of the hierarchy.

11.3 Forecast reconciliation

The above approaches can be generalized to use all available information: Minimum Trace (MinT) optimal reconciliation approach. To use it in practice, there are different simplifications that can be made, which have different properties (e.g., when there are many bottom-level series compared to the length, we use a method called mint_shrink, etc.) In summary, unlike any other existing approach, the optimal reconciliation forecasts are generated using all the information available within a hierarchical or a grouped structure. This is important, as particular aggregation levels or groupings may reveal features of the data that are of interest to the user and are important to be modelled. These features may be completely hidden or not easily identifiable at other levels.

11.5 Reconciled distributional forecasts

We can also compute prediction intervals: either assuming normality or bootstrapping.

12. Advanced forecasting methods

12.1 Complex seasonality

Higher frequency data shows multiple seasonalities at the same time. However, most of the methods seen in this book (ETS, SARIMA…) do not handle that. What can we do?

STL
- Apply STL decomposition, specifying multiple seasonal periods.
- Forecast the seasonal components using a seasonal naive method.
- Forecast the seasonally adjusted series using ETS.
Dynamic Harmonic Regression
TBATS (which, for some reason, is explained in version 2 of the book, but not on the latest…)
- Allows seasonalities to change over time (but it is computationally expensive).
📜 Forecasting at scale (Prophet paper) (also models seasonality with Fourier terms as DHR).

Prophet has the advantage of being much faster to estimate than the DHR models we have considered previously, and it is completely automated. However, it rarely gives better forecast accuracy than the alternative approaches, as these two examples have illustrated.

13. Some practical forecasting issues

Very long and very short time series

With short series, there is not enough data to allow some observations to be withheld for testing purposes, and even time series cross validation can be difficult to apply. The AICc is particularly useful here, because it is a proxy for the one-step forecast out-of-sample MSE. Choosing the model with the minimum AICc value allows both the number of parameters and the amount of noise to be taken into account.

What tends to happen with short series is that the AICc suggests simple models because anything with more than one or two parameters will produce poor forecasts due to the estimation error.

Dr. Mario's 2nd 🧠

Explorer