Prediction competitions are now so widespread that it is often forgotten how controversial they were when first held, and how influential they have been over the years.
To keep this exercise manageable, I will restrict attention to time series forecasting competitions — where only the history of the data is available when producing forecasts.
The earliest non-trivial study of time series forecast accuracy was probably by David Reid as part of his PhD at the University of Nottingham (1969). Building on his work, Paul Newbold and Clive Granger conducted a study of forecast accuracy involving 106 time series (JRSSA, 1974). Although they did not invite others to participate, they did start the discussion on what forecasting methods are the most accurate for different types of time series. They presented the ideas to the Royal Statistical Society, and the subsequent discussion reveals some of the erroneous thinking of the time.
One important feature of the results was the empirical demonstration that forecast combinations improve accuracy. (A similar result had been demonstrated as far back as Galton, 1907.) Yet one discussant (GJA Stern) stated
Another (Maurice Priestley) said
This reveals a view commonly held (even today) that there is some single model that describes the data generating process, and that the job of a forecaster is to find it. This seems patently absurd to me — real data comes from much more complicated, non-linear, non-stationary processes than any model we might dream up — and George Box famously dismissed it saying “All models are wrong but some are useful”.
There was also a strong bias against automatic forecasting procedures. For example, Gwilym Jenkins said
Makridakis & Hibon (1979)
Five years later, Spyros Makridakis and Michèle Hibon put together a collection of 111 time series and compared many more forecasting methods. They also presented the results to the Royal Statistical Society. The resulting JRSSA (1979) paper seems to have caused quite a stir, and the discussion published along with the paper is entertaining, and at times somewhat shocking.
Maurice Priestley was in attendance again and was clinging to the view that there was a true model waiting to be discovered:
Makridakis and Hibon replied
Many of the discussants seem to have been enamoured with ARIMA models.
Then Chatfield got personal:
Again, Makridakis & Hibon responded:
In response to the hostility and charge of incompetence, Makridakis & Hibon followed up with a new competition involving 1001 series. This time, anyone could submit forecasts, making this the first true forecasting competition as far as I am aware. They also used multiple forecast measures to determine the most accurate method.
The 1001 time series were taken from demography, industry and economics, and ranged in length between 9 and 132 observations. All the data were either non-seasonal (e.g., annual), quarterly or monthly. Curiously, all the data were positive, which made it possibly to compute mean absolute percentage errors, but was not really reflective of the population of real data.
The results of their 1979 paper were largely confirmed. The four main findings (taken from Makridakis & Hibon, 2000) were:
- Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones.
- The relative ranking of the performance of the various methods varies according to the accuracy measure being used.
- The accuracy when various methods are being combined outperforms, on average, the individual methods being combined and does very well in comparison to other methods.
- The accuracy of the various methods depends upon the length of the forecasting horizon involved.
The paper describing the competition (Makridakis et al, JF, 1982) had a profound effect on forecasting research. It caused researchers to:
- focus attention on what models produced good forecasts, rather than on the mathematical properties of those models;
- consider how to automate forecasting methods;
- be aware of the dangers of over-fitting;
- treat forecasting as a different problem from time series analysis.
These now seem like common-sense to forecasters, but they were revolutionary ideas in 1982.
In 1998, Makridakis & Hibon ran their third competition (the second was not strictly time series forecasting), intending to take account of new methods developed since their first competition nearly two decades earlier. They wrote
It is brave of any academic to claim that their work is “a final attempt”!
This competition involved 3003 time series, all taken from business, demography, finance and economics, and ranging in length between 14 and 126 observations. Again, the data were all either non-seasonal (e.g., annual), quarterly or monthly, and all were positive.
In the published results, Makridakis & Hibon (2000) claimed that the M3 competition supported the findings of their earlier work. Yet the best two methods were not obviously “simple”.
One was the “Theta” method which was described in a highly complicated and confusing manner. Later, Hyndman and Billah (2003) showed that the Theta method was equivalent to an average of a linear regression and simple exponential smoothing with drift, so it turned out to be relatively simple after all. But Makridakis & Hibon could not have known that in 2000.
The other method that performed extremely well in the M3 competition was the commercial software package ForecastPro. The algorithm used is not public, but enough information has been revealed that we can be sure it is not simple. The algorithm selects between an exponential smoothing and ARIMA model based on some state space approximations and a BIC calculation (Goodrich, 2000).
Neural network competitions
There was only one submission that used neural networks in the M3 competition, but it did relatively poorly. To encourage additional submissions, Sven Crone organized a subsequent competition (the NN3) was organized in 2006 involving 111 of the monthly M3 series. Over 60 algorithms were submitted, although none outperformed the original M3 contestants. The paper describing the competition results (by Crone, Hibon and Nikolopoulos) was published in the IJF in 2011.
This supports the general consensus in forecasting, that neural networks (and other highly non-linear and nonparametric methods) are not well suited to time series forecasting due to the relatively short nature of most time series. The longest series in this competition was only 126 observations long. That is simply not enough data to fit a good neural network model.
There were some follow-up competitions, but as far as I know none of the results have ever been published.
Kaggle time series competitions
Few Kaggle competitions have involved time series forecasting; mostly they are about cross-sectional prediction or classification. However, there have been some notable exceptions.
- George Athanasopoulos and I organized a Tourism forecasting competition in 2010. There was a follow-up part 2 later in the same year. The best methods were described in papers published by the IJF in 2011. Coincidentally, both the winners were from my own home city, Melbourne! One of them (Jeremy Howard) went on to become President and Chief Scientist at Kaggle for a few years, and is now running fast.ai. Another (Phil Brierley) is better known for winning the Kaggle heritage health prize and for running my local Data Science Meetup.
- Recently, Oren Anava and Vitaly Kuznetsov organized a Web traffic competition. Here the task was to forecast future web traffic for approximately 145,000 Wikipedia articles. A paper describing the best methods is currently in progress.
One of the great benefits of the Kaggle platform (and others like it) is that it provides a leaderboard and allows multiple submissions. This has been found to lead to much better results as teams compete against each other over the duration of the competition. George Athanasopoulos and I discussed this important feature in a 2011 IJF paper.
Makridakis is now at it again with the M4 competition. This time there are 100,000 time series, and many more participants. New features of this competition are:
- Weekly, daily and hourly data are included, along with annual, quarterly and monthly data.
- Participants are invited to submit prediction intervals as well as point forecasts.
- There is a strong emphasis on reproducibility (a problem with earlier competitions), and competitors will be required to post their code on Github.
The M4 competition is certainly not the end of time series competitions! There are many features of time series forecasting that have not been studied under competition conditions.
No previous time series competition has explored forecast distribution accuracy (as distinct from point forecast accuracy). The M4 competition is the first to make a start in this diretion with prediction interval accuracy being measured, but it is much richer to measure the whole forecast distribution. This was done, for example, in the GEFCom2014 and GEFCom2017 competitions for energy demand forecasting.
No competition has involved large-scale multivariate time series forecasting. While many of the time series in the competitions are probably related to each other, this information has not been provided. Again, the GEFCom competitions have been ground-breaking in this respect also, by requiring true multivariate forecasts to be provided for the energy demand in different regions of the US.
I know of no large-scale forecasting competition for finance data (e.g., stock prices or returns), yet this would seem to be of great interest judging by the number of submissions to the IJF I receive every week.
The data from many of these competitions are available as R packages.
A useful discussion of forecasting competitions and their history is provided by Fildes, R., & Ord, K. (2002). Forecasting competitions: their role in improving forecasting practice and research. In M. Clements & D. Hendry (Eds.), A companion to economic forecasting (pp. 322–353). Oxford, Blackwell.