Ensembles: Why multiple AI models are better than one

The wisdom of the crowd

The MCG runs a guessing game during AFL matches for visitors to estimate the MCG crowd size before the official number is published. Visitors tweet their estimates to #MCGCrowd, and the winner gets a prize (tweeting multiple guesses is cheating).

Although individual guesses ranged from 10,000 to over 60,000, as a whole the guesses centre around the true number – see Figure 1.

Fig. 1: Histogram of crowd size guesses for the Carlton vs Western Bulldogs game (constructed from sample of 88 tweets). The dashed light blue and dark blue lines represent the median and mean guess. The solid red line represents the official MCG crowd size published after the game.

The median crowd size estimate was 34,849. The mean crowd size estimate was 35,153. The standard deviation was 8,089. The difference between the mean and median is due to a couple of tweets that made extreme guesses a little over 60,000.

The official crowd size reported by the MCG after the game was 35,157. Thus the median estimate was off by 308 people. The mean estimate was off by just 4 people. In my sample, there were only 3 tweets that did better than the median estimate. No-one tweeted a better estimate than the mean. If rather than posting your own estimate, you had waited until just before the competition closed then posted the mean estimate, you would have won. (I’ll admit that in this case, the numbers worked out better than usual – the mean estimate is likely to result in a better estimate than a randomly selected individual estimate, but isn’t usually better than all the individual estimates).

This effect is known as the wisdom of the crowd. The classic articles are by Francis Galton 1907[1] who examined the surprising accuracy of the median estimate in a competition to guess the weight of a slaughtered ox; and less gruesomely, Treynor 1987[2] who examined the mean estimate in a competition to guess the number of beans in a jar. Treynor suggested that this same principle underpins the stock market; different investors use different techniques to estimate the attributes of a company’s stock, and the errors of each technique tend to cancel each other out leading to a good average estimate. Treynor further warns that prominent opinion pieces published about the market (e.g. articles about what Warren Buffett would do) cause the market to become unstable because it creates shared errors in people’s value estimates.

Ensembles in theory

Data scientists have learned to take a similar approach when solving machine learning problems. Rather than relying on any individual model, they take their best models, and report the average (or have the models participate in a vote if a binary outcome is needed). This is known as model ensembling. The textbook example of this is the winners of the $1 Million Netflix prize to predict user ratings of videos. The winning team was actually a merger of three teams who combined all their models together (although the way they ensembled them was more sophisticated than simply taking the average – the models were trained together to complement each other, and treated as predictors to yet another machine learning problem). The resulting ensemble consisted of approximately 500 models[3,4], depending on whether you consider model variants[5, 6, 7].

Ensembles in practice

The sad part of this story is, that despite offering a better than 10% improvement over the current Netflix algorithm at the time (in terms of RMSE), the prize winning Netflix algorithm was never actually put into production (credit to Techdirt and Wired for being quick to pick up on this):

“We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment” — Netflix Tech Blog

Defective models just add noise. So selecting which models to include in an ensemble requires a judgement call. This decision can be different between the confines of a data scientist’s hard drive versus a real world production system, because in production the models have to be supported long term without downtime in the presence of data streams that will inevitably contain all kinds of unanticipated edge cases and fundamentally change in nature over the course of time. Current data science practices focus on optimizing prediction performance for fixed datasets, much like traditional software engineering practices focused on optimizing computational performance for fixed customer requirements. Real applications require the ability to quickly remodel to adapt to changing information sources and evaluation criteria.

Mean estimates are deeply flawed in the presence of unreliable models – all it takes is for one model to report a ridiculously high estimate, like guessing the crowd size is 109, and then the entire ensemble is off. Even in the case of a vote, we must be careful to ensure that at least 50% of the models are behaving appropriately.

So think twice before adding half-baked models to the ensemble, especially if they are going into a production system!

The next post in this series can be found here.


  1. Galton, F., 1907. Vox populi (The wisdom of crowds). Nature, 75(7), pp.450-451.
  2. Treynor, J.L., 1987. Market efficiency and the bean jar experiment. Financial Analysts Journal, 43(3), pp.50-53.
  3. [Blog] Chen, E., 2011. Winning the Netflix Prize: A Summary.
  4. [Presentation] Jahrer, M. and Töscher, A., 2009. Blending Techniques, slide 2.
  5. Koren, Y., 2009. The BellKor Solution to the Netflix Grand Prize.
  6. Töscher, A., Jahrer, M. and Bell, R.M., 2009. The BigChaos solution to the Netflix Grand Prize.
  7. Piotte, M. and Chabbert, M., 2009. The Pragmatic Theory solution to the Netflix Grand Prize.

Header image courtesy of woodleywonderworks under CC BY 2.0 (link).

Thanks to Antonio Giardina, Joshua Asbury and Shannon Pace for proofreading and providing suggestions.