Google Flu project back again: trip over the same stone twice?
Google announced GraphCast, an artificial intelligence engine for the better prediction of weather. Google Flu was a total mess. What now?
In 2009, Nature published a paper describing a new method for predicting flu outbreaks based on the search terms people use when they query Google. Terms such as "fever," "headache," "flu symptoms," and "pharmacies near me" could be used to track the spread of flu in the United States. Not only could these search frequencies and their geographic locations be used to predict doctor visits, but the method was faster and cheaper than the epidemiological tracking methods employed by the Centers for Disease Control and Prevention (CDC).
The paper generated tremendous excitement and was covered by nearly every major newspaper and media outlet. Technology evangelists touted the results as an example of how big data would change the world. University professors discussed the paper in their courses. Data analytics-based startups inserted the Nature article into their presentations.
When you have Google-scale data, ”the numbers speak for themselves." (Chris Anderson, Wired editor)
The method worked reasonably well for a couple of years but, before long, the results started to fail, not by a little, but by a factor of two. Over time, the predictions continued to get worse. The results became so bad that Google killed the project and removed the Flu Trends website.
In retrospect, the study was doomed from the start. There was no theory about which search terms constituted relevant predictors of flu, and that left the algorithm highly susceptible to random correlations over time. For example, "high school basketball" was one of the top 100 predictors of a flu outbreak, simply because the search terms "flu" and "high school basketball" peak in the winter. Like chocolate and Nobel prizes, spurious correlations fall apart on accelerating time scales.
If the Google Flu Trends algorithm had to predict flu cases only for the first two years, we would still be writing about its triumph. When asked to extend beyond this time period, it failed. Sound familiar? Yes, this is overconditioning. The machine probably focused on the irrelevant nuances of that time period. This is where the scientific method can help. It's designed to develop a theory that focuses on the key elements that drive the spread of the flu, while ignoring the inconsequential. Search terms can be good indicators of those key elements, but we need a theory to help us generalize beyond two years of predictions. Without theory, data-driven predictions are based on mere correlations.
When venturing into the black box, also consider the following. Many of the most complicated algorithms use dozens, hundreds or even thousands of variables when making predictions. Google Flu Trends was based on forty-five key search queries that best predicted flu outbreaks. A machine learning system designed to detect cancer could analyze thousands of different genes. That might sound like a good thing. Just add more variables; more data equals better predictions, right? Well, not exactly. The more variables you add, the more training data you need. We talked earlier about the cost of getting good training data. If you have ten thousand genes you want to incorporate into your model, good luck finding the millions of example patients you will need to have any chance of making reliable predictions.
This problem of adding additional variables is known as the curse of dimensionality. If you add enough variables in your black box, you will eventually find a combination of variables that works well, but you may do so by chance. As you increase the number of variables you use to make your predictions, you need exponentially more data to distinguish true predictive ability from luck. But you are likely to soon discover that the success of this prediction was the result of a fortuitous alignment in the data, nothing more. Ask that variable to predict the index for the next three months and the success rate will drop precipitously.
Researchers have not stopped trying to use data to help physicians and solve health problems, nor should they. Researchers at Microsoft Research are using Bing search queries to detect people with undiagnosed pancreatic cancer, hopefully learning from Google Flu Trends' mistakes. As long as it is collected legally, with consent and with respect for privacy, data is valuable for understanding the world. The problem is the hype, the notion that something magical will emerge if we can accumulate data on a large enough scale. We just need to be reminded: Big data is not better; it's just bigger. And it certainly doesn't speak for itself.
Now, Google recently announced (in 2023) that an artificial intelligence based on weather data -and not physics equations and laws- predicts the weather better than the European Centre for Medium-Range Weather Forecasts (ECMWF), and again, they published their work in Science. Sounds familiar, right?
In 2014, TED Conferences and XPrize Foundation announced a prize for "to the first artificial intelligence to come to this stage and give a TED Talk compelling enough to win a standing ovation from the audience". Again, Chris Anderson dixit.


