Contributed by: Tine Van Calster, Wilfried Lemahieu, Bart Baesens
This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!
Search engines have become the go-to means for gaining many types of knowledge, from the answer to run-of-the-mill questions to detailed descriptions or general information about any topic that might sound interesting to the user. These users can act according to their own spontaneity or they might be following a recent trend that they picked up in the media or even in their own personal environment. By analysing the search terms and their frequency, we get a sense of what many people are interested in and how they might use this information in the future. We can therefore question if this data does not contain valuable predictive power that researchers can wield to build forecasting models. This column gives a short introduction to this subject, together with some tested applications.
First of all, Google Trends is one of the applications that Google Inc. offers to its users. It is linked to Google Search and monitors all search terms and their frequencies, which can be explored through graphics and statistics. Every term can be viewed as a relative monthly time series from 2004 to the present month. For some search terms, the application even makes a prognosis for the future as well, which might prove to be interesting for predictive models. The frequency data itself is presented by attributing a value of 100 to the highest peak in the time series and determining the other data points relatively. Additionally, users can view the data by country or region, and can look for specific terms or track them down by ready-made categories.
Now that we have an idea of this new data, the most obvious type of predictive analytics to turn to is forecasting, as we expect the search terms time series to indicate a certain trend. Furthermore, forecasting has many applications, such as sales, demand or stock exchange prediction, and many types of models, from time series analysis to Artificial Neural Networks. In terms of methodology, multivariate time series analysis seems to be a perfect fit for this type of data, as we are essentially trying to predict an on-going trend based on past information in combination with external factors, such as the search terms. Many researchers have indeed implemented this method in their case studies with a preference for (seasonal) ARIMAX models.
In order for the Google Trends data to be relevant, we need use cases where people turn to the internet for help or information. Several applications have already arisen, such as house pricing and sales, unemployment rates, vacation destinations and even the field of epidemiology has used it to predict the increase in cases of influenza. This last example is known as Google Flu Trends, which has had significant media coverage when it was first published. Every applications uses other external factors for its predictions besides the search terms, but these variables strongly depend on the case study itself. While the prediction of housing prices strongly benefits from adding the house price index, the prediction of holiday destinations depends on marketing campaigns and special offers from well-known travel agencies. Google Trends data is therefore rarely used as the only external factor in the forecasting model.
Despite promising results in many fields, some limitations have been pointed out by several authors. Firstly, it is not certain that search terms add a dramatic improvement in predictability in every case. However, generally speaking, they do add to the accuracy of predictions and since the terms are so readily available, they have often proven to be worth the trouble of adding them. Secondly, search terms are highly affected by the “celebrity effect”, which is caused by media attention or campaigns around a specific topic. This causes an increase in search terms, but might not influence the variable that we are predicting, such as the number of people with the flu. However, in some cases, this media attention does entail an increase in the variable as well, such as for sales forecasting.
In short, the use of Google Trends or other search query data for predictive modelling has been tried and tested on several use cases, and has proven to be successful in most. The possibilities of this type of data seem endless, as the data is easily accessible and indicates the wants and needs of the general public. Especially the ability to use regional data allows for fine-tuning models to very specific situations, in combination with other external influences as well.
Endnotes
- “Google Trends Help” Google Help. Google Inc., 2015. Web. 4 november 2015.
- Polgreen, Philip M., et al. “Using internet searches for influenza surveillance.”Clinical infectious diseases11 (2008): 1443-1448.
- Goel, Sharad, et al. “Predicting consumer behavior with Web search.”Proceedings of the National academy of sciences41 (2010): 17486-17490.
- Choi, Hyunyoung, and Hal Varian. “Predicting the present with google trends.”Economic Records1 (2012): 2-9.
- Wu, Lynn, and Erik Brynjolfsson. “The future of prediction: How Google searches foreshadow housing prices and sales.”Economic Analysis of the Digital Economy. University of Chicago Press, 2014.