BIG DATA, GOOGLE, AND INFECTIOUS DISEASE PREDICTION: A STATISTICAL PERSPECTIVE

SHIHAO YANG – HARVARD UNIVERSITY

ABSTRACT

Big data generated from the Internet present a great opportunity to track social and economical activities, including real-time disease surveillance. These big-data insights into real-time infectious disease prediction could help public health officials make timely decisions to save lives. We proposed a theoretically rooted method that leads to robust and accurate real-time tracking of infectious diseases. Our method significantly outperforms all previous internet-based tracking models, including Google Flu Trends and Google Dengue Trends.

In the case of flu, we introduced our real-time digital flu detection method ARGO (AutoRegressive with GOogle data), which combines time series information with Google search data. ARGO is derived from a hidden Markov structure of data-generating mechanism. With a sliding two-year window and an L1 penalty for training, ARGO can incorporate new information as it becomes available, and can automatically select or adjust the most useful Google search queries. We extended ARGO to track dengue fever with great success in five tropical countries around the globe. ARGO is then further extended to incorporate cloud-based electronic health records and to generate near-future predictions weeks ahead. Our latest development upgrades the method for infectious disease tracking in spatial scale. Thanks to the ubiquity of internet search data, ARGO is now capable of disease prediction not only in time but also in space. The upgraded ARGO uses spatial-temporal information pooling, making it flexible, self-correcting, robust and scalable.