Bayesian Sparse Regression for Large-scale Observational Health Data

Akihiko Nishimura – Johns Hopkins University


Growing availability of large healthcare databases presents opportunities to investigate how patients’ response to treatments vary across subgroups. Even with a large cohort size found in these databases, however, low incidence rates make it difficult to identify causes of treatment effect heterogeneity among a large number of clinical covariates. Sparse regression provides a potential solution. The Bayesian approach is particularly attractive in our setting, where the signals are weak and heterogeneity across databases are substantial. Applications of Bayesian sparse regression to large-scale data sets, however, have been hampered by the lack of scalable computational techniques. We adapt ideas from numerical linear algebra and computational physics to tackle the critical bottleneck in computing posteriors under Bayesian sparse regression. For linear and logistic models, we develop the conjugate gradient sampler for high-dimensional Gaussians along with the theory of prior-preconditioning. For more general regression and survival models, we develop the curvature-adaptive Hamiltonian Monte Carlo to efficiently sample from high-dimensional log-concave distributions. We demonstrate the scalability of our method on an observational study involving n = 1,065,745 patients and p = 15,779 clinical covariates, designed to compare effectiveness of the most common first-line hypertension treatments. The large cohort size allows us to detect an evidence of treatment effect heterogeneity previously unreported by clinical trials.

Papers: The talk is mainly based on this paper ( but also touches on this paper (