A classical statistical idea is to introduce data perturbations and examine their impacts on a statistical procedure. I will discuss some recent progress of my group in using a simple data perturbation method, data splitting (DS), for controlling false discovery rate (FDR) in fitting generalized linear and index models. The DS procedure simply splits the data into two halves at random, and computes a statistic reflecting the consistency of the two sets of parameter estimates (e.g., regression coefficients). The FDR control can be achieved by taking advantage of such statistics that are distributed symmetrically around zero for null features. We further propose Multiple Data Splitting (MDS) to stabilize the selection result and boost the power. DS and MDS are straightforward conceptually, easy to implement algorithmically, and applicable to a wide class of linear and nonlinear models. Interestingly, their specializations in GLMs result in scale-free procedures that can circumvent difficulties caused by non-traditional asymptotic behaviors of MLEs in moderate-dimensions and debiased Lasso estimates in high-dimensions. For index models, we had developed an earlier LassoSIR algorithm (Lin, Zhao and Liu 2019), which fits the DS framework quite well. I will also discuss some applications and open questions.

The presentation is based on joint work with Chenguang Dai, Buyu Lin, Xin Xing, and Zhigen Zhao.

Reference Papers: Paper 1, Paper 2, Paper 3, Paper 4


Jun Liu is Professor of Statistics at Harvard University, with a courtesy appointment at Harvard School of Public Health. Dr. Liu received his BS degree in mathematics in 1985 from Peking University and Ph.D. in statistics in 1991 from the University of Chicago. He held Assistant, Associate, and full professor positions at Stanford University from 1994 to 2003. Dr. Liu received the NSF CAREER Award in 1995 and the Mitchell Award in 2000. In 2002, he won the prestigious COPSS Presidents’ Award (given annually to one individual under age 40). He was selected as a Medallion Lecturer in 2002, a Bernoulli Lecturer in 2004, a Kuwait Lecturer of Cambridge University in 2008; and elected to Fellow of the Institute of Mathematical Statistics in 2004, Fellow of the American Statistical Association in 2005, and Fellow of the International Society for Computational Biology in 2022. He was awarded the Morningside Gold Medal in Applied Mathematics in 2010(once every 3 years to an individual of Chinese descent under age 45). He was honored with the Outstanding Achievement Award and the Pao-Lu Hsu Award (once every 3 years) by the International Chinese Statistical Association in 2012 and 2016, respectively.  In 2017, he was recognized by the Jerome Sacks Award for outstanding Cross-Disciplinary Research.

Dr. Liu and his collaborators introduced the statistical missing data formulation and Gibbs sampling strategies for biological sequence motif analysis in early 1990s. The resulting algorithms for protein sequence alignments, gene regulation analyses, and genetic studies have been adopted by many researchers as standard computational biology tools. Dr. Liu has made fundamental contributions to statistical computing and Bayesian modeling. He pioneered sequential Monte Carlo (SMC) methods and invented novel Markov chain Monte Carlo (MCMC) techniques. His theoretical and methodological studies on SMC and MCMC algorithms have had a broad impact in many areas. Dr. Liu has also pioneered novel Bayesian modeling techniques for discovering nonlinear and interactive effects in high-dimensional data and led the developments of theory and methods for sufficient dimension reduction in high-dimensions. Dr. Liu has served on numerous government’s grant review panels and editorial boards of leading statistical journals, including the co-editorship of JASA from 2011-2014. Dr. Liu has co-authored more than 280 research articles published in leading scientific journals and books and mentored more than 35 PhD students and 30 postdoctoral fellows.