403 Academic Research Building
265 South 37th Street
Philadelphia, PA 19104
Research Interests: Statistical machine learning, data-driven decision-making, crowdsourcing, post-selection inference, network analysis, nonparametric Bayes, equity ownership, education in data science
Links: Personal Website
After getting her Ph.D in Mathematics/Statistics from Cornell University, Linda taught in UCLA, Los Angeles for one year. She joined the Wharton School in 1994. She obtained a BS degree from the Mathematics department of Nankai University, China.
Linda’s research area covers statistical machine learning, data-driven decision-making, crowdsourcing, post-selection inference, network analysis, nonparametric Bayes, equity ownership, education in data science. Current on going projects include equity network, inference for high dimensional data, data with measurement errors and post model selection inferences. Linda also enjoys teaching very much.
Zhao, L. H. (2000) Bayesian aspects of some nonparametric problems, The Annals of Statistics, 28, 532–552
Brown, L. D., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S. and Zhao, L. H. (2005) Statistical analysis of a telephone call center: A queueing-science perspective, Journal of the American Statistical Association, 100, 36-50
Cai, T., Low, M. and Zhao, L.H. (2007) Trade-offs between global and local risks in nonparametric function estimation, Bernoulli, 13, 1-19
Berk, R., Brown, L.B. and Zhao, L. (2010) Statistical inference after model selection, Journal of Quantitative Criminology, 26, 217-236
Raykar, V., Yu, S., Zhao, L., .Valadez, G., Florin, C., Bogoni, L. and Moy, L. (2010) Learning from crowds, Journal of Machine Learning Research, 11, 1297–1322
Brown, L. D., Cai, T., Zhang, R., Zhao, L. H. and Zhou, H. (2010) The root-unroot algorithm for density estimation as implemented via wavelet block thresholding, Probability Theory and Related Field, 146, 401-433
Raykar, V. and Zhao, L. (2010) Nonparametric prior for adaptive sparsity, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: 629-636
Nagaraja, C. H., Brown, L.D. and Zhao, L. (2010) An autoregressive approach to house price modeling, The Annals of Applied Statistics, 5, 124-149.
Berk, R., Brown, L.D., Buja, A., Zhang, K. and Zhao, L. H. (2013) Valid post-selection inference, The Annals of Statistics, 41, 802-837
Harrison, A., Meyer, M., Wang, P., Zhao, L. and Zhao, M. (2018) Can a Tiger Change Its Stripes? Reform of Chinese State-Owned Enterprises in the Penumbra of the State, an Vox article
Buja, A., Brown, L.D., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., Zhao,L. (2019). Models as Approximations I: Consequences Illustrated with Linear Regression, Statistical Science, 34 (4), 523-544.
Buja, A., Brown, L.D., Berk, R., Kuchibhotla, A., George, E., and Zhao, L., (2019). Models as Approximations II: A Model-Free Theory of Parametric Regression, Statistical Science, 34(4), 545-565.
Buja, A., Kuchibhotla, A., Berk, R., Tchetgen Tchetgen, E., George, E., and Zhao, L., (2019). Models as Approximations – Rejoinder, Statistical Science, 4, 606 – 620.
Cai, J. , Mandelbaum, A., Nagaraja, C., Shen, H. and Zhao, L. (2019) Statistical Theory Powering Data Science, Statistical Science, 669-691
Kuchibhotla, A., Buja, A., Brown, L.D., Cai, J., George, E., and Zhao, L., (2019) Valid Post-selection Inference in Model-free Linear Regression, Annals of Statistics, 48(5), 2953–2981.
Azriel, D., Brown, L., Sklar, M., Berk, R., Buja, A. and Zhao, L. (2021) Semi-Supervised linear regression, Journal of the American Statistical Association
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2021), Uniform-in-Submodel Bounds for Linear Regression in a Model Free Framework, Econometric Theory, (in press) ().
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2020), A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference, Econometric Theory, (to appear) ().
Abstract: For the last two decades, high-dimensional data and methods have proliferated throughout the literature. The classical technique of linear regression, however, has not lost its touch in applications. Most high-dimensional estimation techniques can be seen as variable selection tools which lead to a smaller set of variables where classical linear regression technique applies. In this paper, we prove estimation error and linear representation bounds for the linear regression estimator uniformly over (many) subsets of variables. Based on deterministic inequalities, our results provide “good” rates when applied to both independent and dependent data. These results are useful in correctly interpreting the linear regression estimator obtained after exploring the data and also in post model-selection inference. All the results are derived under no model assumptions and are non-asymptotic in nature.
Richard A. Berk, Andreas Buja, Lawrence D. Brown, Edward I. George, Arun Kumar Kuchibhotla, Weijie Su, Linda Zhao (2020), Assumption Lean Regression, American Statistician, (in press) ().
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2020), Valid Post-selection Inference in Model-free Linear Regression, Annals of Statistics, 48 (5), pp. 2953-2981.
Abstract: This paper provides multiple approaches to perform valid post-selection inference in an assumption-lean regression analysis. To the best of our knowledge, this is the first work that provides valid post-selection inference for regression analysis in such a general settings that include independent, m-dependent random variables.
Andreas Buja, Arun Kumar Kuchibhotla, Richard A. Berk, Edward I. George, Eric Tchetgen Tchetgen, Linda Zhao (2020), Models as Approximations—Rejoinder, Statistical Science, 34 (4), pp. 606-620.
Junhui Cai, Avishai Mandelbaum, Chaitra H. Nagaraja, Haipeng Shen, Linda Zhao (2020), Statistical Theory Powering Data Science, Statistical Science, 34 (4), pp. 669-691.
Xian Gu, Iftekhar Hasan, Linda Zhao, Yun Zhu (Working), Who Runs China? A Story Told by Machine Learning.
Andreas Buja, Lawrence D. Brown, Richard A. Berk, Edward I. George, Emil Pitkin, Mikhail Traskin, Kai Zhang, Linda Zhao (2019), Models as Approximations I: Consequences Illustrated with Linear Regression, Statistical Science, 34 (4), pp. 523-544.
Andreas Buja, Lawrence D. Brown, Arun Kumar Kuchibhotla, Richard A. Berk, Edward I. George, Linda Zhao (2019), Models as Approximations II: A Model-Free Theory of Parametric Regression, Statistical Science, 34 (4), pp. 345-365.
Franklin Allen, Junhui Cai, Xian Gu, QJ Qian, Linda Zhao, Wu Zhu (Working), Ownership Network and Firm Growth: What Do Five Million Companies Tell About Chinese Economy.
Independent Study Projects require extensive independent work and a considerable amount of writing. ISP in Finance are intended to give students the opportunity to study a particular topic in Finance in greater depth than is covered in the curriculum. The application for ISP's should outline a plan of study that requires at least as much work as a typical course in the Finance Department that meets twice a week. Applications for FNCE 8990 ISP's will not be accepted after the THIRD WEEK OF THE SEMESTER. ISP's must be supervised by a Standing Faculty member of the Finance Department.
Further development of the material in STAT 1110, in particular the analysis of variance, multiple regression, non-parametric procedures and the analysis of categorical data. Data analysis via statistical packages. This course may be taken concurrently with the prerequisite with instructor permission.
Written permission of instructor and the department course coordinator required to enroll in this course.
With the advent of the internet age, data are being collected at unprecedented scale in almost all realms of life, including business, science, politics, and healthcare. Data mining—the automated extraction of actionable insights from data—has revolutionized each of these realms in the 21st century. The objective of the course is to teach students the core data mining skills of exploratory data analysis, selecting an appropriate statistical methodology, applying the methodology to the data, and interpreting the results. The course will cover a variety of data mining methods including linear and logistic regression, penalized regression (including lasso and ridge regression), tree-based methods (including random forests and boosting), and deep learning. Students will learn the conceptual basis of these methods as well as how to apply them to real data using the programming language R. This course may be taken concurrently with the prerequisite with instructor permission.
Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 4000 or 5000 level or permission from instructor.
This course provides the fundamental methods of statistical analysis, the art and science if extracting information from data. The course will begin with a focus on the basic elements of exploratory data analysis, probability theory and statistical inference. With this as a foundation, it will proceed to explore the use of the key statistical methodology known as regression analysis for solving business problems, such as the prediction of future sales and the response of the market to price changes. The use of regression diagnostics and various graphical displays supplement the basic numerical summaries and provides insight into the validity of the models. Specific important topics covered include least squares estimation, residuals and outliers, tests and confidence intervals, correlation and autocorrelation, collinearity, and randomization. The presentation relies upon computer software for most of the needed calculations, and the resulting style focuses on construction of models, interpretation of results, and critical evaluation of assumptions.
Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 4000 or 5000 level or permission from instructor.
Written permission of instructor, the department MBA advisor and course coordinator required to enroll.
This seminar will be taken by doctoral candidates after the completion of most of their coursework. Topics vary from year to year and are chosen from advance probability, statistical inference, robust methods, and decision theory with principal emphasis on applications.
This seminar will be taken by doctoral candidates after the completion of most of their coursework. Topics vary from year to year and are chosen from advance probability, statistical inference, robust methods, and decision theory with principal emphasis on applications.
Dissertation
On February 14, Analytics at Wharton, Wharton Customer Analytics, Penn Engineering, and Wharton Statistics collaborated to host the first Women in Data Science Conference (WiDS) at Penn. Among the impressive roster of PhD students, industry professionals, and professors that presented on a variety of topics were three Wharton undergrads who…
Wharton Stories - 03/06/2020