403 Academic Research Building
265 South 37th Street
Philadelphia, PA 19104
Links: Personal Website
After getting her Ph.D in Mathematics/Statistics from Cornell University , Linda taught in UCLA, Los Angeles for one year. She joined the Wharton School in 1994. She obtained a BS degree from the Mathematics department of Nankai University, China.
Linda’s research area covers from Beysian analysis, Nonparametric analysis and Numerical computation. She mainly publishes in international leading journals. Current on going projects include forecasting house prices, inference for high dimensional data, data with measurement errors and post model selection inferences. Linda also enjoys teaching very much.
Zhao, L. H. (2000) Bayesian aspects of some nonparametric problems, The Annals of Statistics, 28, 532–552
Mao, V. and Zhao, L. H. (2003) Free knot polynomial splines with confidence intervals, Journal of the Royal Statistical Society, Series B, 65, 901-919
Brown, L. D., Wang, Y. and Zhao, L. H. (2003) On the statistical equivalence at suitable frequencies of GARCH and stochastic volatility models with the corresponding diffusion model, Statistica Sinica, 993-1013
Brown, L. D., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S. and Zhao, L. H. (2005) Statistical analysis of a telephone call center: A queueing-science perspective, Journal of the American Statistical Association, 100, 36-50
Cai, T., Low, M. and Zhao, L.H. (2007) Trade-offs between global and local risks in nonparametric function estimation, Bernoulli, 13, 1-19
Berk, R., Brown, L.B. and Zhao, L. (2010) Statistical inference after model selection, Journal of Quantitative Criminology, 26, 217-236
Raykar, V., Yu, S., Zhao, L., .Valadez, G., Florin, C., Bogoni, L. and Moy, L. (2010) Learning from crowds, Journal of Machine Learning Research, 11, 1297–1322
Brown, L. D., Cai, T., Zhang, R., Zhao, L. H. and Zhou, H. (2010) The root-unroot algorithm for density estimation as implemented via wavelet block thresholding, Probability Theory and Related Field, 146, 401-433
Raykar, V. and Zhao, L. (2010) Nonparametric prior for adaptive sparsity, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: 629-636
Nagaraja, C. H., Brown, L.D. and Zhao, L. (2010) An autoregressive approach to house price modeling, to appear The Annals of Applied Statistics
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2020), Valid Post-selection Inference in Assumption-lean Linear Regression, Annals of Statistics, (to appear).
Abstract: This paper provides multiple approaches to perform valid post-selection inference in an assumption-lean regression analysis. To the best of our knowledge, this is the first work that provides valid post-selection inference for regression analysis in such a general settings that include independent, m-dependent random variables.
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2020), A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference, Econometric Theory, (to appear).
Abstract: For the last two decades, high-dimensional data and methods have proliferated throughout the literature. The classical technique of linear regression, however, has not lost its touch in applications. Most high-dimensional estimation techniques can be seen as variable selection tools which lead to a smaller set of variables where classical linear regression technique applies. In this paper, we prove estimation error and linear representation bounds for the linear regression estimator uniformly over (many) subsets of variables. Based on deterministic inequalities, our results provide “good” rates when applied to both independent and dependent data. These results are useful in correctly interpreting the linear regression estimator obtained after exploring the data and also in post model-selection inference. All the results are derived under no model assumptions and are non-asymptotic in nature.
Richard A. Berk, Andreas Buja, Lawrence D. Brown, Edward I. George, Arun Kumar Kuchibhotla, Weijie Su, Linda Zhao (2020), Assumption Lean Regression, American Statistician, (in press).
Junhui Cai, Avishai Mandelbaum, Chaitra H. Nagaraja, Haipeng Shen, Linda Zhao (2020), Statistical Theory Powering Data Science, Statistical Science, 34 (4), pp. 669-691.
Andreas Buja, Arun Kumar Kuchibhotla, Richard A. Berk, Edward I. George, Eric Tchetgen Tchetgen, Linda Zhao (2020), Models as Approximations—Rejoinder, Statistical Science, 34 (4), pp. 606-620.
Xian Gu, Iftekhar Hasan, Linda Zhao, Yun Zhu (Working), Who Runs China? A Story Told by Machine Learning.
Andreas Buja, Lawrence D. Brown, Richard A. Berk, Edward I. George, Emil Pitkin, Mikhail Traskin, Kai Zhang, Linda Zhao (2019), Models as Approximations I: Consequences Illustrated with Linear Regression, Statistical Science, 34 (4), pp. 523-544.
Andreas Buja, Lawrence D. Brown, Arun Kumar Kuchibhotla, Richard A. Berk, Edward I. George, Linda Zhao (2019), Models as Approximations II: A Model-Free Theory of Parametric Regression, Statistical Science, 34 (4), pp. 345-365.
Franklin Allen, Junhui Cai, Xian Gu, QJ Qian, Linda Zhao, Wu Zhu (Working), Ownership Network and Firm Growth: What Do Five Million Companies Tell About Chinese Economy.
Ann Harrison, Marshall Meyer, Will Peichun Wang, Linda Zhao, Minyuan Zhao (Under Review), Can a Tiger Change Its Stripes? Reform of Chinese State-Owned Enterprises in the Penumbra of the State.
Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 400 or 500 level or permission from instructor.
STAT571401 ( Syllabus )
Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 400 or 500 level or permission from instructor.
STAT701401 ( Syllabus )
Independent Study Projects require extensive independent work and a considerable amount of writing. ISP in Finance are intended to give students the opportunity to study a particular topic in Finance in greater depth than is covered in the curriculum. The application for ISP's should outline a plan of study that requires at least as much work as a typical course in the Finance Department that meets twice a week. Applications for FNCE 899 ISP's will not be accepted after the THIRD WEEK OF THE SEMESTER. ISP's must be supervised by a Standing Faculty member of the Finance Department.
Data summaries and descriptive statistics; introduction to a statistical computer package; Probability: distributions, expectation, variance, covariance, portfolios, central limit theorem; statistical inference of univariate data; Statistical inference for bivariate data: inference for intrinsically linear simple regression models. This course will have a business focus, but is not inappropriate for students in the college. This course may be taken concurrently with the prerequisite with instructor permission.
Continuation of STAT 101. A thorough treatment of multiple regression, model selection, analysis of variance, linear logistic regression; introduction to time series. Business applications. This course may be taken concurrently with the prerequisite with instructor permission.
Introduction to concepts in probability. Basic statistical inference procedures of estimation, confidence intervals and hypothesis testing directed towards applications in science and medicine. The use of the JMP statistical package. Knowledge of high school algebra is required for this course.
Further development of the material in STAT 111, in particular the analysis of variance, multiple regression, non-parametric procedures and the analysis of categorical data. Data analysis via statistical packages. This course may be taken concurrently with the prerequisite with instructor permission.
Written permission of instructor and the department course coordinator required to enroll in this course.
Graphical displays; one- and two-sample confidence intervals; one- and two-sample hypothesis tests; one- and two-way ANOVA; simple and multiple linear least-squares regression; nonlinear regression; variable selection; logistic regression; categorical data analysis; goodness-of-fit tests. A methodology course. This course does not have business applications but has significant overlap with STAT 101 and 102. This course may be taken concurrently with the prerequisite with instructor permission.
Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. This course may be taken concurrently with the prerequisite with instructor permission.
Graphical displays; one- and two-sample confidence intervals; one- and two-sample hypothesis tests; one- and two-way ANOVA; simple and multiple linear least-squares regression; nonlinear regression; variable selection; logistic regression; categorical data analysis; goodness-of-fit tests. A methodology course.
This is a course in econometrics for graduate students. The goal is to prepare students for empirical research by studying econometric methodology and its theoretical foundations. Students taking the course should be familiar with elementary statistical methodology and basic linear algebra, and should have some programming experience. Topics include conditional expectation and linear projection, asymptotic statistical theory, ordinary least squares estimation, the bootstrap and jackknife, instrumental variables and two-stage least squares, specification tests, systems of equations, generalized least squares, and introduction to use of linear panel data models.
Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 400 or 500 level or permission from instructor.
Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 400 or 500 level or permission from instructor.
Written permission of instructor, the department MBA advisor and course coordinator required to enroll.
This graduate course will cover the modeling and computation required to perform advanced data analysis from the Bayesian perspective. We will cover fundamental topics in Bayesian probability modeling and implementation, including recent advances in both optimization and simulation-based estimation strategies. Key topics covered in the course include hierarchical and mixture models, Markov Chain Monte Carlo, hidden Markov and dynamic linear models, tree models, Gaussian processes and nonparametric Bayesian strategies.
Written permission of instructor and the department course coordinator required to enroll.
On February 14, Analytics at Wharton, Wharton Customer Analytics, Penn Engineering, and Wharton Statistics collaborated to host the first Women in Data Science Conference (WiDS) at Penn. Among the impressive roster of PhD students, industry professionals, and professors that presented on a variety of topics were three Wharton undergrads who…
Wharton Stories - 03/06/2020