Skip to content
Skip to main menu
# Linda Zhao

## Contact Information

## Overview

## Selected Publications

Continue Reading
## Research

## Teaching

## Current Courses

## Past Courses

## Awards and
Honors

## Activity

### Latest Research

- Professor of Statistics

**Primary Email:**

lzhao@wharton.upenn.edu**Office Phone:**

(215) 898-8228

**office Address:**470 Jon M. Huntsman Hall

3730 Walnut Street

Philadelphia, PA 19104

**Links:**
Personal Website

After getting her Ph.D in Mathematics/Statistics from Cornell University , Linda taught in UCLA, Los Angeles for one year. She joined the Wharton School in 1994. She obtained a BS degree from the Mathematics department of Nankai University, China.

Linda’s research area covers from Beysian analysis, Nonparametric analysis and Numerical computation. She mainly publishes in international leading journals. Current on going projects include forecasting house prices, inference for high dimensional data, data with measurement errors and post model selection inferences. Linda also enjoys teaching very much.

Zhao, L. H. (2000) Bayesian aspects of some nonparametric problems, The Annals of Statistics, 28, 532–552

Mao, V. and Zhao, L. H. (2003) Free knot polynomial splines with confidence intervals, Journal of the Royal Statistical Society, Series B, 65, 901-919

Brown, L. D., Wang, Y. and Zhao, L. H. (2003) On the statistical equivalence at suitable frequencies of GARCH and stochastic volatility models with the corresponding diffusion model, Statistica Sinica, 993-1013

Brown, L. D., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S. and Zhao, L. H. (2005) Statistical analysis of a telephone call center: A queueing-science perspective, Journal of the American Statistical Association, 100, 36-50

Cai, T., Low, M. and Zhao, L.H. (2007) Trade-offs between global and local risks in nonparametric function estimation, Bernoulli, 13, 1-19

Berk, R., Brown, L.B. and Zhao, L. (2010) Statistical inference after model selection, Journal of Quantitative Criminology, 26, 217-236

Raykar, V., Yu, S., Zhao, L., .Valadez, G., Florin, C., Bogoni, L. and Moy, L. (2010) Learning from crowds, Journal of Machine Learning Research, 11, 1297–1322

Brown, L. D., Cai, T., Zhang, R., Zhao, L. H. and Zhou, H. (2010) The root-unroot algorithm for density estimation as implemented via wavelet block thresholding, Probability Theory and Related Field, 146, 401-433

Raykar, V. and Zhao, L. (2010) Nonparametric prior for adaptive sparsity, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: 629-636

Nagaraja, C. H., Brown, L.D. and Zhao, L. (2010) An autoregressive approach to house price modeling, to appear The Annals of Applied Statistics

Richard A. Berk, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2018),

**Working with Misspecified Regression Models**,*Journal of Quantitative Criminology*, (in press).Daniel McCarthy, Kai Zhang, Lawrence D. Brown, Richard A. Berk, Andreas Buja, Edward I. George, Linda Zhao (2018),

**Calibrated Percentile Double Bootstrap For Robust Linear Regression Inference**,*Statistica Sinica*, (in press).Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (Working),

**A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference**.**Abstract:**For the last two decades, high-dimensional data and methods have proliferated throughout the literature. The classical technique of linear regression, however, has not lost its touch in applications. Most high-dimensional estimation techniques can be seen as variable selection tools which lead to a smaller set of variables where classical linear regression technique applies. In this paper, we prove estimation error and linear representation bounds for the linear regression estimator uniformly over (many) subsets of variables. Based on deterministic inequalities, our results provide “good” rates when applied to both independent and dependent data. These results are useful in correctly interpreting the linear regression estimator obtained after exploring the data and also in post model-selection inference. All the results are derived under no model assumptions and are non-asymptotic in nature.Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Richard A. Berk, Linda Zhao, Edward I. George (Working),

**Valid Post-selection Inference in Assumption-lean Linear Regression**.**Abstract:**This paper provides multiple approaches to perform valid post-selection inference in an assumption-lean regression analysis. To the best of our knowledge, this is the first work that provides valid post-selection inference for regression analysis in such a general settings that include independent, m-dependent random variables.Andreas Buja, Richard A. Berk, Lawrence D. Brown, Edward I. George, Arun Kumar Kuchibhotla, Linda Zhao (2016),

**Models as Approximations, Part II: A General Theory of Model-Robust Regression**,*Statistical Science*, (submitted).Andreas Buja, Richard A. Berk, Lawrence D. Brown, Edward I. George, Emil Pitkin, Mikhail Traskin, Linda Zhao, Kai Zhang (2016),

**Models as Approximations, Part I: A Conspiracy of Nonlinearity and Random Regressors in Linear Regression**,*Statistical Science*, (revision submitted).David Azriel, Lawrence D. Brown, Michael Sklar, Richard A. Berk, Andreas Buja, Linda Zhao (Under Review),

**Semi-Supervised Linear Regression**.Richard A. Berk, Lawrence D. Brown, Andreas Buja, Edward I. George, Emil Pitkin, Kai Zhang, Linda Zhao (2014),

**Misspecified Mean Function Regression: Making Good Use of Regression Models That Are Wrong**,*Sociological Methods & Research*, 43 (3), pp. 422-445.**Abstract:**There are over three decades of largely unrebutted criticism of regression analysis as practiced in the social sciences. Yet, regression analysis broadly construed remains for many the method of choice for characterizing conditional relationships. One possible explanation is that the existing alternatives sometimes can be seen by researchers as unsatisfying. In this paper, we provide a di↵erent formulation. We allow the regression model to be incorrect and consider what can be learned nevertheless. To this end, the search for a correct model is abandoned. We o↵er instead a rigorous way to learn from regression approximations. These approximations, not “the truth,” are the estimation targets. We provide estimators that are asymptotically unbiased and standard errors that are asymptotically correct even when there are important specification errors. Both can be obtained easily from popular statistical packages.Kai Zhang, Lawrence D. Brown, Edward I. George, Linda Zhao (2014),

**Uniform Correlation Mixture of Bivariate Normal Distributions and Hypercubically Contoured Densities That Are Marginally Normal**,*American Statistician*, 68 (3), pp. 183-187.Richard A. Berk, Emil Pitkin, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2014),

**Covariance Adjustments for the Analysis of Randomized Field Experiments**,*Evaluation Review*, 37 (3-4), pp. 170-196.**Abstract:**It has become common practice to analyze randomized experiments using linear regression with covariates. Improved precision of treatment effect estimates is the usual motivation. In a series of important articles, David Freedman showed that this approach can be badly flawed. Recent work by Winston Lin offers partial remedies, but important problems remain.

### STAT471 - Modern Data Mining

Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class.

STAT471401 ( Syllabus )

STAT471402 ( Syllabus )

### STAT571 - Modern Data Mining

Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class.

STAT571401 ( Syllabus )

STAT571402 ( Syllabus )

### STAT701 - Modern Data Mining

Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class.

STAT701401 ( Syllabus )

STAT701402 ( Syllabus )

### STAT101 - Introductory Business Statistics

Data summaries and descriptive statistics; introduction to a statistical computer package; Probability: distributions, expectation, variance, covariance, portfolios, central limit theorem; statistical inference of univariate data; Statistical inference for bivariate data: inference for intrinsically linear simple regression models. This course will have a business focus, but is not inappropriate for students in the college.

### STAT102 - Introductory Business Statistics

Continuation of STAT 101. A thorough treatment of multiple regression, model selection, analysis of variance, linear logistic regression; introduction to time series. Business applications.

### STAT111 - Introductory Statistics

Introduction to concepts in probability. Basic statistical inference procedures of estimation, confidence intervals and hypothesis testing directed towards applications in science and medicine. The use of the JMP statistical package.

### STAT112 - Introductory Statistics

Further development of the material in STAT 111, in particular the analysis of variance, multiple regression, non-parametric procedures and the analysis of categorical data. Data analysis via statistical packages.

### STAT431 - Statistical Inference

Graphical displays; one- and two-sample confidence intervals; one- and two-sample hypothesis tests; one- and two-way ANOVA; simple and multiple linear least-squares regression; nonlinear regression; variable selection; logistic regression; categorical data analysis; goodness-of-fit tests. A methodology course. This course does not have business applications but has significant overlap with STAT 101 and 102.

### STAT471 - Modern Data Mining

### STAT511 - Statistical Inference

Graphical displays; one- and two-sample confidence intervals; one- and two-sample hypothesis tests; one- and two-way ANOVA; simple and multiple linear least-squares regression; nonlinear regression; variable selection; logistic regression; categorical data analysis; goodness-of-fit tests. A methodology course.

### STAT520 - Applied Econometrics I

This is a course in econometrics for graduate students. The goal is to prepare students for empirical research by studying econometric methodology and its theoretical foundations. Students taking the course should be familiar with elementary statistical methodology and basic linear algebra, and should have some programming experience. Topics include conditional expectation and linear projection, asymptotic statistical theory, ordinary least squares estimation, the bootstrap and jackknife, instrumental variables and two-stage least squares, specification tests, systems of equations, generalized least squares, and introduction to use of linear panel data models.

### STAT571 - Modern Data Mining

### STAT701 - Modern Data Mining

### STAT899 - Independent Study

### STAT927 - Bayesian Statistical Theory and Methods

This graduate course will cover the modeling and computation required to perform advanced data analysis from the Bayesian perspective. We will cover fundamental topics in Bayesian probability modeling and implementation, including recent advances in both optimization and simulation-based estimation strategies. Key topics covered in the course include hierarchical and mixture models, Markov Chain Monte Carlo, hidden Markov and dynamic linear models, tree models, Gaussian processes and nonparametric Bayesian strategies.

### STAT995 - Dissertation

### STAT999 - Independent Study

Richard A. Berk, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2018), **Working with Misspecified Regression Models**, *Journal of Quantitative Criminology*, (in press).

All Research