Linda Zhao

Linda Zhao
  • Professor of Statistics and Data Science
  • Academic Director of the Dual Master's Degree in Statistics

Contact Information

  • office Address:

    403 Academic Research Building
    265 South 37th Street
    Philadelphia, PA 19104

Research Interests: Statistical machine learning, data-driven decision-making, crowdsourcing, post-selection inference, network analysis, nonparametric Bayes, equity ownership, education in data science

Links: Personal Website

Overview

After getting her Ph.D in Mathematics/Statistics from Cornell University, Linda taught in UCLA, Los Angeles for one year. She joined the Wharton School in 1994. She obtained a BS degree from the Mathematics department of Nankai University, China.

Linda’s research area covers statistical machine learning, data-driven decision-making, crowdsourcing, post-selection inference, network analysis, nonparametric Bayes, equity ownership, education in data science.  Current on going projects include equity network, inference for high dimensional data, data with measurement errors and post model selection inferences. Linda also enjoys teaching very much.

Selected Publications

Zhao, L. H. (2000) Bayesian aspects of some nonparametric problems, The Annals of Statistics, 28, 532–552

Brown, L. D., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S. and Zhao, L. H. (2005) Statistical analysis of a telephone call center: A queueing-science perspective, Journal of the American Statistical Association, 100, 36-50

Cai, T., Low, M. and Zhao, L.H. (2007) Trade-offs between global and local risks in nonparametric function estimation, Bernoulli, 13, 1-19

Berk, R., Brown, L.B. and Zhao, L. (2010) Statistical inference after model selection, Journal of Quantitative Criminology, 26, 217-236

Raykar, V., Yu, S., Zhao, L., .Valadez, G., Florin, C., Bogoni, L. and Moy, L. (2010) Learning from crowds, Journal of Machine Learning Research, 11, 1297–1322

Brown, L. D., Cai, T., Zhang, R., Zhao, L. H. and Zhou, H. (2010) The root-unroot algorithm for density estimation as implemented via wavelet block thresholding, Probability Theory and Related Field, 146, 401-433

Raykar, V. and Zhao, L. (2010) Nonparametric prior for adaptive sparsity, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: 629-636

Nagaraja, C. H., Brown, L.D. and Zhao, L. (2010) An autoregressive approach to house price modeling,  The Annals of Applied Statistics, 5, 124-149.

Berk, R., Brown,  L.D., Buja,  A., Zhang, K. and  Zhao, L. H. (2013) Valid post-selection inference, The Annals of Statistics, 41, 802-837

Harrison, A., Meyer, M., Wang, P., Zhao, L.  and Zhao, M. (2018) Can a Tiger Change Its Stripes? Reform of Chinese State-Owned Enterprises in the Penumbra of the State, an Vox article

Buja, A., Brown, L.D., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., Zhao,L. (2019). Models as Approximations I: Consequences Illustrated with Linear Regression, Statistical Science, 34 (4), 523-544.

Buja, A., Brown, L.D., Berk, R., Kuchibhotla, A., George, E., and Zhao, L.,   (2019). Models as Approximations II: A Model-Free Theory of Parametric Regression, Statistical Science, 34(4),  545-565.

Buja, A., Kuchibhotla, A., Berk, R., Tchetgen Tchetgen, E.,  George, E., and Zhao, L., (2019). Models as Approximations – Rejoinder, Statistical Science, 4, 606 – 620.

Cai, J. , Mandelbaum, A., Nagaraja, C., Shen, H. and Zhao, L. (2019) Statistical Theory Powering Data Science,  Statistical Science, 669-691

Kuchibhotla, A., Buja, A., Brown, L.D.,  Cai, J., George, E., and Zhao, L., (2019) Valid Post-selection Inference in Model-free Linear Regression, Annals of Statistics, 48(5), 2953–2981.

Azriel, D., Brown, L., Sklar, M.,  Berk, R.,  Buja, A.  and Zhao, L. (2021) Semi-Supervised linear regression, Journal of the American Statistical Association

Continue Reading

Research

Teaching

Current Courses (Spring 2024)

  • STAT4710 - Modern Data Mining

    With the advent of the internet age, data are being collected at unprecedented scale in almost all realms of life, including business, science, politics, and healthcare. Data mining—the automated extraction of actionable insights from data—has revolutionized each of these realms in the 21st century. The objective of the course is to teach students the core data mining skills of exploratory data analysis, selecting an appropriate statistical methodology, applying the methodology to the data, and interpreting the results. The course will cover a variety of data mining methods including linear and logistic regression, penalized regression (including lasso and ridge regression), tree-based methods (including random forests and boosting), and deep learning. Students will learn the conceptual basis of these methods as well as how to apply them to real data using the programming language R. This course may be taken concurrently with the prerequisite with instructor permission.

    STAT4710401 ( Syllabus )

    STAT4710402 ( Syllabus )

  • STAT5710 - Modern Data Mining

    Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 4000 or 5000 level or permission from instructor.

    STAT5710401 ( Syllabus )

    STAT5710402 ( Syllabus )

All Courses

  • FNCE8990 - Independent Study

    Independent Study Projects require extensive independent work and a considerable amount of writing. ISP in Finance are intended to give students the opportunity to study a particular topic in Finance in greater depth than is covered in the curriculum. The application for ISP's should outline a plan of study that requires at least as much work as a typical course in the Finance Department that meets twice a week. Applications for FNCE 8990 ISP's will not be accepted after the THIRD WEEK OF THE SEMESTER. ISP's must be supervised by a Standing Faculty member of the Finance Department.

  • STAT1120 - Introductory Statistics

    Further development of the material in STAT 1110, in particular the analysis of variance, multiple regression, non-parametric procedures and the analysis of categorical data. Data analysis via statistical packages. This course may be taken concurrently with the prerequisite with instructor permission.

  • STAT3990 - Independent Study

    Written permission of instructor and the department course coordinator required to enroll in this course.

  • STAT4710 - Modern Data Mining

    With the advent of the internet age, data are being collected at unprecedented scale in almost all realms of life, including business, science, politics, and healthcare. Data mining—the automated extraction of actionable insights from data—has revolutionized each of these realms in the 21st century. The objective of the course is to teach students the core data mining skills of exploratory data analysis, selecting an appropriate statistical methodology, applying the methodology to the data, and interpreting the results. The course will cover a variety of data mining methods including linear and logistic regression, penalized regression (including lasso and ridge regression), tree-based methods (including random forests and boosting), and deep learning. Students will learn the conceptual basis of these methods as well as how to apply them to real data using the programming language R. This course may be taken concurrently with the prerequisite with instructor permission.

  • STAT5710 - Modern Data Mining

    Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 4000 or 5000 level or permission from instructor.

  • STAT6130 - Regr Analysis For Bus

    This course provides the fundamental methods of statistical analysis, the art and science if extracting information from data. The course will begin with a focus on the basic elements of exploratory data analysis, probability theory and statistical inference. With this as a foundation, it will proceed to explore the use of the key statistical methodology known as regression analysis for solving business problems, such as the prediction of future sales and the response of the market to price changes. The use of regression diagnostics and various graphical displays supplement the basic numerical summaries and provides insight into the validity of the models. Specific important topics covered include least squares estimation, residuals and outliers, tests and confidence intervals, correlation and autocorrelation, collinearity, and randomization. The presentation relies upon computer software for most of the needed calculations, and the resulting style focuses on construction of models, interpretation of results, and critical evaluation of assumptions.

  • STAT7010 - Modern Data Mining

    Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 4000 or 5000 level or permission from instructor.

  • STAT8990 - Independent Study

    Written permission of instructor, the department MBA advisor and course coordinator required to enroll.

  • STAT9910 - Sem in Adv Appl of Stat

    This seminar will be taken by doctoral candidates after the completion of most of their coursework. Topics vary from year to year and are chosen from advance probability, statistical inference, robust methods, and decision theory with principal emphasis on applications.

  • STAT9950 - Dissertation

  • STAT9990 - Independent Study

    Written permission of instructor and the department course coordinator required to enroll.

Awards and Honors

Activity

Wharton Magazine

Data: Voting Wait Times, Pre-IPO Confidentiality, and More
Wharton Magazine - 10/16/2020

Wharton Stories

Three women presenting on stage at a glass podium with a Wharton banner in the backgroundPredicting Random Forest Fires in California

On February 14, Analytics at Wharton, Wharton Customer Analytics, Penn Engineering, and Wharton Statistics collaborated to host the first Women in Data Science Conference (WiDS) at Penn. Among the impressive roster of PhD students, industry professionals, and professors that presented on a variety of topics were three Wharton undergrads who…

Wharton Stories - 03/06/2020
All Stories