Linda Zhao

Linda Zhao
  • Professor of Statistics and Data Science
  • Academic Director of the Dual Master's Degree in Statistics

Contact Information

  • office Address:

    403 Academic Research Building
    265 South 37th Street
    Philadelphia, PA 19104

Research Interests: Statistical machine learning, data-driven decision-making, bandits, reinforcement learning, crowdsourcing, post-selection inference, network analysis, nonparametric Bayes, revenue management, equity ownership, education in data science

Links: Personal Website

Overview

After getting her Ph.D in Mathematics/Statistics from Cornell University, Linda taught in UCLA, Los Angeles for one year. She joined the Wharton School in 1994. She obtained a BS degree from the Mathematics department of Nankai University, China.

Linda’s research area covers statistical machine learning, data-driven decision-making, crowdsourcing, post-selection inference, network analysis, nonparametric Bayes, equity ownership, education in data science.  Current on going projects include equity network, inference for high dimensional data, data with measurement errors and post model selection inferences. Linda also enjoys teaching very much.

Selected Publications

Zhao, L. H. (2000) Bayesian aspects of some nonparametric problems, The Annals of Statistics, 28, 532–552

Brown, L. D., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S. and Zhao, L. H. (2005) Statistical analysis of a telephone call center: A queueing-science perspective, Journal of the American Statistical Association, 100, 36-50

Cai, T., Low, M. and Zhao, L.H. (2007) Trade-offs between global and local risks in nonparametric function estimation, Bernoulli, 13, 1-19

Berk, R., Brown, L.B. and Zhao, L. (2010) Statistical inference after model selection, Journal of Quantitative Criminology, 26, 217-236

Raykar, V., Yu, S., Zhao, L., .Valadez, G., Florin, C., Bogoni, L. and Moy, L. (2010) Learning from crowds, Journal of Machine Learning Research, 11, 1297–1322

Brown, L. D., Cai, T., Zhang, R., Zhao, L. H. and Zhou, H. (2010) The root-unroot algorithm for density estimation as implemented via wavelet block thresholding, Probability Theory and Related Field, 146, 401-433

Raykar, V. and Zhao, L. (2010) Nonparametric prior for adaptive sparsity, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: 629-636

Nagaraja, C. H., Brown, L.D. and Zhao, L. (2010) An autoregressive approach to house price modeling,  The Annals of Applied Statistics, 5, 124-149.

Berk, R., Brown,  L.D., Buja,  A., Zhang, K. and  Zhao, L. H. (2013) Valid post-selection inference, The Annals of Statistics, 41, 802-837

Harrison, A., Meyer, M., Wang, P., Zhao, L.  and Zhao, M. (2018) Can a Tiger Change Its Stripes? Reform of Chinese State-Owned Enterprises in the Penumbra of the State, an Vox article

Buja, A., Brown, L.D., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., Zhao,L. (2019). Models as Approximations I: Consequences Illustrated with Linear Regression, Statistical Science, 34 (4), 523-544.

Buja, A., Brown, L.D., Berk, R., Kuchibhotla, A., George, E., and Zhao, L.,   (2019). Models as Approximations II: A Model-Free Theory of Parametric Regression, Statistical Science, 34(4),  545-565.

Buja, A., Kuchibhotla, A., Berk, R., Tchetgen Tchetgen, E.,  George, E., and Zhao, L., (2019). Models as Approximations – Rejoinder, Statistical Science, 4, 606 – 620.

Cai, J. , Mandelbaum, A., Nagaraja, C., Shen, H. and Zhao, L. (2019) Statistical Theory Powering Data Science,  Statistical Science, 669-691

Kuchibhotla, A., Buja, A., Brown, L.D.,  Cai, J., George, E., and Zhao, L., (2019) Valid Post-selection Inference in Model-free Linear Regression, Annals of Statistics, 48(5), 2953–2981.

Azriel, D., Brown, L., Sklar, M.,  Berk, R.,  Buja, A.  and Zhao, L. (2021) Semi-Supervised linear regression, Journal of the American Statistical Association

Continue Reading

Research

  • Junhui Cai, Ran Chen, Martin J. Wainwright, Linda Zhao (Work In Progress), Personalized reinforcement learning: With applications to recommender systems.

    Abstract: Reinforcement learning (RL) has achieved remarkable success across various domains; however, its applicability is often hampered by challenges in practicality and interpretability. Many real-world applications, such as in healthcare and business settings, have large and/or continuous state and action spaces and demand personalized solutions. In addition, the interpretability of the model is crucial to decision-makers so as to guide their decision-making process while incorporating their domain knowledge. To bridge this gap, we propose a personalized reinforcement learning framework that integrates personalized information into the state-transition and reward-generating mechanisms. We develop an online RL algorithm for our framework. Specifically, our algorithm learns the embeddings of the personalized state-transition distribution in a Reproduction Kernel Hilbert Space (RKHS) by balancing the exploitation-exploration trade-off. We further provide the regret bound of the algorithm and demonstrate its effectiveness in recommender systems.

  • Junhui Cai, Ran Chen, Martin J. Wainwright, Linda Zhao (Under Revision), Doubly high-dimensional contextual bandits: An interpretable model with applications to assortment/pricing.

    Abstract: Key challenges in running a retail business include how to select products to present to consumers (the assortment problem), and how to price products (the pricing problem) to maximize revenue or profit. Instead of considering these problems in isolation, we propose a joint approach to assortment-pricing based on contextual bandits. Our model is doubly high-dimensional, in that both context vectors and actions allowed to take values in high-dimensional spaces. In order to circumvent the curse of dimensionality, we propose a simple yet flexible model that captures the interactions between covariates and actions via a (near) low-rank representation matrix. The resulting class of models is reasonably expressive while remaining interpretable through latent factors, and includes various structured linear bandit and pricing models as particular cases. We propose a computationally tractable procedure that combines an exploration/exploitation protocol with an efficient low-rank matrix estimator, and we prove bounds on its regret. Simulation results show that this method has lower regret than state-of-the-art methods applied to various standard bandit and pricing models. We also illustrate the gains achievable using our method by two case studies on real-world assortment-pricing problems for an industry-leading instant noodles company, and a smaller beauty start-up. In each case, we show both the gains in revenue achievable by our bandit methods, as well as the interpretability of the latent factor models that are learned.

  • Junhui Cai, Xian Gu, Linda Zhao, Zhu Wu, “State ownership in China: An equity network perspective”. In The Arc of the Chinese Economy, edited by Hanming Fang and Marshall Meyer, (Cambridge University Press, 2025)

    Abstract: State ownership is the pillar of China’s economy. One cannot understand China’s economy without understanding the state ownership. Existing measures of state-owned enterprises (SOEs), largely self-reported, are limited to industrial firms covered by the Annual Industrial Survey (AIS). We provide a new lens by constructing a novel dynamic equity ownership network of all 40 million registered firms in China. Based on the network, we propose a new dynamic SOE metric. Our analysis reveals systematic and large-scale discrepancies between our method and the existing measures, with ours identifying a notably larger pool of SOEs. By the end of 2017, state capital had increased to 31% among all the in-network firms, while the total capital of all SOEs, including partial SOEs, had climbed up to 85%. Our finding suggests that state ownership exhibits both decentralization and indirect control trends over time, shedding new insights for future research.

  • Franklin Allen, Junhui Cai, Xian Gu, Jun “QJ” Qian, Linda Zhao, Zhu Wu, Centralization or decentralization? The evolution of state-ownership in China.

    Abstract: In this paper, we anatomize the state sector and its role in Chinese economy. We propose a measure of Chinese SOEs (and partial SOEs) based on the firm-to-firm equity investment relationships. We are the first to identify all SOEs among over 40 millions of all Chinese registered firms. Our measure captures a significant larger number of SOEs than the existing measure. The aggregated capital of all (partial) SOEs has climbed up to 85%, and the total state capital in all SOEs has increased to 31%, both over total capital in the economy by 2017. The state ownership shows parallel trends of decentralization (authoritarian hierarchy) and indirect control (ownership hierarchy) over time. In addition, we find mixed ownership is associated with higher firm growth and performance; while hierarchical distance to governments is associated with better firm performance but lower growth. Drawing a stark distinction between SOEs and privately-owned enterprises (POEs) could lead to misperceptions of the role of state ownership in Chinese economy

  • Franklin Allen, Junhui Cai, Xian Gu, Jun “QJ” Qian, Linda Zhao, Wu Zhu (Under Revision), Ownership network and firm growth: What do forty million companies tell about the Chinese economy?.

    Abstract: The finance–growth nexus has been a central question in understanding the unprecedented success of the Chinese economy. Using unique data on all the registered firms in China, we build extensive firm-to-firm equity ownership networks. Entering a network and increasing network centrality leads to higher firm growth, and the effect of global centralities strengthens over time. The RMB 4 trillion stimulus launched by the Chinese government in 2008 partially “crowded out” the positive network effects. Equity ownership networks and bank credit tend to act as substitutes for state-owned enterprises, but as complements for private firms in promoting growth

  • Junhui Cai, Ran Chen, Dan Yang, Zhu Wu, Linda Zhao, Haipeng Shen (Under Review), Network regression and supervised centrality estimation.

    Abstract: The centrality in a network is a popular metric for agents' network positions and is often used in regression models to model the network effect on an outcome variable of interest. In empirical studies, researchers often adopt a two-stage procedure to first estimate the centrality and then infer the network effect using the estimated centrality. Despite its prevalent adoption, this two-stage procedure lacks theoretical backing and can fail in both estimation and inference. We, therefore, propose a unified framework, under which we prove the shortcomings of the two-stage in centrality estimation and the undesirable consequences in the regression. We then propose a novel supervised network centrality estimation (SuperCENT) methodology that simultaneously yields superior estimations of the centrality and the network effect and provides valid and narrower confidence intervals than those from the two-stage. We showcase the superiority of SuperCENT in predicting the currency risk premium based on the global trade network.

  • Junhui Cai, Ran Chen, Qitao Huang, Martin J. Wainwright, Linda Zhao, Wu Zhu (Work In Progress), Optimal assortment and pricing via generalized MNL models with Poisson arrival.

    Abstract: We consider the joint dynamic assortment and pricing problem under the multinomial logit model (MNL) with an unknown time-varying Poisson arrival rate and preference for customers. In the model, for each period, the seller chooses the assortment and product prices jointly to maximize her revenue. The arrival rate of the customers depends on the assortment and prices offered with unknown parameters. We propose an efficient algorithm where the seller can choose assortment and product prices jointly to learn the underlying parameters about the Poisson arrival and preference based on the realized arrival and choices of customers. We show that our algorithm is asymptotically efficient in the sense that the regret is bound with \sqrt{T}\log(T) with a high probability 1-O(1/T) and provide a matching lower bound (up to \log(T)) showing the optimality of our algorithm.

  • Richard A. Berk, Andreas Buja, Lawrence D. Brown, Edward I. George, Arun Kumar Kuchibhotla, Weijie Su, Linda Zhao (2021), Assumption Lean Regression, American Statistician, 75 (1), pp. 76-84.

  • Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2021), Uniform-in-Submodel Bounds for Linear Regression in a Model Free Framework, Econometric Theory, (in press) ().

  • Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2020), A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference, Econometric Theory, (to appear) ().

    Abstract: For the last two decades, high-dimensional data and methods have proliferated throughout the literature. The classical technique of linear regression, however, has not lost its touch in applications. Most high-dimensional estimation techniques can be seen as variable selection tools which lead to a smaller set of variables where classical linear regression technique applies. In this paper, we prove estimation error and linear representation bounds for the linear regression estimator uniformly over (many) subsets of variables. Based on deterministic inequalities, our results provide “good” rates when applied to both independent and dependent data. These results are useful in correctly interpreting the linear regression estimator obtained after exploring the data and also in post model-selection inference. All the results are derived under no model assumptions and are non-asymptotic in nature.

Teaching

Current Courses (Spring 2025)

  • STAT4710 - Modern Data Mining

    With the advent of the internet age, data are being collected at unprecedented scale in almost all realms of life, including business, science, politics, and healthcare. Data mining—the automated extraction of actionable insights from data—has revolutionized each of these realms in the 21st century. The objective of the course is to teach students the core data mining skills of exploratory data analysis, selecting an appropriate statistical methodology, applying the methodology to the data, and interpreting the results. The course will cover a variety of data mining methods including linear and logistic regression, penalized regression (including lasso and ridge regression), tree-based methods (including random forests and boosting), and deep learning. Students will learn the conceptual basis of these methods as well as how to apply them to real data using the programming language R. This course may be taken concurrently with the prerequisite with instructor permission.

    STAT4710401 ( Syllabus )

    STAT4710402 ( Syllabus )

  • STAT5710 - Modern Data Mining

    Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 4000 or 5000 level or permission from instructor.

    STAT5710401 ( Syllabus )

    STAT5710402 ( Syllabus )

  • STAT9916 - Sem In Adv Appl Of Stat

    This seminar will be taken by doctoral candidates after the completion of most of their coursework. Topics vary from year to year and are chosen from advance probability, statistical inference, robust methods, and decision theory with principal emphasis on applications.

    STAT9916301 ( Syllabus )

All Courses

  • FNCE8990 - Independent Study

    Independent Study Projects require extensive independent work and a considerable amount of writing. ISP in Finance are intended to give students the opportunity to study a particular topic in Finance in greater depth than is covered in the curriculum. The application for ISP's should outline a plan of study that requires at least as much work as a typical course in the Finance Department that meets twice a week. Applications for FNCE 8990 ISP's will not be accepted after the THIRD WEEK OF THE SEMESTER. ISP's must be supervised by a Standing Faculty member of the Finance Department.

  • STAT3990 - Independent Study

    Written permission of instructor and the department course coordinator required to enroll in this course.

  • STAT4710 - Modern Data Mining

    With the advent of the internet age, data are being collected at unprecedented scale in almost all realms of life, including business, science, politics, and healthcare. Data mining—the automated extraction of actionable insights from data—has revolutionized each of these realms in the 21st century. The objective of the course is to teach students the core data mining skills of exploratory data analysis, selecting an appropriate statistical methodology, applying the methodology to the data, and interpreting the results. The course will cover a variety of data mining methods including linear and logistic regression, penalized regression (including lasso and ridge regression), tree-based methods (including random forests and boosting), and deep learning. Students will learn the conceptual basis of these methods as well as how to apply them to real data using the programming language R. This course may be taken concurrently with the prerequisite with instructor permission.

  • STAT5710 - Modern Data Mining

    Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 4000 or 5000 level or permission from instructor.

  • STAT6130 - Regr Analysis For Bus

    This course provides the fundamental methods of statistical analysis, the art and science if extracting information from data. The course will begin with a focus on the basic elements of exploratory data analysis, probability theory and statistical inference. With this as a foundation, it will proceed to explore the use of the key statistical methodology known as regression analysis for solving business problems, such as the prediction of future sales and the response of the market to price changes. The use of regression diagnostics and various graphical displays supplement the basic numerical summaries and provides insight into the validity of the models. Specific important topics covered include least squares estimation, residuals and outliers, tests and confidence intervals, correlation and autocorrelation, collinearity, and randomization. The presentation relies upon computer software for most of the needed calculations, and the resulting style focuses on construction of models, interpretation of results, and critical evaluation of assumptions.

  • STAT7010 - Modern Data Mining

    Modern Data Mining: Statistics or Data Science has been evolving rapidly to keep up with the modern world. While classical multiple regression and logistic regression technique continue to be the major tools we go beyond to include methods built on top of linear models such as LASSO and Ridge regression. Contemporary methods such as KNN (K nearest neighbor), Random Forest, Support Vector Machines, Principal Component Analyses (PCA), the bootstrap and others are also covered. Text mining especially through PCA is another topic of the course. While learning all the techniques, we keep in mind that our goal is to tackle real problems. Not only do we go through a large collection of interesting, challenging real-life data sets but we also learn how to use the free, powerful software "R" in connection with each of the methods exposed in the class. Prerequisite: two courses at the statistics 4000 or 5000 level or permission from instructor.

  • STAT8990 - Independent Study

    Written permission of instructor, the department MBA advisor and course coordinator required to enroll.

  • STAT9910 - Sem in Adv Appl of Stat

    This seminar will be taken by doctoral candidates after the completion of most of their coursework. Topics vary from year to year and are chosen from advance probability, statistical inference, robust methods, and decision theory with principal emphasis on applications.

  • STAT9916 - Sem in Adv Appl of Stat

    This seminar will be taken by doctoral candidates after the completion of most of their coursework. Topics vary from year to year and are chosen from advance probability, statistical inference, robust methods, and decision theory with principal emphasis on applications.

  • STAT9950 - Dissertation

    Dissertation

Awards and Honors

Activity

Latest Research

Junhui Cai, Ran Chen, Martin J. Wainwright, Linda Zhao (Work In Progress), Personalized reinforcement learning: With applications to recommender systems.
All Research

Wharton Magazine

Data: Voting Wait Times, Pre-IPO Confidentiality, and More
Wharton Magazine - 10/16/2020

Wharton Stories

Three women presenting on stage at a glass podium with a Wharton banner in the backgroundPredicting Random Forest Fires in California

On February 14, Analytics at Wharton, Wharton Customer Analytics, Penn Engineering, and Wharton Statistics collaborated to host the first Women in Data Science Conference (WiDS) at Penn. Among the impressive roster of PhD students, industry professionals, and professors that presented on a variety of topics were three Wharton undergrads who…

Wharton Stories - 03/06/2020
All Stories