Andreas Buja

Liem Sioe Liong/First Pacific Company Professor Emeritus of Statistics and Data Science

Contact Information

Primary Email:
buja.at.wharton@gmail.com

office Address:
325 Academic Research Building
265 South 37th Street
Philadelphia, PA 19104

Research Interests: data visualization, multivariate statistics, nonparametric statistics

Links: CV

Overview

Education

PhD, Swiss Federal Institute of Technology (ETHZ), 1980

Academic Positions Held

Wharton: 2002-2021 (name Liem Sioe Liong/ First Pacific Company Professor, 2003).
Previous appointment: University of Washington, Seattle. Visiting appointment: Stanford University

Other Positions

Member, Technical Staff, Bellcore/Telcordia, 1987-94
Member, Technical Staff, AT&T Bell Labs, 1994-96
Technology Consultant, AT&T Labs, 1996-2001

Professional Leadership

Editor, Journal of Computational and Graphical Statistics, 1997-2001
Advisory Editor, Journal of Computational and Graphical Statistics, 2001-present

For more information, go to My Personal Page

Research

Andreas Buja and Wolfgang Rolke (Work In Progress), Calibration for Simultaneity: (Re)sampling Methods for Simultaneous Inference with Applications to Function Estimation and Functional Data.
Abstract: We survey and illustrate a Monte Carlo technique for carrying out simple simultaneous inference with arbitrarily many statistics. Special cases of the technique have appeared in the literature, but there exists widespread unawareness of the simplicity and broad applicability of this solution to simultaneous inference. The technique, here called “calibration for simultaneity" or CfS , consists of 1) limiting the search for coverage regions to a one-parameter family of nested regions, and 2) selecting from the family that region whose estimated coverage probability has the desired value. Natural one-parameter families are almost always available. CfS applies whenever inference is based on a single distribution, for example: 1) fixed distributions such as Gaussians when diagnosing distributional assumptions, 2) conditional null distributions in exact tests with Neyman structure, in particular permutation tests, 3) bootstrap distributions for bootstrap standard error bands, 4) Bayesian posterior distributions for high-dimensional posterior probability regions, or 5) predictive distributions for multiple prediction intervals. CfS is particularly useful for estimation of any type of function, such as empirical Q-Q curves, empirical CDFs, density estimates, smooths, generally any type of _t, and functions estimated from functional data. A special case of CfS is equivalent to p-value adjustment (Westfall and Young, 1993). Conversely, the notion of a p-value can be extended to any simultaneous coverage problem that is solved with a one-parameter family of coverage regions.
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2021), Uniform-in-Submodel Bounds for Linear Regression in a Model Free Framework, Econometric Theory, (in press) ().
Richard A. Berk, Andreas Buja, Lawrence D. Brown, Edward I. George, Arun Kumar Kuchibhotla, Weijie Su, Linda Zhao (2021), Assumption Lean Regression, American Statistician, 75 (1), pp. 76-84.
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2020), A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference, Econometric Theory, (to appear) ().
Abstract: For the last two decades, high-dimensional data and methods have proliferated throughout the literature. The classical technique of linear regression, however, has not lost its touch in applications. Most high-dimensional estimation techniques can be seen as variable selection tools which lead to a smaller set of variables where classical linear regression technique applies. In this paper, we prove estimation error and linear representation bounds for the linear regression estimator uniformly over (many) subsets of variables. Based on deterministic inequalities, our results provide “good” rates when applied to both independent and dependent data. These results are useful in correctly interpreting the linear regression estimator obtained after exploring the data and also in post model-selection inference. All the results are derived under no model assumptions and are non-asymptotic in nature.
Richard A. Berk, Matthew Olson, Andreas Buja, Aurelie Ouss (2020), Using Recursive Partitioning to Find and Estimate Heterogenous Treatment Effects in Randomized Clinical Trials, Journal of Experimential Criminology, (to appear) ().
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao (2020), Valid Post-selection Inference in Model-free Linear Regression, Annals of Statistics, 48 (5), pp. 2953-2981.
Abstract: This paper provides multiple approaches to perform valid post-selection inference in an assumption-lean regression analysis. To the best of our knowledge, this is the first work that provides valid post-selection inference for regression analysis in such a general settings that include independent, m-dependent random variables.
Andreas Buja, Arun Kumar Kuchibhotla, Richard A. Berk, Edward I. George, Eric Tchetgen Tchetgen, Linda Zhao (2020), Models as Approximations—Rejoinder, Statistical Science, 34 (4), pp. 606-620.
Andreas Buja, Lawrence D. Brown, Richard A. Berk, Edward I. George, Emil Pitkin, Mikhail Traskin, Kai Zhang, Linda Zhao (2019), Models as Approximations I: Consequences Illustrated with Linear Regression, Statistical Science, 34 (4), pp. 523-544.
Andreas Buja, Lawrence D. Brown, Arun Kumar Kuchibhotla, Richard A. Berk, Edward I. George, Linda Zhao (2019), Models as Approximations II: A Model-Free Theory of Parametric Regression, Statistical Science, 34 (4), pp. 345-365.
Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Junhui Cai (Working), All of Linear Regression.

Teaching

All Courses

STAT3990 - Independent Study
Written permission of instructor and the department course coordinator required to enroll in this course.
STAT4700 - Data Analy & Stat Comp
This course will introduce a high-level programming language, called R, that is widely used for statistical data analysis. Using R, we will study and practice the following methodologies: data cleaning, feature extraction; web scrubbing, text analysis; data visualization; fitting statistical models; simulation of probability distributions and statistical models; statistical inference methods that use simulations (bootstrap, permutation tests). Prerequisite: Waiving the Statistics Core completely if prerequisites are not met. This course may be taken concurrently with the prerequisite with instructor permission.
STAT5030 - Data Analy & Stat Comp
This course will introduce a high-level programming language, called R, that is widely used for statistical data analysis. Using R, we will study and practice the following methodologies: data cleaning, feature extraction; web scrubbing, text analysis; data visualization; fitting statistical models; simulation of probability distributions and statistical models; statistical inference methods that use simulations (bootstrap, permutation tests). Prerequisite: Two courses at the statistics 4000 or 5000 level.
STAT7700 - Data Analy & Stat Comp
This course will introduce a high-level programming language, called R, that is widely used for statistical data analysis. Using R, we will study and practice the following methodologies: data cleaning, feature extraction; web scrubbing, text analysis; data visualization; fitting statistical models; simulation of probability distributions and statistical models; statistical inference methods that use simulations (bootstrap, permutation tests). Prerequisite: Two courses at the statistics 4000 or 5000 level.
STAT9260 - Multivariate Analy: Meth
This is a course that prepares PhD students in statistics for research in multivariate statistics and data visualization. The emphasis will be on a deep conceptual understanding of multivariate methods to the point where students will propose variations and extensions to existing methods or whole new approaches to problems previously solved by classical methods. Topics include: principal component analysis, canonical correlation analysis, generalized canonical analysis; nonlinear extensions of multivariate methods based on optimal transformations of quantitative variables and optimal scaling of categorical variables; shrinkage- and sparsity-based extensions to classical methods; clustering methods of the k-means and hierarchical varieties; multidimensional scaling, graph drawing, and manifold estimation.
STAT9610 - Statistical Methodology
This is a course that prepares 1st year PhD students in statistics for a research career. This is not an applied statistics course. Topics covered include: linear models and their high-dimensional geometry, statistical inference illustrated with linear models, diagnostics for linear models, bootstrap and permutation inference, principal component analysis, smoothing and cross-validation.
STAT9950 - Dissertation
Dissertation

Awards and Honors

Keynote speaker, Classification Society Conference, Milwaukee, WI, USA, 2013
Infovis best paper award for the article “Graphical inference for infovis” by Wickham, H., Cook, D., Hofmann, H., and Buja, A. IEEE Transactions on Visualization and Computer Graphics (Proc. InfoVis’10)., 2010
Journal of Marketing, finalist for the Harold H. Maynard Award and featured blog article of the October Issue, 2007
Keynote speaker, SIAM Conference on Datamining (SDM06), Bethesda, MD, USA, 2006
Fellow, Institute of Mathematical Statistics, 2006
IMS Medallion lecture, Joint Statistical Meetings, New York, 2002
Keynote speaker, European Meeting of the Psychometric Society, Leiden, 1995
Fellow, American Statistical Association, 1994
Award Medal for diploma thesis in mathematics, Swiss Federal Institute of Technology, 1975

Activity

Latest Research

Andreas Buja and Wolfgang Rolke (Work In Progress), Calibration for Simultaneity: (Re)sampling Methods for Simultaneous Inference with Applications to Function Estimation and Functional Data.

All Research

In the News

Different Worlds: Do Recommender Systems Fragment Consumers’ Interests?

The rise of computer-driven recommendation systems designed to help consumers navigate a growing ocean of choice is prompting concerns that the hyperpersonalization of information sources will lead to harmful divisions throughout society. A new study on consumer purchasing patterns in the music industry suggests the opposite. The paper, by Wharton researchers Kartik Hosanagar, Andreas Buja and Daniel M. Fleder, is titled, "Will the Global Village Fracture into Tribes: Recommender Systems and their Effects on the Consumer." …Read More

Knowledge at Wharton - 8/31/2011

All News

Andreas Buja

Contact Information

Overview

Education

Academic Positions Held

Other Positions

Professional Leadership

Research

Teaching

All Courses

STAT3990 - Independent Study

STAT4700 - Data Analy & Stat Comp

STAT5030 - Data Analy & Stat Comp

STAT7700 - Data Analy & Stat Comp

STAT9260 - Multivariate Analy: Meth

STAT9610 - Statistical Methodology

STAT9950 - Dissertation

Awards and Honors

In the News

Knowledge at Wharton

Activity

Latest Research

In the News