REDUCING ML BIAS USING TRUNCATED STATISTICS

CONSTANTINOS DASKALAKIS – MASSACHUSETTS INSTITUTE OF TECHNOLOGY

ABSTRACT:

Machine learning techniques are invaluable for extracting insights from large volumes of data. A key theoretical and practical assumption of most methods, however, is that they have access to independent samples from the entire distribution of relevant data. As such, these methods often fail catastrophically in the face of biased data which breaks this assumption. In this talk, we focus on bias due to censoring or truncation, where samples falling outside of an “observation window” are unreliable or cannot be observed due e.g. to measurement errors, legal or privacy constraints, or biased data collection. We present a general framework based on stochastic gradient descent for regression and classification from truncated samples. While the framework is broadly applicable, we also instantiate it to obtain computationally and statistically efficient methods for truncated density estimation and truncated linear, logistic and probit regression in high dimensions. We also provide experiments to illustrate the practicality of our framework on synthetic and real data.

BIO:

Constantinos (a.k.a. “Costis”) Daskalakis is a Professor of Computer Science at MIT and a member of CS-AI Lab. He works on computation theory and its interface with game theory, economics, probability theory, statistics and machine learning. He holds a Diploma in Electrical and Computer Engineering from the National Technical University of Athens, Greece, and a PhD in Computer Science from UC-Berkeley. He has been honored with the ACM Doctoral Dissertation award, the Kalai Prize from the Game Theory Society, the SIAM outstanding paper prize, the ACM Grace Murray Hopper Award, the Simons investigator award, the Bodossaki Foundation Distinguished Young Scientists Award, and the Nevanlinna prize from the International Mathematical Union.