Privacy in Machine Learning and Statistical Inference



The results of learning and statistical inference reveal information about the data they use. This talk discusses the possibilities and limitations of fitting machine learning and statistical models while protecting the privacy of individual records.

I will begin by explaining what makes this problem difficult, using recent examples of example memorization and other breaches. I will present differential privacy, a rigorous definition of privacy in statistical databases that is now widely studied, and increasingly used to analyze and design deployed systems.

Finally, I will present recent algorithmic results on a fundamental problem: differentially private mean estimation. We give an efficient and (nearly) sample-optimal algorithm for estimating the mean of “nicely” distributed data sets. When the data come from a Gaussian or sub-Gaussian distribution, the new algorithm matches the sample complexity of the best nonprivate algorithm.

The last part of the talk is based on joint work with Gavin Brown and Sam Hopkins that shared the Best Student Paper award at COLT 2023(