While machine learning is many times faster than humans at finding patterns in scientific data, the task of validating these patterns as “meaningful” is still left to the scientist (who presumably would perform further experiments) or to ad-hoc methods such as visualization. To effectively accelerate scientific discovery with machine learning, human validation must be replaced with automated validation to the extent possible. In this talk I will present instances in which unsupervised learning tasks can be augmented with data driven artefact removal or stability guarantees.

In the case of clustering, I will introduce a new framework for proving that a clustering is approximately “correct”, that does not require a user to know anything about the data distribution. This framework has some similarities to PAC bounds in supervised learning. Unlike PAC bounds, the bounds for clustering can be calculated exactly by solving a convex program and can be of direct practical utility.

In the case of non-linear dimension reduction by manifold learning, I will demonstrate some of my group’s contributions to making the output of ML algorithms reproducible and interpretable. At the core of this work is the notion of augmenting the algorithm output with an estimated Riemannian metric, i.e. with the information that allows it to preserve the original data geometry.

Joint work with Dominique Perrault-Joncas, James McQueen, Yu-chia Chen, Samson Koelle, Hanyu Zhang.