The Science of Measurement in Machine Learning

Jacob Steinhardt – University of California, Berkeley

Abstract

In machine learning, we are obsessed with datasets and metrics: progress in areas as diverse as natural language understanding, object recognition, and reinforcement learning is tracked by numerical scores on agreed-upon benchmarks. However, other ways of measuring ML models are underappreciated, and can unlock important insights.

In this talk, I’ll discuss three important quantities beyond test accuracy: datapoint-level variation, similarity between representations, and robustness. In each case, forming a good measurement is itself hard: there are many similarity measures and forms of robustness, and both similarity and variation are often dominated by statistical noise. We address these issues, and make new discoveries:
* As models get larger, while overall accuracy increases, many individual predictions get worse.
* Models of different depth appear to still make similar computations in similar orders.
* Few things consistently help robustness, but data augmentation and pre-training can help in many settings.

Beyond these specific observations, I will tie measurement to historical trends in science, and draw lessons from the success of biology and physics in the mid-20th century.