Theoretical Foundations of Pretrained Models

Qi Lei – Princeton University

Abstract

A pre-trained model refers to any model trained on broad data at scale and can be adapted (e.g., fine-tuned) to a wide range of downstream tasks. The rise of pre-trained models (e.g., BERT, GPT-3, CLIP, Codex, MAE) transforms applications in various domains and aligns with how humans learn. Humans and animals first establish their concepts or impressions from different data domains and data modalities. The learned concepts then help them learn specific tasks with minimal external instructions. Accordingly, we argue that a pre-trained model follows a similar procedure through the lens of deep representation learning. 1) Learn a data representation that filters out irrelevant information from the training tasks; 2) Transfer the data representation to downstream tasks with few labeled samples and simple models.

This talk establishes some theoretical understanding for pre-trained models under different settings, ranging from supervised pretraining, meta-learning, and self-supervised learning to domain adaptation or domain generalization. I will discuss the sufficient (and sometimes necessary) conditions for pre-trained models to work based on the statistical relation between training and downstream tasks. The theoretical analyses partly answer how they work, when they fail, guide technical decisions for future work, and inspire new methods in pre-trained models.

I will mostly talk about the following 3 papers:

Few-shot learning via learning the representation, provably (ICLR 2021)

Predicting what you already know helps: Provable self-supervised learning (NeurIPS 2021)

How fine-tuning allows for effective meta-learning (NeurIPS 2021

I also intend to briefly cover A Theory of Label Propagation for Subpopulation Shift (ICML 2021)