Next-Symbol Prediction Without Mixing: Optimal Rates, Algorithms, and Hardness

YIHONG WU – YALE UNIVERSITY

ABSTRACT

Consider the following “ChatGPT-style” problem: Observing a time series X_1,…,X_n, one is tasked to predict the next (unseen) symbol X_{n+1}. Mathematically, this boils down to estimating the conditional distribution P_{X_{n+1}|X_1,…,X_n}, which informs downstream tasks such as sampling for text generation or estimating the most likely realizations for autocomplete. This is a well-defined but non-standard statistical problem, in that the estimand is random and data-dependent, unless the data are i.i.d., in which case the problem reduces to density estimation. For dependent data, most of the existing literature focuses on parameter estimation which relies on favorable mixing time (spectral gap) conditions which are in fact not needed for prediction. (Indeed, a chain that moves at a glacial speed is easy to predict but estimating the transition probabilities is impossible.) This allows for an assumption-free framework to study next-symbol prediction but also results in new technical challenges due to the lack of concentration for sample path statistics.

In this talk, we introduce information-theoretic techniques including, in particular, those rooted in universal compression to determine the optimal rate of prediction risk for Markov chains, achieved by computationally efficient algorithms. Extensions to models with long-term memory including hidden Markov models and the associated computational challenges will also be discussed. This is based on joint work with Yanjun Han (NYU), Soham Jana (Notre Dame), and Tianze Jiang (Princeton).

Related Papers: