Large Scale Prediction with Decision Trees

Jason Klusowski – Princeton University

Abstract

Decision trees are one of the most elemental methods for classification problems. They are often deployed in contexts where many explanatory variables are observed and where a high importance is placed on the simplicity and interpretability of the fitted model, such as business and medicine. In this talk, I will show that trees trained with C4.5 methodology are consistent for certain nonparametric classification models, even when the number of explanatory variables grows sub-exponentially with the sample size, under both weak and strong forms of sparsity. These statistical guarantees highlight the crucial role that data dependent splits play in determining the adaptive properties of the tree.