A Causal Perspective on When Domain Adaptation Algorithms Succeed or Fail

Yuansi Chen – Duke University

Abstract

Modern large-scale datasets are often collected from multiple methods and data sources. The data heterogeneity makes it difficult for classical algorithms under i.i.d. data assumptions to obtain good prediction performance on new unseen data. Domain adaptation (DA) arises as an important problem in which the source data used to train a model is different from the target (new unseen) data used to test the model. Recent advances in DA have mainly been application-driven, especially in the field on image recognition and text classification. While DA is empirically beneficial in these applications, DA is also known to fail if applied blindly. Motivated by the empirical successes and failures of DA methods, we propose a theoretical framework via structural causal models (SCM) that enables analysis and comparison of the prediction performance of DA methods. In particular, we prove that under linear SCM, the popular DA method called DIP is guaranteed to have a low target error when the prediction problem is anti-causal without label distribution perturbation. However, DIP fails to outperform the estimator trained solely on the source data when these assumptions are not met. We show that better DA methods exists in the presence of multiple heterogeneous source datasets.

Link to Paper: https://arxiv.org/abs/2010.15764