Statistical Machine Learning for Genetics and Health: Multi-modality, Interpretability, Mechanism

Bianca Dumitrascu – University of Cambridge


Genomic and medical data are available at unprecedented scales. This is due, in part, to improvements and developments in data collection, high throughput sequencing, and imaging technologies. How can we extract lower dimensional representations from these high dimensional data in a way that retains fundamental biological properties across different scales? Three main challenges arise in this context: how to aggregate information across different experimental modalities, how to enforce that such representations are interpretable, and how to leverage prior dynamical knowledge to provide new insight into mechanism. I will present my work on developing statistical machine learning models and algorithms to address these challenges. First, I will present a generative model for learning representations that jointly model information from gene expression and tissue morphology in a population setting. Then, I will describe a method for making multi-modal representations interpretable using a label-aware compressive classification approach for gene panel selection in single cell data. Finally, I will discuss inference methods for models which encode mechanistic assumptions, a need that arises naturally in gene regulatory networks, predator-prey systems, and electronic health care records. Throughout this work, recent advances in machine learning and statistics are harnessed to bridge two worlds — the world of real, messy biological data and that of methodology and computation. This talk describes the importance of domain knowledge and data-centric modeling in motivating new statistical venues and introduces new ideas that touch upon improving experimental design in biomedical contexts.

My talk is built around three pillars that my work is structured around — the relevant papers are as follows: