Abhishek Chakrabortty

Abhishek Chakrabortty
  • Postdoctoral Researcher

Contact Information

  • office Address:

    431-1 Jon M. Huntsman Hall,
    3730 Walnut Street,
    Philadelphia, PA 19104

Research Interests: Semi-supervised learning; High dimensional statistics; Semi-parametric inference; Causal inference and missing data; Biomedical applications.

Links: CV, Awards, Publications & Preprints, Working Papers, Teaching

Overview

Note: I have recently moved to Texas A&M University starting August, 2019. Please use my new email addresses: abhishek@stat.tamu.edu or achakrabortty@tamu.edu for contacting me in the future and not my Wharton email id any further.

This is my temporary webpage and won’t be active or updated after August, 2019.

Education

Ph.D. in Biostatistics, Harvard University, 2016.

Master of Statistics (M. Stat.), Indian Statistical Institute, 2011.

Bachelor of Statistics (B. Stat.), Indian Statistical Institute, 2009.

Biosketch

I am currently a tenure-track Assistant Professor at the Department of Statistics, Texas A&M University starting Fall 2019.

Previously, I was a Postdoctoral Researcher at the Department of Statistics and the Department of Biostatistics, Epidemiology & Informatics (DBEI), University of Pennsylvania, where I was mentored by Prof. T. Tony Cai and Prof. Hongzhe Li.

I received my Ph.D. from the Department of Biostatistics, Harvard University, where I was advised by Prof. Tianxi Cai, and my Bachelors and Masters degrees in Statistics from the Indian Statistical Institute, Kolkata, India. I was born and raised in Kolkata, India and I am also a proud alumnus of South Point High School, Kolkata, India.

More details on my academic career can be found in my Curriculum Vitae (CV).

I am broadly interested in the various aspects of modern statistical inference and learning, including theory, methods and algorithms, and their applications in the analysis of large datasets arising in various scientific disciplines in the big data era.

Feel free to explore my website for more details on my research, as well as my teaching activities and my awards and distinctions. Please email me if you have any comments, suggestions or questions regarding my work. They are most welcome!

 

Continue Reading

Research

My research, in general, is focused on developing efficient and robust statistical methods with provable theoretical guarantees and scalable implementation, for the analysis of large scale, complex and high dimensional datasets that are encountered in modern studies, especially biomedical studies, with the goal of drawing useful statistical inference and addressing meaningful scientific or policy related questions.

Research Interests

My research interests broadly lie at the interface of semi-parametric inference, high dimensional statistics and statistical learning in semi-supervised settings and weakly supervised settings, with applications in the statistical analysis of large and complex observational datasets arising in modern biomedical studies. Some of my specific methodological and application based research interests are listed below.

Methodology: Semi-supervised inference; Semi-parametric inference with high dimensional data; Missing data and causal inference; High dimensional inference; Regularized estimation; Non-asymptotic performance guarantees.

Applications: Discovery research using electronic medical records (EMR) data; Automated phenotyping; Personalized medicine (treatment selection, treatment effects estimation, risk prediction, comparative effectiveness etc.)

Other interests: Concentration inequalities and tail bounds; Empirical processes; Debiasing and sample-splitting; Model misspecification; Non-parametric regression; Sufficient dimension reduction.

Publications and Preprints

Some of my publications and preprints are listed below. Please consult my CV for a more comprehensive list, including working papers and other publications. See also my Google Scholar profile.

 

  • Stephanie F. Chan, Boris P. Hejblum, Abhishek Chakrabortty, Tianxi Cai (2019), Semi-Supervised Estimation of Covariance with Application to Phenome-Wide Association Studies with Electronic Medical Records Data, Statistical Methods in Medical Research, to appear.

    Abstract: Electronic medical records (EMRs) data are valuable resources for discovery research. They contain detailed phenotypic information on individual patients, opening opportunities for simultaneously studying multiple phenotypes. A useful tool for such simultaneous assessment is the Phenome-wide association study (PheWAS), which relates a genomic or biological marker of interest to a wide spectrum of disease phenotypes, typically defined by the diagnostic billing codes. One challenge arises when the biomarker of interest is expensive to measure on the entire EMR cohort. Performing PheWAS based on supervised estimation using only subjects who have marker measurements may yield limited power. In this paper, we focus on the setting where the marker is measured on a small fraction of the patients while a few surrogate markers such as historical measurements of the biomarker are available on a large number of patients. We propose an efficient semi-supervised estimation procedure to estimate the covariance between the biomarker and the billing code, leveraging the surrogate marker information. We employ surrogate marker values to impute the missing outcome via a two-step semi-non-parametric approach and demonstrate that our proposed estimator is always more efficient than the supervised counterpart without requiring the imputation model to be correct. We illustrate the proposed procedure by assessing the association between the C-reactive protein (CRP) and some inflammatory diseases with an EMR study of inflammatory bowel disease performed with the Partners HealthCare EMR where CRP was only measured for a small fraction of the patients due to budget constraints.

  • Abhishek Chakrabortty, Preetam Nandy, Hongzhe Li (Under Review), Inference for Individual Mediation Effects and Interventional Effects in Sparse High-Dimensional Causal Graphical Models.

    Abstract: We consider the problem of identifying intermediate variables (or mediators) that regulate the effect of a treatment on a response variable. While there has been significant research on this topic, little work has been done when the set of potential mediators is high-dimensional. A further complication arises when the potential mediators are interrelated. In particular, we assume that the causal structure of the treatment, the potential mediators and the response is a directed acyclic graph (DAG). High-dimensional DAG models have previously been used for the estimation of causal effects from observational data. In particular, methods called IDA and joint-IDA have been developed for estimating the effect of single interventions and the effect of multiple simultaneous interventions respectively. In this paper, we propose an IDA-type method, called MIDA, for estimating mediation effects from high-dimensional observational data. Although IDA and joint-IDA estimators have been shown to be consistent in certain sparse high-dimensional settings, their asymptotic properties such as convergence in distribution and inferential tools in such settings remained unknown. In this paper, we prove high-dimensional consistency of MIDA for linear structural equation models with sub-Gaussian errors. More importantly, we derive distributional convergence results for MIDA in similar high-dimensional settings, which are applicable to IDA and joint-IDA estimators as well. To the best of our knowledge, these are the first distributional convergence results facilitating inference for IDA-type estimators. These results have been built on our novel theoretical results regarding uniform bounds for linear regression estimators over varying subsets of high-dimensional covariates, which may be of independent interest. Finally, we empirically demonstrate the usefulness of our asymptotic theory in the identification of large mediation effects and we illustrate a practical application of MIDA in genomics with a real dataset.

  • David Cheng, Abhishek Chakrabortty, Ashwin N. Ananthakrishnan, Tianxi Cai (Under Revision), Estimating Average Treatment Effects with a Double-Index Propensity Score.

    Abstract: We consider estimating average treatment effects (ATE) of a binary treatment in observational data when data-driven variable selection is needed to select relevant covariates from a moderately large number of available covariates X. To leverage covariates among X predictive of the outcome for efficiency gain while using regularization to fit a parameteric propensity score (PS) model, we consider a dimension reduction of X based on fitting both working PS and outcome models using adaptive LASSO. A novel PS estimator, the Double-index Propensity Score (DiPS), is proposed, in which the treatment status is smoothed over the linear predictors for X from both the initial working models. The ATE is estimated by using the DiPS in a normalized inverse probability weighting (IPW) estimator, which is found to maintain double-robustness and also local semiparametric efficiency with a fixed number of covariates p. Under misspecification of working models, the smoothing step leads to gains in efficiency and robustness over traditional doubly-robust estimators. These results are extended to the case where p diverges with sample size and working models are sparse. Simulations show the benefits of the approach in finite samples. We illustrate the method by estimating the ATE of statins on colorectal cancer risk in an electronic medical record (EMR) study and the effect of smoking on C-reactive protein (CRP) in the Framingham Offspring Study.

  • Abhishek Chakrabortty, Matey Neykov, Raymond J. Carroll, Tianxi Cai (Under Review), Surrogate Aided Unsupervised Recovery of Sparse Signals in Single Index Models for Binary Outcomes.

    Abstract: We consider the recovery of regression coefficients, denoted by β0, for a single index model (SIM) relating a binary outcome Y to a set of possibly high dimensional covariates X, based on a large but unlabeled dataset U, with Y never observed. On U, we fully observe X and additionally, a surrogate S which, while not being strongly predictive of Y throughout the entirety of its support, can forecast it with high accuracy when it assumes extreme values. Such datasets arise naturally in modern studies involving large databases such as electronic medical records (EMR) where Y, unlike (X, S), is difficult and/or expensive to obtain. In EMR studies, an example of Y and S would be the true disease phenotype and the count of the associated diagnostic codes respectively. Assuming another SIM for S given X, we show that under sparsity assumptions, we can recover β0 proportionally by simply fitting a least squares LASSO estimator to the subset of the observed data on (X, S) restricted to the extreme sets of S, with Y imputed using the surrogacy of S. We obtain sharp finite sample performance bounds for our estimator, including deterministic deviation bounds and probabilistic guarantees. We demonstrate the effectiveness of our approach through multiple simulation studies, as well as by application to real data from an EMR study conducted at the Partners HealthCare Systems.

  • Arun Kumar Kuchibhotla and Abhishek Chakrabortty (Under Review), Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression.

    Abstract: Concentration inequalities form an essential toolkit in the study of high-dimensional statistical methods. Most of the relevant statistics literature in this regard is, however, based on the assumptions of sub-Gaussian/sub-exponential random vectors. In this paper, we first bring together, via a unified exposition, various probability inequalities for sums of independent random variables under much weaker exponential type (sub-Weibull) tail assumptions. These results extract a part sub-Gaussian tail behavior of the sum in finite samples, matching the asymptotics governed by the central limit theorem, and are compactly represented in terms of a new Orlicz quasi-norm -- the Generalized Bernstein-Orlicz norm -- that typifies such kind of tail behaviors. We illustrate the usefulness of these inequalities through the analysis of four fundamental problems in high-dimensional statistics. In the first two problems, we study the rate of convergence of the sample covariance matrix in terms of the maximum elementwise norm and the maximum $k$-sub-matrix operator norm which are key quantities of interest in bootstrap procedures and high-dimensional structured covariance matrix estimation. The third example concerns the restricted eigenvalue condition, required in high dimensional linear regression, which we verify for all sub-Weibull random vectors under only marginal (not joint) tail assumptions on the covariates. To our knowledge, this is the first unified result obtained in such generality. In the final example, we consider the Lasso estimator for linear regression and establish its rate of convergence to be generally $\sqrt{k\log p/n}$, for $k$-sparse signals, under much weaker tail assumptions (on the errors as well as the covariates) than those in the existing literature. The common feature in all our results is that the convergence rates under most exponential tails match the usual ones obtained under sub-Gaussian assumptions. Finally, we also establish a high-dimensional central limit theorem with a concrete rate bound for sub-Weibulls, as well as tail bounds for suprema of empirical processes. All our results are finite sample.

  • Abhishek Chakrabortty and Tianxi Cai (2018), Efficient and Adaptive Linear Regression in Semi-Supervised Settings, The Annals of Statistics, 46 (4), pp. 1541-1572. 10.1214/17-AOS1594

    Abstract: We consider the linear regression problem under semi-supervised settings wherein the available data typically consists of: (i) a small or moderate sized `labeled' data, and (ii) a much larger sized `unlabeled' data. Such data arises naturally from settings where the outcome, unlike the covariates, is expensive to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR). Supervised estimators like the ordinary least squares (OLS) estimator utilize only the labeled data. It is often of interest to investigate if and when the unlabeled data can be exploited to improve estimation of the regression parameter in the adopted linear model. In this paper, we propose a class of `Efficient and Adaptive Semi-Supervised Estimators' (EASE) to improve estimation efficiency. The EASE are two-step estimators adaptive to model mis-specification, leading to improved (optimal in some cases) efficiency under model mis-specification, and equal (optimal) efficiency under a linear model. This adaptive property, often unaddressed in the existing literature, is crucial for advocating `safe' use of the unlabeled data. The construction of EASE primarily involves a flexible `semi-non-parametric' imputation, including a smoothing step that works well even when the number of covariates is not small; and a follow up `refitting' step along with a cross-validation (CV) strategy both of which have useful practical as well as theoretical implications towards addressing two important issues: under-smoothing and over-fitting. We establish asymptotic results including consistency, asymptotic normality and the adaptive properties of EASE. We also provide influence function expansions and a `double' CV strategy for inference. The results are further validated through extensive simulations, followed by application to an EMR study on auto-immunity.

  • Abhishek Chakrabortty and Tianxi Cai (Working), A Unified Framework for Robust and Adaptive Z-Estimation in Semi-Supervised Settings.

  • Sheng Yu, Abhishek Chakrabortty, Katherine P. Liao... et al. (2017), Surrogate-Assisted Feature Extraction for High-Throughput Phenotyping, Journal of the American Medical Informatics Association, 24 (e1), pp. 143-149. 10.1093/jamia/ocw135

    Abstract: Objective: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. Methods: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype’s International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. Results: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn’s disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F-score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. Conclusion: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces over-fitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.

  • Abhishek Chakrabortty (2016), Robust Semi-Parametric Inference in Semi-Supervised Settings, Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

    Abstract: In this dissertation, we consider semi-parametric estimation problems under semi-supervised (SS) settings, wherein the available data consists of a small or moderate sized labeled data (L), and a much larger unlabeled data (U). Such data arises naturally in settings where the outcome, unlike the covariates, is expensive to obtain, a frequent scenario in modern studies involving large electronic databases. It is often of interest in such SS settings to investigate if and when U can be exploited to improve the estimation efficiency, compared to supervised estimators based on L only. In Chapter 1, we propose a class of Efficient and Adaptive Semi-Supervised Estimators (EASE) for SS linear regression. These are semi-non-parametric imputation based two-step estimators adaptive to model mis-specification, leading to improved efficiency under model mis-specification, and equal (optimal) efficiency when the linear model holds. This adaptive property is crucial for advocating safe use of L. We provide asymptotic results establishing our claims, along with methods for inference based on EASE, followed by simulations and application to real data. In Chapter 2, we provide a unified framework for SS M/Z-estimation problems, based on general estimating equations, and propose a family of EASE estimators that are always as efficient as the supervised estimator and more efficient whenever U is actually informative for the parameter of interest. For a subclass of these problems, we also provide a flexible semi-non-parametric imputation strategy for constructing such EASE estimators. We provide asymptotic results establishing our claims, as well as methods for inference, followed by simulations and application to real data. In Chapter 3, we consider regressing a binary outcome (Y) on set of (possibly high dimensional) covariates (X) based on a large but unlabeled data with observations only for X, and additionally, a surrogate (S) which can predict Y with high accuracy but only when it assumes extreme values. Assuming Y and S both follow single index models versus X, we show that under sparsity assumptions, we can recover the regression parameter of Y versus X through a simple least squares LASSO estimator fitted to the subset of the data restricted to the extreme sets of S with Y imputed using the surrogacy of S. We provide sharp finite sample performance guarantees for our estimator, followed by simulations and application to real data.

  • Abhishek Chakrabortty (2011), Association Mapping of Discrete Pehnotypes Using Poisson Regression, Masters Thesis, Indian Statistical Institute, Kolkata, India.

    Abstract: In this dissertation, we attempt to study the extent of association of a given marker locus with the unknown causal locus of a discrete phenotype, a count variable with no prespecified range of variation, based on a Poisson regression models. Assuming the causal and the marker loci are both bi-allelic and conforming to the Hardy-Weinberg frequencies, and using the notion of the genetic distance (GD) between 2 loci as the primary criteria for their association, we develop the the model set-up addressing necessary convergence issues of the maximum likelihood estimates of the parameters obtained by Fisher Scoring along with proving the equivalence of the significance of Genetic Distance to that of the slope parameter of the model. With this equivalence established, the extent of association is now studied by subjecting the slope parameter to standard asymptotic tests of significance such as Log-Likelihood Ration Test, the Wald's Test and the Score Test. Extensive simulation studies are implemented based on which we present a comparative study of the asymptotic tests used, in terms of their empirical powers and levels at different sample sizes and parameter combinations. We next present a detailed analysis where we compare the performance of our model-based tests with that of ANOVA, one of the most common approaches used for analyzing association mapping problems. Finally, we shift our attention to a new set-up where we have a family based data instead of population data. We propose a modified model and 2 new model-based tests for investigating the association based on such a data, and follow it up with extensive simulation studies where we analyze the empirical levels and the powers achieved by these tests under various parameter combinations. Possible improvements of the results and potential areas of further research including the issue of stratification and extension of the models for multiple marker loci are also discussed in the concluding sections of the report.

Teaching

Instructor at the Department of Statistics, Texas A&M University:

  1. Fall 2019 – STAT 651 (Statistics in Research I).

Instructor at the Department of Biostatistics, Harvard University:

  1. Summer 2015 – Operational Mathematics (a math camp on college level theory of linear algebra and real analysis, as part of the Biostatistics Summer Preparatory Courses for incoming doctoral students).

Teaching Assistant (TA) at the Department of Biostatistics, Harvard University for the following graduate level courses:

  1. Spring 2013 – BIOSTAT 244 (Analysis of Failure Time Data); Instructor: Dr. Judith Lok.
  2. Fall 2013 – BIOSTAT 235 (Advanced Regression and Statistical Learning); Instructor: Dr. Robert Gray.
  3. Spring 2014 – BIOSTAT 244 (Analysis of Failure Time Data); Instructor: Dr. Judith Lok.
  4. Fall 2014 – BIOSTAT 235 (Advanced Regression and Statistical Learning); Instructor: Dr. Robert Gray.

Teaching awards:

  1. Certificate of Teaching Excellence (Spring 2013, Fall 2013 and Spring 2014), awarded by the Harvard Graduate School of Arts and Sciences.
  2. Certificate of Distinction in Teaching (2012-13), awarded by the Harvard School of Public Health.

Awards and Honors

(1) IMS New Researcher Travel Award (2019) and IMS Travel Award (2016), awarded by the Institute of Mathematical Statistics (IMS).

(2) Joshua E. Neimark Memorial Travel Assistance Award (2014), awarded by the American Association for the Advancement of Science (AAAS).

(3) NBHM Postgraduate Scholarship (2009-11) in mathematical sciences, awarded by the National Board for Higher Mathematics (NBHM), Government of India.

(4) Certificates of Distinction and Excellence in Teaching (Spring 2013, Fall 2013 and Spring 2014), awarded by the Harvard Graduate School of Arts and Sciences (GSAS).

(5) Certificate of Distinction in Teaching (2012-13), awarded by the Harvard School of Public Health (HSPH) and the Department of Biostatistics, Harvard University.

(6) Finalist for the American Statistical Association (ASA) Non-parametric Statistics Section Student Paper Award (2015).

(7) Awards for semestral performances (2006-11) at Indian Statistical Institute (ISI).

    Miscellaneous

    Working Papers (2019+)

    (1) Abhishek Chakrabortty, Jiarui Lu, T. Tony Cai and Hongzhe Li (2019+). High Dimensional M-Estimation with Missing Outcomes: A General Semi-Parametric Framework. Manuscript in preparation.  (Working draft)  (Slides)

    (2) Abhishek Chakrabortty and Arun Kumar Kuchibhotla (2019+). Tail Bounds for Canonical U-Statistics and U-Processes with Unbounded Kernels. Manuscript in preparation.  (Working draft)

    (3) Abhishek Chakrabortty, T. Tony Cai and Hongzhe Li (2019+). High Dimensional Semi-Supervised Regression: Robust and Adaptive Inference and the Multi-fold Benefits of Unlabeled Data. Manuscript in preparation.

    Other Publications

    (1) Sian Y. Lim, Sara R. Schoenfeld, Abhishek Chakrabortty et. al. (2016). Improving Predictive Value of Gout Case Definitions in Electronic Medical Records Utilizing Natural Language Processing: A Novel Informatics Approach. Arthritis and Rheumatology 2016, 68 (Suppl. 10). (Link)

    (2) Saurabh Ghosh and Abhishek Chakrabortty (2014). A Poisson Regression Model for Association Mapping of Count Phenotypes. Molecular Cytogenetics, 7 (Suppl. 1): O1. (Link)

    (3) Bhaswar B. Bhattacharya, Abhishek Chakrabortty, Shirshendu Ganguly and Shyamalendu Sinha (2009). Visual Cryptographic Schemes for Color Images with Low Pixel Expansion. In Proceedings of the 9th National Workshop on Cryptology 2009 (Surat, India): 64-69. (PDF)

    Activity

    Latest Research

    Stephanie F. Chan, Boris P. Hejblum, Abhishek Chakrabortty, Tianxi Cai (2019), Semi-Supervised Estimation of Covariance with Application to Phenome-Wide Association Studies with Electronic Medical Records Data, Statistical Methods in Medical Research, to appear.
    All Research