Skip to main content
Discovering the causes of cancer and the means of prevention

Advancing cancer research using modern approaches to missingness in real-world data - Dr. Rebecca Hubbard

photograph of Dr. Rebecca Hubbard

Dr. Rebecca Hubbard

DCEG BB Seminar

October 27, 2022 | 10:30 AM – 11:30 AM


Add to Outlook Calendar


Rebecca A. Hubbard, Ph.D., Professor of Statistics, Department of Biostatistics, Epidemiology, and Informatics

University of Pennsylvania School of Medicine


Advancing cancer research using modern approaches to missingness in real-world data

Real-world data (RWD) including electronic health records represent an enormous research resource of particular value for studying rare diseases, generating timely evidence in settings where limited or no treatment options are available, and investigating questions about health equity. However, RWD have many limitations including complex patterns of missing data induced by the irregularity of interaction between patients and the healthcare system. Novel approaches to handling missing data including machine learning-based imputation methods have been touted as a potential solution to this problem, but evaluation of their performance in the context of real-world comparative effectiveness research (CER) is lacking. In the context of RWD-based CER, missingness can be handled in multiple ways. Multiple imputation (MI) can be used to impute variables with missingness prior to estimation of a propensity score. Alternatively, propensity score calibration (PSC) transforms this missing data problem into a measurement error problem. The PSC approach has potential to alleviate the computational burden of MI in large RWD databases. I will present a comparative evaluation of standard and novel methods to addressing missing data in the context of a real-world study of the comparative effectiveness of immunotherapy and chemotherapy for treatment of advanced urothelial cancer. Using plasmode simulation grounded in this context, we compare the performance of traditional and machine learning-based imputation methods as well as MI and PSC. We identify settings for missing data in which modern approaches have promise and those in which the greater flexibility of these methods potentially results in overfitting and poor statistical performance. I will conclude with reflections on how we can embrace modern advances in data and methodology while preserving time-tested principles that ensure research rigor and reproducibility.


Danping Liu, Ph.D., Investigator, Biostatistics Branch

Join the Meeting

Join the Zoom