/
Missing Data Imputation in RWD Exploration of Multiple Techniques on Open-Source Data

Missing Data Imputation in RWD Exploration of Multiple Techniques on Open-Source Data

Project Scope 

We aim to publish a white paper which shall explore multiple models available for missing data imputation and share with the wider group the potential of each model and its efficiency in dealing with different kinds of missing data. The model efficiency is compared using a single open-source simulated dataset. 

Project Statement 

Fitness of RWD to support/generate RWE requires certain processes to be studied (and the ICH E9 guideline to be considered too). One of the core sections is handling missing data. In existing scenarios, we cannot expect RWD to have all dependent variables to be non-missing or complete. Sponsors should use strategies to account for missing data, correct for redundant data and resolve any inconsistencies in the data. For this data to be fit for regulatory submission and meet regulatory expectations, we shall explore multiple imputation techniques on a single open-source simulated dataset (Synthea database). 

Project Impact 

The results of the project will help the industry evaluate what model is best suited for missing data imputation in different scenarios of missingness. As per the document Real-World Data: Assessing Registries to Support Regulatory Decision-Making for Drug and Biological Products, subsection Considerations When Linking a Registry to Another Registry or Another Data System' which mentions that sponsors should use strategies to account for missing data, correct for redundant data, resolve any inconsistencies in the data, and address other potential problems, such as protecting patient privacy while transferring data securely. But the strategies have not yet been distinguished or discussed with any regulatory body. We shall be the first amongst the industry to list all potential challenges and limitations in missing data imputation. Also, use of complete open-source technology should be cost-effective, which industry as a whole is supporting and shifting its direction in. 

Project LeadsEmail

Likhita Kolli, GSK

likhita.x.kolli@gsk.com
Nicola Newton, PHUSE Project Assistant

nicky@phuse.global 

CURRENT STATUS Q4 2024

  • Target to decide the data source and request stats to share the techniques to explore according to data type. Split teams to work on project set-up (Apache Spark) and create a GitHub Repository.

Objectives & Deliverables

Timelines

White PaperQ3 2025

Related pages