Machine learning (ML) continues to make inroads in many industry applications. This said, in at least four focus areas—healthcare, finance, insurance, and critical infrastructure— data privacy requirements and data sparsity limit useful estimation and deployment of ML-enabled algorithms. Recent advances in data synthesis show promise for addressing these challenges. If realistic data can be synthesized in a manner that still reflects to a sufficient degree the underlying granular properties of the ground-truth data, ML can be extended in substantial and scalable ways. If successful, we can improve materially decision-support systems in industries that still struggle with deploying industrial-strength ML-enabled models.
Early approaches (Sklar, 1959) use univariate marginal distributions together with a copula. Others have extended this probabilistic approach to remedy a range of issues with Sklar’s approach (Kamthe, et. al, 2021). Another interesting (and popular) thread focuses on generative adversarial networks (GANs) (Goodfellow, et. al., 2020) and extensions, such as a conditional generative adversarial net (CGAN) (Fu, et. al., 2019)
In this track, we will provide a review of some of the more promising synthetic data generation approaches. In addition, we will describe applications of these approaches to show how data privacy and data sparsity can be suitably addressed.