From real to Synthetic Data ensuring quality
Jan 15, 2025·
Nora Amama-BenHassun
BIOSTATNET2025Abstract
The growing reluctance to share original datasets and the increasing demand to comply with privacy regulations have motivated the adoption of synthetic data. Synthetic data replicates the statistical properties of the original datasets while ensuring that individual-level information or sensitive variables are not disclosed (Nowok et al., 2016; Raab et al., 2017). However, to effectively evaluate the quality of synthetic data, the development and refinement of validation metrics is required (Snoke et al., 2018; Raab et al., 2021). This assessment ensures the usability and reliability of synthetic datasets.
This research aims to introduce some existing validation metrics implemented in tools such as the synthpop package. The focus is on synthetic tabular data, with an emphasis on showcasing a comprehensive list of validation metrics that hold statistical significance and serve as a foundation for the development of new metrics. To address the challenges of validating synthetic data, the research highlights tailored methodologies for specific domains, such as energy, where there are unique challenges. Synthetic data offers opportunities to accelerate model training while ensuring compliance with privacy regulations. By developing robust metrics, the goal is to provide a practical framework for validating high-quality synthetic datasets that meet the needs of sensitive fields. These metrics will be presented to demonstrate their relevance and potential impact, ultimately addressing significant gaps in the literature concerning synthetic data validation in the energy sector.
As highlighted by Raab (2022), validation metrics can be categorized into three key dimensions: resemblance, utility, and privacy. Resemblance metrics, such as Propensity Score Mean-Squared Error (pMSE) or Kolmogorov-Smirnov Statistic (SPECKS), evaluate the similarity in the statistical distributions between the synthetic and original datasets. Utility metrics include measures like the Voas-Williamson Utility Measure (VW), which assess the suitability of synthetic data for specific analytical tasks, such as machine learning or statistical modeling. Privacy metrics ensure that sensitive information from the original data cannot be reconstructed or identified.
This research aims to introduce some existing validation metrics implemented in tools such as the synthpop package. The focus is on synthetic tabular data, with an emphasis on showcasing a comprehensive list of validation metrics that hold statistical significance and serve as a foundation for the development of new metrics. To address the challenges of validating synthetic data, the research highlights tailored methodologies for specific domains, such as energy, where there are unique challenges. Synthetic data offers opportunities to accelerate model training while ensuring compliance with privacy regulations. By developing robust metrics, the goal is to provide a practical framework for validating high-quality synthetic datasets that meet the needs of sensitive fields. These metrics will be presented to demonstrate their relevance and potential impact, ultimately addressing significant gaps in the literature concerning synthetic data validation in the energy sector.
As highlighted by Raab (2022), validation metrics can be categorized into three key dimensions: resemblance, utility, and privacy. Resemblance metrics, such as Propensity Score Mean-Squared Error (pMSE) or Kolmogorov-Smirnov Statistic (SPECKS), evaluate the similarity in the statistical distributions between the synthetic and original datasets. Utility metrics include measures like the Voas-Williamson Utility Measure (VW), which assess the suitability of synthetic data for specific analytical tasks, such as machine learning or statistical modeling. Privacy metrics ensure that sensitive information from the original data cannot be reconstructed or identified.
Date
Jan 15, 2025 6:00 PM — 9:00 PM
Event
Location
ADEIT Fundación Universidad-Empresa de la Universidad de València
Plaza Virgen de la Paz, 3, Ciutat Vella, València 46001