
UTILITY MEETS PRIVACY: Paper published in the IEEE Access Journal
17 December 2024
The protection of sensitive data has always been a concern and has become even more important with the increasing collection and analysis of digital data. Sensitive information is defined as information that can be used to directly or indirectly identify an individual. At the same time, data sharing for analytics, especially machine learning, is becoming a critical aspect of innovative business and social applications.
Sensitive data is collected and protected by governments, the military, intelligence agencies, healthcare and industrial organizations. In healthcare in particular, machine learning promises groundbreaking benefits, but privacy concerns stand in the way of data sharing and hinder the progress of researchers and industry experts.
Synthetic data is therefore increasingly recognized as a viable solution for creating artificial analogs of sensitive data sets to facilitate data sharing for analytical purposes without compromising privacy or violating data protection regulations. The basic premise is that synthetic data does not correspond to real persons or subjects, thereby reducing the risk of disclosing sensitive information. To ensure the usefulness of synthetic data, its performance must be evaluated by comparing its accuracy in certain analytical tasks with that of the real data set. The closer the results of the synthetic data match those of the real data set, the greater its usefulness. In his study, Julian Höllig analyzes many health datasets and synthesizers to obtain large-scale, comparable evaluation results. He also introduces a novel user privacy score that integrates a privacy measure into the evaluation and quantifies the trade-off between the two.
More about this paper: https://ieeexplore.ieee.org/document/10918632
Image source: AdobeStock/Anatthaphon (Generated with AI)