HomeTechnologyCreate privacy-preserving synthetic data for machine learning with SmartNoise

Create privacy-preserving synthetic data for machine learning with SmartNoise

The COVID-19 pandemic demonstrates the tremendous significance of sufficient and relevant data for research, causal analysis, government action, and medical progress. However, for understandable data protection considerations, individuals and decision-makers are often very reluctant to share personal or sensitive data. To ensure sustainable progress, we need new practices that enable insights from personal data while reliably protecting individuals’ privacy.

Pioneered by Microsoft Research and their collaborators, differential privacy is the gold standard for protecting data in applications that prepare and publish statistical analyses. Differential privacy provides a mathematically measurable privacy guarantee to individuals by adding a carefully tuned amount of statistical noise to sensitive data or computations. It offers significantly higher privacy protection levels than commonly used disclosure limitation practices like data anonymization. The latter increasingly shows vulnerability to re-identification assaults—especially as more data about individuals become publicly available.

SmartNoise is jointly developed by Microsoft and Harvard’s Institute for Quantitative Social Science (IQSS) and the School of Engineering and Applied Sciences (SEAS) as part of the Open Differential Privacy (OpenDP) initiative. The platform’s initial version was launched in May 2020 and comprises mechanisms for providing differentially private results to users of analytical queries to protect the underlying dataset. The SmartNoise system includes differentially private algorithms, techniques for managing privacy budgets for subsequent queries, and other capabilities.

Workflow of a user submitting a query to a database that is protected by the SmartNoise system. After the query is processed by the privacy module and the budget store, the user receives differentially private results (e.g. counts, averages).

Check out the SmartNoise website to learn more. The code for the updated open source differential privacy platform is available on GitHub.

Privacy-preserving synthetic data

With the new release of SmartNoise, we are adding several synthesizers that allow creating differentially private datasets derived from unprotected data.

A differentially private synthetic dataset is generated from a statistical model based on the original dataset. The synthetic dataset represents a “fake” sample derived from the original data while retaining as many statistical traits as possible. The essential advantage of the synthesizer approach is that the differentially private dataset can be analyzed any number of times without increasing the privacy risk. Therefore, it enables collaboration between several events, democratizing knowledge, or open dataset initiatives.

While the synthetic dataset embodies the original data’s essential properties, it is mathematically impossible to preserve the full data value and guaranteeing record-level privacy at the same time. Usually, we can’t perform arbitrary statistical analysis and machine learning tasks on the synthesized dataset to the same extent as it is possible with the original data. Therefore, the type of downstream job should be considered before the data is synthesized.

For instance, the workflow for producing a synthetic dataset for supervised machine learning with SmartNoise looks as follows:

High level workflow how a dataset is synthesized for a machine learning task with SmartNoise: The original tabular dataset contains of features and labels. The QUAIL-method combines a synthesizer and a differentially private classifier to generate a new differentially private dataset that contains the statistical properties of the original data.

Various techniques exist to generate differentially private synthetic data, including approaches based on deep neural networks, auto-encoders, and generative adversarial models.

The new release of SmartNoise includes the following data synthesizers:

graphical user interface, text, application, Word

Check out our research paper to learn more about synthesizers and their performance in machine learning situations.

Learn more about differential privacy

Data protection in companies, government authorities, research institutions, and other organizations is a joint effort that includes various roles, including analysts, data scientists, data privacy officers, decision-makers, regulators, and lawyers.

To make the highly efficient but not always intuitive concept of differential privacy accessible to a broad viewers, we have launched a comprehensive whitepaper about the technique and its practical applications. In the paper, you can learn about the underestimated risks of common data anonymization practices, the thought behind differential privacy, and how to use SmartNoise in practice. Furthermore, we assess different levels of privacy protection and their impact on statistical results.

The following example compares the distribution of 50,000 income data points to differentially private histograms of the same data, each generated at different privacy budgets (managed by the parameter epsilon).

Comparison of histograms for California income distribution. Original (unprotected) histogram plus three differentially private versions, each with a different privacy parameter. Overall, the histograms are very consistent. Minor deviations are visible when the level of protection is the highest (high amount of random noise).

Lower epsilon values lead to a higher degree of protection and are therefore also associated with a more intense statistical noise. Even in the aggressive privacy setting with an epsilon value of 0.05, the differentially private distribution reflects the original histogram quite well. It turns out, that the error can be lowered further by increasing the amount of data.

Accompanying the whitepaper, several Jupyter notebooks are available for you to experience SmartNoise and other differential privacy technologies in practice and adapt them to your use cases. The demo situations range from protecting personal data against privacy assaults, providing basic statistics to advanced machine learning and deep learning applications.

Six images including graphical user interface, application.

To make the differential privacy concept generally understandable, we refrain from discussing the underlying mathematical concepts. Rather, we search to hold the technical descriptions at a high level. Nonetheless, we recommend that readers have background knowledge about and understand machine learning concepts.

Join the SmartNoise Early Adopter Acceleration Program

We have introduced the SmartNoise Early Adopter Acceleration Program to support the utilization and adoption of SmartNoise and OpenDP. This collaborative program with the SmartNoise team aims to accelerate the adoption of differential privacy in solutions today that will open data and offer insights to advantage society.

If you have a project that would advantage from using differential privacy, we invite you to apply.


Most Popular