by Rebecca Balebako and Yuval Zukerman
Generative AI is all around us, and it is voracious for data. It offers many promises for improving business, health, and other areas of human life. At the same time, AI experts, regulators, and even the average person on the street have all raised concerns that generative AI could create privacy violations. Due to the rapid adoption of generative AI, privacy advocates and regulators are concerned that generative AI will demand more personal data usage and potentially expose it. Furthermore, they worry that the growing use of generative AI means privacy leaks can occur at a greater scale and with fewer controls.
Generated Synthetic data to the rescue?
At the same time, some companies and researchers are working on using generative AI to protect privacy. generate synthetic data. Synthetic data is a simulated dataset based on actual data, generated to have similar statistical properties as the original data. And now, with the growing availability and power of GenAI, they use the technology to produce synthetic data.
If generated correctly and with attention, synthetic data offers some real benefits to privacy. In theory, it should be impossible to link such data to the original data. In most jurisdictions, if you can’t link the data back to personal data, it can be considered “anonymous.” For example, GDPR, the main privacy regulation in the EU, defines anonymous data as “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable” [1]. Anonymous data does not fall under GDPR. From a regulatory perspective, organizations can store and retain “anonymous” data indefinitely. In addition, anonymous data could be shared freely, allowing a company to use it to train multiple purpose-built models. For example, by reusing anonymous data, we can imagine developing new health solutions or financial fraud detection.
The idea that generative models can synthesize anonymous data and alleviate all privacy issues promises a quick and easy solution. On the face, this sounds simple and tempting, but is it too good to be true? Is generated synthetic data snake oil or a revolutionary solution to safeguard our data? In many cases, synthetic data overpromises and underdelivers when it comes to being anonymous and relieving all privacy concerns.
Not all synthetic data is the same, and not all is privacy protective.
Not all synthetic data is privacy-preserving data! To help illustrate, I created this Venn diagram 🙂
As I mentioned a moment ago, to protect privacy, the data must be “generated correctly and with attention to privacy”. Yup, you need to apply some form of privacy protection to your data before even generating synthetic data, and you need to understand which statistical properties you most need.
However, figuring out what protection to apply and which statistical properties can be difficult. Even experts in statistical data generation have found this difficult. For example, the US Census Bureau was applying some statistical techniques to provide valuable census data while still protecting against disclosure. After testing, they found the techniques were more vulnerable than anticipated. In the last data release, the Census Bureau switched to a formal method called Differential Privacy to protect their data [2].
Concerns about synthetic data are being raised elsewhere. Researchers have demonstrated the risk of naively generating synthetic data empirically. When they attempted to measure the chance of disclosure, the researchers identified cases of generated synthetic data that had worse privacy protection than non-synthetic disclosure control approaches [3]. Other work has found that when it comes to measuring how linkable a synthetic dataset was, the most protective datasets also had strict privacy measures applied [4]. So, what must you do to generate privacy-preserving synthetic data sets?
How to protect privacy?
If simply generating synthetic data doesn’t anonymize your dataset, what do you need to do to produce anonymous data? I’m sorry to say it, but you have to start making some trade-offs and balance data precision against privacy. You cannot get both 100% statistical similarity and privacy protection when generating synthetic data. Put in other words, it is impossible to get fully private, anonymous data with all the statistical properties (including outliers) of your original dataset with full utility and fidelity [5]. Even with synthetic data, you still need to make some tradeoffs and take some basic steps to future-proof your privacy program. Anyone who tells you otherwise is selling you snake oil.
Your organization will need to find the data protection method that is right for its needs. Your decision about the privacy-utility-fidelity tradeoff will be based on your risk appetite, your data usage, and context.
Step 1: Log all your privacy data transformations and annotate your datasets
Don’t let doing the right thing backfire because you didn’t document what you did. I’ve seen organizations who anonymized their dataset but didn’t document exactly what they did. They just labeled it anonymous. A few years later with some team turnover, no one could say with confidence how that data had been transformed or how close it was to the original set. The team was then forced to delete this hard-to-collect, labeled, training dataset because they could not guarantee the privacy promises or the statistical properties. Don’t let this happen to you. Whenever you apply privacy protection to a model or dataset, describe in detail what you did.
Generative data and privacy are both rapidly changing spaces. The definitions of identifiable and anonymous have been changing over time. What you do today might be considered vulnerable in the future, and you will want to at least know what you did.
Step 2: Choose a privacy-protective method
Anonymization, too, is a term that has changed meaning over the past few decades. This article aims to give you an intuitive understanding of options, without burying you in details. Here are some examples of different dataset transformations that have all been called “anonymizing.”
- PII has been replaced with other identifiers, such as cryptographically randomized keys, and the links between the new identifiers and the old are stored in separate places.
- PII has been removed completely.
- Outliers and PII have been removed to make it harder to link outside information to learn about someone (see k-anonymity)
- Outliers and PII have been removed, and information has been made less granular (e.g., timestamps or location)
- Add noise to the data: add a small random perturbation to the data.
- Differential Privacy (DP) has been applied to add noise to aggregate data so that there is a mathematical guarantee that it is very difficult to guess whether someone is in the dataset. Differential privacy was formalized in 2006 [6] and has lots of active research on attacks and best practices.
Each of these methods offers different levels of privacy protection. I’ve listed them roughly in order (from least to strongest) level of protection. Differential Privacy (DP) has for some years been considered the gold standard in removing linkability, and some privacy experts (including me) will not call a dataset “anonymous” unless they know that a company correctly applied DP . We consider everything else pseudonymous.
To best understand the impact of differential privacy, let’s look at an example.
The table below gives you samples of what a structured dataset could look like with different transformations. Notice how you gradually lose some information with each transformation.
Identifiable Data | Data with PII removed | PII removed and noise | Aggregated using differential privacy |
Name=RebeccaPhoneNumber=412-100-100 Age=40 Name=JoePhoneNumber=617-100-100Age=20 | Age = 40 Age = 20 | Age = 40.2 Age = 19.8 | Average Age = 30.09 |
You will need to make trade-offs like these between utility and privacy.
A few tools claim to measure how good your privacy protections are. One example is Anonymeter, which claims to “quantify different types of privacy risks in synthetic tabular datasets.” [7] Running such a tool on your synthetic data will likely also be useful to understand how good your protections are, but some privacy experts think these metrics are not complete. I am curious to see what happens in this space in the future.
Why generate synthetic data at all?
You’ve seen that you still need to apply data protections to your training data. At this point, you may ask why you should generate synthetic data at all if you have to do that hard work of adding privacy protections anyway. If you are anonymizing the data during or prior to generation anyway, why should you even do the extra step of synthesizing data? Generating data adds complexity to your process, requires computing power, and makes it harder to reverse engineer any concerns about the model.
So when is generating data likely worth it? Generating synthetic data is a great option when you have long-tailed data or anomalies. Examples include looking for fraudulent behavior when most of the data is not from abusive behavior. Another example, rare diseases in a health data set. When the information you care about is rare, then generating more data with synthetic data may offer benefits.
Synthetic datasets are also useful for testing security and privacy vulnerabilities. For example, privacy red teams may choose to mimic a motivated attacker accessing user data to cause harm; many attacks can be demonstrated on a synthetic dataset without causing privacy violations to the people in the original dataset.
In comparison, you may find that DP-generated models are ideal when the margins are less relevant to your core decision-making. In these cases, DP offers high privacy protections without sacrificing much utility. Furthermore, if you are in a highly regulated environment, or if your reputation for trust and safety is critical, you should consider DP-generated synthetic data.
Overall, generated synthetic data, when combined with privacy-protective techniques, can offer many promises for new and improved uses of data in a privacy-protective manner. However, synthetic generation cannot be done naively; the privacy protections must still be weighed and documented. Since privacy is a shifting landscape, you may need to reevaluate the privacy protections you apply over time. I recommend carefully defining what privacy protections you use and annotating any models with a clear description of what was done to protect it. While tools like model cards and a model registry can help you do this, it is incumbent upon you to actually do the work documenting. Three years from now, you’ll be happy you have done so.
Learn more and references
- https://edps.europa.eu/system/files/2021-04/21-04-27_aepd-edps_anonymisation_en_5.pdf
- https://www2.census.gov/library/publications/decennial/2020/census-briefs/c2020br-03.pdf
- Ruiz, N., Muralidhar, K. and Domingo-Ferrer, J., 2018. On the privacy guarantees of synthetic data: a reassessment from the maximum-knowledge attacker perspective. In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2018, Valencia, Spain, September 26–28, 2018, Proceedings (pp. 59-74). Springer International Publishing.
- Giomi, Matteo, Franziska Boenisch, Christoph Wehmeyer, and Borbála Tasnádi. “A Unified Framework for Quantifying Privacy Risk in Synthetic Data.” , Proceedings on Privacy Enhancing Technologies 2023(2), 312–328
- https://aws.amazon.com/blogs/machine-learning/how-to-evaluate-the-quality-of-the-synthetic-data-measuring-from-the-perspective-of-fidelity-utility-and-privacy
- Calibrating Noise to Sensitivity in Private Data Analysis by Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith. In Theory of Cryptography Conference (TCC), Springer, 2006. doi:10.1007/11681878_14. The full version appears in Journal of Privacy and Confidentiality, 7 (3), 17-51. doi:10.29012/jpc.v7i3.405
- https://github.com/statice/anonymeter
- Llugiqi, M. and Mayer, R., 2022, August. An Empirical Analysis of Synthetic-Data-Based Anomaly Detection. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction (pp. 306-327). Cham: Springer International Publishing.