Are good intentions enough to justify de-anonymizing a dataset?
Imagine you are a privacy engineer faced with this conversation:
Manager: “Hey, we have a bunch of data on how people do X, and we are going to share it with Y? We removed all the identifiers, so it should be good enough, right”
Privacy Engineer: “Well, unless you are using differential privacy with the correct settings, it could probably leak information about individuals. Even with differential privacy, it might still be possible to make sensitive inferences about groups of people.”
Manager: “We didn’t use differential privacy but I think what we did is enough. Can you prove to me that there is a problem?”
So, you’ve been asked to re-identify a de-identified dataset before it is shared! It’s great that the manager came to you before releasing it!
Sharing Data for good reasons
There are many perfectly good reasons to share a dataset, whether for public service decisions, identifying diseases, or a million other ways to better understand the world for good.
However, there is tons of work showing that even when identifiers are removed, it is possible to reidentify datasets (see the Learn More section). In the privacy engineering community, the gold standard for protecting individual data in shared datasets is to use differential privacy (DP). However, DP still is not widely adopted.
How are shared datasets protected if DP is not used? Commonly, strong identifiers are removed, such as names, email addresses, or government IDs. Sometimes data is fuzzed; timestamps will have seconds removed, IP addresses can be broadened, or location data can be made less specific. These are all helpful steps. They raise the hurdle for anyone who wants to re-identify the data, but they do not “anonymize” the data fully against someone who is motivated to re-identify it.
There might be good reasons to try to de-anonymize datasets
Re-identifying datasets is not as hard as it should be. Attackers do it all the time, but you are not an attacker. You might want to re-identify a dataset for one of the following reasons.
You are a privacy expert and someone has asked for help with sharing their data before they share it (like this example: Privacy, Ethics, and Data Access: A Case Study of the Fragile Families Challenge)
You are a data, math, or privacy academic and want to publish a paper that updates the field of knowledge on what is re-identifiable. The goal is to encourage better data protection on sensitive data.
You are running a Privacy Red Team exercise and want to demonstrate that the dataset is vulnerable. The goal is to encourage better data protection before the data is shared.
Whether to test the privacy protections by de-anonymizing the data
So, you are a good person, and you want to share data for a good reason. Some identifiers have been removed, and now you want to know how good the privacy protections are. Is it worth trying to re-identify a dataset?
Slow your roll! For data and math nerds (myself included), this is a fun challenge. It’s like a puzzle or a game; a game that can have real consequences on real people. You are trying to undo the protections put in place. At the same time, maybe you can bring about change and better protections. How do you decide whether it is worth doing?
There are four things you need to do before you re-identify the dataset:
1. Understand the goals of sharing the dataset, and for what else the data could be used. Talk to the owners of the dataset if you are not a domain expert.
What are the goals of sharing the data? Is it to cure a disease? Is it to increase marketing?
Who will the data be shared with? Medical researchers? The whole world? An authoritarian government with a history of jailing dissidents? Are their legal contracts in place or other types of protection?
These have different benefits and risks to the people in the dataset. Weigh them appropriately.
2. Be really clear on what you mean by re-identifying a dataset. Here is a non-exhaustive list of what might count as “re-identifying” a dataset.
- Do you want to identify groups of people, for example, do you want to be able to find all Swahili speakers in Minnesota in the dataset?
- Do you want to be able to link two different rows in the same dataset as being the same person so you can track activities, even if you don’t know who that person is?
- Do you need to just pick out one specific person, e.g. figure out which row in the dataset is the mayor of Boston?
- Or do you want to be able to re-link the data to PII? For example, do you want to be able to match the protected dataset to another that has full names or email addresses?
A crisp definition of what you plan to do is really important for step 4 when you look at the risks and benefits of doing the reidentification.
Now do two risk-benefit analyses. You do not need to get bogged down in doing an in-depth quantitative risk-benefit analysis with hard numbers. It may be enough to simply brainstorm risks and benefits.
3. What are the risks and benefits of sharing the data as is? You likely already have many ideas here based on your work in Step 1. Now you need to frame what you learned as risks and benefits to the people in the data set.
Benefits: For example, is there a chance that sharing the data as is will save thousands of lives?
Risk: What will actually happen to people if the data about them is released and de-identified? Is the data or people well protected through other mechanisms (such as contracts or limited access to the data)?
4. What are the risks and benefits of de-identifying the data? You have already decided what you mean by deidentify in Step 2.
Risk: If you are successful, you (and your team) will have access to the de-identified data. Do the people in the dataset want that? Can they trust you? Do you have a code of ethics or rules of engagement? What protections do you have in place for this data?
Benefits: What are the benefits of you re-identifying the data? What will change for the better if you do? Will it change the state of knowledge? Will it change how the dataset is released? Are the people releasing the dataset open to using more stringent privacy protections? If you re-identify the dataset, will the data still be released as-is due to other pressures?
You should probably spend the most time on the first step, which is understanding the goals of sharing the data. That should inform all the future steps. It will allow you to more easily recognize the risks and benefits of sharing the data, versus the risks and benefits of re-identifying it. What are the data subjects' tolerance and desire for the risks and benefits? For example, as a data subject, I might be personally willing to share sensitive data with medical researchers if it means better treatment for specific diseases.
Now put it all together. Once you have clearly framed the problem and understand the risks and benefits, you can decide whether to proceed. Only move forward if the re-identification will yield, on the weight of it, positive outcomes.
Learning more.
Explanations and Primers on re-identification:
- The Wikipedia page on re-identification includes lots of examples of previous work.
- A very good introduction to a technical area in a legal journal: Re-Identification of “Anonymized” Data, by Boris Lubarsky April 2017, GEO. L. TECH. REV. 202
Legal aspects of re-identification
- The only work I can find addressing the legal aspects of researchers who want to re-identify data is: Re-Identification Attacks And Data Protection Law by Cedric Lauradoux Teodora Curelariu Alexandre Lodie February 6, 2023
Technical aspects of the risk of re-identifying
- It’s a few years old by this point, but a very thoughtful explanation: “A Precautionary Approach to Big Data Privacy” Arvind Narayanan Joanna Huey Edward W. Felten March 19, 2015
- A recent technical explanation for thinking about what you mean by re-identifying data: SoK: Managing risks of linkage attacks on data privacy Jovan Powar (University of Cambridge) and Alastair Beresford (University of Cambridge)
Introduction to Differential Privacy
- I highly recommend “A friendly, non-technical introduction to differential privacy” on the “Ted is writing things” blog
Introduction to Privacy Red Teams
- I have a course on this: https://www.privacyengineer.ch/product/redteam/
- If you want a 3-minute introduction, see my blog post