When classification models leak private information

I was talking to a friend the other day about privacy leaks from AI, and they wanted to know if a machine learning model itself protects privacy.  They argued that if you have a raw training set, and then you use it to generate a model that only classifies data, that should remove the privacy-sensitive information.  The classification model is already aggregating information and it doesn’t return identifying information.  Therefore, building the model itself may already be privacy-protective.  The answer to my friend’s question about whether it protects privacy is “Yes, that is better than sharing the raw training set.”  However, the important question is whether it does enough to protect privacy.  The answer is, “it depends.”

If your definition of privacy is that the model doesn’t spit out identifying information about someone, then many classification models do protect privacy.  However, for most people, privacy is more than that.  There are all types of information that people consider sensitive, even if they are not unique identifiers.   Classification models can also reveal sensitive information.  Let’s walk through a few examples of how classification models can reveal personal information and potentially cause harm to people.  

Imagine a model that detects spam, and classifies a phone message as either spam or not spam.  The output of the model is binary, and not personally identifying information like an address or social security number.  Can we rest assured that the model will protect privacy?  Is it safe because it is outputting harmless information like “true” versus “false” or  “spam” versus “not spam”?

Not necessarily.  We still have to worry about harm caused from classification models.   Let’s start with a somewhat ridiculous thought experiment.  Imagine that the spam classifier only thinks all phone messages are spam unless they are from Taylor Swift.  Only Taylor Swift gets through this spam detector.    If you were an attacker and you knew that’s how the classifier worked, you could simply make up phone numbers and send it to the classifier.  Once the classifier says “Not Spam”, you know that you’ve guessed Taylor Swift’s number.  So, with some effort, the classifier leaked personal information. Does it matter?  Well, if you use this to start harassing Taylor, you are creating harm.  And the point is, many adversaries will make the effort to learn things about people and to cause harm.  

I used Taylor Swift as an example, but the same holds true for any person who doesn’t want to be cold-called and harassed.  If I had used your child or grandmother instead of Taylor Swift in that example, you might have a strong emotional reaction.  

You can expand this example to discuss privacy harms with a more useful classifier.   What if your spam classifier is based on location; any message coming from a certain country is considered spam.  An attacker can now use the classifier to get information about where people are. In many situations, revealing whether someone is located in Nigeria versus Switzerland won’t be sensitive and won’t cause harm.   But some people might have valid reasons for not wanting to share that information widely.  Imagine someone engaged in a LGBTQ+ event or activity.  They might not want the government to know if they are located in one of the many countries that criminalizes such behavior.   These examples might seem unrealistic for your model, but it is important to consider whether the people you are classifying have any way to control who learns what about them from your model.

Classifiers can also cause harm in an automated and powerful way by revealing how people are behaving.   What if you have an image classifier that detects whether someone is a woman and whether someone is wearing a shawl over their hair?  Some nation-states might use this information to arrest or harm someone.  We’ve seen cases of governments around the world making it illegal for women to either wear (France) or not wear (Iran) head coverings.  In some cases, women face real life-threatening harm when this behavior is revealed. 

Women wearing headscarves

You can expand these examples to think of harms for classifiers that are not binary.  Classifiers that can predict income, ethnicity, religion, and family status are not binary classifications, but those classifications are sensitive information. 

So model developers, even if your model is “only” a classification model, you still need to think about how people will use it and what it will reveal. Think about how motivated adversaries could try to use your model to learn about people.  Once you have a clear sense of the problems and threats, then you can choose the appropriate protection so your model is responsible and does not cause harm.   A responsible AI engineer or privacy engineer can help you think through these risks, test your system, and figure out how to mitigate the threats.

About the author 


Dr. Rebecca Balebako is a certified privacy professional (CIPP/E, CIPT, Fellow of Information Privacy) who helped multiple organizations improve their privacy through research, analysis, and engineering. 

Our Vision

 We work together with companies to build data protection solutions that are lasting and valuable, thereby protecting privacy as a human right.  

Privacy by Default


Quality Process