Why Protegrity Our unique data security technologies are built for today's data-driven businesses. Learn More
Data Discovery Uncover where sensitive data resides
Data Management Control every corner of the enterprise
Data Protectors Protection beyond platforms
Professional Services Security experts for every need
Security Gateways Data security that goes with the data
Vaultless Tokenization Go beyond encryption
Contact Us
Posted on: January 5, 2017

Pseudonymization vs. Anonymization and How They Help With GDPR

Pseudonymization vs. AnonymizationPseudonymization and Anonymization are two distinct terms that are often confused in the data security world. With the advent of GDPR, it is important to understand the difference, since anonymized data and pseudonymized data fall under very different categories in the regulation.

Pseudonymization and Anonymization are different in one key aspect. Anonymization irreversibly destroys any way of identifying the data subject. Pseudonymization substitutes the identity of the data subject in such a way that additional information is required to re-identify the data subject.

You can think about it in terms of authors. Let’s say we have 10 books written by “Anonymous”, we have no way of identifying if all 10 books were written by the same person, or if they were written by 2,3,4 or 10 different persons. Now let’s say we have 10 books written by Mark Twain. We know that all 10 books were written by the same person, even if we don’t know that Mark Twain is actually Sam Clemens. Clemens wrote under a pseudonym, while the other authors in our example were anonymous.

In practice, let’s look at tokenization. Tokenization provides a consistent token for each unique name and requires access to additional information (our static lookup tables/code books) to re-identify the data:

Pseudonymization vs Anonymization

Here, with the pseudonymized data, we may not know the identity of the data subject, but we can correlate entries with specific subjects (records 1 and 7 reference the same person, records 2 and 5 reference the same person, records 3 and 4 reference the same person). If we have access to re-identify the data via the token lookup tables, then we can get back to the real identity. With the anonymized data, however, we only know that there are 7 records and there is no method to re-identify the data.

Pseudonymization is a method to substitute identifiable data with a reversible, consistent value. Anonymization is the destruction of the identifiable data.

With Anonymization, we must also be concerned about “indirect re-identification”. If we return to our author example above. An analysis of the writing style of our anonymous authors might allow us to indirectly identify them. We might not be able to identify the name, but we might be able to identify that specific books were written by the same person, because of their unique writing style. If that author has also written something under their own name, we might be able to completely identify the individual, by comparing the anonymous writing style with known author styles.

New call-to-action

As an example, let’s say an organization retains records of a customer’s purchase history, but they anonymize the name, address and other easily identifiable records. Since humans are creatures of habit, it may still be possible to identify a record indirectly.

Every morning, Monday through Friday, Bob goes to the same coffee shop and buys the same coffee and scone for breakfast. He always uses his debit card. On Friday night, he always withdraws $200 from the ATM next to his office, because it’s poker night with his buddies.

Even if the organization has “anonymized” Bob’s personally identifiable data (destroyed his name, address, etc.), his behavior allows us to indirectly re-identify him (all of these transactions reference the same person, because we can identify his predictable behavior). Therefore, the data set has not been properly anonymized.

To properly anonymize this data, we might have to use additional methods to ‘hide’ individual behavior. For example, we might only store records based on some kind of grouping.

“50 people went to this coffee shop every morning.”
“100 people got money from this ATM every Friday.”
“A total of $100,000 was taken from this ATM on Friday.”
“30 people bought scones today”

Now the data has been anonymized, because we have no way of seeing Bob’s predictable pattern of behavior.

Protegrity tokenization is an excellent form of pseudonymization.

Anonymization, is an exercise that should be undertaken by expert statisticians, data scientists, etc. and based on the sort of data retained by the individual organization.

GDPR Insights