Skip to main content

Data anonymisation

How to get rid of the ‘poison pill’ in big data

Most data privacy regulations restrict the processing of personal data without informed and explicit consent. However, in big data how that data will be used is often unclear when it is collected, making valid consent hard to get. For this reason, personal data are a ‘poison pill’ in many big data pools.

Like the ‘copyleft’ effect in open-source licensing, personal data can put big data-based businesses at risk.

But data privacy law restrictions don’t apply to anonymous or depersonalised data, so anonymising data is often the key to unlocking the value

What is anonymisation?

In the EU, under recital 26 of the EC Directive 95/46 and according to draft EU data privacy regulation, data are considered anonymous if the data subject is no longer identifiable and if the data is retained in a form that makes identifying the data subject impossible by all means reasonably likely to be used for the data controller alone or in collaboration with another party. This requires more than just pseudonymisation.

What is the challenge in anonymising data?

For anonymous data, reidentification of individuals must be impossible. This requires the reduction of information and is difficult to achieve while retaining the inherent value of data. Deleting an individual’s name and other unique identifiers, called pseudonymisation, isn’t enough because the combination of truncated data sets with other data in a big data pool may lead to reidentification of individuals and so re-personalise the data. Effective anonymisation must prevent everybody from singling out individuals in a data set, from linking records within or between data sets and from inferring any information to an individual.

To prevent reidentification, it will often be necessary to take additional measures such as randomisation and generalisation. Randomisation means diluting the veracity of data, say, switching certain attributes, and adding uncertainty so the data can no longer refer to an individual. Common methods include noise addition, permutation or differential privacy. 

Generalisation means modifying the scale or order of magnitude of certain identifiers in sets of data – location by region rather than precise co-ordinates, timing by month rather than weeks or days, for example – aggregation of data sets or applying k-anonymity, l-diversity or t-closeness. For more on these approaches, we recommend Art 29 Working Party’s working paper on anonymisation.

In short, effective anonymisation is tricky. It requires robust governance and group policies on generating, processing and retaining/deleting raw data as well as scrutiny to prevent infection of a valuable big data pool by the poison pill of personal data.

What are the typical pitfalls?

Anonymisation as such constitutes processing of personal data

Anonymisation of personal data is a form of processing and requires consent or alternative justification under data privacy law. For that reason, anonymising personal data cannot cure a non-compliant collection of personal data.

Anonymising personal data is an ongoing challenge, not a one-off exercise

Anonymised data has to be retained in a form in which identification of the data subject is impossible. Anonymous data combined with other data sets in a shared big data environment might allow for re-personalising presumed anonymous data and all of a sudden applicable restrictions on the processing of personal data would kick in. This is most relevant in group data-sharing arrangements and in M&A. New technology or analytics tools may also open routes to re-personalise anonymous data and may require further measures to keep data sets anonymous.

Raw data must be deleted

As long as the data sets before the application of anonymisation techniques (raw data) are kept somewhere by someone it is possible to re-personalise the data sets by reference to the raw data. Hence, effective anonymisation requires deletion of respective raw data. However, in many cases this does not or not significantly affect the (statistical) usability of data for big data purposes.

Pseudonymisation or encryption does not render personal data anonymous

As long as the key to pseudonymised or encrypted data is kept somewhere or by someone, data are – at best – indirectly personal but not anonymous. One of the biggest misconceptions is that pseudonymised data are anonymous data in legal terms.