In 2005, Harvard University initiated a high-impact study on the human genome, called the Personal Genome Project, for which it recruited 2,500 volunteers to donate genetic material. These volunteers were told from the outset that the information they provided would remain anonymous: genetic material can be used to determine people’s predisposition to diseases or various conditions. The volunteers also provided information on their consumption of alcohol and drugs, among other personal data, all of which was sensitive content that could have a negative impact on the lives of these individuals if shared.
The Personal Genome Project team deleted names, identity document numbers, and any other “personal” donor data. To preserve some information on the geographic distribution of the data, volunteers’ full addresses were hidden and only their zip code information was retained. At the time, the project manager expressed misgivings over the risk of someone being able to re-identify the records and access the private data associated with each individual. That risk became a reality much faster than expected when the Harvard Data Privacy Lab, led by Latanya Sweeney, succeeded in re-identifying 43% of the data from a sample of donors it took as part of an experiment intended to test the reliability of these procedures. Date of birth, zip code, and gender proved sufficient to pinpoint individuals’ names (by consulting electoral information and other sources) (Sweeney et al., 2013). In fact, in another paper, Sweeney (2000) estimated that 87% of the US population could be correctly identified on the basis of just these three pieces of information.
This case raises several questions that we address in this paper. Is this a one-off example or is it possible to reidentify records based on a handful of data points if external databases are consulted? Which types of data enable a database to be de-anonymized, as was the case above with date of birth and zip code? Can clear recommendations be formulated and best practices identified that allow data sets to be anonymized to protect individuals’ identity and privacy?
To address this issue, we distinguish between two key concepts: de-identification and anonymization. De-identification is the removal of elements that associate a record with an individual, such as personal identification codes, device codes (IP addresses, MAC addresses), and biometric identifiers. In contrast, anonymization entails eliminating the possibility of associating records with the individuals to whom those records refer. These actions are not in binary opposition: they lie on a continuum that runs from fully identified records to fully anonymized records.
As we will show in this article, the anonymization of data sets requires both technical insight and careful analysis. In this paper, we put forward a high-level methodology to find solutions to the problem of sharing information without violating privacy so as to guarantee the anonymization of databases.