data de-identification [English]

Syndetic Relationships

InterPARES Definition

n. ~ A process to protect confidential information in a dataset by removing or obfuscating identifiable information (US) or personal data (EU) that could allow an individual to be identified.


  • Geber et al. 2009 (†688 ): Data de-identification is the removing of all, some, or portions of identifiers (e.g., name, address, Social Security number or Social Insurance number) from the data prior to use in testing or production environments or release to third parties. While this has been the predominant method of data sanitization or obfuscation, it is important to realize that de-identified data may be subject to re-identification by utilizing categorical (i.e., demographic characteristics) or numerical data. Because of the risks associated with data being re-identified and the relatively small additional overhead of applying masking over de-identification alone, depending on the intended use of the data, often data masking is a superior option for non-production use or analysis of data at a non-aggregate level (i.e., analyzing individual records rather than sets of records). (†1600)
  • Malin, et al. 2003 (†722 p.1): Until recently, it was believed that if data looked anonymous, it was anonymous. Tables, in which each row of information related to a person, were shared somewhat freely provided none of the columns included explicit identifiers, such as name, address, or Social Security number. This kind of “de-identified” data can often be linked to other tables that do include explicit identifiers (“identified data”) to re-identify people by name. Fields appearing in both de-identified and identified tables link the two, thereby relating names to the subjects of the de-identified data. (†1647)
  • Ohm 2010 (†712 p.1716): Privacy lawyers tend to refer to release-and-forget anonymization techniques using two other names: deidentification and the removal of personally identifiable information (PII). Deidentification has taken on special importance in the health privacy context. Regulations implementing the privacy provisions of the Health Insurance Portability and Accountability Act (HIPAA) expressly use the term, exempting health providers and researchers who deidentify data before releasing it from all of HIPAA’s many onerous privacy requirements. (†1630)
  • Richardson, et al. 2015 (†714 p.85): The science of de-identification continues to advance, and data de-identification has become an accepted form of protecting the confidentiality of personal information under federal regulation. At the same time, re-identification studies have continued to focus on data disclosures that fail to meet any modern standard of de-identification. Thus, while public health organizations may lack specific guidance on how to de-identify data in a way permissible under their applicable state confidentiality laws, they can reasonably rely on the efficacy of modern de-identification techniques, so long as the governing confidentiality standard allows for the disclosure of data that does not identify an individual. (†1634)
  • Richardson, et al. 2015 (†714 p.84): In recent years, researchers have studied techniques to re-identify purportedly confidential datasets. These studies often report startling high success rates, and have caused some scholars to question the efficacy of de-identification entirely.10 For example, an often cited 2000 study found 87% of the U.S. population could be uniquely identified by their combination of gender, date of birth, and zip code. Even when new researchers replicated the study to reflect a growing population, they still found 63% of the population uniquely identifiable using these variables. Out of context, these numbers are startling. In reality, however, these unique and exact combinations of gender, date of birth, and zip code would never be present in a de-identified dataset. Such combinations are either generalized or removed entirely, drastically reducing the risk of re-identification. In the latter study above, researchers found the risk of unique identification dropped sharply when given slightly more abstract data. When they replaced an individual’s full date of birth with only the month and year, only 4.2% of the population remained uniquely identifiable, and when they also replaced zip code with county, just 0.2% remained uniquely identifiable. More impressively, data de-identified using the HIPAA safe harbor method is said to present only a .04% risk of unique identification. Still, the majority of re-identification studies continue to target data that is not truly de-identified, leading to what some call “the myth of easy re-identification.” While academics and scientists debate de-identification’s merits, however, a more pertinent question has been neglected: is sharing de-identified data legal? (†1635)
  • Wikipedia (†387 s.v. de-identification): The process used to prevent a person’s identity from being connected with information.... [C]ommon strategies for de-identifying datasets are deleting or masking personal identifiers, such as name and social security number, and suppressing or generalizing quasi-identifiers, such as date of birth and zip code. (†1653)