InterPARES Trust AI - Artificial Intelligence

data anonymization [English]

Syndetic Relationships

InterPARES Definition

n. ~ A process to protect personal information by removing or obfuscating personally identifiable information (US) or personal data (EU) that relates to an individual or that could allow the individual to be identified.

General Notes

Data anonymization is often used to make restricted datasets more widely accessible, while simultaneously protecting sensitive or personally identifiable information (US) or personal data (EU). It may use a combination of techniques, such as deleting, encrypting, or obfuscating specific data elements that are unique to and that could directly identify an individual, such as name, Social Security Number, or email address. Effective data anonymization is particularly challenging, as a group of seemingly unrelated elements can identify individuals indirectly. For example, birth date, sex, and ZIP code can identify many individuals, even in large data sets (Sweeney 2000). Unlike data obfuscation, data anonymization does not require that the resulting dataset has the same functionality as the original dataset.
Some sources suggest that anonymization is distinguished from pseudonymization. Anonymization is limited to a specific dataset, making it impossible to link information about the individual to other datasets. Pseudonymization creates an artificial identifier for individuals that can be used to link information about the individuals in multiple datasets.

Other Definitions

Health Informatics 2008 (†703 s.v. "anonymized data"): Data from which the patient cannot be identified by the recipient of the information. [SOURCE: General Medical Council Confidentiality Guidance]

Citations

[UK] Minister of State 2012 (†663 p. 7): Data relating to a specific individual where the identifiers have been removed to prevent identification of that individual. (†1514)
EPIC 2015 (†680 ): In 2006, Netflix released data pertaining to how 500,000 of its users rated movies over a six-year period. Netflix “anonymized” the data before releasing it by removing usernames. Still, Netflix assigned unique identification numbers to users in order to allow for continuous tracking of user ratings and trends. Researchers used this information to uniquely identify individual Netflix users. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database. (†1556)
Geber et al. 2009 (†688 ): Data anonymization allows for the maintaining of exact values of data or retaining precise data value distribution, and therefore allows the data to precisely represent production data because it is unaltered, yet it is anonymous because it is unreadable. At a high level, this is accomplished by applying one-way cryptographic hashing to data elements.3 Data anonymization is utilized to perform a variety of analytical and business intelligence functions on data, including marketing data analysis, fraud detection, and consolidation of customer data. (†1601)
Geber et al. 2009 (†688 ): Often analyses of data for business intelligence, such as purchasing trends, services usage patterns, and customer satisfaction results, can be performed at an aggregate level or without a need to know precisely which individual is attributed to which values. With an understanding of the analyses intended to be performed, appropriate data masking techniques can be selected to produce extremely similar results (within an acceptable tolerance percentage) that the same analysis operations would produce for production data. Therefore, mitigating the risks of allowing analysts, particularly those located overseas, to handle large amounts of production data becomes practical through data masking. For functions requiring a high degree of precision and the comparison of individual data elements, data anonymization may be a viable option. (†1605)
Ohm 2010 (†679 p. 1707): The reverse of anonymization is reidentification or deanonymization. A person, known in the scientific literature as an adversary, reidentifies anonymized data by linking anonymized records to outside information, hoping to discover the true identity of the data subjects. (†1555)
Sedayo 2012 (†681 p. 1): Data anonymization makes data worthless to others, while still allowing Intel IT to process it in an useful way. [We conducted a proof of concept that] was successful in demonstrating that data anonymization can work and that obscured data is still useful for analysis. (†1558)
Sedayo 2012 (†681 p. 2): Anonymization is a technqiue that enterprises can use to increase the security of data in the public cloud while still allowing the data to be analyzed and used. Data anonymization is the process of changing data that will be used or published in a way that prevents the identification of key information. (†1559)
Sweeney 2000 (†678 p. 2): It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on [5-digit ZIP, gender, date of birth]. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only [place, gender, date of birth], where place is basically the city, town, or municipality in which the person resides. And even at the county level, [county, gender, date of birth] are likely to uniquely identify 18% of the U.S. population. In general, few characteristics are needed to uniquely identify a person. (†1551)