InterPARES Trust AI - Artificial Intelligence

data sanitization [English]

Syndetic Relationships

RT: data masking; data obfuscation; media sanitization
SYN: sanitized data

InterPARES Definition

n. ~ 1. Computing · A process to destroy data in a way that it is irretrievable. – 2. Computing · A process to remove or obfuscate confidential or restricted data from a dataset. – 3. Computing · A process to ensure that data contains no malicious code before it is processed.

General Notes

Data sanitization¹ commonly connotes the irretrievable destruction of information, including the use of media sanitization techniques ranging from overwriting data to physical obliteration of the media. Data sanitization² is sometimes used generally with the sense of cleaning personal or confidential information from a dataset. Data sanitization³ is exemplified by checking data submitted through the web for SQL injection attacks and escaping the data.

Citations

CMU 2011 (†706 p. 17): Media sanitization is a process by which data is irreversibly removed from media or the media is permanently destroyed. The following table defines baseline controls for sanitization and disposal of media that records and/or stores Institutional Data. (†1620)
Net 2000 2010 (†700 p. 22): Given the legal and organizational operating environment of today, many test and development databases will require some form of sanitization in order to render the informational content anonymous (†1595)
Net 2000 2010B (†701 p. 2): Data Sanitization is the process of disguising sensitive information in test and development databases by overwriting it with realistic looking but false data of a similar type. (†1596)
Rajalaxmi and Natarajan 2012 (†724 p.934): Data sanitization approaches hide the sensitive knowledge by modifying the original database. Usually, these approaches hide either frequent itemsets or utility itemsets, but not both. Also, frequent itemset hiding considers the presence or absence of items, whereas utility itemset hiding deals with internal and external utility of items. When support and utility of the itemsets are combined, it produces itemsets with high utility and support. When the data owner intends to hide sensitive utility and frequent itemsets, it is not possible to use frequent itemset hiding approaches since they reveal certain sensitive itemsets even after sanitization. (†1651)
Rajalaxmi and Natarajan 2012 (†724 p.936): ... There are subtle differences between data perturbation and data sanitization. First, data perturbation mainly focuses on individual data privacy whereas data sanitization methods aim to protect sensitive knowledge. In data perturbation, data utility is measured with the accurate aggregate statistical information while data sanitization measures data utility based on the ability to discover non-sensitive patterns. Also, data perturbation techniques have assumptions about the data distribution whereas data sanitization does not consider the distribution. (†1652)
UCR 2011 (†705 ): The process of deliberately, permanently, and irreversibly removing or destroying the data stored on a memory device. A device that has been sanitized has no usable residual data and even advanced forensic tools should not ever be able recover erased data. (†1619)
Wikipedia (†387 s.v. "sanitization (classified information)"): Sanitization is the process of removing sensitive information from a document or other message (or sometimes encrypting it), so that the document may be distributed to a broader audience. When the intent is secrecy protection, such as in dealing with classified information, sanitization attempts to reduce the document's classification level, possibly yielding an unclassified document. When the intent is privacy protection, it is often called data anonymization. Originally, the term sanitization was applied to printed documents; it has since been extended to apply to computer media and the problem of data remanence as well. (†1618)