InterPARES Trust AI - Artificial Intelligence

data masking [English]

Syndetic Relationships

RT: data obfuscation; data sanitization

InterPARES Definition

n. ~ A process to conceal restricted information using a variety of techniques to replace or alter the data.

General Notes

Techniques include substituting information with random characters or NULL, with functionally equivalent values, or encryption.

Citations

Gartner IT Glossary (†298 s.v. "redaction tools"): Redaction tools have been included as part of document imaging products since their introduction. In electronic documents, redaction refers to the permanent removal of information, not the masking or obfuscating of data. (†1591)
Geber et al. 2009 (†688 ): Data masking allows us to generate faux, yet representative, data for use in the full Systems Development Life Cycle (“SDLC”)–which includes application development, unit testing, systems testing, user acceptance testing, and performance testing–or for specific business intelligence purposes (e.g., statistical analyses, profitability analysis). ¶ Consequently, data masking allows for maintaining: · Representative data in volume; data quantity and size used in testing or data analysis matches what is found in production systems. This is particularly important for performance testing. · Representative data in value distribution; when data is used for testing purposes, it need only have the same or similar values found in production data, without revealing or corresponding to individuals’ data. When data is used for analysis purposes, often it is not the values belonging to an individual record that are of interest; instead the data at the aggregate level may be analyzed using statistical techniques. ¶ This allows for the maintaining of data utility while protecting against: · Identity Disclosure, which occurs when an individual record can be tied to a particular entity; the identity of an individual can thus be inferred from the data. · Value Disclosure, which occurs when the value of a confidential attribute for a particular entity (the value of one or more variables) can be inferred from the data. (†1599)
Geber et al. 2009 (†688 ): Often analyses of data for business intelligence, such as purchasing trends, services usage patterns, and customer satisfaction results, can be performed at an aggregate level or without a need to know precisely which individual is attributed to which values. With an understanding of the analyses intended to be performed, appropriate data masking techniques can be selected to produce extremely similar results (within an acceptable tolerance percentage) that the same analysis operations would produce for production data. Therefore, mitigating the risks of allowing analysts, particularly those located overseas, to handle large amounts of production data becomes practical through data masking. For functions requiring a high degree of precision and the comparison of individual data elements, data anonymization may be a viable option. (†1604)
Net 2000 2010 (†700 p. 2): Data Masking is the replacement of existing sensitive information in test or development databases with information that looks real but is of no use to anyone who might wish to misuse it. In general, the users of the test, development or training databases do not need to see the actual information as long as what they are looking at looks real and is consistent. (†1593)
Net 2000 2010 (†700 p. 8): Masking data, besides being the generic term for the process of data anonymization, means replacing certain fields with a mask character (such as an X). This effectively disguises the data content while preserving the same formatting on front end screens and reports. [For example, after masking, a credit card number might appear as 4346 XXXX XXXX 5379.] (†1594)
Wikipedia (†387 s.v. "data masking"): Data masking or data obfuscation is the process of hiding original data with random characters or data. The main reason for applying masking to a data field is to protect data that is classified as personal identifiable data, personal sensitive data or commercially sensitive data, however the data must remain usable for the purposes of undertaking valid test cycles. It must also look real and appear consistent. (†1592)