data obfuscation [English]


Syndetic Relationships

InterPARES Definition

n. ~ A process to protect personal or confidential information by replacing or altering restricted data elements (or combinations of data elements) with functionally equivalent values.

General Notes

Data obfuscation requires that the resulting dataset has the same functionality as the original. It is often used when individuals developing or testing a system are not authorized to access to the data itself. While encryption would prevent those individuals from access to the data, it would change the format in such a way that its functionality is lost. For example, code to check data values for unallowable characters or patterns (a wildcard in a string, a malformed identification number) would not work with encrypted data, which may have a different format. As much as possible, the obfuscated data should be realistic; for example, names replaced with common names, rather than random text strings.
   One technique, shuffling, randomizes and recombines the values for different data elements such that (for example) the original names, addresses, and phone numbers are associated with different names, addresses, and phone numbers. Another technique alters numerical values, such as wages and dates, randomly within a range of the original.

Other Definitions

  • Gartner IT Glossary (†298 s.v. "application obfuscation"): A set of technologies used to protect an application and its embedded intellectual property (IP) from application-level intrusions, reverse engineering and hacking attempts. Application obfuscation tools protect the application code as the increasing use of intermediate language representations (such as Java and .NET) enables hackers to easily reverse-engineer IP embedded in software.
  • Greenway 2009 (†514 ): Data obfuscation enables the hiding of sensitive data from insiders (e.g. application developers and testers) while keeping the obfuscated data realistic and therefore testable.

Citations

  • Bakken, et al. 2004 (†515 ): In some domains, the need for data privacy and data sharing conflict. Data obfuscation addresses this dilemma by extending several existing technologies and defining obfuscation properties that quantify the technologies' usefulness and privacy preservation. (†807)
  • Edgar 2004 (†694 p. 4-7, passim.): There are a variety of Data Sanitization techniques available: NULL'ing out, masking data, substitution, shuffling records, number variance, gibberish generation, and encryption/decryption. (†1581)
  • Geber et al. 2009 (†688 ): Through an understanding of business processes, data flows, and the application of advanced data obfuscation – including data masking, data deidentification, and data anonymization – your organization or client can achieve a great number of business goals and continue to perform current functions with out using the actual, real PII and sensitive data of customers and employees, thus severely reducing risk and liability. ¶ Techniques for data obfuscation include data masking, data de-identification, and data anonymization. (†1598)
  • Geber et al. 2009 (†688 ): Data obfuscation is most often utilized for generating data for development and test environments, as well as many data analysis functions. Data obfuscation is usually applied in batch; a masked copy of a database or databases is created for later use. (†1603)
  • Greenway 2009 (†514 ): Data obfuscation techniques must satisfy a basic rule: the obfuscated data should satisfy the same business rules as the real data. . . . ¶  Data obfuscation is the concealment of meaning in data or information usage, making it confusing and harder to interpret. ¶ Obfuscation is essentially the technique used to de-identify data. ¶ The terms obfuscation and data encryption are often intermixed although they are fundamentally different. Encryption prevents non-authorised users from understanding the data. Typically, encryption can be applied when the 'data is at rest', in order to protect the data against data loss; encryption can also be applied 'in transit', which protects the information from being compromised during transmission. However, with encryption, authorised users can still have access to the underlying data. Data obfuscation protects individual's data in non-production environments by replacing it with representative but fictitious data. In the event of a data loss involving obfuscated data, a non-authorised user may be able to read the data (including field headings), however it will not reflect any individual's details. (†805)
  • Greenway 2009 (†514 ): Data obfuscation (which is also sometimes referred to as data anonymisation, data masking, data privacy, data scrambling) - the test data is built from a sub-set of the production data that has been subject to a number of techniques designed to obscure the origin of the data. Specifically those techniques must prevent personally identifiable information or sensitive information from being identified from data. The techniques must not allow the original data to be re-created by reverse engineering. (†1607)
  • ISACA Glossary (†743 s.v. obfuscation): The deliberate act of creating source or machine code that is difficult for humans to understand. (†1789)
  • NIST 2013 (†734 p. F-180): Organizations use a combination of hardware and software techniques for tamper resistance and detection. Organizations employ obfuscation and self-checking, for example, to make reverse engineering and modifications more difficult, time-consuming, and expensive for adversaries. (†1829)
  • Parameswaran and Blough (†516 p. 1): Data Obfuscation (DO) techniques distort data in order to hide information. One application area for DO is privacy preservation. Many data obfuscation techniques have been suggested and implemented for privacy preserving data mining applications. However, existing approaches are either not robust to privacy attacks or they do not preserve data clusters, thereby making it difficult to apply data mining techniques (†809)
  • Wikipedia (†387 s.v. "data masking"): Data masking or data obfuscation is the process of hiding original data with random characters or data. ¶ The main reason for applying masking to a data field is to protect data that is classified as personal identifiable data, personal sensitive data or commercially sensitive data, however the data must remain usable for the purposes of undertaking valid test cycles. . . . ¶ The primary concern from a corporate governance perspective is that personnel conducting work in these non-production environments are not always security cleared to operate with the information contained in the production data. This practice represents a security hole where data can be copied by unauthorised personnel and security measures associated with standard production level controls can be easily bypassed. This represents an access point for a data security breach. ¶ Data masking techniques [described in detail in the entry] include substitution, shuffling, number and data variance, encryption, nulling out or deletion, and masking out. (†806)