n. ~ The quality of data as regards to accuracy and truthfulness.
In the context of big data, the veracity of publicly available datasets should not be assumed. Ballard (2014, 8) gives examples of a compilation of information about individuals that gives incorrect information because its data was not verified using other authoritative datasets, as well as datasets on customer reviews that contain false, malicious reports.
- Ballard 2014 (†528 p. 7): Veracity or "data trustworthiness" has the following three characteristics: · The quality or cleanliness/consistency/accuracy of the data · The provenance or source of the data, along with its lineage over time · The intended usage because the usage can dramatically affect what is considered an acceptable level of trust or quality · Where did the data come from? · Did the data originate internally within the organization or externally? · Is it publicly available data, such as phone numbers, or is it behavioral data from a data aggregator? · Is the data from a transaction that can be audited and proven? · Is the data truth or opinion? · Is the data an intentional fabrication? · Is the raw data usable as is, such as in the case of fraud detection, where identifying the aberrations are the focus, or does it require standardization and cleansing? · What governance methods does an organization use to vet and rank veracity and classify its dimensions? As more data sources move from internal to acquired externally, this issue has become more pressing. · How do you classify the trust factor? Organizations seriously must consider classification as part of the governance process. (†841)
- Ballard 2014 (†528 p. 8): [Under discussion of veracity] Publicly available data is another area that warrants further examination. Public data does not necessarily translate into accurate and reliable data, especially when multiple sources are combined. For example, a man who died in 1979 is listed by a website that advertises itself as a “People Search Engine”. The website used consolidated public data to determine that the man is 92 years old and living in Florida, although he never lived there and his wife had remarried and later was widowed again. He also gained an incorrect middle initial with his move, which just happens to be the initial of his wife. Had that website managed its master data effectively and cross-referenced it with the Social Security Death Index (SSDI), it would have concluded that he died in 1979. Chapter 1. Introducing big data 9 The same holds true for social data, whether from social media or product or service reviews. Extracting sentiment through text analytics is a useful activity in terms of discerning reactions to, for example, products and events. In doing so, analysts must recognize there is a good likelihood that some or all of the postings or reviews could be false, either paid for by companies or individuals seeking to improve their ratings, or to disparage competitors. Gartner Group estimates that by 2014, false reviews constitute 10 - 15% of all reviews.14 Case in point: A travel website with over 200 million unique visitors per month recently removed over 100 reviews that were created by a hotel chain executive. The executive created positive reviews of his properties while creating negative reviews of his rivals. (†2704)