big data (p. 2): Big data is defined as “Extracting insight from an immense volume, variety, and velocity of data, in context, beyond what was previously possible.” (This is the IBM definition). Because data has become so voluminous, complex, and accelerated in nature, traditional computing methods no longer suffice. . . . Big data represents the confluence of having new types of available information, new technical capability, and processing capacity, and the desire and belief that it can be used to solve problems that were previously impossible. At the same time, many of the concepts of big data are not new. (†837)
big data (p. 3): The origins of the three key dimensions of “big data” (volume, variety, and velocity) were first described by Gartner Group’s Doug Laney in 2001 in a research paper, where he conveyed the impact of e-commerce on data volumes, increased collaboration, and the desire to more effectively use information as a resource. [http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf]
(†839)
big data (p. 4): big data is most frequently described according to the three dimensions or characteristics that Doug Laney first described: volume, variety, and velocity. More recently, experienced big data practitioners have come to appreciate that veracity is also a key element, that is, the “trust factor” behind the data. Although technologists do not always acknowledge the final “V” (value), gaining insight needs to translate into business value in order for any big data effort to be truly successful. (†840)
data governance (p. 26): The core disciplines of data governance cover data quality management, Information Lifecycle Management, and information security and privacy. (†843)
information governance (p. 16): Information Governance is the glue that drives value and mitigates risk. There are several key areas where Information Governance for big data is critical, such as metadata management, security and privacy, data integration and data quality, and master data management. It is interesting to note that big data innovators recognize the importance of governance to the success of their projects.
(†842)
veracity (p. 7): Veracity or "data trustworthiness" has the following three characteristics:
· The quality or cleanliness/consistency/accuracy of the data
· The provenance or source of the data, along with its lineage over time
· The intended usage because the usage can dramatically affect what is
considered an acceptable level of trust or quality
· Where did the data come from?
· Did the data originate internally within the organization or externally?
· Is it publicly available data, such as phone numbers, or is it behavioral data
from a data aggregator?
· Is the data from a transaction that can be audited and proven?
· Is the data truth or opinion?
· Is the data an intentional fabrication?
· Is the raw data usable as is, such as in the case of fraud detection, where
identifying the aberrations are the focus, or does it require standardization
and cleansing?
· What governance methods does an organization use to vet and rank veracity
and classify its dimensions? As more data sources move from internal to
acquired externally, this issue has become more pressing.
· How do you classify the trust factor? Organizations seriously must consider
classification as part of the governance process. (†841)
veracity (p. 8): [Under discussion of veracity] Publicly available data is another area that warrants further examination. Public
data does not necessarily translate into accurate and reliable data, especially
when multiple sources are combined. For example, a man who died in 1979 is
listed by a website that advertises itself as a “People Search Engine”. The
website used consolidated public data to determine that the man is 92 years old
and living in Florida, although he never lived there and his wife had remarried
and later was widowed again. He also gained an incorrect middle initial with his
move, which just happens to be the initial of his wife. Had that website managed
its master data effectively and cross-referenced it with the Social Security Death
Index (SSDI), it would have concluded that he died in 1979.
Chapter 1. Introducing big data 9
The same holds true for social data, whether from social media or product or
service reviews. Extracting sentiment through text analytics is a useful activity in
terms of discerning reactions to, for example, products and events. In doing so,
analysts must recognize there is a good likelihood that some or all of the postings
or reviews could be false, either paid for by companies or individuals seeking to
improve their ratings, or to disparage competitors. Gartner Group estimates that
by 2014, false reviews constitute 10 - 15% of all reviews.14 Case in point: A travel
website with over 200 million unique visitors per month recently removed over
100 reviews that were created by a hotel chain executive. The executive created
positive reviews of his properties while creating negative reviews of his rivals. (†2704)