InterPARES Trust AI - Artificial Intelligence

data mining [English]

Other Languages

minería de datos (Spanish)
mineração de dados (Portuguese)

Syndetic Relationships

RT: big data; text mining
SF: knowledge discovery

InterPARES Definition

n. ~ An approach to discover implicit patterns, often non-obvious, in very large data sets (big data) through a variety of techniques of analysis, categorization, clustering and correlation.

Other Definitions

Dilly 1995 (†480 1.1): The term data mining has been stretched beyond its limits to apply to any form of data analysis. Some of the numerous definitions of Data Mining, or Knowledge Discovery in Databases are: ¶ Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency networks, analysing changes, and detecting anomalies. · William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database. · Marcel Holshemier & Arno Siebes (1994)
Gartner IT Glossary (†298 s.v. "data mining"): The process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques.

Citations

Alexander 2012 (†479 ): Data mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. (†688)
Dilly 1995 (†480 1.1): Data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. It is the computer which is responsible for finding the patterns by identifying the underlying rules and features in the data. The idea is that it is possible to strike gold in unexpected places as the data mining software extracts patterns not previously discernable or so obvious that no-one has noticed them before. (†686)
Dilly 1995 (†480 1.2.4): Knowledge Discovery in Databases (KDD) or Data Mining, and the part of Machine Learning (ML) dealing with learning from examples overlap in the algorithms used and the problems addressed. The main differences are: · KDD is concerned with finding understandable knowledge, while ML is concerned with improving performance of an agent. So training a neural network to balance a pole is part of ML, but not of KDD. However, there are efforts to extract knowledge from neural networks which are very relevant for KDD. · KDD is concerned with very large, real-world databases, while ML typically (but not always) looks at smaller data sets. So efficiency questions are much more important for KDD. · ML is a broader field which includes not only learning from examples, but also reinforcement learning, learning with teacher, etc. ¶ KDD is that part of ML which is concerned with finding understandable knowledge in large sets of real-world examples. When integrating machine learning techniques into database systems to implement KDD some of the databases require: · more efficient learning algorithms because realistic databases are normally very large and noisy. It is usual that the database is often designed for purposes different from data mining and so properties or attributes that would simplify the learning task are not present nor can they be requested from the real world. Databases are usually contaminated by errors so the data mining algorithm has to cope with noise whereas ML has laboratory type examples i.e. as near perfect as possible. · more expressive representations for both data, e.g. tuples in relational databases, which represent instances of a problem domain, and knowledge, e.g. rules in a rule-based system, which can be used to solve users' problems in the domain, and the semantic information contained in the relational schemata. ¶ Practical KDD systems are expected to include three interconnected phases · Translation of standard database information into a form suitable for use by learning facilities; · Using machine learning techniques to produce knowledge bases from databases; and · Interpreting the knowledge produced to solve users' problems and/or reduce data spaces. Data spaces being the number of examples. (†687)
ISO 14873 (draft) 2012 (†504 p. 3): Computational process that extracts patterns by analysing quantitative data from different perspectives and dimensions, categorizing it, and summarizing potential relationships and impacts. (†782)
Kurian 2013 (†576 s.v. data mining): Process of extracting usable information from raw data through the use of algorithms and statistical models to form predictive models that can identify significant patterns of use. (†1091)
Law 2011 (†581 s.v. data mining): 1. The process of using sophisticated software to identify commercially useful statistical patterns or relationships in online databases 2. The extraction of information from a data warehouse to assist managerial decision making. The information obtained in this way helps organizations gain a better understanding of their customers and can be used to improve customer support and marketing activities. (†1136)
NIST 2013 (†734 p. B-6): Data Mining/Harvesting - An analytical process that attempts to find correlations or patterns in large data sets for the purpose of data or knowledge discovery. (†1839)
Text and Data Mining 2014 (†521 p. 10): Text and data mining involves the deployment of a set of continuously evolving research techniques which have become available as a result of widely distributed access to massive, networked computing power and exponentially increasing digital data sets, enabling almost anyone who has the right level of skills and access to assemble vast quantities of data, whether as text, numbers, images or in any other form, and to explore that data in search of new insights and knowledge. [Note: This definition accords broadly with the one proposed by the Publishing Research Consortium (2013): ‘Data mining is an analytical process that looks for trends and patterns in data sets that reveal new insights. These new insights are implicit, previously unknown and potentially useful pieces of information. The data, whether it is made up of words or numbers or both, is stored in relational databases. It may be helpful to think of this process as database mining or as some refer to it ‘knowledge discovery in databases. Data mining is well established in fields such as astronomy and genetics.’] (†825)
Text and Data Mining 2014 (†521 p. 13): There is a bundle of controversial issues arising from concerns about data privacy and protection, currently leading to new policy initiatives in Europe, which may cause further divergence between the European and American landscape for text and data mining. This follows high level tensions over access to mobile phone calls and other data by American intelligence agencies. One likely impact is that data held in North America, including data of European origin, will attract less rigorous levels of protection compared with data held in Europe. (†828)
Wikipedia (†387 s.v. "unstructured data"): Techniques such as data mining and text analytics and noisy-text analytics provide different methods to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or part-of-speech tagging for further text mining-based structuring. Unstructured Information Management Architecture (UIMA) provides a common framework for processing this information to extract meaning and create structured data about the information. (†619)
Witten 2005 (†523 ): Just as data mining can be loosely described as looking for patterns in data, text mining is about looking for patterns in text. However, the superficial similarity between the two conceals real differences. Data mining can be more fully characterized as the extraction of implicit, previously unknown, and potentially useful information from data [Witten and Frank, 2000]. The information is implicit in the input data: it is hidden, unknown, and could hardly be extracted without recourse to automatic techniques of data mining. With text mining, however, the information to be extracted is clearly and explicitly stated in the text. It’s not hidden at all–most authors go to great pains to make sure that they express themselves clearly and unambiguously–and, from a human point of view, the only sense in which it is “previously unknown” is that human resource restrictions make it infeasible for people to read the text themselves. The problem, of course, is that the information is not couched in a manner that is amenable to automatic processing. Text mining strives to bring it out of the text in a form that is suitable for consumption by computers directly, with no need for a human intermediary. (†832)