text mining [English]


Other Languages

Syndetic Relationships

InterPARES Definition

(also text data mining) n. ~ A process to analyze natural-language text, typically unstructured, for a variety of purposes, including summarization, document clustering, text categorization, language identification, authorship attribution, and extracting data elements (names, dates, abbreviations, acronyms) (Witten et al, 2005, 10).

Citations

  • Anawis 2014 (†525 ): Text mining can be defined as the analysis of semi-structured or unstructured text data. The goal is to turn text information into numbers so that data mining algorithms can be applied. It arose from the related fields of data mining, artificial intelligence, statistics, databases, library science, and linguistics. ¶ There are seven specialties within text mining that have different objectives. These can be decided by answers to the questions shown in the decision tree in Figure 2. These specialties are: 1. Information retrieval: storage and retrieval of documents 2. Document Clustering: group and categorize documents using data mining clustering algorithms 3. Document Classification: group and categorize documents based on labeled examples 4. Web mining: understand relationships of hyper linkages of documents on the web 5. Information Extraction: identify specific facts and relationships of unstructured text 6. Natural language processing: understanding language structure, such as parts of speech 7. Concept extraction: group words into similar semantic groups (†834)
  • Gartner IT Glossary (†298 s.v. text mining): The process of extracting information from collections of textual data and utilizing it for business objectives. (†984)
  • Text and Data Mining 2014 (†521 p. 10): Text and data mining involves the deployment of a set of continuously evolving research techniques which have become available as a result of widely distributed access to massive, networked computing power and exponentially increasing digital data sets, enabling almost anyone who has the right level of skills and access to assemble vast quantities of data, whether as text, numbers, images or in any other form, and to explore that data in search of new insights and knowledge. [Note: This definition accords broadly with the one proposed by the Publishing Research Consortium (2013): ‘Data mining is an analytical process that looks for trends and patterns in data sets that reveal new insights. These new insights are implicit, previously unknown and potentially useful pieces of information. The data, whether it is made up of words or numbers or both, is stored in relational databases. It may be helpful to think of this process as database mining or as some refer to it ‘knowledge discovery in databases. Data mining is well established in fields such as astronomy and genetics.’] (†826)
  • Wikipedia (†387 s.v. "text mining"): Also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). ¶ Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods. (†829)
  • Witten 2005 (†523 p. 1): The phrase “text mining” is generally used to denote any system that analyzes large quantities of natural language text and detects lexical or linguistic usage patterns in an attempt to extract probably useful (although only probably correct) information [Sebastiani, 2002]. (†830)
  • Witten 2005 (†523 p. 2): Just as data mining can be loosely described as looking for patterns in data, text mining is about looking for patterns in text. However, the superficial similarity between the two conceals real differences. Data mining can be more fully characterized as the extraction of implicit, previously unknown, and potentially useful information from data [Witten and Frank, 2000]. The information is implicit in the input data: it is hidden, unknown, and could hardly be extracted without recourse to automatic techniques of data mining. With text mining, however, the information to be extracted is clearly and explicitly stated in the text. It’s not hidden at all–most authors go to great pains to make sure that they express themselves clearly and unambiguously–and, from a human point of view, the only sense in which it is “previously unknown” is that human resource restrictions make it infeasible for people to read the text themselves. The problem, of course, is that the information is not couched in a manner that is amenable to automatic processing. Text mining strives to bring it out of the text in a form that is suitable for consumption by computers directly, with no need for a human intermediary. (†831)
  • Witten et al. 2005 (†524 p. 10): Text mining is a burgeoning new field that attempts to glean meaningful information from natural-language text. It may be loosely characterized as the process of analyzing text to extract information that is useful for particular purposes. It most commonly targets text whose function is communication of factual information or opinions, and the motivation for trying to extract information from such text automatically is compelling – even if success is only partial. "Text mining" (sometimes called "text data mining") defies tight definitions but encompasses a wide range of activities: text summarization, document retrieval; document clustering; text categorization; language identification; authorship ascription; identifying phrases, phrase structures, and key phrases; extracting "entities" such as names, dates, and abbreviations; locating acronyms and their definitions; filling predefined templates with extracted information; and even learning rules from such templates. (†833)