Reflections from the Texas Hill Country: Text Mining

Definition of text-mining – “The discovery by computer of new, previously unknown information by automatically extracting information from different written resources.” – Hearst, M. What is Text Mining?; www.berkeley.edu/~hearst/textmining.html.

Wikipedia Definition: “Text mining, also known as intelligent text analysis, text data mining , unstructured data management, or knowledge discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge (usually converted to metadata elements) from unstructured text (i.e. free text) stored in electronic form. This can be achieved either through added markup in XML, Atom or RDF formats or though the analysis of common phraseologies indicating certain relationships.”

Text mining is similar to data mining, but it can handle unstructured, as well as structured data. Examples of unstructured data: email, full-text documents, and HTML files.

Humans find it easy to handle unstructured communications because we can distinguish and apply linguistic patterns to them, enabling us to handle slang, spelling variations, and to extract the context from text. Computers cannot do this, but they can process data at much higher speed than we can. For data mining to be successful, applications that combine a human being’s linguistic capabilities with the speed and accuracy of a computer must be created.

How a text mining program works:
1. The program starts with a collection of documents.
2. It retrieves one of the documents.
3. It analyzes the text, and may use information extraction, or clustering, or summarization, or other tools.
4. It may place the information in a management information system.
5. The system yields useful knowledge.

We are now teaching computers our natural language using natural language processing, which may utilize these technologies:
1. Information extraction
2. Topic tracking
3. Summarization
4. Categorization
5. Clustering
6. Concept Linkage
7. Information visualization
8. Question answering

Topic Tracking

You set up a profile at a site such as Yahoo Alerts, and the tool uses the documents you search for to predict other documents you may be interested in retrieving.

Summarization

Text summarization reduces the length of a document, but retains its main ideas and overall meaning. It is difficult to train computers to analyze semantics and interpret meaning. Microsoft Word has a summarization tool built into it that uses the strategy of sentence extraction.

Categorization

Categorization involves identifying the main themes of a document, without attempting to process the actual information. It counts the words that appear, and uses the counts to identify the main topics in the document. Categorization tools often rely upon a thesaurus which has predetermined the topics and relationships among the terms it categorizes.

Clustering

Clustering tools group similar documents on the fly. For example, I went to the Clusty search engine and typed “HCI degrees” in the search box. The information displayed on the right side of the page followed the usual format search results are displayed in by most search engines. The top 183 of at least 51,055 hits were

displayed. On the left side of the page, however, Clusty clustered 187 results by topics:
Human-Computer (187)
Masters (19) – Masters degrees, not limited to HCI degrees
Stanford (12) – HCI-related degrees from Stanford University
Research (17) – Links to schools and other organizations that have done research on HCI
Several other categories I will not bother to list

Another clustering search engine is Vivisimo. If you go to the home page of this search engine and type “HCI degrees” in the search box, the clustered results are very similar to those retrieved by Clusty. On the left side of the page, one sees the following categories:
Human, Computer Interaction (45)
Computer Science (19)
Master’s Degrees (16)
Stanford (12)
Several more I will not list

Concept Linkage
Concept linkage encourages browsing for information rather than searching for it. These tools connect related documents by identifying their shared concepts. The Yahoo Directory is a good example of a search engine that helps you search through related categories when you are not sure what you need to type in the search box.

Information Visualization

A good example of search engine based on information visualization is the KartOO metasearch engine When I searched for HCI degrees in KartOO, the left side of the screen displayed categories of information much as did Vivisimo and Clusty. On the right side of the screen, however, the links are shown in map format. When you move the mouse over a link, the program draws lines to other related web pages.

Question Answering

Some search engines, such as Ask.com , formerly Ask Jeeves, let the user type a question in the search box and display the ten sites it considers have the most relevant information. These tools utilize multiple techniques, such as information extraction and question categorization.

Sources:

Fan, W., Wallace, L., Rich, S., Zhang, Z. Tapping the power of test mining. In Communications of the ACM 49, 9 (Sept. 2006), 73-82.

Wikipedia. Text Mining. http://en.wikipedia.org/wiki/Text_mining. Accessed: August 28, 2006.

Resources on World Wide Web:

Text-Mining.org
http://www.text-mining.org/index.jsp

Data Mining Conferences
http://www.kmining.com/info_conferences.html

Text Mining: Science Digs Deeper
http://www.firstauthor.org/research_tools.html#TextMining

Text Mining Tools:

Ultimate Research Assistant
http://www.hoskinson.net/ultimate.research.assistant/

Inxight
http://www.inxight.com/

ClearForest
http://www.clearforest.com/

Convera
http://www.convera.com/

Megaputer
http://www.megaputer.com/

Search Engines and Web Tools

KartOO
http://www.kartoo.com/

Clusty
http://clusty.com/

Vivisimo
http://vivisimo.com/

Yahoo Alerts
www.alerts.yahoo.com

Reflections from the Texas Hill Country

About Me

Monday, August 28, 2006

Text Mining

0 Comments: