This book gives an overview of what data mining is and the tools available to perform it; Market Basket Analysis, Memory Based Reasoning, Automatic Cluster Detection, Link Analysis, Decision Trees, Artificial Neural Networks. Genetic Algorithms are also included, which, while not a data mining tool, are being used to train neural nets.
In each case the authors describe the principles behind the tool, its strengths and weaknesses and applications were it are applicable. The authors give tips on what data preparation is required for the tool, both in terms of data massaging
, (which is required for neural nets) and indicate were it is important to select training sets that have approximately equal proportions of good
& bad
outcomes, in order for the tool to predict correctly.
The descriptions include simple examples of the tool to give an overview of how the tool works. But as the title indicates, this book is for users who are considering using data mining tools. It does not describe how to use particular applications, neither does it include code examples (pseudo or actual) if you are interesting in developing your own tools.
The book is easy to read and includes many examples from their experience of data mining in the real world. Further information on topics covered in the book can be found at the authors' web site www.data-miners.com.
4 Sep 2001
Discusses the data mining process within the context of creating business, namely 1. Identifying the problem, 2. Analysing the data, 3. Taking action, 4. Measuring the outcome.
The first & third stages are business issues, the others relate to the data mining tools.
The virtuous cycle combines selecting the right data mining tool and data and integrating them into the business.
explainablityof what the model is doing, and how easy the model is to apply. The chapter provides an introduction to each data mining tool.
MBR mimics how people are able to make decisions based on their past experience, by identifying previous cases and applying the information from these cases. MBR uses a distance function, a combination function and the number of neighbours to find the most similar existing cases for classification or prediction. Unlike other types of data mining tools, BMR is readily applicable to analysing text.
The chapter describes how to choose the distance and combination functions and the number of neighbours to be used. Included is a description of how MBR was used to classify news stories.
Clustering is able to perform undirected knowledge discovery or unsupervised learning, as it identifies clusters that are similar to each other. It is rarely used by itself, as we are generally not interested in finding clusters, rather what the items in the cluster have in common.
Clustering is a good tool to use at the start of a new data mining project when faced with a large, complex set of data that may have a lot of internal structures.
Link analysis is based on a branch of mathematics called graph theory. It is able to identify relationships between data. Most data mining applications are unable to take advantage of this information, but consequently link analysis is not applicable to all types of data or able to solve all problems.
The chapter includes brief examples of how link analysis was used by telephone companies to identify who has home fax machines, and how cellular telephone customers can be segmented for the purpose of selling new services.
Decision trees are powerful and popular tools for classification and prediction. Their attractiveness stems in part from the fact that the decisions are based on rules that can be represented in English or SQL.
The chapter includes an introduction as to how decision trees work and to the CART, CHAID & C4.5 algorithms used for building the trees. The authors provide a case study of how they used decision trees as part of a decision support system for the credit card division of a bank, and how they can be used with time-series data, where decision trees were used to simulate a coffee roaster.
Constructing Intelligent Agents with JAVA (Chapter 5), also contains information on decision trees and includes code to implement one.
Neural networks are popular because they have a proven track record in many data mining and decision support systems. They are very powerful, general purpose tools capable of performing prediction, classification and clustering, and have been applied across a wide range of industries. Their drawback though is that they can not detail why the solution is valid.
In addition to explaining what a neural network comprises of and how it works, this chapter details how to select the training data, and how to prepare a wide range of data types and time-series for use by a neural net.
Also covered are Self Organising Maps (SOMs) or Kohonen (Feature) Maps. Constructing Intelligent Agents with JAVA (Chapter 5), also contains information on neural nets and includes code to implement a back propagation neural net and Kohonen map.