Data Mining For Geotechnical And Mining Engineers
It is now quite customary to arrange any set of data in a computer spreadsheet where the rows represent different cases or tests and the columns represent values of the parameters measured or calculated for each case. However, in some cases, we find it very hard, or even impossible, to find patterns or models for a set of data. Usually the degree of difficulty increases with both the number of data parameters (number of columns) and the number of data points (number of rows). One is quite likely to face or deal with a data, which is somehow or somewhere biased, deficient or inaccurate. This is usually a most disturbing difficulty, as any model incorporating such deficiencies or inaccuracies contains noise. Such models cannot be useful until they pass through an appropriate filtering process. But, most often even this judgmental recognition of the bad data is also unknown and needs to be discovered, investigated and justified.
Successful models are the results of good, reliable and accurate data. For example, suppose we haven’t discovered the well-known relations V = IR and P = RI² in electricity, but, after some experiments, we have measured the currents: I = 10, 20, 30A, the resistances: R = 5, 10, 15Ω, the voltages: V = 50, 200, 450V, and the powers: P = 0.5, 4 and 13.5 kW. From this “good” data, one can easily derive the relationship between the power (P) and the resistance (R) and the current (I) leading to the well-established equation: P = RI², although the three parameters (I, R, V) are not even independent and are potential sources of noise creation if the data were inaccurate, biased or disturbed. Now, it is left to the reader to try to disturb this data in any wishful fashion to experience the extreme difficulties involved in rediscovering the above relationships, which are even not unknown to us any more.
The problem is that most often we do not know how good or bad or efficient or deficient is our data. These are perhaps sufficient reasons to remind us that any experimental data need to be validated professionally before it can be used in data mining for a predictive modelling process.
Data mining is the process of extracting a category, pattern, or model from an existing data for predicting either another existing data or a non-existing data. Linear regression, i.e. fitting a simple line to a set of data, is the simplest data mining method. In this case there is no interactions between the independent parameters. Any curvature in the line is a sign of either non-linearity or interactions among various parameters. Traditional statistical regression models are not appropriate for discrete, descriptive, or item-based data. For example consider data points (e.g. percentage of the time for selling one shopping item (dairy) versus that of other items (meat) together in a supermarket) distributed as two distinctive regions or clusters in a x-y coordinate system. In this case, a classical mathematical regression function cannot represent such discontinuous sets or clustering behaviour. For discontinuous or item data, methods of cluster analysis and decision trees are the two common techniques used to form subsets, groups or categories with common behaviour/properties.
There are two categories for data mining, namely the inferential and the non-inferential techniques. Hypothesis testing and inference from sample to population are the main features or framework of the inferential data mining. Inferential techniques have their foundations on the traditional statistical theories. Discriminant analysis or group regression, linear regression, analysis of variance, logistic and multinomial (categorical) regression and time series analysis, all belong to this category. The key difference between inferential and non-inferential techniques is in whether hypotheses need to be specified before hand. In the non-inferential methods this is not normally required, as each is semi or completely automated as it searches for a model. Cluster analysis, market basket/association analysis, link analysis, memory based reasoning, decision trees and neural networks, all belong to non-inferential techniques. There is no predefined outcome in either of the cluster analysis, decision trees and market basket techniques, as all three use categorical or continuous predictors to create cluster, tree or basket (association) memberships for various data points. Linkage is created between sets of items in the link analysis, while all types of data (including text) can be entered in the memory-based reasoning technique to predict an outcome. Amongst all these, neural network is one of the most popular and powerful techniques that can use both categorical and continuous predictors to predict categorical and continuous outcome variables. The reader may refer to Berry et al (1997) for more information on all these models.
Inferential and neural networks are perhaps the most relevant techniques to almost all engineering disciplines. Skipping the inferential methods (Alehossein and Hood, 1996), we elaborate on the neural networks.