Friday, 6 March 2015

Principles of Data Mining and the Scientific Method

This post is intended to serve as an overview of the principles of data mining, and the important stages in the process. Various aspects of these stages will be addressed in detail in future posts.

One can think of data mining as the scientific method re-branded for new a age in which the large volumes of data are available, with additional emphasis placed on the treatment of such data. In the scientific method one observes certain phenomena, and then tests hypotheses proposed to explain these observations (or data sets). In the mature sciences (eg: physics, chemistry) the scientific method has produced rigorous mathematical representations of reality. For example we now know that the acceleration due to gravity (a) acting on an object by another object of much larger mass (M) and radius (R) is a = G M / R2, where the gravitational constant is G=6.67384 × 10-11 m3 kg-1 s-2. There is no need to continually re-estimate G from data. Substituting in the mass and radius of the Earth produces a gravitational acceleration of a=9.81ms-2. However, in many new complex fields including social science, finance, business intelligence and genetics, there are no such fundamental mathematical representations as yet, and one must rely on data mining techniques to build models from the observations in order to make predictions of these systems.

There are six key stages in the data mining process including: 1) pose the question; 2) collect / generate the data; 3) check and clean the data; 4) build and validate the model; 5) use this model to make predictions and/or optimise the system; and 6) report and visualise the results. This process is illustrated below, and is a modified version of that in [1]. The process is illustrated in a sequential manner, but it is in fact iterative. If problems are encountered in downstream stages, then you may have to return to earlier stages to either: build an alternate model; perform additional data checking; collect more data; or pose an alternate question if your original question is inappropriate or unanswerable. I will now provide more details on each of these stages.



It may seem obvious, but the first stage in the process is to pose a question to be answered. Or more specifically pose a hypothesis that can be tested. It is important to be as precise as possible, as this will define the effort and investment required for each of the forthcoming stages in the data mining process. It is also important to do as much background reading on previous work done in the field, as to ensure you are not reinventing the wheel. From my own personal experience, in today's research and commercial environments the problem is typically not that we don't have the sufficient data, but rather we don't have sufficient questions.

The second stage is to collect and store the appropriate type, quality and quantity of data required to answer the question at hand. The data may be collected from observations of the environment (eg: global atmospheric temperature measurements) and/or generated by numerical simulation (eg: general circulation models of the climate). In either case all errors, uncertainties and caveats should be documented.

The third stage involves the collating, checking and cleaning of the data. Aspects that should be checked include:
  • Integrate / wrangle the data from various sources into a consistent data structure and check that the data from different sources is in fact consistent.
  • Ensure the data has the appropriate type (e.g. integer, float, text, images, video). For example the average number of children per family may be a float (e.g. 2.4) but the number of children in a given family must be an integer (e.g. 2).
  • Check that the data is in fact realisable. For example you cannot have a negative amount of rainfall.
  • Check that the data is "timely", that is, collected from a period appropriate to answer the question at hand.
  • Remove repeated and redundant data.
  • Detect and remove outliers / anomalies from the database. This is a large field and will be discussed at a future time.
  • Flag samples with any missing values and either remove the entire sample or augment the sample with an appropriate estimate of the missing value. This is also a large field in itself and will be the subject of a future post.
Given that enough storage space is available, it is good practice to keep a copy of the raw data before any checking, cleaning or compression is undertaken. This way if a bug is found in any of the downstream data processing codes, the analysis can be repeated from the source data.

The next phase is to develop models representing the system from which the data was collected. The form of these models is wide and varied and dependent upon the question which you are aiming to answer. For example:
  • If you are interested in identifying groups of customers with similar purchasing patterns then clustering methods would be the most appropriate.
  • If your project requires image or voice recognition then deep learning methods are at present the optimal solution.
  • If you are looking to extrapolate company earnings in a hypothetical future economy then statistical regression may be the most appropriate approach.
  • If you need to determine the parameters for models that are very computationally expensive, then one can minimise a response surface model of the simulation error as opposed to the model directly.
I will provide worked examples of each of these applications in future posts. Regardless of the approach it is good practice to build the model using a sub-set of the data (the training set), and verify the model on the remaining data not used during the model training process. If the model does not perform adequately well on the test data, then one may either need to adopt a more complex model, or collect more data, depending on if the model is either under or over fitting the data. This is discussed in more detail in a future post on multi-dimensional linear regression.

Once a model is built and verified it can then be used to make predictions and/or optimise the system design. The most appropriate optimisation method depends on the dimensionality and nonlinearity of the parameter space, and the computational cost required to evaluate the model. Typical available optimisation methods include: gradient base search; genetic algorithms; evolutionary methods; stochastic optimisation; swarm optimisation; and response surface modelling, to name but a few. I will demonstrate the application of response surface models in the following post.

Visualisation of the original data and/or model predictions is an efficient way to report and communicate the results of your analysis. I have already discussed the visualisation of time varying three-dimensional data sets in a previous post. There are also a variety of techniques available for visualising even higher multi-dimensional multi-variate data set. An example would be visualising how the GDP of an economy varies with employment, population, education, water and food availability, etc. There are various techniques available to visualise highly dimensional data sets including: parallel coordinates; radial visualisation; sun burst; and matrix scatter plots.


The following posts will provide further details on the various aspects and facets of data mining highlighted here.

References:
[1] Kantardic, M., 2003, Data mining: concepts, models, methods, and algorithms, Wiley-IEEE Press.