This
post is intended to serve as an overview of the principles of data
mining, and the important stages in the process. Various aspects of
these stages will be addressed in detail in future posts.
One
can think of data mining as the scientific method
re-branded for new a age in which the large volumes of data are
available, with additional emphasis placed on the treatment of such
data. In the scientific method one observes certain phenomena, and
then tests hypotheses proposed to explain these observations (or data
sets). In the mature sciences (eg: physics,
chemistry) the scientific method has produced rigorous mathematical
representations of reality. For example we now know that the acceleration due to gravity (a) acting on an object by another object of much
larger mass (M) and radius (R) is a = G M / R2, where the gravitational constant is G=6.67384
× 10-11 m3 kg-1 s-2. There is no need to continually re-estimate G from data. Substituting in the mass and radius of
the Earth produces a gravitational acceleration of a=9.81ms-2.
However, in many new complex fields including social science,
finance, business intelligence and genetics, there are no such
fundamental mathematical representations as yet, and one must rely on
data mining techniques to build models from the observations in order
to make predictions of these systems.
There
are six key stages in the data mining process including: 1) pose the question; 2) collect / generate the data; 3) check and clean the
data; 4) build and validate the model; 5) use this model to make
predictions and/or optimise the system; and 6) report and visualise
the results. This process is illustrated below, and is a modified
version of that in [1]. The process is illustrated in a
sequential manner, but it is in fact iterative. If problems are
encountered in downstream stages, then you may have to return to
earlier stages to either: build an alternate model; perform
additional data checking; collect more data; or pose an alternate
question if your original question is inappropriate or unanswerable.
I will now provide more details on each of these stages.
It
may seem obvious, but the first stage in the process is to pose a question to be answered. Or more specifically pose a hypothesis that
can be tested. It is important to be as precise as possible, as this
will define the effort and investment required for each of the
forthcoming stages in the data mining process. It is also important
to do as much background reading on previous work done in the field,
as to ensure you are not reinventing the wheel. From my own personal
experience, in today's research and commercial environments the
problem is typically not that we don't have the sufficient data, but
rather we don't have sufficient questions.
The second stage is to collect and store the appropriate type, quality
and quantity of data required to answer the question at hand. The
data may be collected from observations of the environment (eg:
global atmospheric temperature measurements) and/or generated by
numerical simulation (eg: general circulation models of the climate).
In either case all errors, uncertainties and caveats should be
documented.
The
third stage involves the collating, checking and cleaning of the
data. Aspects that should be checked include:
- Integrate / wrangle the data from various sources into a consistent data structure and check that the data from different sources is in fact consistent.
- Ensure the data has the appropriate type (e.g. integer, float, text, images, video). For example the average number of children per family may be a float (e.g. 2.4) but the number of children in a given family must be an integer (e.g. 2).
- Check that the data is in fact realisable. For example you cannot have a negative amount of rainfall.
- Check that the data is "timely", that is, collected from a period appropriate to answer the question at hand.
- Remove repeated and redundant data.
- Detect and remove outliers / anomalies from the database. This is a large field and will be discussed at a future time.
- Flag samples with any missing values and either remove the entire sample or augment the sample with an appropriate estimate of the missing value. This is also a large field in itself and will be the subject of a future post.
Given
that enough storage space is available, it is good practice to keep a
copy of the raw data before any checking, cleaning or compression is
undertaken. This way if a bug is found in any of the downstream data
processing codes, the analysis can be repeated from the source data.
The
next phase is to develop models representing the system from which
the data was collected. The form of these models is wide and varied
and dependent upon the question which you are aiming to answer. For
example:
- If you are interested in identifying groups of customers with similar purchasing patterns then clustering methods would be the most appropriate.
- If your project requires image or voice recognition then deep learning methods are at present the optimal solution.
- If you are looking to extrapolate company earnings in a hypothetical future economy then statistical regression may be the most appropriate approach.
- If you need to determine the parameters for models that are very computationally expensive, then one can minimise a response surface model of the simulation error as opposed to the model directly.
I
will provide worked examples of each of these applications in future
posts. Regardless of the approach it is good practice to build the model using a sub-set of the data (the training set), and verify the model on the
remaining data not used during the model training process. If the model does not perform adequately well on the test data, then one may either need to adopt a more complex model, or collect more data, depending on if the model is either under or over fitting the data. This is discussed in more detail in a future post on multi-dimensional linear regression.
Once
a model is built and verified it can then be
used to make predictions and/or optimise the system design. The most appropriate optimisation
method depends on the dimensionality and nonlinearity of the parameter
space, and the computational cost required to evaluate the model.
Typical available optimisation methods include: gradient base search;
genetic algorithms; evolutionary
methods; stochastic optimisation; swarm optimisation; and response surface modelling, to name but a few. I will demonstrate the application of response surface models in the following post.
Visualisation
of the original data and/or model predictions is an efficient way to report and
communicate the results of your analysis. I have already discussed
the visualisation of time varying three-dimensional data sets in a
previous post. There are also a variety of techniques available for
visualising even higher multi-dimensional multi-variate data set.
An example would
be visualising how the GDP of an economy varies with employment,
population, education, water and food availability, etc. There are
various techniques available to visualise highly dimensional data
sets including: parallel coordinates; radial visualisation; sun burst; and matrix scatter plots.
The following posts will provide further details on the various aspects and facets of data mining highlighted here.
References:
[1] Kantardic,
M., 2003, Data mining: concepts, models, methods, and algorithms,
Wiley-IEEE Press.
No comments:
Post a Comment