how the world works - a data scientist's perspective: March 2015

Friday, 6 March 2015

Calibration of Computationally Expensive Simulations via Response Surface Optimisation

In this post I will provide a general definition of optimisation, and an overview of some popular algorithms. I will then go through an example on the use of response surface optimisation to determine the turbulence model parameters that minimise the error between a computational expensive fluid flow simulation of a submarine, and associated experimental measurements. We first presented this work in [1].

Optimisation is the means of determining the design parameters that minimise a specified cost function, under a set of constraints. For example if we want to minimise the noise of a fan, then the cost function would be the decibel level of the noise generated by the fan, and the design parameters would include the width, length, thickness and material type of the fan blades. There may also be a constraints put on the weight and sharpness of the components for occupational, health and safety reasons. Constraints limit the allowable combination of design parameters, and typically result in a lower performance design had the constraints not been in place. The cost functions can also be multi-objective. For example, in addition to minimising the noise of the fan, one may also want to minimise the time, energy and financial cost required to manufacture the components. Typically the designer must determine the relative importance of each of the objectives to combine them into one cost function.

Optimisation methods can be classed as two different types: gradient based methods; and non-gradient based methods. If the design / parameter space for a given cost function is convex (continually decreasing with only one minimum) then gradient based methods are an extremely efficient way of determining the optimal solution. In these methods one estimates how the cost function changes with the design parameters, and this information is used to select the next combination of parameters to set. However, if the cost functions are not convex and are highly non-linear, then gradient based methods have the potential to get stuck in local minima and produce sub-optimal designs. In this case non-gradient based methods are preferred.

One of the more popular non-gradient based methods is evolutionary optimisation. These methods attempt to evolve an optimal design through natural selection, or survival of the fittest. This is achieved through breeding a set of parents to produce a set of children, and keep only the fittest children (low cost function) for further breeding [2]. Each parent and child have a DNA gene sequence defined by the design input parameters. Breeding is undertaken on the basis of mutation and cross-over (gene swapping). There are two main incarnations that utilise the above approach, genetic algorithms (GA) and evolutionary algorithms (EA). GA were development first, with the parameters encoded into binary numbers, which are essentially treated as genes in the reproductive process. In the mutation process one of the digits in the binary coding is randomly changed from a 1 to a 0 or visa-versa. In the cross-over process to genes are split at a random point and interchanged. EA are philosophically the same, however, the crossover and mutation steps are performed on the design parameters directly, without any binary coding and decoding processes. In either case the breeding process is continued over multiple generations until a sufficiently good solution is found or computational resources are exhausted.

Evolutionary approaches can achieve a very good solution in a reasonable time, however, they require many concurrent evaluations of the cost function [2]. This is not reasonable for cost functions that are very computationally expensive. In this case Response Surface Models (RSM) can be utilised to build an estimate of the parameter space from a limited number of cost function evaluations. The minimum of the RSM is then obtained either numerically or analytically, and the cost function is then re-evaluated for the estimated optimal parameter set. The RSM is updated with the additional cost function evaluation, and the new minimum determined. This process is repeated until there is a sufficiently small difference between the RSM estimate of the cost function minimum and the value obtained from the cost function itself. The only subjective choice is the functional form of the RSM. Popular choices are splines and Kriging [3]. The latter is adopted in the following example. Kriging has the added feature of providing error estimates of the RSM throughout the design space.

As mentioned at the beginning of the post, the optimisation task here is to determine the turbulence model parameters that minimise the error between an expensive computational fluid dynamics (CFD) simulation of a submarine, and associated experimental measurements (lift force, drag force and yaw moment). Slices of the velocity field are illustrated with respect to the submarine body below.

For this particular turbulence model there are two key parameters β_i and α_∞. We, therefore, have a two-dimensional parameter space. Reasonable limits for these two parameters are set, with 9 evenly spaced values are selected along each dimension, producing 9x9=81 design candidates. The CFD simulation was run 81 times - once for each of the design candidates – with the lift force, drag force and yaw moments calculated from the flow field. The multi-objective cost function is the average squared error between the forces calculated from the CFD and the forced measured experimentally. The initial RSM of the simulation error is illustrated below, where blue contours are low error and red contours are high error. The symbols in the figure below represent the initial 81 evaluations of the cost function.

The minimum of this RSM is determined along with the associated values of β_i and α_∞. An additional CFD simulation is run with these particular values of β_i and α_∞. The cost function (error in the forces between the CFD simulation and the measurements) is evaluated and the RSM is updated. This process continues until the difference between the RSM of the cost function and the evaluated cost function is sufficiently small. The final RSM with the additional cost function evaluations are illustrated below.

In addition to determine the optimal solution, the RSM also provides an understanding of how the cost function changes throughout the design space, which evolutionary methods do not. RSM are also capable of answering multiple questions at once. For example in a multi-objective optimisation study if the relative importance of the objectives change, then a new RSM can be built using the existing simulation results. Further refinement of the RSM would in all likelihood be required to find the new optimal solution for the new cost function. If one was to use an evolutionary method, then the optimisation process would have to be run from scratch.

The RSM optimisation method can also be used to determine the optimal hyper-parameters used in the data modelling process, such as regularisation parameters used in regression and classification studies.

References:

[1] Chng, T. S., Widjaja, R., Kitsios, V. & Ooi , A., RANS Turbulence Model Optimisation based on Surrogate Management Framework , Australasian Fluid Mechanics Conference , Queensland University, Brisbane, Australia , 3-7 December, 2007 .

[2] Fogel, D., 1994, An introduction to simulated evolutionary optimization, IEEE transactions on neural networks, Vol. 5, pp 3-14.

[3] Jones, D. R., 2001, A taxonomy of global optimization methods based on response surfaces, J. Global Opt., Vol. 21, pp 345-383.

Principles of Data Mining and the Scientific Method

This post is intended to serve as an overview of the principles of data mining, and the important stages in the process. Various aspects of these stages will be addressed in detail in future posts.

One can think of data mining as the scientific method re-branded for new a age in which the large volumes of data are available, with additional emphasis placed on the treatment of such data. In the scientific method one observes certain phenomena, and then tests hypotheses proposed to explain these observations (or data sets). In the mature sciences (eg: physics, chemistry) the scientific method has produced rigorous mathematical representations of reality. For example we now know that the acceleration due to gravity (a) acting on an object by another object of much larger mass (M) and radius (R) is a = G M / R², where the gravitational constant is G=6.67384 × 10^-11 m³ kg^-1 s^-2. There is no need to continually re-estimate G from data. Substituting in the mass and radius of the Earth produces a gravitational acceleration of a=9.81ms^-2. However, in many new complex fields including social science, finance, business intelligence and genetics, there are no such fundamental mathematical representations as yet, and one must rely on data mining techniques to build models from the observations in order to make predictions of these systems.

There are six key stages in the data mining process including: 1) pose the question; 2) collect / generate the data; 3) check and clean the data; 4) build and validate the model; 5) use this model to make predictions and/or optimise the system; and 6) report and visualise the results. This process is illustrated below, and is a modified version of that in [1]. The process is illustrated in a sequential manner, but it is in fact iterative. If problems are encountered in downstream stages, then you may have to return to earlier stages to either: build an alternate model; perform additional data checking; collect more data; or pose an alternate question if your original question is inappropriate or unanswerable. I will now provide more details on each of these stages.

It may seem obvious, but the first stage in the process is to pose a question to be answered. Or more specifically pose a hypothesis that can be tested. It is important to be as precise as possible, as this will define the effort and investment required for each of the forthcoming stages in the data mining process. It is also important to do as much background reading on previous work done in the field, as to ensure you are not reinventing the wheel. From my own personal experience, in today's research and commercial environments the problem is typically not that we don't have the sufficient data, but rather we don't have sufficient questions.

The second stage is to collect and store the appropriate type, quality and quantity of data required to answer the question at hand. The data may be collected from observations of the environment (eg: global atmospheric temperature measurements) and/or generated by numerical simulation (eg: general circulation models of the climate). In either case all errors, uncertainties and caveats should be documented.

The third stage involves the collating, checking and cleaning of the data. Aspects that should be checked include:

Integrate / wrangle the data from various sources into a consistent data structure and check that the data from different sources is in fact consistent.
Ensure the data has the appropriate type (e.g. integer, float, text, images, video). For example the average number of children per family may be a float (e.g. 2.4) but the number of children in a given family must be an integer (e.g. 2).
Check that the data is in fact realisable. For example you cannot have a negative amount of rainfall.
Check that the data is "timely", that is, collected from a period appropriate to answer the question at hand.
Remove repeated and redundant data.
Detect and remove outliers / anomalies from the database. This is a large field and will be discussed at a future time.
Flag samples with any missing values and either remove the entire sample or augment the sample with an appropriate estimate of the missing value. This is also a large field in itself and will be the subject of a future post.

Given that enough storage space is available, it is good practice to keep a copy of the raw data before any checking, cleaning or compression is undertaken. This way if a bug is found in any of the downstream data processing codes, the analysis can be repeated from the source data.

The next phase is to develop models representing the system from which the data was collected. The form of these models is wide and varied and dependent upon the question which you are aiming to answer. For example:

If you are interested in identifying groups of customers with similar purchasing patterns then clustering methods would be the most appropriate.
If your project requires image or voice recognition then deep learning methods are at present the optimal solution.
If you are looking to extrapolate company earnings in a hypothetical future economy then statistical regression may be the most appropriate approach.
If you need to determine the parameters for models that are very computationally expensive, then one can minimise a response surface model of the simulation error as opposed to the model directly.

I will provide worked examples of each of these applications in future posts. Regardless of the approach it is good practice to build the model using a sub-set of the data (the training set), and verify the model on the remaining data not used during the model training process. If the model does not perform adequately well on the test data, then one may either need to adopt a more complex model, or collect more data, depending on if the model is either under or over fitting the data. This is discussed in more detail in a future post on multi-dimensional linear regression.

Once a model is built and verified it can then be used to make predictions and/or optimise the system design. The most appropriate optimisation method depends on the dimensionality and nonlinearity of the parameter space, and the computational cost required to evaluate the model. Typical available optimisation methods include: gradient base search; genetic algorithms; evolutionary methods; stochastic optimisation; swarm optimisation; and response surface modelling, to name but a few. I will demonstrate the application of response surface models in the following post.

Visualisation of the original data and/or model predictions is an efficient way to report and communicate the results of your analysis. I have already discussed the visualisation of time varying three-dimensional data sets in a previous post. There are also a variety of techniques available for visualising even higher multi-dimensional multi-variate data set. An example would be visualising how the GDP of an economy varies with employment, population, education, water and food availability, etc. There are various techniques available to visualise highly dimensional data sets including: parallel coordinates; radial visualisation; sun burst; and matrix scatter plots.

The following posts will provide further details on the various aspects and facets of data mining highlighted here.

References:

[1] Kantardic, M., 2003, Data mining: concepts, models, methods, and algorithms, Wiley-IEEE Press.