how the world works - a data scientist's perspective: April 2014

Tuesday 22 April 2014

Koalas vs Possums - Modelling Biological Growth using Recursive Bayesian Estimation

In this week's post I will be discussing how to track biological population growth using a simple numerical model coupled together with incomplete noisy measurements. Firstly I will provide an overview of the numerical population model, then I will illustrate how to incorporate the measurements to ensure the numerical model is more representative of the real world.

The example I am using to illustrate this is the predator-prey problem. In the classic predator-prey problem, there is one type of predator and type of prey, for example foxes and rabbits. Here, however, will be using two predators and one prey to achieve the required level of complexity. The selected elements of the model have a somewhat Australian flavour. The two types of predators are koalas and possums. You may not think that these two animals are the most vicious of predators, however, you might think otherwise if you were the chosen prey ... gum leaves! The numerical model calculates the population of the koalas, possums and gum leaves, based on certain assumptions. We need to provide the rate at which an average koala and possum eats gum leaves. For both the koalas and possums, we also need to provide a threshold amount of gum leaves below which they die and above which they breed. We can estimate these rates and thresholds from the data itself, but for the moment we'll assume that we know what they are. Details on the numerical model can be found in reference [1] detailed at the end of this post.

For a particular selection of rates and thresholds we calculate the evolution of the koala, possum and gum leaf population versus time, as illustrated in the plot below. The specific numbers do not mean much, it is just intended to be an illustration of the approach. Here you can see a very regular and periodic pattern. The number of gum leaves (green line) initially increase, which is then followed by an increase in the population of both the koalas and possums as there is now more food available. When there are so many koalas and possums such that they are eating the gum leaves faster than the gum leaves can grow, the number of gum leaves decrease. When the gum leaves get below a certain threshold value the animals start to die because there is not enough food available. With there now being less koalas and possums the gum leaves begin to grow back. When there is enough food the animals begin to breed again, and so the cycle continues.

However, if we slightly change the rate at which the possums eats the gum leaves then we get a very different evolution of all three populations, as you can see from the plot below. It is clear that the system is now more complicated and less periodic / regular. This is one illustration of the butterfly effect, where a very small change to the system can produce very different results. The overall pattern to the population growth, however, is the same with the growth in the animal population following a growth in the number of gum leaves, and the decay in the animal population following a decay in the number of gum leaves. The predator-prey model has no random components in it all, however, it produces changes in the population, which on the surface appear to be random. They are in fact not random, but chaotic. This type of chaotic behaviour can only be produced by a system with at least three degrees of freedom (koala, possum and gum leaves in this case). This comparison has illustrated how important it is to get the input parameters of the model right. A small change to the rate at which possums eat gum leaves produced qualitatively similar behaviour, but quantitatively different results.

We will now augment the simple model used above with measurements of the true system using Recursive Bayesian Estimation. Recursive Bayesian Estimation is the general term given to a variety of methods used to incorporate measurements into a numerical model, as soon as these measurements become available. In this application, we'll be using the Ensemble Kalman Filter. The interested readers can find the mathematical details on wikipedia. It essentially involves running an ensemble of many numerical models at the same time each starting with slightly different conditions (e.g. different populations), and also in some cases slightly different model parameters (e.g. possum feeding rates). A comparison of each instance of the numerical model gives an indication of the natural variability (or error) in the system. When a measurement becomes available it is compared to what the numerical model suggests the measurement should be, and the model is then pushed in the direction of the measurements. If the measurement has no noise in it at all, then the model is set to be measurement exactly. If there is some error in the measurement, then the numerical model is pushed in the direction of the measurement. The extent of which is determined by a comparison of the natural variability and measurement error.

Here the Ensemble Kalman Filter is used to estimate the population growth properties from a partial noisy measurement of the "true" system. The "true" system is the second case discussed above, but here we are just focussing at the first 100 time units of it. The measurement is "partial" in that we have only measured the koala and possum populations, and do not have a measurement of the number of leaves. They are "noisy" in that I have added a random number to the measurement of the "true" system.

In the plot below the "true" Koala population is represented by the red line, and the noisy measurements of it are represented by the small black boxes. You can see that the black boxes in some cases are shifted away from the red line, due to the noise in the measurement. In between each measurement, the numerical model calculates the population until a new measurement comes along to correct it. The numerical model estimate in each of the following plots is given by the black line.

Likewise, as can be seen from the plot below, the numerical model estimates the possum population, which is corrected when the measurements become available.

The Ensemble Kalman Filter uses the measurements of the koala and possum populations to not only correct the koala and possum population estimates from the numerical model, but also to correct the gum leaf population. This illustrates a key feature of the approach. One can make an estimate of something that you cannot measure (gum leaf populations), throughout its relationship (numerical model) to something that you can measure (koala and possum population).

In addition to estimating the population of the gum leaves from the measurements of the animal population, you can also estimate the model parameters from the data itself. For example we can estimate the rate at which the possums eat the gum leaves, which we identified earlier to significantly effect the evolution of the populations. In the plot below you can see that the estimate of the possum feeding rate approaches the true value as more measurements of the animal population become available. Each of the stair step changes in the estimate coincide with a new measurement. The more parameters you wish the estimate from the data, the more numerical simulations are required to be run simultaneously in the ensemble.

This fun example of Recursive Bayesian Estimation has illustrated a powerful approach for fusing together numerical models and real world data. It has implications for any application in which a numerical model is required to simulate reality, particularly when the numerical model parameters are not well known. This is particularly the case in socio-economic and complex biological systems.

References:

[1] Timo Eirola , Alexandr V. Osipov and Gunnar Soderbackao, Chaotic Regimes in a Dynamical System of the Type Many Predators One Prey, Helsinki University of Technology, Institute of Mathematics, Research Reports A368 (1996).

Sunday 6 April 2014

Flow over Aircraft and Wind Turbines - Capturing Increasing Complexity using Principal Components

In my previous post I discussed the important role that turbulence plays in the evolution of the atmosphere and ocean. Here I'll discuss turbulence in the context of the flow over aircraft wings and wind turbines. Specifically I'll show how the complexity of the flow increases as the object moves faster. I'll also show how to represent this complexity in a simple say using the data mining technique of "principal component analysis", also known as "proper orthogonal decomposition" in engineering, "singular value decomposition" in mathematics, or "empirical orthogonal functions" in geophysics.

Arguably the most important qunatity in the field of fluid dynamics is the Reynolds number, which is essentially the ratio of the momentum force to the viscous force. It is high for objects moving fast in a fluid of low viscosity (eg: air), and low for objects moving slowly in a very viscous fluid (eg: honey). Flows with higher Reynolds numbers are more complex and have a greater range of scales, that is the largest vortex in the flow is significantly bigger than the smallest voretx in the flow. I'll illustrate below how the complexity of the flow over an aerofoil (representative of an idealised aircraft wing or wind turbine blade) increases with Reynolds number.

The configuration that I am looking at is an aerofoil at an angle of 18 degrees with the flow moving from left to right in the movies below. All of the movies illustrated below are generated by post-processing the data resulting from computational fluid dynamics simulations, using conceptually the same approach as that discussed in my previous post on the simulations of the atmosphere and ocean. The volume of fluid surrounding the aerfoil is broken down into a series of grid boxes and the Navier-Stokes equations are solved at each position to determine the velocity and pressure throughout the fluid volume.

In fact the flow around an aerofoil at very low Reynolds numbers (or very slow moving aerofoils), does not change with time. The fluid is moving, but its velocity and presssure at each position is not changing. It is not until the aerofoil reaches a certain critical Reynolds number (or speed) that the flow begins to change with time, as illustrated in the movie below. Here the flow is changing in time and two-dimensional. It is coloured by vorticity, which is a measure of the rotation of the fluid. Red is rotating in the couter clockwise direction and blue is rotating in the clockwise direction.

I will now use principal component analysis to breaks down the flow into a series of "modes", which can be added together with varying weights to reconstruct each instant in time. One can also think of the method as a form of information compression, and has also been used for facial feature detection. For this particular flow 94% of the energy can be represented by the first two modes (or "facial features") illustrated below. Further details of this flow and its stability properties can be found in my Journal of Computational Physics paper in reference [1] and in my PhD thesis in reference [2] listed below.

mode 1

mode 2

As the Reynolds number increases the flow transitions for an unsteady two-dimensional flow, to an unsteady three-dimensional flow illustrated in the movie below. This movie is illustrating three-dimensional surfaces defining the boundary of the complex vortex structures. They are coloured by rotation in the flow direction. The total data set is 17.5Gb.

Here the first two modes represent the two-dimensional aspects of the flow and capture only 65% of the energy.

mode 1

mode 2

The next two modes capture the three-dimensional aspects of flow. Further details of this flow and the associated modes can be found in my PhD thesis.

mode 3

mode 4

As the Reynolds number is increased further the flow becomes even more complex with many smaller vortices being generated, as illustrated in the movie below. This total data set is 35Gb.

Here the first two modes now represent only 45% of the total energy. What is interesting here, is that the flow is so complex that the large scale vortices are hidden amongst the forest of small scale vortices. The principal components, however, are able to extract the large features from the data. Further details on this flow and the modal decomposition can be found in my Journal of Fluid Mechanics paper in reference [3].

mode 1

mode 2

It is clear that as the Reynolds number increases and the flow becomes more complex, the first two modes represent less and less of the total energy. This also means that more modes are required to represent a given percentage of the total energy.

Principal Component Analysis has many applications and can be used to extract key features from any data set be it physical, biological or socio-economic.

References:

[1] Kitsios, V., Rodríguez, D., Theofilis, V., Ooi, A. & Soria, J., 2009, BiGlobal stability analysis in curvilinear coordinates for massively separated lifting bodies, Journal of Computational Physics, Vol. 228, pp 7181-7196. [link]

[2] Kitsios, V., 2010, Recovery of fluid mechanical modes in unsteady separated flows, PhD Thesis, The University of Melbourne. [PDF]

[3] Kitsios, V., Cordier, L., Bonnet, J.-P., Ooi, A. & Soria, J., 2011, On the coherent structures and stability properties of a leading edge separated aerofoil with turbulent recirculation, Journal of Fluid Mechanics, Vol. 683, pp 395-416. [link]

Saturday 5 April 2014

The Physical Earth System and the Role of Turbulence

It is important to understand the Earth system, both for understanding past climates and also for making future predictions. Numerical simulations of the Earth system are undertaken using high performance computing facilities over different time periods for differing means.

Numerical Weather Prediction simulates 1-7 days of a virtual Earth, and can be used to determine what the weather be over the following next week. This is important for planning of public events, renewable energy power plant management and operation, and early warnings of extreme events (eg: bushfire, flood, strong winds).
Seasonal Prediction simulates 1-2 years of a virtual Earth. This type of simulation estimates for example if the following summer will be hotter and/or wetter than this year?
Climate Projections are simulated over 100's of years, and estimate the properties of future climates under various scenarios of human activity?
Paleo-climate Simulation are undertaken over 100,000's of years, and are used to understand previous significant changes to the Earth, and also to test if current climate simulations can replicate past observed trends.

The Earth system comprises of various components including the:

Atmosphere - moves heat, dust, sea spray, air pollution and other aerosols throughout the Earth.
Ocean - moves heat, salt, water pollution and aquatic life throughout the Earth.
Ice sheet - solid ice fixed to the Earth's sea floor (eg: at Antarctica). It melts and freezes as a result of natural variability and human activities.
Sea ice - frozen ice that floats on the ocean surface (eg: encircling Antarctica), which also grows and recedes due to natural variability and human activities.
Vegetation - on land (eg: forests), and in the ocean (eg: coral reefs).
Biology - large scale biological growth can effect the Earth system, for example rapid growth in the phytoplankton population can change the chemical composition of the ocean.
Carbon cycle - the exchange of carbon throughout the Earth. For example burning vegetation releases carbon into the atmosphere, some of which is then dissolved into the ocean.

In computer simulations of the Earth system, the atmosphere and ocean are subdivided into a series of three-dimensional grid boxes, with the atmospheric wind, oceanic currents, rain and other quantities solves in each grid box location. The interaction of the atmosphere and ocean with the remaining components in the Earth system are approximated by some form of model. These interactions effect the atmospheric and oceanic circulation, which in turn effect all of the Earth system components.

In order to correctly represent the ice sheet, sea ice, vegetation, biology and carbon cycle, as a first step the atmospheric and oceanic circulation must be correctly simulated. Turbulence plays an essential role in correctly simulating the atmosphere and ocean. In the atmosphere for example, the vortices (or eddies) range from the largest ones ~1000km in horizontal diameter down to the smallest approximately 1mm in diameter, and everything in between. These different sizes of vortices is what we refer to as "scales of turbulence". To make things even more complicated each of the these vortices of different sizes interact with each other.

The difficulty in simulating the atmosphere and ocean, is that even if we were to pool together all of the most powerful super-computers, it would still be impossible to simulate all of the scales of turbulence. In reality the grid we use to simulate the atmosphere is broken down into grid boxes ~100km in the horizontal direction, and model the influence of the vortices smaller than ~100km on the larger vortices that are explicitly represented by the grid. This type of model is called a "subgrid turbulence model", the development of which has been the focus of my own research.

If the influence of the small unresolved vortices on the large resolves ones is not modelled correctly, then this an produce resolution dependent results, which has been a significant and long standing problem since the earliest climate simulations [1]. I recently solved this problem by developing subgrid turbulence models for both the atmosphere and ocean using stochastic data mining techniques [2,3,4]. I generated high resolution (very small grid box sizes) atmospheric and oceanic data sets, and determined the model structure from this data. Simulations adopting these scaling laws are shown to produce resolution independent statistics, with orders of magnitude improvement in computational efficiency. I have also more recently successfully applied this technique to numerical simulations of boundary layer flows [5].

Here is an animation of the winds generated from the simulations of the atmosphere. Red is fast wind and blue is slow wind. You can get more details in my Journal of the Atmospheric Sciences paper in reference [2] listed below.

Below is an animation of the currents generated from the simulations of the ocean. Here we are looking at the southern hemisphere. You can get more details in my Ocean Modelling journal paper reference [3].

As a side note, when you are in a plane and the pilot announces that they are approaching some turbulence, they are referring to the regions in the atmosphere that more turbulent / complex / disordered. My following post will discuss how fluids flow over aircraft wings.

References:

[1] Smagorinsky, J., 1963, General circulation experiments with the primitive equations. I. The basic experiment. Mon. Wea. Rev., 91, 99–164.

[2] Kitsios, V., Frederiksen, J.S. & Zidikheri, M.J., 2012, Subgrid model with scaling laws for atmospheric simulations, Journal of the Atmospheric Sciences, 69, pp 1427-1445. [link]

[3] Kitsios,V., Frederiksen, J.S. & Zidikheri, M.J., 2013, Scaling laws for parameterisations of subgrid eddy-eddy interactions in simulations of oceanic circulations, Ocean Modelling, 68, pp 88-105. [link]

[4] Kitsios,V., Frederiksen, J.S. & Zidikheri, M.J., 2013, Theoretical comparison of subgrid turbulence in atmospheric and oceanic quasigeostrophic models, Nonlinear Processes in Geophysics, 23, pp 95-105. [link]

[5] Kitsios,V., Sillero, J.A., Frederiksen, J.S. & Soria, J., 2015, Proposed stochastic parameterisations of subgrid turbulence in large eddy simulations of turbulent channel flow, Journal of Turbulence, 16, pp 729-741. [link]