how the world works - a data scientist's perspective: Supervised machine learning

To recap from the previous post supervised machine learning aims to build predictive models linking a series of inputs to a series of outputs, as opposed to unsupervised machine learning which builds descriptive models representing the data at hand. There are two types of supervised machine learning: regression, that predicts a continuous number for a given input sample; and classification that predicts to which class (or group) a particular sample belongs. Here we will be concentrating on classification.

In the previous post, I built a regression model predicting the average life expectancy of a country from various socio-economic input factors. As an example of classification, I will use this same data set to determine if a particular country in a particular year is in the OECD (that is the group of countries belonging to the Organisation for Economic Co-operation and Development) or not. The list of input factors/fields (with the associated variable name in parentheses) include:

Life expectancy (lifeExp)
GDP per capita (gdpPC)
Total per capita spend on health (healthPC)
Government per capita spend on health (healthPCGov)
Births per capita (birth)
Deaths per capita (death)
Deaths of children under 5 years of age per capita (deathU5)
Years women spend in school (womenSchool)
Years men spend in school (menSchool)
Population growth rate (popGrowth)
Population (pop)
Immigration rate (immigration)
Flag stating whether a country in a particular year is the OECD (oecd).

The data was downloaded from the Gapminder and OECD websites, data wrangling was done using pandas, machine learning undertaken using scikit-learn, and visualisations developed using matplotlib and seaborn.

For clarity I have visualised the data for the key factors determined in the previous regression study against the OECD flag variable. If a country in a particular year is in the OECD the variable is given a value of oecd=1, if it is not in the OECD is has a value of oecd=0. Let us first consider the subplot below highlighted by the black box, which is the per capita death rate of children under the age of 5 (deathU5) versus the OECD flag. One can see that only countries with a low infant death rate are in the OECD, however, there are also certain countries with low infant death rates that are not in the OECD. By inspecting the remaining subplots one can also say that in general countries in the OECD have higher life expectancies, lower birth rates and spend more money on health care. The supervised machine learning task is to use these inputs fields (and the others listed above, but not visualised here) to predict whether or not a particular country in a particular year is in the OECD.

There are various classification algorithms that one could use including: logistic regression; support vector machines (also known as large margin classifiers); k-nearest neighbours; or neural networks / deep learning (the subject of the following post). In this example I will be adopting logistic regression [1], which essentially aims to fit to the data an "S" shaped curve, called the sigmoid" (or logistic) function, as opposed to the linear function used in the previous regression example. The inputs to the sigmoid function are again the feature variables and model parameters, however the output is now bounded by 0 and 1, and can be interpreted as being the probability of a particular sample belonging to the class at hand. In the present example the output is the probability of a country being in the OECD. If the probability is greater than 0.5 then it estimated to be in the OECD, if the probability is less than 0.5 it is estimated to not be in the OECD. In general, however, this threshold of 0.5 can be modified to control the predictive performance of the model.

Before building the model, each of the features are first standardised, by subtracting away the mean and then dividing by the standard deviation. To address underfitting (high bias) and overfitting (high variance) we first break the available data into a training data set (consists of 60% of the samples) to build the model, a cross validation data set (20% of the samples) to select the optimal regularisation level/hyper-parameter, and a test data set (remaining 20% of the samples) to determine the performance of the optimal model. I reduce the complexity of the logistic regression model by applying an L1 regularisation on the model parameters, which penalises candidate models for having large magnitude coefficients, with the penalty proportional to the regularisation hyper-parameter. This concept is discussed further in the previous post with respect to regression. The stronger the regularisation level the more simple the final model, and the weaker the regularisation the more complex the model becomes.

The quantification of the error in classification is not as straight forward as in regression studies, particularly for skewed data in which there may be many more negative samples (not in the OECD) as opposed to positive samples (in the OECD) or vice-versa. In regression studies the cost function is simply the summed squared error between the model prediction and the true value. In classification studies there are four types of prediction outcomes:

true positive - a positive result (in OECD) is predicted for a positive event (in OECD)
true negative - a negative result (not in OECD) is predicted for a negative event (not in OECD)
false positive (or a Type I error) - a positive result (in OECD) is predicted for a negative event (not in OECD)
false negative (or a Type II error) - a negative result (not in OECD) is predicted for a positive event (in OECD)

Associated with these outcomes are two measures: precision; and recall.

Firstly, precision (P) is the proportion of correctly predicted positive events to the total amount of events predicted as positive (True positives / [True positives + False positives]). The precision versus regularisation level is illustrated in the figure below. For all plots in this post the blue dots represent the training data set, and the red dots the cross validation data set. Typical of machine learning studies, the precision (inversely proportional to the generalisation error) of the training data set is greater than the precision of the cross validation data set for all regularisation levels. As the regularisation level increases the precision of the training data set tends upwards, particularly for very small regularisation levels as the model becomes increasingly complex. This means that the countries that are predicted as being in the OECD are in the OECD the vast majority of the time.

Recall (R) is the proportion of correctly predicted positive events to the total amount of actual positive events (True positives / [True positives + False negatives]). Typically the greater the precision the lower the recall. We can see from the figure below, for both the training and cross validation data sets as the regularisation level decreases and the model becomes more complex, the recall of the model reduces (as the precision increases). This means that while the countries that are predicted as being in the OECD are in the OECD the vast majority of the time (high precision), the model is also classifying many countries as not being in the OECD when in fact they actually are (low recall). For logistic regression one can also trade off the precision and recall measures against each other by modifying the threshold probability (set to 0.5 here) between positive and negative events. This effect is discussed in more detail in [2].

One way of combining the precision (P) and recall (R) measures of model performance is the F1 score, given by 2*P*R/(P+R). This measure is illustrated below over the same range of regularisation parameters. The optimal model is defined as the one with the highest F1 score in predicting the cross validation data set below. This occurs for a regularisation parameter of 1.3.

The next stage is to generate learning curves to determine the sensitivity of the error of the optimal model to the number of samples used to build the model. If sufficient data has been used to build the model, then the performance measures of the model over the test and cross validation sets should converge. As the number of samples used to build the model increases, the performance measures of the test data set should decrease (generalisation error increase), whilst the performance measures of the cross validation data set should increase (generalisation error decrease). This appears to be the case from the precision learning curve illustrated below. The convergence is less clear, however, when inspecting the associated plots for the recall and F1 score measures.

The dominant model coefficient in the optimal model are associated with the average life expectancy (lifeExp), government spend on health care (healthPCGov) and the infant death rate (deathU5). The ability of the optimal model is found to have an F1 score of approximately 0.9 for the training, cross validation and test data sets. To improve the predictability of the classification model (and also the regression model in the previous post) one could adopt a more complex unsupervised machine learning method, such as neural networks. This will be the subject of the following post.

References:

[1] Cox, D.R., 1958, "The regression analysis of binary sequences", J Roy Stat Soc B, Vol. 20, 215–242.

[2] Ng, A., 2015, Course in Machine Learning, Stanford University, https://class.coursera.org/ml-005/lecture

how the world works - a data scientist's perspective

Sunday, 14 June 2015

Supervised machine learning – classification of countries in the OECD

1 comment: