Eomics: Machine learning

Last month I had the pleasure of taking part in a virtual conference organised by the Bank of England on big data and machine learning (ML). One of the things that struck me most was the relative youth of the presenters, many of whom are still writing their PhDs. This is a clear illustration of the fact that this is a brand new field whose limits are being extended every month and which is increasingly being applied in the field of economics and finance. If ever you want to get in on the ground on what promises to be one of the new fields of economic analysis, now is a good time to get started.

Some basics

Big data and ML go hand-in-hand. The development of web and cloud based systems allows the generation and capture of data in quantities which were unimaginable just a few years ago. It is estimated that 59% of the global population are active internet users – around 4.66 billion people. Every second they send 3.4 million emails, 5700 tweets and perform almost 69,000 Google searches. PwC reckoned that in 2019 there were 4.4 Zettabytes (ZB) of data stored online – a figure that could hit 44ZB in 2020. A decent laptop these days will have a hard disk with one Terrabyte of storage capacity but you would need more than 47 billion of them to store the current volume of our data universe (44 x 1024³). If these were stacked one on top of another, it would generate a column over 1 million kilometres high – three times the distance to the moon. Clearly, a lot of the data stored online does not yield any valuable insight but given the vast amount of available information even a small fraction of it is still too much for humans to reasonably digest.

This is where the machines come in. Traditional computer programs represent a series of instructions designed to perform a specific task in a predictable manner. But they run into difficulties in the case of big data applications because the decision trees built into the program (the “if-then” loops) can simply become too big. Moreover, a traditional program represents a fixed structure which goes on doing the same thing ad infinitum which may not be ideal in a situation where we gather more data and begin to understand it better. A machine learning algorithm (MLA) is designed to be much more flexible. Rather than being based on a series of hard-coded decision rules, an MLA incorporates a very large number of basic rules which can be turned off and on via a series of weights derived via mathematical optimisation routines. This makes MLAs more successful than traditional computer programs in areas such as handwriting and speech recognition and are better able to deal with tasks such as driving where rapid adjustment to changed conditions is required.

But a machine needs to be trained in order to progressively improve its performance in a specific task in much the same way that humans learn by repetition. In the AI community there are five broad categories of training techniques, the most common of which is supervised learning in which input and output data are labelled (i.e. tagged with informative labels that aid identification) and the MLA is manually corrected by the human supervisor in order to improve its accuracy[1]. One of the common problems is that the model might fit the training data very well but be completely flummoxed when faced with out-of-sample data (overfitting). By contrast an underfitting model cannot replicate either the training data or the out-of-sample data which makes it useless for decision making.

Our final task is to ensure that the MLA has learned what we want it to. In one early experiment, data scientists tried to teach a system to differentiate between battle tanks and civilian vehicles. It turned out that it learned only to differentiate between sunny and cloudy days and it proved to be useless in real world situations. This demonstrates the old adage that if you ask a stupid question, you get a stupid answer, and highlights the importance of setting up the MLA in order that it focuses on the question of interest.

Applying ML to economics

How is any of this relevant to economics? First of all ML has the potential to revolutionise our statistical analysis of big datasets. In particular, certain applications should make it easier to reduce the dimensionality of big datasets, making them more manageable whilst still retaining the meaningful properties of the original data (see below). This is important because large datasets are often “sparse” i.e. a number of non-zero observations surrounded by large numbers of zeros, which tends to be a hindrance to many traditional statistical estimation methods.

ML also theoretically allows us to estimate and compare a range of models more easily. In applied economics, researchers normally start by choosing a single functional form and putting their efforts into a statistical assessment of whether the data agree with their preconceptions. It is usual for researchers to operate with only one model given the labour intensive nature of the exercise and it becomes a laborious task to compare a range of models using traditional analytical techniques. However, ML should make it easier to compare a range of different models. In a very interesting paper on the application of ML techniques to economics, Susan Athey argues that a more systematic approach to model selection facilitated by ML, which removes the arbitrary approach to specification searches, “will become a standard part of empirical practice in economics” in future.

An example

In order to give a flavour of the application of ML techniques, I present here an example of a supervised learning technique known as a random forest model. This is a form of clustering model in which a large number of data observations are reduced down to a smaller number of groups in an example of dimensionality reduction.

To conceptualise a random forest model, think of a decision tree formed by splitting our dataset into two (see chart above). Both halves can be further divided into sub-categories until at some point we run out of ways to split them further (in other words there is no additional information content). In a standard decision tree model, since the trees are derived from the same underlying data, they may not be completely independent from each other. If, however, we randomly sample data from each tree created by the model it can be shown that this reduces the degree of bias compared to a standard decision tree (for those interested in a more detailed discussion, this paper from the BoE is very accessible). We “train” our model by allowing it to operate on a sub-sample of our dataset and apply the “knowledge” gained during the training period to the rest of the sample to see whether it makes accurate predictions.

The Bank of England applied ML techniques in a paper using random forest models to predict banking distress. Based on this blog post by Saulo Pires de Oliveira, we can demonstrate exactly the same techniques used in the BoE paper to show the random forest model in action. It is written in the R software system and rather than use financial data, we use the famous Anderson iris data set which looks at the characteristics of three variations of irises (the data comes as standard in the R system). Our objective is to determine on the basis of the characteristics (the length and the width of the sepals and petals) which category of plant each observation belongs to. The code is available below and since R is free to download it is a simple matter of copying this code into R and running it to reproduce the results.

The model uses part of the dataset as input to a training algorithm and applies the results to the rest of the sample. How do we know whether our results are any good? Following the BoE example, we calculate the Receiver Operating Characteristic (ROC) curve which plots the true positive rate against the false positive rate (see chart below). The former is high and the latter low, suggesting that the model performs well in determining which class of iris the data correspond to. This is confirmed by a cross-check against the area under the curve algorithm which shows a value of 98% (the higher the value the better the fit).

Whilst this is a very simple example it does show the power of ML techniques. From an economists' point of view their use as a statistical technique for classifying patterns makes them an extremely powerful new tool. But we should beware of overdoing the hype when it comes to their use in some other areas. Many of the more general problems in cognition are not classification problems that MLAs are good at solving. Moreover, they tend to be data hungry whereas a human can learn abstract relationships with much less data input. One thing that humans still do better than machines is adaption and they are not going to replace us any time soon. But for the statisticians they are likely to be a boon.

[1] The others are semi-supervised learning in which the data are unlabelled but the MLA is still subject to manual correction. An active learning approach allows the MLA to query an external information source for labels to aid in identification. Unsupervised learning forces the MLA to find structure in the input data without recourse to labels and without any correction from a supervisor. Finally reinforcement learning is an autonomous, self-teaching MLA that learns by trial and error with the aim of achieving the best outcome. It has been likened to learning to ride a bicycle, in which early efforts often involve falling off but fine-tuning actions which gradually eliminate mistakes eventually lead to success.

Eomics

Thursday, 10 December 2020

Machine learning: A primer

About Me