Eomics: big data

Showing posts with label big data. Show all posts

Thursday, 10 December 2020

Machine learning: A primer

Last month I had the pleasure of taking part in a virtual conference organised by the Bank of England on big data and machine learning (ML). One of the things that struck me most was the relative youth of the presenters, many of whom are still writing their PhDs. This is a clear illustration of the fact that this is a brand new field whose limits are being extended every month and which is increasingly being applied in the field of economics and finance. If ever you want to get in on the ground on what promises to be one of the new fields of economic analysis, now is a good time to get started.

Some basics

Big data and ML go hand-in-hand. The development of web and cloud based systems allows the generation and capture of data in quantities which were unimaginable just a few years ago. It is estimated that 59% of the global population are active internet users – around 4.66 billion people. Every second they send 3.4 million emails, 5700 tweets and perform almost 69,000 Google searches. PwC reckoned that in 2019 there were 4.4 Zettabytes (ZB) of data stored online – a figure that could hit 44ZB in 2020. A decent laptop these days will have a hard disk with one Terrabyte of storage capacity but you would need more than 47 billion of them to store the current volume of our data universe (44 x 1024³). If these were stacked one on top of another, it would generate a column over 1 million kilometres high – three times the distance to the moon. Clearly, a lot of the data stored online does not yield any valuable insight but given the vast amount of available information even a small fraction of it is still too much for humans to reasonably digest.

This is where the machines come in. Traditional computer programs represent a series of instructions designed to perform a specific task in a predictable manner. But they run into difficulties in the case of big data applications because the decision trees built into the program (the “if-then” loops) can simply become too big. Moreover, a traditional program represents a fixed structure which goes on doing the same thing ad infinitum which may not be ideal in a situation where we gather more data and begin to understand it better. A machine learning algorithm (MLA) is designed to be much more flexible. Rather than being based on a series of hard-coded decision rules, an MLA incorporates a very large number of basic rules which can be turned off and on via a series of weights derived via mathematical optimisation routines. This makes MLAs more successful than traditional computer programs in areas such as handwriting and speech recognition and are better able to deal with tasks such as driving where rapid adjustment to changed conditions is required.

But a machine needs to be trained in order to progressively improve its performance in a specific task in much the same way that humans learn by repetition. In the AI community there are five broad categories of training techniques, the most common of which is supervised learning in which input and output data are labelled (i.e. tagged with informative labels that aid identification) and the MLA is manually corrected by the human supervisor in order to improve its accuracy[1]. One of the common problems is that the model might fit the training data very well but be completely flummoxed when faced with out-of-sample data (overfitting). By contrast an underfitting model cannot replicate either the training data or the out-of-sample data which makes it useless for decision making.

Our final task is to ensure that the MLA has learned what we want it to. In one early experiment, data scientists tried to teach a system to differentiate between battle tanks and civilian vehicles. It turned out that it learned only to differentiate between sunny and cloudy days and it proved to be useless in real world situations. This demonstrates the old adage that if you ask a stupid question, you get a stupid answer, and highlights the importance of setting up the MLA in order that it focuses on the question of interest.

Applying ML to economics

How is any of this relevant to economics? First of all ML has the potential to revolutionise our statistical analysis of big datasets. In particular, certain applications should make it easier to reduce the dimensionality of big datasets, making them more manageable whilst still retaining the meaningful properties of the original data (see below). This is important because large datasets are often “sparse” i.e. a number of non-zero observations surrounded by large numbers of zeros, which tends to be a hindrance to many traditional statistical estimation methods.

ML also theoretically allows us to estimate and compare a range of models more easily. In applied economics, researchers normally start by choosing a single functional form and putting their efforts into a statistical assessment of whether the data agree with their preconceptions. It is usual for researchers to operate with only one model given the labour intensive nature of the exercise and it becomes a laborious task to compare a range of models using traditional analytical techniques. However, ML should make it easier to compare a range of different models. In a very interesting paper on the application of ML techniques to economics, Susan Athey argues that a more systematic approach to model selection facilitated by ML, which removes the arbitrary approach to specification searches, “will become a standard part of empirical practice in economics” in future.

An example

In order to give a flavour of the application of ML techniques, I present here an example of a supervised learning technique known as a random forest model. This is a form of clustering model in which a large number of data observations are reduced down to a smaller number of groups in an example of dimensionality reduction.

To conceptualise a random forest model, think of a decision tree formed by splitting our dataset into two (see chart above). Both halves can be further divided into sub-categories until at some point we run out of ways to split them further (in other words there is no additional information content). In a standard decision tree model, since the trees are derived from the same underlying data, they may not be completely independent from each other. If, however, we randomly sample data from each tree created by the model it can be shown that this reduces the degree of bias compared to a standard decision tree (for those interested in a more detailed discussion, this paper from the BoE is very accessible). We “train” our model by allowing it to operate on a sub-sample of our dataset and apply the “knowledge” gained during the training period to the rest of the sample to see whether it makes accurate predictions.

The Bank of England applied ML techniques in a paper using random forest models to predict banking distress. Based on this blog post by Saulo Pires de Oliveira, we can demonstrate exactly the same techniques used in the BoE paper to show the random forest model in action. It is written in the R software system and rather than use financial data, we use the famous Anderson iris data set which looks at the characteristics of three variations of irises (the data comes as standard in the R system). Our objective is to determine on the basis of the characteristics (the length and the width of the sepals and petals) which category of plant each observation belongs to. The code is available below and since R is free to download it is a simple matter of copying this code into R and running it to reproduce the results.

The model uses part of the dataset as input to a training algorithm and applies the results to the rest of the sample. How do we know whether our results are any good? Following the BoE example, we calculate the Receiver Operating Characteristic (ROC) curve which plots the true positive rate against the false positive rate (see chart below). The former is high and the latter low, suggesting that the model performs well in determining which class of iris the data correspond to. This is confirmed by a cross-check against the area under the curve algorithm which shows a value of 98% (the higher the value the better the fit).

Whilst this is a very simple example it does show the power of ML techniques. From an economists' point of view their use as a statistical technique for classifying patterns makes them an extremely powerful new tool. But we should beware of overdoing the hype when it comes to their use in some other areas. Many of the more general problems in cognition are not classification problems that MLAs are good at solving. Moreover, they tend to be data hungry whereas a human can learn abstract relationships with much less data input. One thing that humans still do better than machines is adaption and they are not going to replace us any time soon. But for the statisticians they are likely to be a boon.

[1] The others are semi-supervised learning in which the data are unlabelled but the MLA is still subject to manual correction. An active learning approach allows the MLA to query an external information source for labels to aid in identification. Unsupervised learning forces the MLA to find structure in the input data without recourse to labels and without any correction from a supervisor. Finally reinforcement learning is an autonomous, self-teaching MLA that learns by trial and error with the aim of achieving the best outcome. It has been likened to learning to ride a bicycle, in which early efforts often involve falling off but fine-tuning actions which gradually eliminate mistakes eventually lead to success.

Tuesday, 6 October 2020

How not to Excel

The UK authorities continue to demonstrate imaginative ways to screw things up in ways which would be laughable were they not so serious. The recent news that the number of reported Covid-19 cases jumped by 85% on Saturday and by another 78% on Sunday to leave them more than three times the figure reported on Friday has been blamed on a computer error. But it was not a major system failure arising from the complexity of the infrastructure. It was one of those dumb things that happen from time to time, like when the Mars Climate Orbiter was lost in 1999 after one piece of software provided output in imperial units to a routine that was expecting them in metric units.

In this instance, the agencies responsible for collecting the swabs for the Covid track and trace system delivered data in a CSV file, whose length is theoretically unlimited, to Public Health England (PHE) which imported it into an Excel spreadsheet. Unfortunately PHE failed to realise that a spreadsheet is limited to 1,048,576 row entries and 16,384 column entries. Data which exceed these limits are simply ignored, hence 15,841 positive test results were overlooked – as were the details of those with whom they had been in contact. As someone with a lot of hands-on experience handling datasets which regularly exceed Excel limits, I was very surprised that an organisation handling such volumes of data made such a basic error (for a small consultancy fee I will happily teach the health authorities to handle such datasets).

Better tools for the job

The issue resonated with me because over the course of recent months I have become very interested in the appliance of data science techniques to the collection and analysis of large datasets, particularly real time data. Whilst I am no expert, I know enough to recognise that Excel is not the appropriate tool. There are much better resources to handle data. If it is storage that the authorities are concerned with, a low cost solution would be to use a dedicated database such as Microsoft Access. Excel is great for dealing with relatively small datasets but it is completely the wrong tool for dealing with big data. Jon Crowcroft, Professor of Communications Systems at the University of Cambridge, was quoted as telling the BBC that “Excel was always meant for people mucking around with a bunch of data for their small company to see what it looked like … And then when you need to do something more serious, you build something bespoke that works - there's dozens of other things you could do. But you wouldn't use XLS. Nobody would start with that."

Nor is it necessarily the right tool for many economic applications. Back in 2013, it was revealed that a paper by Carmen Reinhart and Ken Rogoff contained a spreadsheet coding error which invalidated their result that GDP growth declines once public debt exceeds 90% of GDP. For years academic economists have been using systems such as Matlab and Gauss for much of their quantitative work. Whilst they are excellent for handling data matrices underpinning most econometric analysis, they come with a high price tag. This limits their use to those who have stumped up the licence fee and discourages those who merely wish to engage in low cost experimentation.

Increasingly, however, the economics profession is moving towards the use of systems which can store data and conduct advanced analytics. Two of the most popular are the R software environment and the Python programming language. Both are free to download and each has a huge volume of online libraries which users can integrate into their own system. So far as most economic applications are concerned, the likelihood is that someone has already written a library to do the analysis you are interested in or there is something sufficiently close that minimal code changes are required. Since both can do what Matlab and Gauss can do, and they can be downloaded for free, what’s not to like?

The cost of change

Unfortunately, financial costs are not the only issue: a major investment in time is required in order to become proficient in any system. Since neither R nor Python are particularly user friendly at first glance, it is easy to understand why people are daunted by the prospect of getting stuck into what looks like some heavy duty coding. Moreover, those who many years ago invested time and effort in learning other systems need to be persuaded that the benefits of switching are worthwhile. In my case, I have yet to come across a system that handles structural macroeconomic models better than the Aremos system, whose roots extend back almost 50 years (a view that may not be shared by everyone but it is a system which has worked well for me for many years). However R and Python do a lot of other things far better so I have been experimenting with both.

Examples: (i) Big data sets

At the outset I should declare my preference for R. This is primarily due to a number of system-related reasons, but Python can do all the things I am about to describe. A good place to start is the analysis of big data sets and anyone who has looked at the Google mobility data will run into the same problem as PHE did when looking at Covid data. Whilst it is possible to download the CSV file from Google’s website containing 2,621,847 records (as of today) it is not possible to load it into Excel. But R can handle vectors of more than 2.1 billion records so it is straightforward to download the data and do any required data manipulation before exporting it in the format of your choice.

(ii) Natural language processing

Another thing that R does well – although probably not as well as Python – is natural language processing. I may look at this topic in more detail another time, but suffice to say that last year I did some work in R to analyse the communication content of the Bank of England MPC minutes. Amongst other things, the analysis looks at the readability of the minutes by calculating the Flesch ease of reading index. We can also attempt to define particular keywords in context by identifying those words which are most closely associated with a specific term. Thus, for example, we can identify how often the word “inflation” is associated with those words representing concern (“worries”, “problems” etc.) thus allowing us to quantify the extent to which the BoE is currently worried about inflation (we can add further filters to determine whether the concerns are about overly-high or overly-low inflation).

(iii) Scraping the web

A lot of data sit on websites which in the past might have had to be typed in manually. Those days are long behind us. Numerous libraries exist in both R and Python which allow users to grab data from online sources. We can, for example, import data from Twitter which opens up numerous possibilities for analysing tweet patterns. One of the routines I regularly undertake is to scrape four-hourly data on UK electricity generation directly from Twitter as an input into my real-time economic analysis

(iv) A bit of statistical fun

For anyone who may be daunted by the thought of using systems such as R, the best way to get acquainted is to run some existing code and experiment with it yourself – something made more palatable if it happens to coincide with a subject that interests you. I will thus leave you with an example in the form of code (below) designed to extract data from the Fantasy Premier League database to predict my points score for last weekend’s fixtures. The top panel shows the code and the bottom panel displays the output. For anyone with a team entered in the Fantasy Premier League (and there are more than 6 million people around the world), all you have to do to customise the code by substituting your own team number into line 8 in the top panel (“entryid=…”). For the record I was predicted to score 54 points but in the end I scored a miserable 36. The code worked fine – the problem was that the algorithm which produced my expected points score was an exogenous variable over which I had no control thus highlighting the old computing adage of “garbage in, garbage out.”

Last word

Whilst Excel is a fantastic tool for many of the day-to-day tasks we undertake, it is limited in what it can do. You can be sure that PHE will not make the same data mistake again. But the point of this post is to demonstrate that there are more appropriate tools for the job they are trying to undertake. You don’t have to be a rocket scientist to figure that out. The appliance of data science will suffice.

Monday, 30 April 2018

Beware the big data rush

Bank of England chief economist Andy Haldane today gave a speech entitled Will Big Data Keep Its Promise? in which he assessed the contribution that big data can make to improving decision making in finance and macroeconomics. Whilst I agree that this is indeed a subject that offers significant potential, we do have to be mindful of the downsides associated with the data trails we leave as we live our lives in a digital world.

In 2005 there were around 1 billion global internet users; today there are estimated to be almost 3.5 billion. Just as important, there has been a significant switch from the one-way flow of traffic from suppliers to consumers, which characterised the early years of internet use, to a more interactive medium. Today, users send around 6000 tweets, make 40,000 Google searches and send 2 million emails every second. The capacity of text on the internet is estimated at 1.1 zettabytes, which is approximately 305.5 billion pages of A4 paper and which is projected to rise to 2 zettabytes by 2019 (more than 550 billion sheets). And that is without the pictures! To take another example, the Large Hadron Collider generates 15 petabytes of data each year, equivalent to around 15,000 years of digital music.

Where does all this data come from? Some of it is merely the transcription of existing data into an electronic form that makes it more accessible. Wikipedia, for example, has helped to democratise knowledge in a way that was previously impossible. But a lot of it has come into being as a result of technological developments which allow the capture of much greater volumes of information. This has been facilitated by the rise of cloud computing which allows users to store, manage and process vast amounts of information in a network of remote servers (ironically, this is a reversal of the trend of recent decades which saw a shift from centralised towards local data storage). Perhaps even more important, the rise of social media such as Twitter and Facebook has vastly increased the volume of information pumped out (not to mention the rise of microblogging sites in China such as Tencent or Sina Weibo).

Clearly, a lot of this information does not yield any valuable insight but given the vast amount of available information even a small fraction of it is still too much for humans to reasonably digest. Even if we only require 0.5% of the information stored online, we would still need 1.5 billion sheets of A4. The problem is compounded by the fact that we do not necessarily know what is useful information and what can easily be discarded so we have to scan far more than we require in order to stream out the good stuff. As a result, much progress has been made in recent years to devise methods of scanning large datasets in order to search for relevant information.

To the extent that knowledge is power, it stands to reason that those with the data in the digital age are those with the power. This raises a big question of how much control we should be prepared to give up, and there are legal issues about who owns the information that most of us have until now simply given away for free – something that the recent Facebook furore brought into the open.

But whilst social media platforms contain huge amounts of data that can be extracted at relatively little cost, and are often a useful barometer of public opinion, they are biased towards younger, urban-dwelling high income users. Relying on Tweets, for example, without accounting for this bias risks repeating the classic mistake made when trying to predict the US presidential results in 1936 and 1948, when the polling samples were skewed by the inclusion of those picked at random from the phonebook, at a time when telephone penetration was low.

Thus, whilst I agree with Haldane’s sentiment that “economics and finance needs to make an on-going investment in Big Data and data analytics” we need to beware of the headlong rush. As I wrote in a piece last year, “before too long, there will almost certainly be a spectacular miss which will bring out the critics in droves” and it could yet be that the Facebook problems will be a catalyst for a rethink. At the present time, much of society is only operating in the foothills of the big data revolution. The real trick, as former boss of Hewlett-Packard Carly Fiorina once said, will be to turn data into information, and information into insight. We are not quite there yet.