Showing posts with label data science. Show all posts
Showing posts with label data science. Show all posts

Tuesday, 6 October 2020

How not to Excel


The UK authorities continue to demonstrate imaginative ways to screw things up in ways which would be laughable were they not so serious. The recent news that the number of reported Covid-19 cases jumped by 85% on Saturday and by another 78% on Sunday to leave them more than three times the figure reported on Friday has been blamed on a computer error. But it was not a major system failure arising from the complexity of the infrastructure. It was one of those dumb things that happen from time to time, like when the Mars Climate Orbiter was lost in 1999 after one piece of software provided output in imperial units to a routine that was expecting them in metric units.

In this instance, the agencies responsible for collecting the swabs for the Covid track and trace system delivered data in a CSV file, whose length is theoretically unlimited, to Public Health England (PHE) which imported it into an Excel spreadsheet. Unfortunately PHE failed to realise that a spreadsheet is limited to 1,048,576 row entries and 16,384 column entries. Data which exceed these limits are simply ignored, hence 15,841 positive test results were overlooked – as were the details of those with whom they had been in contact. As someone with a lot of hands-on experience handling datasets which regularly exceed Excel limits, I was very surprised that an organisation handling such volumes of data made such a basic error (for a small consultancy fee I will happily teach the health authorities to handle such datasets). 

Better tools for the job 

The issue resonated with me because over the course of recent months I have become very interested in the appliance of data science techniques to the collection and analysis of large datasets, particularly real time data. Whilst I am no expert, I know enough to recognise that Excel is not the appropriate tool. There are much better resources to handle data. If it is storage that the authorities are concerned with, a low cost solution would be to use a dedicated database such as Microsoft Access. Excel is great for dealing with relatively small datasets but it is completely the wrong tool for dealing with big data. Jon Crowcroft, Professor of Communications Systems at the University of Cambridge, was quoted as telling the BBC that “Excel was always meant for people mucking around with a bunch of data for their small company to see what it looked like … And then when you need to do something more serious, you build something bespoke that works - there's dozens of other things you could do. But you wouldn't use XLS. Nobody would start with that."

Nor is it necessarily the right tool for many economic applications. Back in 2013, it was revealed that a paper by Carmen Reinhart and Ken Rogoff contained a spreadsheet coding error which invalidated their result that GDP growth declines once public debt exceeds 90% of GDP. For years academic economists have been using systems such as Matlab and Gauss for much of their quantitative work. Whilst they are excellent for handling data matrices underpinning most econometric analysis, they come with a high price tag. This limits their use to those who have stumped up the licence fee and discourages those who merely wish to engage in low cost experimentation. 

Increasingly, however, the economics profession is moving towards the use of systems which can store data and conduct advanced analytics. Two of the most popular are the R software environment and the Python programming language. Both are free to download and each has a huge volume of online libraries which users can integrate into their own system. So far as most economic applications are concerned, the likelihood is that someone has already written a library to do the analysis you are interested in or there is something sufficiently close that minimal code changes are required. Since both can do what Matlab and Gauss can do, and they can be downloaded for free, what’s not to like? 

The cost of change 

Unfortunately, financial costs are not the only issue: a major investment in time is required in order to become proficient in any system. Since neither R nor Python are particularly user friendly at first glance, it is easy to understand why people are daunted by the prospect of getting stuck into what looks like some heavy duty coding. Moreover, those who many years ago invested time and effort in learning other systems need to be persuaded that the benefits of switching are worthwhile. In my case, I have yet to come across a system that handles structural macroeconomic models better than the Aremos system, whose roots extend back almost 50 years (a view that may not be shared by everyone but it is a system which has worked well for me for many years). However R and Python do a lot of other things far better so I have been experimenting with both. 

Examples: (i) Big data sets

At the outset I should declare my preference for R. This is primarily due to a number of system-related reasons, but Python can do all the things I am about to describe. A good place to start is the analysis of big data sets and anyone who has looked at the Google mobility data will run into the same problem as PHE did when looking at Covid data. Whilst it is possible to download the CSV file from Google’s website containing 2,621,847 records (as of today) it is not possible to load it into Excel. But R can handle vectors of more than 2.1 billion records so it is straightforward to download the data and do any required data manipulation before exporting it in the format of your choice. 

(ii) Natural language processing 

Another thing that R does well – although probably not as well as Python – is natural language processing. I may look at this topic in more detail another time, but suffice to say that last year I did some work in R to analyse the communication content of the Bank of England MPC minutes. Amongst other things, the analysis looks at the readability of the minutes by calculating the Flesch ease of reading index. We can also attempt to define particular keywords in context by identifying those words which are most closely associated with a specific term. Thus, for example, we can identify how often the word “inflation” is associated with those words representing concern (“worries”, “problems” etc.) thus allowing us to quantify the extent to which the BoE is currently worried about inflation (we can add further filters to determine whether the concerns are about overly-high or overly-low inflation). 

(iii) Scraping the web 

A lot of data sit on websites which in the past might have had to be typed in manually. Those days are long behind us. Numerous libraries exist in both R and Python which allow users to grab data from online sources. We can, for example, import data from Twitter which opens up numerous possibilities for analysing tweet patterns. One of the routines I regularly undertake is to scrape four-hourly data on UK electricity generation directly from Twitter as an input into my real-time economic analysis 

(iv) A bit of statistical fun 

For anyone who may be daunted by the thought of using systems such as R, the best way to get acquainted is to run some existing code and experiment with it yourself – something made more palatable if it happens to coincide with a subject that interests you. I will thus leave you with an example in the form of code (below) designed to extract data from the Fantasy Premier League database to predict my points score for last weekend’s fixtures. The top panel shows the code and the bottom panel displays the output. For anyone with a team entered in the Fantasy Premier League (and there are more than 6 million people around the world), all you have to do to customise the code by substituting your own team number into line 8 in the top panel (“entryid=…”). For the record I was predicted to score 54 points but in the end I scored a miserable 36. The code worked fine – the problem was that the algorithm which produced my expected points score was an exogenous variable over which I had no control thus highlighting the old computing adage of “garbage in, garbage out.”

Last word 

Whilst Excel is a fantastic tool for many of the day-to-day tasks we undertake, it is limited in what it can do. You can be sure that PHE will not make the same data mistake again. But the point of this post is to demonstrate that there are more appropriate tools for the job they are trying to undertake. You don’t have to be a rocket scientist to figure that out. The appliance of data science will suffice.