May 2013 – Craig Nicol

I think every pragmatic programmer or aspiring code guru needs a core programming challenge that they return to whenever they want to try something new, like signature tune a guitarist will play on every new guitar to see how it fits their style.

My favourite pattern is The Mandelbrot Set because it’s a nice way to check the main features of any language : looping, branching, and creating complex structures, as well as adding a graphical level to start looking at the surrounding libraries. It’s also a neat optimisation problem, and each language I’ve used lends itself to slightly different optimisations.

I’ve gone through a few versions, from Basic, to a couple of versions of C++, Python and Javascript, discovering double-buffering when I got my hands on the SDL for gcc, and list comprehensions to do full-screen iterations in Python, and there’s always a new way to calculate and generate the output.

So what’s your workbench? Do you build a unit-testing framework? Or a shopping cart app? Or do you turn every language into a LOLCode parser?

Introduction

In order to frame my thoughts for the May 30th Developer’s Hangout on Big Data, I wanted to put together some thoughts on the main things I’d like to discuss and include some references to show where I’m coming from. I realise some of these examples aren’t specific to Big Data, they’re simply extrapolations of existing problems, but my main concern is that the technological problems are easy to overcome, but it’s going to take some seriously smart engineering to overcome the social and human problems, and they might not be solvable.

In order to focus on laying out my position, I’m going to save the references until the end.

The Models are too small

I started thing about this problem whilst reading The Black Swan by Nassim Nicholas Taleb [3], which, if you’re a computer scientist, might not tell you anything you don’t already know, but is a great reminder of how analysis is done badly in the big bad world, particularly in finance. Stock market data is highly time-sensitive, is high volume, and requires fast, accurate analysis to be useful, so is a great example of the type of problem that often comes up when talking about Big Data.

What Taleb argues is that the analysis used, despite winning various economists Nobel prizes, is fundamentally flawed, because it ultimately attempts to over-fit a simple, Gaussian based, model onto complex, fractal, data. What this means is that the models produce nice, simple numbers like mean and standard deviation which are then used to calculate risk. Unfortunately, those numbers are completely meaningless for the type of data they’re used on. If you were selling goods on the internet to Europe and the US, you might decide to build your warehouse at the mean location of your customers to reduce shipping costs, and use standard deviation to calculate how much you should charge. Your mean location will put you in the middle of the Atlantic, and the 2 standard deviations to cover 95% of your customers will also include shipping locations in South America and Africa that you’re not interested in. Time for a better shipping model.

Whilst I like the beauty of Peter Norvig’s assertion that the only way to scale big data is to keep the model simple and the data big [4], we also need to bear in mind Einstein’s maxim that a solution should be as simple as possible, but no simpler.

The models are wrong

Even if you’re using the right model, we’re using big data into a world where 88% of Excel spreadsheets are wrong [9], and bad calculations cause banking crashes [7,8]. The thinking seems to be to gather more data from more sources to provide more confidence, so they can ask “How can we possibly have missed anything?”

And even if the models are calculating the right results based on the inputs, if those inputs are wrong or incomplete, the model is just as wrong as one that multiplies when it should divide. Create a very smart terrorism screening program, and then match a name to the wrong person, and it produces the wrong result. Or analyse data for several countries to measure the impact of the recession, and forget to include the top 5 countries. That will skew your results.

We’re developers, we look for solutions, so we test the models on a variety of data and check the results. And we never notice we’re modelling the wrong thing because our tests are doomed to succeed, as my robotics lecturer liked to say.

We don’t understand the results

So, we overcome the problems with the models being too small or being wrong, and we get some sensible results. We calculate some p-values on some of the data to make sure we’re seeing something different from the norm, but no-one understands p-values. [5,6]

Or we map variables and we see patterns in data, because we’re human and we like patterns. We like seeing faces on Mars, or a single narrative that links JFK, UFOs and Nixon. And we like confirmation that Internet Explorer is evil.

We make the wrong decisions based on the results

The biggest problem though is what we do with the analysis. Whether the model, the data, or the analysis is wrong, or even if it’s right, people draw conclusions that lead to poor decisions for themselves or for society.

I believe that there’s enough paranoia about Data Protection in EU that it’s unlikely to be a privacy issue (at least in the public sector, especially following the backlash against ID cards in the UK), but it can lead companies and governments to make bad, and possibly dangerous decisions. They see house prices going up and decide to offer 125% loans, they see that Just-In-Time delivery, driven by big data and cheap loans, saves money right until banks collapse and stop giving loans, and suddenly there’s no more Woolworths.

Developers, and shop floor workers, may also be aware that sometimes managers like to measure “productivity” by seeing how many lines of code, or units of work, someone can do per hour, and offering incentives to increase productivity [10]. So developers

write

code

that

spans

several

lines.

Measuring code quality as bugs fixed may show improvement across releases, but if people are incentivized by bug count, the software will have more bugs and is ultimately lower quality and more costly to maintain. At A&E, patients are left sitting in the Ambulance, which cannot go to another call, because the A&E department wants to meet a 4hr target for length of patient visit. Productivity, as measured, increases, but the act of measuring has made the situation worse. These perverse incentives only exist because the data is there. The data is accurate and the analysis is accurate, but the result is chaos.

What can we do about it?

First of all, we need to understand there are a few problems here, and each one has a different solution. Education is obviously important, but is there a way we can use the analysis itself to reduce errors by watching users and learning from the mistakes that are made? Or can we use the UX to highlight and guide the user away from the mistakes?

Do we plot data in the background and complain if users calculate the mean and standard deviation if the data doesn’t support it? Do we calculate power laws and refuse to calculate p-values if the data doesn’t justify it? In short, can the computer be smarter than the user in interpreting the data as well as just crunching the numbers, to make sure the model is big enough and correct?

Once we have the analysis, is there any way to educate users or encourage them to use it the right way? The best example of this I’ve seen is the TheyWorkForYou stats on 3 word usage [11]: show silly examples to help people think about the numbers they are seeing. Do we need to add disclaimers to all analysis? Is there any automated approach to find ways to encourage continual improvement so that users can understand the data and the analysis better?

What can we do?