I’ve been re-evaluating the time I spend on the developer hangouts, and despite the interest I get from the invites, the turnouts have not been high enough to sustain these meetings. As a result, I will not be hosting any more.
If there is interest, I may think about a podcast-like Hangout-On-Air, but I would need some speakers for that, and I think that outlet would be better served by the plans +TechUp Inverness has for streaming their talks, and I might see if I can persuade other Scottish developer talks to make more content available online for streaming and reviewing (although I know from experience at work, there is a significant extra effort involved in setting such a thing up).
It’s been an interesting experiment, but it’s time to try something else. In the short term, I will be using the time I spent on this to update my blog as I’ve got a backlog of posts about node.js that I want to complete.
Many thanks to everyone who attended any of the hangouts this year. I hope to catch up with you all again in person at other events.
In order to frame my thoughts for the May 30th Developer’s Hangout on Big Data, I wanted to put together some thoughts on the main things I’d like to discuss and include some references to show where I’m coming from. I realise some of these examples aren’t specific to Big Data, they’re simply extrapolations of existing problems, but my main concern is that the technological problems are easy to overcome, but it’s going to take some seriously smart engineering to overcome the social and human problems, and they might not be solvable.
In order to focus on laying out my position, I’m going to save the references until the end.
The Models are too small
I started thing about this problem whilst reading The Black Swan by Nassim Nicholas Taleb , which, if you’re a computer scientist, might not tell you anything you don’t already know, but is a great reminder of how analysis is done badly in the big bad world, particularly in finance. Stock market data is highly time-sensitive, is high volume, and requires fast, accurate analysis to be useful, so is a great example of the type of problem that often comes up when talking about Big Data.
What Taleb argues is that the analysis used, despite winning various economists Nobel prizes, is fundamentally flawed, because it ultimately attempts to over-fit a simple, Gaussian based, model onto complex, fractal, data. What this means is that the models produce nice, simple numbers like mean and standard deviation which are then used to calculate risk. Unfortunately, those numbers are completely meaningless for the type of data they’re used on. If you were selling goods on the internet to Europe and the US, you might decide to build your warehouse at the mean location of your customers to reduce shipping costs, and use standard deviation to calculate how much you should charge. Your mean location will put you in the middle of the Atlantic, and the 2 standard deviations to cover 95% of your customers will also include shipping locations in South America and Africa that you’re not interested in. Time for a better shipping model.
Whilst I like the beauty of Peter Norvig’s assertion that the only way to scale big data is to keep the model simple and the data big , we also need to bear in mind Einstein’s maxim that a solution should be as simple as possible, but no simpler.
The models are wrong
Even if you’re using the right model, we’re using big data into a world where 88% of Excel spreadsheets are wrong , and bad calculations cause banking crashes [7,8]. The thinking seems to be to gather more data from more sources to provide more confidence, so they can ask “How can we possibly have missed anything?”
And even if the models are calculating the right results based on the inputs, if those inputs are wrong or incomplete, the model is just as wrong as one that multiplies when it should divide. Create a very smart terrorism screening program, and then match a name to the wrong person, and it produces the wrong result. Or analyse data for several countries to measure the impact of the recession, and forget to include the top 5 countries. That will skew your results.
We’re developers, we look for solutions, so we test the models on a variety of data and check the results. And we never notice we’re modelling the wrong thing because our tests are doomed to succeed, as my robotics lecturer liked to say.
We don’t understand the results
So, we overcome the problems with the models being too small or being wrong, and we get some sensible results. We calculate some p-values on some of the data to make sure we’re seeing something different from the norm, but no-one understands p-values. [5,6]
Or we map variables and we see patterns in data, because we’re human and we like patterns. We like seeing faces on Mars, or a single narrative that links JFK, UFOs and Nixon. And we like confirmation that Internet Explorer is evil.
We make the wrong decisions based on the results
The biggest problem though is what we do with the analysis. Whether the model, the data, or the analysis is wrong, or even if it’s right, people draw conclusions that lead to poor decisions for themselves or for society.
I believe that there’s enough paranoia about Data Protection in EU that it’s unlikely to be a privacy issue (at least in the public sector, especially following the backlash against ID cards in the UK), but it can lead companies and governments to make bad, and possibly dangerous decisions. They see house prices going up and decide to offer 125% loans, they see that Just-In-Time delivery, driven by big data and cheap loans, saves money right until banks collapse and stop giving loans, and suddenly there’s no more Woolworths.
Developers, and shop floor workers, may also be aware that sometimes managers like to measure “productivity” by seeing how many lines of code, or units of work, someone can do per hour, and offering incentives to increase productivity . So developers
Measuring code quality as bugs fixed may show improvement across releases, but if people are incentivized by bug count, the software will have more bugs and is ultimately lower quality and more costly to maintain. At A&E, patients are left sitting in the Ambulance, which cannot go to another call, because the A&E department wants to meet a 4hr target for length of patient visit. Productivity, as measured, increases, but the act of measuring has made the situation worse. These perverse incentives only exist because the data is there. The data is accurate and the analysis is accurate, but the result is chaos.
What can we do about it?
First of all, we need to understand there are a few problems here, and each one has a different solution. Education is obviously important, but is there a way we can use the analysis itself to reduce errors by watching users and learning from the mistakes that are made? Or can we use the UX to highlight and guide the user away from the mistakes?
Do we plot data in the background and complain if users calculate the mean and standard deviation if the data doesn’t support it? Do we calculate power laws and refuse to calculate p-values if the data doesn’t justify it? In short, can the computer be smarter than the user in interpreting the data as well as just crunching the numbers, to make sure the model is big enough and correct?
Once we have the analysis, is there any way to educate users or encourage them to use it the right way? The best example of this I’ve seen is the TheyWorkForYou stats on 3 word usage : show silly examples to help people think about the numbers they are seeing. Do we need to add disclaimers to all analysis? Is there any automated approach to find ways to encourage continual improvement so that users can understand the data and the analysis better?
Different levels of professionalism depending on the level of the team, enforce standards when teams are cowboys and relax them as people get more experienced.
How do you bring professionalism to the team or to new members? Provide information and guidance. Define new practices. Tools such as JIRA, use of source control, tie commits into tasks and issues.
Phases : in a small team, pick the right people, and they set the culture. For existing teams, need stronger sticks to enforce behaviour, technical. Role models, training and mentoring.
Public good : What is a universal public good? Depends on the country. Kill-bots are good for saving lives of US soldiers, but not so good for Afghanistan and Iraq citizens. Public interest disclosure – whistle-blowers. Responsible disclosure of exploits or data protection or other breaches. Do you close the hole first or notify first. Also thinking about data protection. Keep data in the right area to avoid DPA or Patriot Act breaches.
Don’t be evil. Don’t do anything illegal.
US has a very different concept of “professional” from state to state. More like the Electronic Engineer chartership – CPD, maintaining quality, providing guarantees.
Health and engineering, been around for centuries. Being responsible if things go wrong. “I’m certified in a few things, and I’ve decided I don’t like it very much”. Defines those “in” and “out”. See “scrum masters” – disbarring people who don’t do things the way things should be. See lawyers in US getting disbarred for opposing the Vietnam War.
Software Craftmanship – principles and practices, a set of standards we agree to adhere to.
Saying no – if we can’t meet a deadline, say so. If something takes too long, make impact on deadlines clear. If you’re dependent on other systems, you’re very sensitive to them, and you have to adapt to them saying no.
Chartership – does that give your argument more weight? Does it help to build trust with the client?
Not enough software developers! And a lot of them are contractors. Scouring the world for the best talent.
Finance doesn’t trust certain developers, particularly in other countries (outside EEA especially), but skills are improving everywhere.
No ego – being gentle when criticising, removing ownership from code – owned by the team. Shared code ownership.
If you can’t keep up with the latest technology, or other CPD requirements, then you’re in the wrong profession.
Is programming too hard or too boring? Are programmers are socially awkward anti-establishment hackers? Or are we all boring middle-aged men in suits?
That’s the perception outside the field, whereas we all know programmers are cool, interesting, social people. The question is, how do we make the next generation see it so they can join us, and help build whatever will follow the Internet?
The best answers we came up with all involved reaching out, whether that’s visiting schools or providing child-friendly events, and finding ways to get kids interested without bumping up against the stereotypes. Should we have hackday events? Do we need a Young Programmer of the Year competition where we can find and inspire teams to develop software and win a prize, take some regional heats and a national final? Or do we do things on a smaller scale?
So, do we tell people it’s problem solving? That’s ultimately the job, after all. Talk to people, understand their problems, and then build a solution. We use computers to do it, but the interest and complexity of the job is as much about the people and the processes we’re solving for as the tools we are using to do it. And if it’s problem solving, do we just need to find interesting problems to solve, whether it’s maze finding, or controlling robots, because robots are cool.
How much duty do we, as a community, and as a profession, have to find and nurture the next generation of developers and computer engineers?
In this month’s hangout, following the last fascinating discussion about recruitment, I want to look further back, into schools and universities and see what people think about the way programming is taught, and how to get the next generation as excited about software development as we are.
Let’s step further back. We’ve discussed recruitment last month, so this month, let’s talk about education. What should we, as people who live and breathe software development, do to encourage the next generation?
Is it up to schools, or is there something else we should be doing? Volunteering at the Maker’s Fayre? Or is everything just peachy the way things are, with programming an ever more expensive hobby, ever more divorced from using a real machine?
Is the Raspberry Pi a good move for teaching, bringing back the BBC Micro and the 8-bit days or is it irrelevant in the modern world of touchscreens and consumer technology?
Yep, it’s not Scottish, and it’s the second one – I 1-indexed that one, boosh.
A few technical problems – turns out microphones and headphones don’t work unless they’re turned on and enabled, and webcams don’t work if you don’t have one.
So, the chat? Can you work on island whilst the rest of your team are on another island? Webcams will see to most of that, but even on a beautiful island like Mull, don’t expect people to flock from all over the world to you office, just to hack some PHP. Make use of the tools at your disposal.
As some of you may know, I started a monthly technical forum at work, broadcast out from head office to 3 remote sites and whatever client sites people want to dial in from. We make heavy use of slides and low-motion visuals to save bandwidth, and try and keep things accessible to people only on the phone. It’s a big team, and a big effort, but after a year of meetings, it’s really paying off, management are backing it, and it’s a good place for staff to talk and learn. Getting the images and sounds to the remote office isn’t the tricky part (although it’s harder than it should be), the trick is to keep the remote participants engaged. Give them feedback channels in advance of the event so they can get their questions answered. Provide videos afterwards for people to watch at their leisure. And at every opportunity, remind them that their voice should be heard. If an online conversation is swamped by one site, what’s the point of having it online?
If you’re coming along next week, noon at My Google+ Profile (invites : http://goo.gl/P0p2b ) keep an eye on the news for something you want to discuss, or throw a link to a book, a gadget or a toolkit you want to discuss below.