Categories
development

CodeCraftConf 2019 : What is data anyway? (Answers)

Here’s my thoughts on data following my CodeCraftConf guided conversation. Here are the questions I asked during my guided conversation at CodeCraftConf 2019. They are also available on GitHub if you would like to fork and modify them for your own use.

Most developers are data driven, start with the data structure, not the algorithm. Either data driven design, or the Merise Methodology.

Data, whilst often divided by microservice, is often stored on the same server/cluster, creating a monolith behind the microservices.

Not all data access is secured and audited, although there does appear to be a trend to on-behalf-of flows through the microservice, allowing user-centered access control. Strict data access design is prevalent, although the efficacy was less clear, and strict design applies to all data, including publicly available data.

Keeping sight of data in distributed systems is hard. Jepsen was suggested as one resource to help, but I’m happy to hear of others.

As well as data that can be used to discriminate by collecting gender, name, postcode etc., we also discussed how missing data can be used to discriminate, such as when Glasgow accents aren’t included in voice training data, or when women aren’t used in medical trails.

There’s also the big and growing problem of data collected by people who do not consider the discrimination or privacy implications. For a biologist, DNA is a puzzle that helps them decode cancer, and more examples make the puzzle easier to solve. But for others, DNA is a tool to map insurance risk, to find criminals, and to track down family members whether or not they want to be found. How do we train everyone else to understand?

And the takeaway question : what questions aren’t you asking about your data?

Categories
data development

CodeCraftConf 2019 : What is data anyway? (Questions)

Here are the questions I asked during my guided conversation at CodeCraftConf 2019. They are also available on GitHub if you would like to fork and modify them for your own use. Thankyou to everyone who came to the discussion, I will post a follow-up to discuss some of the interesting answers.

What is data anyway?

Navigating SQL, NoSQL, JSON and how to work with data in a post-RDMS, big-data world

Questions

Data modelling

  1. When designing a system, do you start with the data or the code?
  2. Has the rise of cloud based or non relational data stores changed how we model our data?
  3. Do you need to update your data when the models in the code change? How do you do it?
  4. Does all your data have to have the same shape?
  5. Should the data you expose to the outside world broadly match the data at rest?

Data security

  1. How do you secure your data?
  2. In light of GDPR, How do you ensure you aren’t collecting too much data?
  3. Who has access to your data?
  • Do you know if anyone unauthorised has accessed it?
  1. How do you protect yourself against bad data and trojan data?
  • Bad data = data that is fake, or is used for real world attacks
  • Trojan data = data that can compromise your or your customer’s systems

Ethical data

  1. Can your data be used to discriminate?
  • Can you prove it?
  • Is your data biased?
  • Are you recording hidden correlations? (ZIP code suggests race)
  1. Who owns your data?
  2. What questions aren’t you asking?

Unused questions

  1. What makes data big?
  2. Are you collecting the right data?
  3. Is the data you’re collecting right?
  4. Where is your data?

Technology choices

  1. Do you still have a place for traditional RDBMS?
Categories
code development programming security

This is raw

This is raw chicken : 🐤

If you eat it like that, you may get hurt immediately, by its beak, or its claws. It may grab your money and run off with it.

If you want to eat it, better to kill it first. 💀

If you eat it like that, you may get hurt or die, in a few hours, or days. Washing it won’t help.

Cook it. Cook it well. If there’s any sign of pink, cook it some more. 🔥

It might still kill you, but at least you’re a lot safer than when you started.


This is raw data : 🐤

If you display it, users will get hurt immediately, whether by cross-site scripting, cookie sniffing, crypt-currency mining, or something else. If you’re lucky, it will be something your user’s see immediately and leave your site never to return. Otherwise they may get infected.

If you want to use it, better to validate it first. 💀

If you save it like that, your users are still vulnerable. It might appear on the front end in a different form. It might be a string of unicode characters that crashes your phone. It might be a link to somewhere they can’t trust.

Encapsulate it. Sandbox it. Never trust it, in or out . HTML encode, whitelist the output as well as the input.

And if you need to avoid spam, or incitement, or solicitation, maybe you need editors. Computers can’t fix all the social problems. 🔥

Categories
.net code

Self-testing configuration – the example.config story

Some halloween cakes reflected in a mirror
Mirror image

We’re all good developers and never store production secrets in our source code. We have to keep them somewhere though.

If you’re an ASP.Net developer and you store your secrets in a separate file, how do you check them on release?

In a previous job, we were building an application in AWS EC2, which meant we couldn’t sit on top of the lovely Azure Key Vault/secrets.json workflow, and instead used a secure file drop as part of the deployment. We had a couple of releases that failed because secrets that were available locally weren’t included in the release package, causing the website to fail to start. We didn’t have the nice logging provided by the .Net core configuration objects, so what we did was create configuration documentation in the form on an example.config file that contained all the keys we expected, and populated dummy values. On startup, we’d compare the list of keys in the example.config which did not contain any secrets, with the actual config, and fail the startup with a report of which keys were different so we had a very quick way of checking the config was in sync with expectations.

Categories
development programming security

NMandelbrot : running arbitrary code on client

As part of my grand plan for map-reduce in JavaScript and zero-install distributed computing, I had to think about how to gain user trust in a security context where we don’t trust the server. I couldn’t come up with a good answer.

Since then, we’ve seen stories of malicious JavaScript installed to mine cryptocurrencies , we know that JavaScript can be exploited to read kernel memory, including passwords, on the client, and I suspect we’ll see a lot more restrictions on what JavaScript is allowed to do – although as the Spectre exploit is fundamentally an array read, it’s going to be a complex fix at multiple levels.

I had ideas about how to sandbox the client JavaScript (I was looking at Python’s virtualenv and Docker containers to isolate code, as well as locking them into service workers which already have a vastly more limited API), but that relies on the browser and OS maintaining separation, and if VMs can’t maintain separation with their levels of isolation, it’s not an easy job for browser developers, or anyone running on their platform.

I also wondered if the clients should be written in a functional language that transpiled to JavaScript, to have language level enforcement of immutability and safety checks. And of course, because a functional style and API provides a simpler context to reason about map-reduce, by avoiding any implicit shared context.

Do you allow someone else’s JavaScript on your site, whether a library, or a tracking script, or random ads from Russia, Korea, botnets and script kiddies? How do you keep your customers safe? And how do you isolate processes you trust from processes that deal with the outside world and users? JavaScript will be more secure in the future, and the research is fascinating (JavaScript Zero: real JavaScript, and zero side-channel attacks) but can you afford to wait?

Meltdown and Spectre shouldn’t change any of this. But now is a good time to think about it. Make 2018 the year you become paranoid about users, 3rd parties and other threats. The year is still young, but the exploits are piling up.

 

Categories
data development free speech security

Government insecurity agencies

Given the SSL attacks that could be traced back to classing secure encryption as weapons subject to export restrictions, it’s clear that government security agencies have a deep conflict of interest that has led to significantly reduced security protection for their own citizens.

It’s clear that the Ransomware (or Ransomware as diversion) attacks on UK and US hospitals and many other sites are directly due to the NSA backdoor toolkit that was stolen earlier this year. Because if the government has a back door into a system, or an encryption platform, everyone has a backdoor, even if they don’t have access to it yet.

Which is why it’s great to see the EU outlawing backdoors in order to protect us as patients, service users, and data subjects, and I completely expect this will apply, like GDPR, to any system holding EU citizens data. So when the UK puts on its “we need a back door” legislation, companies need to choose to trade with the UK and compromise their security, or trade with the much bigger EU and protect their customers.

Encryption is like a lock, but it isn’t. It’s like a safe door, but it isn’t. Abstractions help to frame the problem, but they can obscure the issues. They make lawmakers think that what applies to banks applies to data.

(note: bank processes are optimised to replace credit cards because security works best when you can throw away a channel and start again if it’s compromised; this includes reversing transactions – which is hard to do when it’s the release of your personal data that needs reverted, rather than a row in a ledger than can be corrected by an additional row).

Encryption isn’t the problem. The San Bernardino iPhone had no useful intel. All the recent attackers in the UK were known, reported, and could have been tracked if they were prioritised. Banning encryption will have about as much impact as banning white vans. Breaking encryption weakens our security, threatens international trade especially with the EU, and when security holes lead to attacks on our hospitals and other infrastructure, bad security threatens our lives.

But so long as we’re afraid of terrorism, it’s OK for the populous to suffer?

Categories
development security

How much data can you lose before you’re in trouble?

Ransomware is a very aggressive attack. Whilst many espionage operations are about sneaking in and copying data without your knowledge, ransomware hits you over the head with a hammer to let you know you’ve lost your data. It’s not theft, it’s extortion.

The big pro is that at least you know you’ve been breached, and the form of attack means that whilst you might not have access to your data, the bad guys might not either.

But you’ve got a good backup strategy, right? You can roll back the data to a known good point in history, and maybe even roll forward your changes from there.

But maybe it doesn’t matter. Maybe you can run your business just as effectively without that data, or those templates. Maybe you shouldn’t be keeping that data at all?

If you have data you need, distribute it. Secure it, but decide if the greater risk is you losing access to the data, or someone else gaining access.

If you have data you don’t need, Don’t store it.

Categories
development

Moving house and denormalised data 

There’s a lot of abstract talk about data normalisation and having a single source of truth within your organisation for each type of data. Sometimes this goes very wrong. 

I moved house this year, and whilst I accept the extra administration that comes from each organisation needing to be updated individually (after all, why should DVLA need to know who I bank with?), I do expect each organisation to make it easy to change my address. 

And it was easy, except for my bank. Current accounts, credit cards and savings are all managed separately, the address I see online is unrelated to any of these, and I have to change my address on joint and single accounts separately. 

I don’t care what your backend system is like. My interface should be one call and you sort it out, not keep redirecting me to queue after queue. And sending me a letter to the new address telling me it’s fixed, followed by another letter dated a week later to the old address with marketing in it. 

And then, I tried to phone them about that letter, and I couldn’t pass their security questions, because my address was updated everywhere except the phone validation system. 

How easy is it for your data subjects to update their data? One call or many? Is their security tied to that data – could they lose access if a bad agent adds a new address or credit card? How confident are you that you know all the places in your user, marketing, communication and other databases that need updating? 

How can you make it easier? 

Categories
development programming security ux

Your API sucks : illegality

No human is illegal, especially your users. I know you were told to make it secure, so you’re filtering input, but some users (so long as you cover Scotland) live in Flat 1/2, so let me put the slash in their address. And let Shaun O’Malley have an apostrophe. Not only does this make for a poor user experience that the developers using your API either have to pass on to their users, or find a way to deal with – if you let them know what encodings you support – but that might be against security policy.

Worse than that however, a policy like that is a red flag to hackers. If you don’t allow /, do you allow \, or do you filter DROP? It’s a sign that there’s a wormhole through your code with a single line of defence against SQL Injection, or script injection. As a developer, it worries me, from a usability and a security perspective.

 

Stop sending the wrong message and making my users illegal.

Categories
code development

Your API sucks : documentation

Your API is perfect. It just needs a little finesse on the documentation. It’s fine, just farm it out to Johnny Clueless. After all, good APIs, like all good code, is self documenting. It’s just those damn users that keep asking what you mean when you said that, or wanting to know what all the options are. They’re very demanding.

Documentation out of sync with the code

getPostcode(double opt)
This method takes returns the current longitude and latitude of the user based on the information provided by their device.

Teaching Granny to suck eggs

Bad API documentation is like bad comments, it tells you everything except what you need to know, such as the return format.

getAddressFromPostcode(string postcode)
This method takes a postcode and returns the associated address.

Be explicit

However, don’t use this as an excuse to avoid documentation. Good documentation helps users access your API without using up your time. Be explicit about options, about which standards you support, and where necessary, how you support them. Even for simple things.

explicit-documentationThis image represents an authentication call to retrieve an API token that can be used when making further calls to an API. Each of the arrows represents an email discussion where the user needed clarification. This can be simplified with the right explicit statement:

To retrieve your API token, a POST request must be made to the HTTPS URL provided, with the following key-value pairs x-www-form-urlencoded into the Body of the request:

Key                       Value

client_id              your_username

client_secret      your_password

grant_type          client_credentials

If successful, this will return a JSON body in the response. The value associated with the key access_token should be appended to all subsequent requests as described below.

The use of italics is defined appropriately, and user specific information is supplied in a format to match the documented table.