Categories
code cosmosdb data development

Cosmosdb and Heterogeneous data

A selection of different watches. They all tell the time, but some are analogues, some are digital, some are branded, and some are not.
Same, but different

CosmosDb, in common with other NoSQL databases, is schema-free. In other words, it doesn’t validate incoming data by default. This is a feature, not a bug. But it’s a dramatic change in thinking, akin to moving to a dynamically typed language from a statically typed one (and not, as it might first appear, moving from a strongly typed to a weakly typed one).

For those of us coming from a SQL or OO background, it’s tempting to use objects, possibly nested, to represent and validate the data, and hence encourage all the data within a collection to have the same structure (give or take some optional fields). This works, but it doesn’t provide all the benefits of moving away from a structured database. And it inherits from classic ORMs the migration problem when the objects and schema need to change. It can very easily lead to a fragile big-bang deployment.

For those of us used to dynamic languages and are comfortable with Python’s duck typing or the optional-by-default sparse mapping required to use continuously-versioned JSON-based RESTful services, there’s an obvious alternative. Be generous in what you accept.

If I have a smart home, packed with sensors, I could create a subset of core data with time, sensor identifier and a warning flag. So long as the website knows if that identifier is a smoke alarm or a thermostat, it can alert the user appropriately. But on top of that, the smoke alarm can store particle count, battery level, mains power status, a flag for test mode enabled, and the thermostat can have a temperature value, current programme state, boiler status, etc, both tied into the same stream.

Why would I want to do this?

Versioning

Have historic and current data from a device/user in one place, recorded accurately as how it was delivered (so that you can tweak the algorithm to fix that timedrift bug) rather than having to reformat all your historical data when you know only a small subset will ever be read again.

Data siblings

Take all the similar data together for unified analysis – such as multiple thermostat models with the same base properties but different configurations. This allows you to generate a temperature trend across devices, even as the sensors change, if sensors are all from different manufacturers, and across anything with a temperature sensor.

Co-location

If you’re making good use of cosmosdb partitions you may want to keep certain data within a partition to optimise queries. For example, a customer, all of their devices, and aggregated summaries of their activity. You can do this by partitioning on the customer id, and collecting the different types of data into one collection.

Conclusion

NoSQL is not 3NF, so throw put those textbooks and start thinking of data as more dynamic and freeform. You can still enforce structure if you want to, but think about if you’re causing yourself pain further down the road.

Check out @craignicol’s Tweet: https://twitter.com/craignicol/status/1122224379658633217?s=09

Categories
development

CodeCraftConf 2019 : What is data anyway? (Answers)

Here’s my thoughts on data following my CodeCraftConf guided conversation. Here are the questions I asked during my guided conversation at CodeCraftConf 2019. They are also available on GitHub if you would like to fork and modify them for your own use.

Most developers are data driven, start with the data structure, not the algorithm. Either data driven design, or the Merise Methodology.

Data, whilst often divided by microservice, is often stored on the same server/cluster, creating a monolith behind the microservices.

Not all data access is secured and audited, although there does appear to be a trend to on-behalf-of flows through the microservice, allowing user-centered access control. Strict data access design is prevalent, although the efficacy was less clear, and strict design applies to all data, including publicly available data.

Keeping sight of data in distributed systems is hard. Jepsen was suggested as one resource to help, but I’m happy to hear of others.

As well as data that can be used to discriminate by collecting gender, name, postcode etc., we also discussed how missing data can be used to discriminate, such as when Glasgow accents aren’t included in voice training data, or when women aren’t used in medical trails.

There’s also the big and growing problem of data collected by people who do not consider the discrimination or privacy implications. For a biologist, DNA is a puzzle that helps them decode cancer, and more examples make the puzzle easier to solve. But for others, DNA is a tool to map insurance risk, to find criminals, and to track down family members whether or not they want to be found. How do we train everyone else to understand?

And the takeaway question : what questions aren’t you asking about your data?

Categories
data development

CodeCraftConf 2019 : What is data anyway? (Questions)

Here are the questions I asked during my guided conversation at CodeCraftConf 2019. They are also available on GitHub if you would like to fork and modify them for your own use. Thankyou to everyone who came to the discussion, I will post a follow-up to discuss some of the interesting answers.

What is data anyway?

Navigating SQL, NoSQL, JSON and how to work with data in a post-RDMS, big-data world

Questions

Data modelling

  1. When designing a system, do you start with the data or the code?
  2. Has the rise of cloud based or non relational data stores changed how we model our data?
  3. Do you need to update your data when the models in the code change? How do you do it?
  4. Does all your data have to have the same shape?
  5. Should the data you expose to the outside world broadly match the data at rest?

Data security

  1. How do you secure your data?
  2. In light of GDPR, How do you ensure you aren’t collecting too much data?
  3. Who has access to your data?
  • Do you know if anyone unauthorised has accessed it?
  1. How do you protect yourself against bad data and trojan data?
  • Bad data = data that is fake, or is used for real world attacks
  • Trojan data = data that can compromise your or your customer’s systems

Ethical data

  1. Can your data be used to discriminate?
  • Can you prove it?
  • Is your data biased?
  • Are you recording hidden correlations? (ZIP code suggests race)
  1. Who owns your data?
  2. What questions aren’t you asking?

Unused questions

  1. What makes data big?
  2. Are you collecting the right data?
  3. Is the data you’re collecting right?
  4. Where is your data?

Technology choices

  1. Do you still have a place for traditional RDBMS?
Categories
development security

Flatter Data

I was watching The Verge summary of The Selfish Ledger, Google X’s thought experiment on what your personal data could do in the future. I started to think about Flatland.

Flatland is a book by Edwin A Abbott about dimensions. In the book, A Square lives in a 2D world, with other 2D shapes, and tries to comprehend the universe when 3D shapes start turning up, but A Square can only comprehend them in slices or shadows/projections.

See this video by Carl Sagan if you want to know more.

The personal data organisations see of us is like the circles projected in Flatland. Google sees the videos I like and the technologies I search for help on. HMRC sees my income, savings, and charitable giving. NHS sees my health.

Companies make decisions on this data, and, like the flatlanders, generalise from the pink circles they see. Sometimes that accurately reflects the brown circles, oftentimes, not. Sometimes what looks like 2 circles is a pair of legs, and what looks like one circle is actually a group hug.

I don’t want companies to disambiguate that. I endorse the spirit of GDPR, that data should only be given up in informed consent (absent the usual rights exemptions for criminals who who violate the rights of others.)

For those of us who work in tech, we need to embrace the ambiguity, and help users and other data subjects understand how they have been categorised. Let them embrace anonymity via randomisation, such as number variance data masking.

You never own someone else’s data, you merely look after it for as long as they let you. It’s not about privacy. It’s not about data. It’s about trust. It’s about ethics.

Categories
code data development free speech security

Ethics in technology

This is an extension of a twitter thread I wrote in response to this tweet, and thread about the Cambridge Analytica revelations.

One of the key modern problems is how easy it is to access these tools. You don’t need professional training to string these together.

It’s as dangerous as if someone invented a weapon that could kill 10s or 100s of people, light enough to carry anywhere, and available in any store, without training. And expecting owners to police themselves.

People are terrified of AI. We know we don’t need AI to disable hospitals. We don’t need AI to intercept Facebook logins (although FireSheep and the pineapple are less effective now). We don’t need AI to send a drone into a crowded market.

Make a website the only place for government applications, such as medicare or millennials railcards and it’s easy to remove access for all citizens.

But combine all that with data and you can fuck up someone’s life without trying. You can give 2 people the same national insurance number or other id. You can flag them on the no fly list.

You can encode prejudice into the algorithm and incarcerate someone because they grew up in a black neighborhood.

The algorithm is God. The algorithm is infallible. Trust the algorithm.

Even when it tells you someone is more capable than the humans says she is, and punishes them.

(unless you’re under GDPR where you have the right to question the algorithm)

But tell anyone that people will use data for purposes they hadn’t considered (like using RIPA anti-terror legislation to see if someone’s in the school catchment area) then you’re paranoid.

Be paranoid. People will always stick crowbars in the seams. Whatever your worst case scenario for your code is, you’re probably not even close.


You can see my original tweet, and the repies, here:

The Guardian has a great interview on AI, existential threats and ethics on their podcast here.

Categories
development security

How much data can you lose before you’re in trouble?

Ransomware is a very aggressive attack. Whilst many espionage operations are about sneaking in and copying data without your knowledge, ransomware hits you over the head with a hammer to let you know you’ve lost your data. It’s not theft, it’s extortion.

The big pro is that at least you know you’ve been breached, and the form of attack means that whilst you might not have access to your data, the bad guys might not either.

But you’ve got a good backup strategy, right? You can roll back the data to a known good point in history, and maybe even roll forward your changes from there.

But maybe it doesn’t matter. Maybe you can run your business just as effectively without that data, or those templates. Maybe you shouldn’t be keeping that data at all?

If you have data you need, distribute it. Secure it, but decide if the greater risk is you losing access to the data, or someone else gaining access.

If you have data you don’t need, Don’t store it.

Categories
cloud data development security

Cloud is ephemeral

The Cloud is just someone else’s servers, or a portion thereof. Use the cloud because you want to scale quickly, only pay for what you use, and put someone else, with a global team, on the hook for recovering from outages. You’d also like a safety net, somewhere out there with the data you cannot afford to lose. But whatever is important to you, don’t keep it exclusively somewhere out of your control. Don’t keep your one copy “out there”. Back it up, replicate it. Put your configuration and infrastructure in source control. Distributed. Cloud thinking is about not relying on a machine. Eliminate Single Points of Failure, where you can, although there’s little you can do about a single domain name.

Understand your provider. Don’t let bad UI or configuration lose your data : Slack lost 800,000 messages.

Your cloud provider is a dependency. That makes it your responsibility. Each will give you features you can’t get on your own. They give you an ecosystem you can’t get from your desktop, and a platform to collaborate with others. They give you federated logins, global backups and recovery, content delivery networks, load balancing on a vast scale. But if the worst happens, know how to recover. “It’s in the cloud” is not a disaster recovery strategy, just ask the GitLab customers (although well played to them on their honesty so the rest of us can learn). Have your own backup. And remember, it’s not a backup unless you’ve verified you can restore.

It takes you 60 seconds to deploy to your current provider. How long does it take to deploy if that service goes dark?

Categories
development security

Primer : A tech view of GDPR

I was fortunate enough to attend an event at The Data Lab in Edinburgh today on the new General Data Protection Regulation, coming to the EU and the UK. There were 4 talks from a variety of angles, but for me the key takeaways were that the primary thrust of the regulation is about prevention rather than cure, and auditing and control rather than additional technical implementations, aside from the Data Portability clause.

Best practice still applies. Collect only the minimum data required, and don’t collect personal data unless you have to. Encrypt your data, in transit and at rest. Privacy should be the default, and only extended by informed choice.

But you need a data breach policy. An email to Troy Hunt might be OK if it’s a hobby project that was breached, but you need to notify data subjects and users if there is a breach, and you need the security policies and audits to protect you if the lawsuits start flying.

I’m not a lawyer, so I won’t offer advice there. But as you’re designing your systems, now’s the chance to audit, prepare and secure. Don’t be the first high-profile fine under the new rules.

february 14 2017 at 0237pm
february 14 2017 at 0237pm

dsc 0437
dsc 0437

dsc 0438
dsc 0438

dsc 0439
dsc 0439

Categories
development

Moving house and denormalised data 

There’s a lot of abstract talk about data normalisation and having a single source of truth within your organisation for each type of data. Sometimes this goes very wrong. 

I moved house this year, and whilst I accept the extra administration that comes from each organisation needing to be updated individually (after all, why should DVLA need to know who I bank with?), I do expect each organisation to make it easy to change my address. 

And it was easy, except for my bank. Current accounts, credit cards and savings are all managed separately, the address I see online is unrelated to any of these, and I have to change my address on joint and single accounts separately. 

I don’t care what your backend system is like. My interface should be one call and you sort it out, not keep redirecting me to queue after queue. And sending me a letter to the new address telling me it’s fixed, followed by another letter dated a week later to the old address with marketing in it. 

And then, I tried to phone them about that letter, and I couldn’t pass their security questions, because my address was updated everywhere except the phone validation system. 

How easy is it for your data subjects to update their data? One call or many? Is their security tied to that data – could they lose access if a bad agent adds a new address or credit card? How confident are you that you know all the places in your user, marketing, communication and other databases that need updating? 

How can you make it easier? 

Categories
data security

Privacy is not your only currency 

If you’re not paying, you’re the product.

But you’re not. In security, we talk about 2-factor authentication, where 2 factor is 2 out of 3 : who you are, what do you know, and what do you have. Who you are is the product, a subset of a target market for advertising, or a data point in a data collection scoop. The former requires giving up privacy, the latter less so.

Advertising is about segmenting audiences and focusing campaigns, so views and clicks both matter, to feed into demographics and success measures. Ad blocking is a double whammy – no ads displayed, and no data on you. Websites tend to argue that the former deprives them of revenue, many users argue that the latter deprives them of privacy.

What you have is money, and who you are is part of a demographic than can be monetised in order to advertise to you to get your money.

But what else do you have? If you’re on the web you have a CPU that can be used to compute something, whether it’s looking for aliens or looking for cancerous cells. If you’re happy to give up your CPU time.

Who else are you? You might be an influencer. You might be a data point in a freemium model that makes the premium model more valuable (hello, LinkedIn).

What do you know? If you’re a human you know how to read a CAPTCHA (maybe), you could know multiple languages. Maybe you know everything about porpoises and you can tell Wikipedia.

Your worth to a website isn’t always about the money you give them, or the money they can make from selling your data. It’s the way we’ve been trained to think, but there’s so much else we can do for value.