Categories
code cosmosdb data development

Cosmosdb and Heterogeneous data

A selection of different watches. They all tell the time, but some are analogues, some are digital, some are branded, and some are not.
Same, but different

CosmosDb, in common with other NoSQL databases, is schema-free. In other words, it doesn’t validate incoming data by default. This is a feature, not a bug. But it’s a dramatic change in thinking, akin to moving to a dynamically typed language from a statically typed one (and not, as it might first appear, moving from a strongly typed to a weakly typed one).

For those of us coming from a SQL or OO background, it’s tempting to use objects, possibly nested, to represent and validate the data, and hence encourage all the data within a collection to have the same structure (give or take some optional fields). This works, but it doesn’t provide all the benefits of moving away from a structured database. And it inherits from classic ORMs the migration problem when the objects and schema need to change. It can very easily lead to a fragile big-bang deployment.

For those of us used to dynamic languages and are comfortable with Python’s duck typing or the optional-by-default sparse mapping required to use continuously-versioned JSON-based RESTful services, there’s an obvious alternative. Be generous in what you accept.

If I have a smart home, packed with sensors, I could create a subset of core data with time, sensor identifier and a warning flag. So long as the website knows if that identifier is a smoke alarm or a thermostat, it can alert the user appropriately. But on top of that, the smoke alarm can store particle count, battery level, mains power status, a flag for test mode enabled, and the thermostat can have a temperature value, current programme state, boiler status, etc, both tied into the same stream.

Why would I want to do this?

Versioning

Have historic and current data from a device/user in one place, recorded accurately as how it was delivered (so that you can tweak the algorithm to fix that timedrift bug) rather than having to reformat all your historical data when you know only a small subset will ever be read again.

Data siblings

Take all the similar data together for unified analysis – such as multiple thermostat models with the same base properties but different configurations. This allows you to generate a temperature trend across devices, even as the sensors change, if sensors are all from different manufacturers, and across anything with a temperature sensor.

Co-location

If you’re making good use of cosmosdb partitions you may want to keep certain data within a partition to optimise queries. For example, a customer, all of their devices, and aggregated summaries of their activity. You can do this by partitioning on the customer id, and collecting the different types of data into one collection.

Conclusion

NoSQL is not 3NF, so throw put those textbooks and start thinking of data as more dynamic and freeform. You can still enforce structure if you want to, but think about if you’re causing yourself pain further down the road.

Check out @craignicol’s Tweet: https://twitter.com/craignicol/status/1122224379658633217?s=09

Categories
data development

CodeCraftConf 2019 : What is data anyway? (Questions)

Here are the questions I asked during my guided conversation at CodeCraftConf 2019. They are also available on GitHub if you would like to fork and modify them for your own use. Thankyou to everyone who came to the discussion, I will post a follow-up to discuss some of the interesting answers.

What is data anyway?

Navigating SQL, NoSQL, JSON and how to work with data in a post-RDMS, big-data world

Questions

Data modelling

  1. When designing a system, do you start with the data or the code?
  2. Has the rise of cloud based or non relational data stores changed how we model our data?
  3. Do you need to update your data when the models in the code change? How do you do it?
  4. Does all your data have to have the same shape?
  5. Should the data you expose to the outside world broadly match the data at rest?

Data security

  1. How do you secure your data?
  2. In light of GDPR, How do you ensure you aren’t collecting too much data?
  3. Who has access to your data?
  • Do you know if anyone unauthorised has accessed it?
  1. How do you protect yourself against bad data and trojan data?
  • Bad data = data that is fake, or is used for real world attacks
  • Trojan data = data that can compromise your or your customer’s systems

Ethical data

  1. Can your data be used to discriminate?
  • Can you prove it?
  • Is your data biased?
  • Are you recording hidden correlations? (ZIP code suggests race)
  1. Who owns your data?
  2. What questions aren’t you asking?

Unused questions

  1. What makes data big?
  2. Are you collecting the right data?
  3. Is the data you’re collecting right?
  4. Where is your data?

Technology choices

  1. Do you still have a place for traditional RDBMS?
Categories
development google leadership

Successful teams

Successful teams deliver successful projects. As a lead, how do you build a successful team?

There are many factors to build a successful team, but the foundation of them all is safety. Can problems be discussed openly? Does everyone trust everyone else? And once you have that, the team can build. Build diversity, build towards a common goal, and build something that matters.

Successful Google team

Google defines successful teams according to its research at https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/

Psychological safety: Can we take risks on this team without feeling insecure or embarrassed?
Dependability: Can we count on each other to do high quality work on time?
Structure & clarity: Are goals, roles, and execution plans on our team clear?
Meaning of work: Are we working on something that is personally important for each of us?
Impact of work: Do we fundamentally believe that the work we’re doing matters?

I accept, given multiple ongoing accusations against them about defending toxic managers and culture, that Google may not be living these values. However, these are clear statements that are supported by other studies such as The Game Outcomes project.

Penguins at Edinburgh Zoo

So how do we build a team like that?

Number one thing, and the only way I’ve found success, is to empower the team and everyone within it to make changes. Without that ownership, nothing matters.

Once you have that, you as the leader have to own the rest. Delegate where you need to, but own your team’s safety, support, direction, purpose and motivation.

Safety

Are you free to take risks and try something new?

Not everything you do will be a success, so do you celebrate knowledge and learning as a goal? Yes, that cost us time and money, but we learned not to do that again

Are team members supported? When someone mansplains your tech lead, do you correct them, and ensure her voice is heard? When a deaf developer joins the team, do you ask whether they prefer lip reading or sign, and help the team adapt appropriately? Do you recognise colour? Do you use preferred pronouns?

When mistakes are made, do you find someone to blame or do you all accept responsibility to address it? If the production database can be deleted by the graduate on their first day, and there are no backups, that is never their fault.

Creating Psychological Safety in the Workplace https://hbr.org/ideacast/2019/01/creating-psychological-safety-in-the-workplace

High Quality

Do you always have an high standard?

Everyone has their code reviewed, especially the lead. Is every line of code, and every process open to review and improvement? Great that you’re agile, but if you really value people over process, write the process down, and follow it. It doesn’t mean no process, it means that process serves the people, not the other way around. It means you change it when it no longer supports the people or the product.

What are your quality standards for code, for user experience, for security, and most importantly for behaviour? How are they enforced? And are they always enforced on time, every time?

Have policies. Do not have a daily fight over tabs vs spaces.

Direction

Ask everyone on your team what the team is building. If you get more than one answer, that’s a bug.

Ask everyone which part everyone else on the team plays towards that. Does that match how they see their role? Are there any gaps in responsibility?

Ask everyone what their priority is and why. Is anyone blocked? Ask them what their next priority is and if they have everything they need to fulfil it. If not, do they know where to get it?

Purpose

Is everyone bringing their whole self to work? Do office politics make them wary? Are they in a marginalized group and they have to bring representation as well as talent, and they are having to do both jobs at once?

At the office, is this the number one thing for them to be doing? Are your developers feeling stuck in support or BA? Are they frustrated that they aren’t allowed to refactor a gnarly piece of code that’s very open to be improved because “it works, don’t touch it”.

Does everyone on the team feel empowered to speak up and to fix things where they interfere with the goal of the team?

Motivation

Ask everyone why the team is building what they’re building, and why their part is important.

How will this change the user’s day? How will it affect the company? What’s the net improvement?

The Game Outcomes formulation

If you don’t like the Google formulation, try the game outcomes one. There’s plenty that applies to non-game projects. There’s a few negatives to avoid, and I’ll revisit them in a later post.

The most important indicators for success from the Game Outcomes project are:

  1. Great game development teams have a clear, shared vision of the game design and the development plan and an infectious enthusiasm for that vision.
  2. Great game development teams carefully manage the risks to the design vision and the development plan.
  3. Members of great game development teams buy into the decisions that are made.
  4. Great game development teams avoid crunch (overtime).
  5. Great gamedev teams build an environment where it’s safe to take a risk and stick your neck out to say what needs to be said.
  6. Great gamedev teams do everything they can to minimize turnover and avoid changing the team composition except for growing it when needed. This includes avoiding disruptive re-organizations as much as possible.
  7. Great gamedev teams resolve interpersonal conflicts swiftly and professionally.
  8. Great gamedev teams have a clearly-defined mission statement and/or set of values, which they genuinely buy into and believe in. This matters FAR more than you might think.
  9. Great gamedev teams keep the feedback loop going strong. No one should go too long without receiving feedback on their work.
  10. Great gamedev teams celebrate novel ideas, even if they don’t achieve their intended result. All team members need the freedom to fail, especially creative ones.

How do you keep your team on the right path?

We all want to work on successful projects, and there’s been a couple of times in my career I’ve been lucky enough to work in a team where everyone is delivering 10x. 10x developers don’t work in isolation, they work on teams where all the above needs are met, and they thrive off each other.

It’s great to have that dream team, but start by thinking about how to make your team reliably successful, and you’ll be doing better than most software teams.

Categories
.net code data development

CosmosDb in The Real World : Azure Global Bootcamp 2019 (Glasgow)

Thank you to those who came to my talk today about CosmosDb. I hope you found it useful.

If you’d like to review the slides, you’ll find the presentation online here :

CosmosDb In The Real World – GitPitch

If you have any further questions please ask below and I’ll do my best to answer.

Categories
code data development free speech security

Ethics in technology

This is an extension of a twitter thread I wrote in response to this tweet, and thread about the Cambridge Analytica revelations.

One of the key modern problems is how easy it is to access these tools. You don’t need professional training to string these together.

It’s as dangerous as if someone invented a weapon that could kill 10s or 100s of people, light enough to carry anywhere, and available in any store, without training. And expecting owners to police themselves.

People are terrified of AI. We know we don’t need AI to disable hospitals. We don’t need AI to intercept Facebook logins (although FireSheep and the pineapple are less effective now). We don’t need AI to send a drone into a crowded market.

Make a website the only place for government applications, such as medicare or millennials railcards and it’s easy to remove access for all citizens.

But combine all that with data and you can fuck up someone’s life without trying. You can give 2 people the same national insurance number or other id. You can flag them on the no fly list.

You can encode prejudice into the algorithm and incarcerate someone because they grew up in a black neighborhood.

The algorithm is God. The algorithm is infallible. Trust the algorithm.

Even when it tells you someone is more capable than the humans says she is, and punishes them.

(unless you’re under GDPR where you have the right to question the algorithm)

But tell anyone that people will use data for purposes they hadn’t considered (like using RIPA anti-terror legislation to see if someone’s in the school catchment area) then you’re paranoid.

Be paranoid. People will always stick crowbars in the seams. Whatever your worst case scenario for your code is, you’re probably not even close.


You can see my original tweet, and the repies, here:

The Guardian has a great interview on AI, existential threats and ethics on their podcast here.

Categories
data development free speech security

Government insecurity agencies

Given the SSL attacks that could be traced back to classing secure encryption as weapons subject to export restrictions, it’s clear that government security agencies have a deep conflict of interest that has led to significantly reduced security protection for their own citizens.

It’s clear that the Ransomware (or Ransomware as diversion) attacks on UK and US hospitals and many other sites are directly due to the NSA backdoor toolkit that was stolen earlier this year. Because if the government has a back door into a system, or an encryption platform, everyone has a backdoor, even if they don’t have access to it yet.

Which is why it’s great to see the EU outlawing backdoors in order to protect us as patients, service users, and data subjects, and I completely expect this will apply, like GDPR, to any system holding EU citizens data. So when the UK puts on its “we need a back door” legislation, companies need to choose to trade with the UK and compromise their security, or trade with the much bigger EU and protect their customers.

Encryption is like a lock, but it isn’t. It’s like a safe door, but it isn’t. Abstractions help to frame the problem, but they can obscure the issues. They make lawmakers think that what applies to banks applies to data.

(note: bank processes are optimised to replace credit cards because security works best when you can throw away a channel and start again if it’s compromised; this includes reversing transactions – which is hard to do when it’s the release of your personal data that needs reverted, rather than a row in a ledger than can be corrected by an additional row).

Encryption isn’t the problem. The San Bernardino iPhone had no useful intel. All the recent attackers in the UK were known, reported, and could have been tracked if they were prioritised. Banning encryption will have about as much impact as banning white vans. Breaking encryption weakens our security, threatens international trade especially with the EU, and when security holes lead to attacks on our hospitals and other infrastructure, bad security threatens our lives.

But so long as we’re afraid of terrorism, it’s OK for the populous to suffer?

Categories
Blogroll code development search

Microsoft Edge, ungooglability and a new class of bugs

Microsoft definitely has a naming problem. .net core was one thing, but calling a browser Edge was just trolling developers. Try searching for “Edge CSS” or “JavaScript Edge”. It’s a lesson in frustration, which means the bugs in the new browser are extra painful to debug because it’s that much harder to find the blog posts and Q&A for the last person to fix the problem.

And Edge doesn’t behave like IE, or Firefox, or Chrome. I’m sure Microsoft, like the other vendors, are updating OSS frameworks to help them target Edge, but there’s still a lot of Javascript and CSS that breaks silently, so no Console logs to help, no odd numbers in the calculated CSS, and no hacks to persuade Edge that it can render just like the big browsers.

I want to like the browser, I really do. Anything that brings the end of IE closer has to be welcomed, but even after the Anniversary update of Windows 10, it’s far from ready. If I try to open IIS failure logs in Windows 10, it opens up IE, and displays with the correct CSS, and then tells me I should use Edge, where the CSS is broken. It’s frustrating as a user, and as a developer. It’s an alpha product, and it should have been treated as such. Give it to devs, allow power users to opt in, and iterate it. Microsoft still needs to learn what it means to develop in the open.

Documentation

Unfortunately the problem is then compounded by Microsoft’s documentation problem. For all the faults of IE, at least Microsoft had a good reputation for documentation at the height of MSDN. Unfortunately, MSDN is starting to decay, and there’s a number of conflicting alternatives springing up. For us developers, the seemingly preferred route for latest information is blog posts (or the comments thereon – which were the only source of information for a knotty Docker problem we had), but there’s also GitHub, docs.microsoft.com and the occasional update to the existing MSDN documentation suite.

Microsoft seem to be trying to frustrate developers. Especially when they have evolving, and conflicting APIs (I’m looking at you Azure, and the Python vs PowerShell vs Node APIs, and the Portal experience). The documentation experience at Microsoft feels like the Google UI experience before Material Design. And it needs a similar overhaul.

I love seeing Microsoft trying to be more open and I see it working, to a certain extent, in the C# and .Net space, aside from the .Net Core RC release cycle chaos. They’ve come a long way from the days of alt.Net (although I agree that we need to recapture that passion, both for the sake of new developers, and for the sake of keeping Microsoft in check), but they’re in danger of alienating developers once more with the confusion, and the inconsistencies within certain platforms.

In that context, removing project.json and keeping .csproj was the right decision. One clear and consistent path. Now go and apply it across the board.

Categories
cloud data development security

Cloud is ephemeral

The Cloud is just someone else’s servers, or a portion thereof. Use the cloud because you want to scale quickly, only pay for what you use, and put someone else, with a global team, on the hook for recovering from outages. You’d also like a safety net, somewhere out there with the data you cannot afford to lose. But whatever is important to you, don’t keep it exclusively somewhere out of your control. Don’t keep your one copy “out there”. Back it up, replicate it. Put your configuration and infrastructure in source control. Distributed. Cloud thinking is about not relying on a machine. Eliminate Single Points of Failure, where you can, although there’s little you can do about a single domain name.

Understand your provider. Don’t let bad UI or configuration lose your data : Slack lost 800,000 messages.

Your cloud provider is a dependency. That makes it your responsibility. Each will give you features you can’t get on your own. They give you an ecosystem you can’t get from your desktop, and a platform to collaborate with others. They give you federated logins, global backups and recovery, content delivery networks, load balancing on a vast scale. But if the worst happens, know how to recover. “It’s in the cloud” is not a disaster recovery strategy, just ask the GitLab customers (although well played to them on their honesty so the rest of us can learn). Have your own backup. And remember, it’s not a backup unless you’ve verified you can restore.

It takes you 60 seconds to deploy to your current provider. How long does it take to deploy if that service goes dark?

Categories
data security

Privacy is not your only currency 

If you’re not paying, you’re the product.

But you’re not. In security, we talk about 2-factor authentication, where 2 factor is 2 out of 3 : who you are, what do you know, and what do you have. Who you are is the product, a subset of a target market for advertising, or a data point in a data collection scoop. The former requires giving up privacy, the latter less so.

Advertising is about segmenting audiences and focusing campaigns, so views and clicks both matter, to feed into demographics and success measures. Ad blocking is a double whammy – no ads displayed, and no data on you. Websites tend to argue that the former deprives them of revenue, many users argue that the latter deprives them of privacy.

What you have is money, and who you are is part of a demographic than can be monetised in order to advertise to you to get your money.

But what else do you have? If you’re on the web you have a CPU that can be used to compute something, whether it’s looking for aliens or looking for cancerous cells. If you’re happy to give up your CPU time.

Who else are you? You might be an influencer. You might be a data point in a freemium model that makes the premium model more valuable (hello, LinkedIn).

What do you know? If you’re a human you know how to read a CAPTCHA (maybe), you could know multiple languages. Maybe you know everything about porpoises and you can tell Wikipedia.

Your worth to a website isn’t always about the money you give them, or the money they can make from selling your data. It’s the way we’ve been trained to think, but there’s so much else we can do for value.

Categories
ai artificialintelligence c++ code development programming search

Updated slides on Genetic Algorithms

I had the opportunity at work to revisit my Generic Algorithm talk, to refactor it with a bit more time to hopefully make it clearer. I also ported the C++ template code to Python to make it easier to demo. I’ll be talking about the implementation differences in a future post but I’ve included the links for the talk below for public consumption.