Here’s my thoughts on data following my CodeCraftConf guided conversation. Here are the questions I asked during my guided conversation at CodeCraftConf 2019. They are also available on GitHub if you would like to fork and modify them for your own use.
Most developers are data driven, start with the data structure, not the algorithm. Either data driven design, or the Merise Methodology.
Data, whilst often divided by microservice, is often stored on the same server/cluster, creating a monolith behind the microservices.
Not all data access is secured and audited, although there does appear to be a trend to on-behalf-of flows through the microservice, allowing user-centered access control. Strict data access design is prevalent, although the efficacy was less clear, and strict design applies to all data, including publicly available data.
Keeping sight of data in distributed systems is hard. Jepsen was suggested as one resource to help, but I’m happy to hear of others.
As well as data that can be used to discriminate by collecting gender, name, postcode etc., we also discussed how missing data can be used to discriminate, such as when Glasgow accents aren’t included in voice training data, or when women aren’t used in medical trails.
There’s also the big and growing problem of data collected by people who do not consider the discrimination or privacy implications. For a biologist, DNA is a puzzle that helps them decode cancer, and more examples make the puzzle easier to solve. But for others, DNA is a tool to map insurance risk, to find criminals, and to track down family members whether or not they want to be found. How do we train everyone else to understand?
And the takeaway question : what questions aren’t you asking about your data?
What is data anyway?
Navigating SQL, NoSQL, JSON and how to work with data in a post-RDMS, big-data world
- When designing a system, do you start with the data or the code?
- Has the rise of cloud based or non relational data stores changed how we model our data?
- Do you need to update your data when the models in the code change? How do you do it?
- Does all your data have to have the same shape?
- Should the data you expose to the outside world broadly match the data at rest?
- How do you secure your data?
- In light of GDPR, How do you ensure you aren’t collecting too much data?
- Who has access to your data?
- Do you know if anyone unauthorised has accessed it?
- How do you protect yourself against bad data and trojan data?
- Bad data = data that is fake, or is used for real world attacks
- Trojan data = data that can compromise your or your customer’s systems
- Can your data be used to discriminate?
- Can you prove it?
- Is your data biased?
- Are you recording hidden correlations? (ZIP code suggests race)
- Who owns your data?
- What questions aren’t you asking?
- What makes data big?
- Are you collecting the right data?
- Is the data you’re collecting right?
- Where is your data?
- Do you still have a place for traditional RDBMS?