Here’s my thoughts on data following my CodeCraftConf guided conversation. Here are the questions I asked during my guided conversation at CodeCraftConf 2019. They are also available on GitHub if you would like to fork and modify them for your own use.
Most developers are data driven, start with the data structure, not the algorithm. Either data driven design, or the Merise Methodology.
Data, whilst often divided by microservice, is often stored on the same server/cluster, creating a monolith behind the microservices.
Not all data access is secured and audited, although there does appear to be a trend to on-behalf-of flows through the microservice, allowing user-centered access control. Strict data access design is prevalent, although the efficacy was less clear, and strict design applies to all data, including publicly available data.
Keeping sight of data in distributed systems is hard. Jepsen was suggested as one resource to help, but I’m happy to hear of others.
As well as data that can be used to discriminate by collecting gender, name, postcode etc., we also discussed how missing data can be used to discriminate, such as when Glasgow accents aren’t included in voice training data, or when women aren’t used in medical trails.
There’s also the big and growing problem of data collected by people who do not consider the discrimination or privacy implications. For a biologist, DNA is a puzzle that helps them decode cancer, and more examples make the puzzle easier to solve. But for others, DNA is a tool to map insurance risk, to find criminals, and to track down family members whether or not they want to be found. How do we train everyone else to understand?
And the takeaway question : what questions aren’t you asking about your data?