Metadata
- Author: Benn Stancil
- Full Title:: Will We Ever Have Clean Data?
- Category:: 🗞️Articles
- Document Tags:: data analysis, Data Analysis, data culture, Data culture, data quality, Data Quality,
- URL:: https://benn.substack.com/p/will-we-ever-have-clean-data?utm_source=substack&utm_medium=email
- Finished date:: 2023-08-04
Highlights
When people talk about data quality and reliability, they often implicitly frame it as an unambiguous fight against entropy. We win if we’re persistent, prudent, disciplined, and thoughtful; we lose if we are lazy, reckless, inattentive, or foolish (View Highlight)
the Frankenstein stack that gets created—planned one step at a time, full of half-built experiments and partially-deprecated failures—looks like a huge mistake. But, just as true of analysis, the mess has a purpose (View Highlight)
Inevitably, even the best laid reporting plans give way to a lot of exploratory messes. Each potential metric produces a bunch of analyses to assess it; each analysis produces more questions and ad hoc offshoots. Multiply this by all the metrics and dashboards on your blueprint, and complicate it by constantly shifting the business underneath it, and the development process looks less like an organized construction site and more like an artist’s studio or a writer’s desk. (View Highlight)
The good news, however, is that none of this is incompatible with data quality itself. We just have to imagine different ways to provide it. Instead of focusing on stability, for instance, are there ways to make instability safer? Or, to take it even further, could we make things easier to refactor, and in fact, encourage more rewrites? (View Highlight)
These are the kinds of calculations that, if done in a single query, would probably happen in a series of CTEs. In dbt, however, it can make sense to pull each one (or some reasonable set of a few) out into their own models. This way, if you’re debugging this giant knot of logic, you can query that intermediate model directly. Or if it’s useful in other calculations, you can recycle it. (View Highlight)