rw-book-cover

Metadata

Highlights

But then…things start to go downhill. The model fails for no apparent reason. When the data science team investigates they discover a wall of technical debt written 8 years ago that no one understands. After pinging the data Slack channel they receive very few meaningful answers and eventually write a ticket to the data engineering team’s JIRA backlog. What follows is weeks or months of back and forth, confusion, bickering over ownership, and an eventual breakdown in quality that leaves everyone exasperated and moving at a snail’s pace. (View Highlight)

Overloaded with work, data engineers do their best to string together one pipeline after another with minimal testing, documentation, or data modeling (View Highlight)

Following AGILE, teams deploy iterative changes ranging from small bug improvements to features that significantly alter the customer experience. The data engineer can only keep up temporarily. Eventually, the deployment velocity is too much to manage, and expectations of what the data means and its underlying implementation fall out of sync. (View Highlight)

Data Engineers spend days to weeks attempting to track down what caused the problem. When they finally do, it’s (as expected) an upstream issue (View Highlight)

Data consumers quickly lose trust in the complex SQL queries stitched together by their fellow data scientists. Who can be bothered to parse through hundreds of lines filters and CASE WHEN statements with a product manager breathing down your neck about deliverables . (View Highlight)

this point, it’s common for data leadership to try something drastic. Data teams might: • Buy a new data-quality tool • Switch to a different organizational paradigm • Lobby for a large-scale refactor (and usually fail) • Create data governance councils, committees, and other groups • Cry (View Highlight)

However, with democratization comes chaos. A lack of managed truth means that anything can change at any time, and when those changes don’t map back to our expectations, bad things happen. (View Highlight)

What made sense for the start-up no longer works at scale. When applications require trust, like AI/ML pipelines, data products, or financial reporting, the flexibility of the Modern Data Stack and its democratization suddenly becomes a massive liability. (View Highlight)

Late in 2022, Bird announced they had overstated company revenue by two years from recognizing unpaid customer rides (View Highlight)

While 71% of C-suite respondents claimed to completely trust the accuracy of their financial data, only 38% of finance ICs said the same. (View Highlight)

Data Catalogs are a record of what has already happened, but they are not a framework for ensuring that the right information is communicated to the right person when things change (View Highlight)

Owners are defined in name only, not function. It is not the responsibility of the dataset owner to manage the expectations of data consumers because they were never asked to do so! (View Highlight)

Unfortunately, the rise of the Modern Data Stack demonstrates that we are living in a post-truth world, quite literally (View Highlight)

We are constantly in the position of playing catch-up, always reacting to the world around us after the damage has already occurred, and in some cases never reacting at all. In my opinion, there is no bigger challenge to solve in data than this (View Highlight)

The solution is surprisingly, painfully simple. Better data communication. (View Highlight)

rw-book-cover

Metadata

Highlights

(View Highlight)

Following AGILE, teams deploy iterative changes ranging from small bug improvements to features that significantly alter the customer experience. The data engineer can only keep up temporarily. Eventually, the deployment velocity is too much to manage, and expectations of what the data means and its underlying implementation fall out of sync. (View Highlight)

Software engineers make reasonable and locally optimal decisions for the services they maintain. Perhaps they drop a column from their production database because it’s no longer useful, split one column into three, or alter the underlying logic of how discounts are applied to an order. While that’s fine and good for their own applications, the downstream consumers of the data are affected in various ways: (View Highlight)

Data producers remain unaware of this problem while data scientists only recognize something is wrong when their most important data assets start to behave unexpectedly. (View Highlight)

for startups. When a company is small data developers can afford to be in the ‘room where it happens.’ If (View Highlight)

So what if your app usage numbers are off by 20%?” they might say. “If we have a general idea of what’s working and what’s not do we really need the additional cost required to maintain 100% accuracy (View Highlight)

However, with democratization comes chaos (View Highlight)

A Post-Truth World (View Highlight)

Late in 2022, Bird announced they had overstated company revenue by two years from recognizing unpaid customer rides (View Highlight)

a survey of finance professionals by Blackline (View Highlight)

While 71% of C-suite respondents claimed to completely trust the accuracy of their financial data, only 38% of finance ICs said the same. (View Highlight)

The solution is surprisingly, painfully simple. Better data communication. (View Highlight)

rw-book-cover

Metadata

Highlights

Following AGILE, teams deploy iterative changes ranging from small bug improvements to features that significantly alter the customer experience. The data engineer can only keep up temporarily. Eventually, the deployment velocity is too much to manage, and expectations of what the data means and its underlying implementation fall out of sync. (View Highlight)

data quality starts to spiral out of control at an exponential rate. The organizational complexity of managing changes to many different data assets all at once creates hurdles for each stakeholder. Inevitably, pointing the finger at the other side is about as far as anyone gets toward actually solving the problem. At this point, it’s common for data leadership to try something drastic. Data teams might: • Buy a new data-quality tool • Switch to a different organizational paradigm • Lobby for a large-scale refactor (and usually fail) • Create data governance councils, committees, and other groups • Cry While well-intentioned, each of these solutions is addressing a symptom and not the root cause. (View Highlight)

In a post-data warehouse world, the assumption is typically only that: An assumption. There is no evidence that the item_pricing table contains up-to-date information. There is no business expert who controls the floodgates to production ensuring data consumers are making the right choices, and there is rarely documentation to even explain how the data should be accessed or what a single row in a table represents. An explosion of unreliable data emerges as a result. What made sense for the start-up no longer works at scale. When applications require trust, like AI/ML pipelines, data products, or financial reporting, the flexibility of the Modern Data Stack and its democratization suddenly becomes a massive liability (View Highlight)

(View Highlight)

(View Highlight)