Metrics layer

Tags:: 📝CuratedNotes , Data Engineering, Data Analysis

Voltaire decía: “Si usted quiere conversar conmigo, defina sus términos”.
Link to original

First, what is “a metric”?

Aggregations over your facts or dimensions. From The metrics layer has growing up to do - Amit’s Newsletter:

Simple aggregations

Aggregation with scalar functions (sum(Revenue) - sum(Cost))

Metrics that require joins (e.g., conversion rates with Slowly Changing Dimensions)

Metrics with window functions

Metrics with multiple aggregation levels (e.g., ratios in market share).

Multi-fact metrics (e.g., sales and purchases).

What is the problem solved by the metrics layer?

Having a centralized definition of metrics outside of BI tools. We want to define metrics because as explained in 🗞️ The missing piece of the modern data stack:

Without a rollup to draw from, data consumers have to follow the second path: aggregate new metrics directly from dimension tables. That leaves the nature of the aggregation up to the person doing the analysis, and these aggregations are rarely simple. Counting weekly orders in Europe, for example, requires you to define week, order, and Europe. Do weeks start on Sunday or Monday? In which time zone? Do orders include those made with gift cards? What about returns? And are European customers those with billing addresses or shipping addresses in Europe? Are Russian customers European? Are British customers European? While all of this logic might live in the rollup_orders table, it isn’t necessarily in the dimension_orders table, meaning someone has to apply it on their own to do their analysis. This makes it incredibly difficult for people, especially people who aren’t analysts and aren’t familiar with the weird nuances that riddle most datasets, to consistently arrive at the same result.
Link to original

With a way to define metrics…

we’re going to avoid two people defining sales as different numbers on a per-record level (does it include tax or not?), or using a different timezone to aggregate these numbers (are we using UTC, our head office time, or the local time of the store we sold things in?). (What’s an OLAP cube? 🎲 - Analytics Engineers Club)

There are two things to define: the base table, and the aggregation itself.

In the world of BI, a metric is a succinct summarization of data to make it easily palatable to humans. Inherent to this are two concepts — the formula to be applied to summarize the data (metric formula definition) and the data to be summarized (metric data definition). In most BI tools, these concepts are conflated into one and exist as the combined “metric definition”, locked up inside the BI tool.

We think these should be split apart.

The complex SQL query that produces the rows needed by the metric should be defined separately from the metric definition (the SUM or COUNT or AVERAGE operation performed by the metric). This is a fundamental concept that allows us to centralize data production for a metric and manage metric data lineage in the data warehouse — very similar to how data transformation is handled by DBT (The 7 traits of a modern metrics stack)

You may think it is enough with views but views fall short for several reasons. On the one hand, you may have an explosion of views considering the dimensions and grains you want to offer (🗣️ Coalesce 2021. The Metric System). Second, at some point, you may want to materialize those views to have them precomputed (Rollup tables).

BI tools allow to define metrics. However, we would want to access such definitions from other places apart from the those tools. From 🗞️ The missing piece of the modern data stack: