- Tags:: 📝CuratedNotes , Data Engineering, Data Analysis
Related: Metrics frameworks
Voltaire decía: “Si usted quiere conversar conmigo, defina sus términos”.
Link to original
First, what is “a metric”?
Aggregations over your facts or dimensions. From The metrics layer has growing up to do - Amit’s Newsletter:
- Simple aggregations
- Aggregation with scalar functions (sum(Revenue) - sum(Cost))
- Metrics that require joins (e.g., conversion rates with Slowly Changing Dimensions)
- Metrics with window functions
- Metrics with multiple aggregation levels (e.g., ratios in market share).
- Multi-fact metrics (e.g., sales and purchases).
What is the problem solved by the metrics layer?
Having a centralized definition of metrics outside of BI tools. We want to define metrics because as explained in 🗞️ The missing piece of the modern data stack:
Without a rollup to draw from, data consumers have to follow the second path: aggregate new metrics directly from dimension tables. That leaves the nature of the aggregation up to the person doing the analysis, and these aggregations are rarely simple. Counting weekly orders in Europe, for example, requires you to define week, order, and Europe. Do weeks start on Sunday or Monday? In which time zone? Do orders include those made with gift cards? What about returns? And are European customers those with billing addresses or shipping addresses in Europe? Are Russian customers European? Are British customers European? While all of this logic might live in the
Link to originalrollup_orders
table, it isn’t necessarily in thedimension_orders
table, meaning someone has to apply it on their own to do their analysis. This makes it incredibly difficult for people, especially people who aren’t analysts and aren’t familiar with the weird nuances that riddle most datasets, to consistently arrive at the same result.
With a way to define metrics…
we’re going to avoid two people defining sales as different numbers on a per-record level (does it include tax or not?), or using a different timezone to aggregate these numbers (are we using UTC, our head office time, or the local time of the store we sold things in?). (What’s an OLAP cube? 🎲 - Analytics Engineers Club)
There are two things to define: the base table, and the aggregation itself.
In the world of BI, a metric is a succinct summarization of data to make it easily palatable to humans. Inherent to this are two concepts — the formula to be applied to summarize the data (metric formula definition) and the data to be summarized (metric data definition). In most BI tools, these concepts are conflated into one and exist as the combined “metric definition”, locked up inside the BI tool.
We think these should be split apart.
The complex SQL query that produces the rows needed by the metric should be defined separately from the metric definition (the SUM or COUNT or AVERAGE operation performed by the metric). This is a fundamental concept that allows us to centralize data production for a metric and manage metric data lineage in the data warehouse — very similar to how data transformation is handled by DBT (The 7 traits of a modern metrics stack)
You may think it is enough with views but views fall short for several reasons. On the one hand, you may have an explosion of views considering the dimensions and grains you want to offer (🗣️ Coalesce 2021. The Metric System). Second, at some point, you may want to materialize those views to have them precomputed (Rollup tables).
BI tools allow to define metrics. However, we would want to access such definitions from other places apart from the those tools. From 🗞️ The missing piece of the modern data stack:
Implementations
This is the current proposal (as of 2021-01-01) of dbt for metrics:
However, since they seem to be moving towards more direct support of them:
- [Feature] dbt should know about metrics · Issue #4071 · dbt-labs/dbt-core
- Metrics | dbt Docs
- dbt Metrics Framework Playbook
Additionally, there are several open source packages offering metrics on top of dbt. A good overview is in this article. On my own I saw:
- Metriql Docs | Metriql Docs
- lightdash/lightdash: An open source alternative to Looker built using dbt. Made for analysts ❤️ (though this one will also act as a addition/replacement of Metabase)
Metabase also supports metrics (07 Segments and Metrics (metabase.com)), but there is no way to link with dbt yet, at least using dbt-metabase (Setting Metabase Metrics using dbt properties · Issue #25 · gouline/dbt-metabase (github.com))
Of course, big companies design their own systems, such as Minerva in the case of AirBnB: How Airbnb Achieved Metric Consistency at Scale | by Robert Chang | The Airbnb Tech Blog | Medium
Other references
- You may want to make snapshots of your metrics: https://erikapullum.com/blog/snapshot-metrics
- Note that the metrics layer can also be called a metric store: What is a metrics store? Why your data team should define business metrics in code - Transform Data.
- An amazing explanation of Vicki Boykis on how… the truth does not really exists: All numbers are made up, some are useful - by Vicki Boykis (substack.com).
- Why cohort analysis beats all other approaches to calculating LTV (lifetimely.io)
- Metric trees