Data Modeling

Tags:: 📝CuratedNotes , Data Engineering

Data modeling are different techniques to structure information so that answering most business questions is easy, without the need to predict in advance which type of questions are going to be asked, and without needing to make ad-hoc transformations every time we want to answer a new question.

My own explanation of dimensional modeling on a company: Fight the Entropy.

This is also very related to the Metrics layer.

Styles

More or less in order of popularity/exotism/applied nowadays:

Kimball Dimensional modeling
Data Vault
OBT - One Big Table
Activity Schema
Entity-Centric by Maxime Beauchemin
Bill Inmon in Building the Data Warehouse book (kind of a 3NF)
Functional Kimball by Maxime Beauchemin

Preferred way

The best advice comes from this critique to Kimball Dimensional modeling: Kimball in the context of the modern data warehouse: what’s worth keeping, and what’s not - YouTube, and another on OBT:

…ditching the star schema and just building denormalized big tables instead, that is, join all joinable fields to create a large multi-purpose table and making views over this table to be the data marts exposed to the end-users (Modern Data Warehouse Modelling: The Definitive Guide - Part 2)
Link to original

You start easy with OBT. On another hand, you may use Kimball Dimensional modeling as an intermediate step to OBT.

This is also shared by other modern articles such as: Star Schema vs. OBT for Data Warehouse Performance | Blog | Fivetran

staging your ELT process such that the data all get transformed into something like a star schema before everything gets re-joined back”

Slowly Changing Dimensions type 2

For SCD2 in the modern data stack, there are two alternatives: do them directly with Airbyte, or doing them with dbt snapshots.

Joining the fact and the SCD2 table is easy by specifying conditions on the date as part of the join.

Data modeling in Lakehouse. Convergence of Data Lake and Data Warehouse

Medallion Architecture + Data Vault + Kimball Dimensional modeling: Data Warehousing Modeling Techniques and Their Implementation on the Databricks Lakehouse Platform

Metadata propagation

gouline/dbt-metabase: Model synchronization from dbt to Metabase. seems to be a very interesting package, providing two-way sync between dbt and Metabase (metadata is propagated to Metabase, and dashboards appear in dbt lineage as Exposures | dbt Docs).

Other refs

A very nice example to explain why modeling to other people: Data Modeling Layer & Concepts | The Analytics Setup Guidebook (holistics.io)

The truth hierachy, the burden of the proof on the new model 🗞️ Productizing Analytics. A Retrospective

We also enforced a dashboard “truth hierarchy”, that is, if in the process of exploring data in Looker, a stakeholder developed a metric that “disagreed” with a metric already established in the canonical dashboards, the burden of proof was on them to understand the misalignment. The data team was available to help with quick investigations through a company Slack channel, but the goal was to teach stakeholders how to fish and to avoid having to spend all of their time chasing down numbers that didn’t quite agree

Dr. Mario's 2nd 🧠

Explorer