Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform

rw-book-cover

Metadata

Author: Databricks
Full Title:: Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform
Category:: 🗞️Articles
Document Tags:: Data Modeling, Data Vault,
URL:: https://www.databricks.com/blog/2022/06/24/prescriptive-guidance-for-implementing-a-data-vault-model-on-the-databricks-lakehouse-platform.html
Finished date:: 2023-05-19

Highlights

A Data Vault is well suited to the lakehouse methodology since the data model is easily extensible and granular with its hub, link and satellite design so design and ETL changes are easily implemented. (View Highlight)

(View Highlight)

Data Vault modeling recommends using a hash of business keys as the primary keys. Databricks supports hash, md5, and SHA functions out of the box to support business keys. (View Highlight)

Business Vault tables can be organized by data domains - which serve as an enterprise “central repository” of standardized cleansed data. (View Highlight)

Point-in-Time tables as well as Bridge tables in the Gold/Presentation layer on top of the Business Data Vault (View Highlight)

As business processes change and adapt, the Data Vault model can be easily extended without massive refactoring like the dimensional models. Additional hubs (subject areas) can be easily added to links (pure join tables) and additional satellites (e.g. customer segmentations) can be added to a Hub (customer) with minimal changes. (View Highlight)

Also loading a dimensional model Data Warehouse in Gold layer becomes easier (View Highlight)

• Reduce the optimize.maxFileSize to a lower number, such as 32-64MB vs. the default of 1 GB. By creating smaller files, you can benefit from file pruning and minimize the I/O retrieving the data you need to join. • Data Vault model has comparatively more joins, so use the latest version of DBR which ensures that the Adaptive Query Execution is ON by default so that the best Join strategy is automatically used. Use Join hints only if necessary. ( for advanced performance tuning). (View Highlight)

Dr. Mario's 2nd 🧠

Explorer

Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform

Metadata

Highlights

Graph View

Table of Contents