Metadata
- Author: Chad Sanderson
- Full Title:: Data Contracts for the Warehouse
- Category:: 🗞️Articles
- Document Tags:: Data Contracts
- URL:: https://open.substack.com/pub/dataproducts/p/data-contracts-for-the-warehouse?utm_source=direct&utm_campaign=post&utm_medium=web
- Finished date:: 2023-01-31
Highlights
data contracts are abstract, an interface for data that describes schema and semantics, and should not be tied to any one tool or implementation (View Highlight)
an example of a simple Orders table contract defined using protobuf (View Highlight)
Using tools like dbt, we can materialize a table in a dev environment to verify the schema and values of the table adhere to the contract (View Highlight)
We use the Confluent (Kafka) Schema Registry to store contracts for the data warehouse (View Highlight)
Once we’ve run integration tests to verify the contract can be met in the data warehouse, we take the schemas of the tables under contract and use the production schema registry to check for backward incompatible changes (View Highlight)
Using dbt and Airflow, we move data through our data warehouse in scheduled batches. When a data contract is created for a specific table, a process reads the contract and generates a corresponding set of monitors. These checks are executed immediately after the data is updated. By checking data before the next transformation can be run, we are able to detect and can stop the propagation of bad data through the rest of the data pipeline. (View Highlight)