Metadata
- Author: Gleb Mezhanskiy
- Full Title:: The Best Data Contract Is the Pull Request
- Category:: 🗞️Articles
- Document Tags:: Data Contracts
- URL:: https://www.datafold.com/blog/the-best-data-contract-is-the-pull-request
- Finished date:: 2023-01-31
Highlights
This data contract is essentially a structured description of a table that can be published somewhere and consumed programmatically. What can we do with it? (View Highlight)
In development, data contracts help us prevent breaking changes by validating the new versions of data against the contract. For example, we can have a CI (e.g. Github Actions) check the schema of the table against the contract and raise an error if someone attempts to modify the schema without updating the contract. (View Highlight)
E.g. Dagster, a competitor to Airflow, allows to define inputs and outputs of any task so that they can be verified before those actually run data – addressing one of Airflow’s chronic pain points. (View Highlight)
The primary difference between having data contracts in place vs. a catalog is that catalog is descriptive whereas contracts are both descriptive and prescriptive i.e. they define not just how the data looks, but how it must look. (View Highlight)
Catalogs are made for humans, and contracts are made primarily for machines (so that they can do the hard work of validation for humans). However, the metadata from data contracts is an excellent source of information for a data catalog to help people discover and understand the data better. (View Highlight)
At the moment it seems unlikely that modern data stack vendors, who mostly speak to each other in SQL, API calls or pointers to database tables, would adopt a unified data contract interface. Unless emerging platform frameworks such as dbt will establish and force such upon everyone (View Highlight)
If every part of the data processing pipeline is version-controlled If one can easily know the impact of every change (Know What You Ship principle) If every change is staged and reviewed as a pull request before getting into production
Then you can achieve data reliability without implementing data contracts throughout the entire data platform! (View Highlight)