- Tags:: 🗞️Articles, Data Quality, Data culture
- Author:: Krishna Puttaswamy, and Suresh Srinivas
- Link:: Uber’s Journey Toward Better Data Culture From First Principles | Uber Engineering Blog
- Source date:: 2021-03-16
- Finished date:: 2021-04-24
- Data Quality Checks from Uber:
Freshness: time delay between production of data and when the data is 99.9% complete in the destination system including a watermark for completeness (default set to 3 9s), as simply optimizing for freshness without considering completeness leads to poor quality decisions.
Completeness: % of rows in the destination system compared to the # of rows in the source system.
Duplication: % of rows that have duplicate primary or unique keys, defaulting to 0% duplicate in raw data tables, while allowing for a small % of duplication in modeled tables.
Cross-data-center consistency: % of data loss when a copy of a dataset in the current datacenter is compared to the copy in another datacenter.
Semantic checks: captures critical properties of fields in the data such as null/not-null, uniqueness, # of distinct values, and range of values.