- Tags:: 📚Books, Modern Data Stack, Data Engineering
- Author:: Dave Fowler and Matthew C. David (ChartIO)
- Liked:: 4
- Link:: The Informed Company
- Source date:: 2021-10-22
- Finished date:: 2021-11-01
- Cover::
No está mal como una especie de resumen ejecutivo del modern data stack, pero a poco que estés metido en la pomada, no te va a añadir gran cosa. Además, lo ha editado peña que curra en dbt, con lo que será bastante parcial.
Enfocado en “analytics engineering”:
This book is not for AI‐enabled teams and does not cover AI workflows, machine learning models, or real‐time operational use cases. Instead, its goal is to provide best practices for building and maintaining a robust data analytics stack (i.e. the analytics foundation on which an AI workflow can be built).
Su visión de Data lakes, Data Warehouses, Lakehouse. Convergence of Data Lake and Data Warehouse:
Los data marts los hacemos para luchar contra la entropía creciente (Slow team velocity vs entropy and needs):
given enough time, hundreds of tables accumulate in a warehouse. Users become overwhelmed when trying to find relevant data. It’s also possible that, depending on the team, department, or use case, different people want to use the same data structured in different ways. So while the meanings of individual fields are unified, the abstractions used by different departments have diverged. given enough time, hundreds of tables accumulate in a warehouse. To sort through these challenges, we progress to the data mart stage. (p. xxxiv)
Hablando de entropía… la de los dashboards.
People tend to keep adding more and more charts to existing dashboards. This leads to cluttered interfaces and less accessible information to non‐subject matter experts (…) Generally, it’s best to organize a dashboard around a single question or goal and then break out multiple dashboards for follow‐up questions and analyses. (p. 19)
They reference another “mini book” from them that might be interesting How to design a dashboard.
Custom extract-and-load no merece la pena
The biggest reason DIY is generally a wrong choice is that extracting and loading costs more when data scientists, analysts, and engineers do it rather than a third‐party service provider (p. 54)
Also, the potential need to write new code and maintain existing code when data sources update is high. We recommend avoiding manual extract and load; use tools like Fivetran or Stitch, which automatically handle data source updates so that any data engineers can focus on more critical tasks (p. 60).
Sources change all the time, and ELT tools manage these changes (…) it very well may be unavoidable to track down broken queries and update them to work with a new version of the API (p. 60).
Al contrario que el costroso Metabase, hay otros visualizadores que tienen “smart refresh”, para evitar estar refrescando un dashboard que nadie está viendo activamente (p. 67).
En las views que sirven de interfaz de las tablas de origen…
Even if we were keeping the entire table, instead of writing a SELECT * FROM [tablename] query, we should write out all the columns that should be kept. This will make it easier to edit in the future and prevent new columns from being added in without us knowing (p. 88)
En el típico debate Kimball Dimensional modeling a saco o OBT - One Big Table, coexistencia:
create a wide table (materialized as a view) that contains pre‐joined results. These wide tables can sit alongside the normalized version in the data warehouse (p. 127)
On business metrics, que ok a tener metrics en las views/tablas, pero evitar la agregación salvo que sea necesaria porque hacer el cálculo fuera muy muy costoso, y tener estandarizados los dashboards (p. 130).
En cuanto a mantener histórico de los datos, Slowly Changing Dimensions y toda la pesca, es parco. Solo receta hacer snapshots de los datos con dbt, y una muy breve mención (pero sin mencionarlo) a Change Data Capture (p. 133).
Y por último, que seamos fuertes en la convención de nombres de tablas y columnas y valores para evitar un mal uso:
The style rule of pre‐pending “deprecated_” to fields is the best way to manage analysts aggregating it for metrics because it’s apparent to everyone that the data should not be used. It’s also worth letting users know that these fields and metrics are no longer useful through email or with any BI tools, to avoid catching anyone off guard. Again, naming conventions play an integral role in keeping users from querying data warehouse objects incorrectly. (p. 147)
Change Flags and Cryptic Abbreviations to Meaningful Values (e.g., booleans to “True” and “False” strings).