📖 Fundamentals of Data Engineering

Tags:: 📚Books , Data Engineering
Author::
Liked::
Link::
Source date::
Finished date::
Cover::

Why did I want to read it?

What did I get out of it?

Preface

I know plenty of the following people! Their impotence as data scientists move them upstream, to the origin of their data:

How did this book come about? The origin is deeply rooted in our journey from data science into data engineering. We often jokingly refer to ourselves as recovering data scientists. (p. xiii)

1. Data Engineering described

Note some definitions in the book. One could think that Data Engineering is not a subset of software engineering, but the reverse:

Data engineering is all about the movement, manipulation, and management of data. —Lewis Gavin (p. 4)

What Is Dan Ariely doing here?

Dan Ariely tweeted, “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” (p. 8)

We are moving away from Hadoop world. Too much effort to maintain it:

Despite the power and sophistication of open source big data tools, managing them was a lot of work and required constant attention. Often, companies employed entire teams of big data engineers, costing millions of dollars a year, to babysit these platforms. Big data engineers often spent excessive time maintaining complicated tooling and arguably not as much time delivering the business’s insights (…) Whereas data engineers historically tended to the low-level details of monolithic frameworks such as Hadoop, Spark, or Informatica, the trend is moving toward decentralized, modularized, managed, and highly abstracted tools (p. 9-10)

Nowadays, the data-tooling landscape is dramatically less complicated to manage and deploy. Modern data tools considerably abstract and simplify workflows. As a result, data engineers are now focused on balancing the simplest and most cost-effective, best-of-breed services that deliver value to the business. (p. 13)

Along with non-SQL:

The advent of MapReduce and the big data era relegated SQL to passé status. Since then, various developments have dramatically enhanced the utility of SQL in the data engineering lifecycle. Spark SQL, Google BigQuery, Snowflake, Hive, and many other data tools can process massive amounts of data by using declarative, set-theoretic SQL semantics. SQL is also supported by many streaming frameworks, such as Apache Flink, Beam, and Kafka. We believe that competent data engineers should be highly proficient in SQL. (p. 20)

without going too far:

A proficient data engineer also recognizes when SQL is not the right tool for the job and can choose and code in a suitable alternative. A SQL expert could likely write a query to stem and tokenize raw text in a natural language processing (NLP) pipeline but would also recognize that coding in native Spark is a far superior alternative to this masochistic exercise. (p. 21)

There is a three stage data maturity model (very similar to others such as the 📖 Data Means Business), but the key is to keep eyes on the prize (business use cases, talk to people!) and:

Avoid undifferentiated heavy lifting. Don’t box yourself in with unnecessary technical complexity. Use off-the-shelf, turnkey solutions wherever possible. (p. 15)

The main bottleneck for scaling is not cluster nodes, storage, or technology but the data engineering team. Focus on solutions that are simple to deploy and manage to expand your team’s throughput. (p. 16)

Technology distractions are a more significant danger here than in the other stages. There’s a temptation to pursue expensive hobby projects that don’t deliver value to the business. Utilize custom-built technology only where it provides a competitive advantage. (p. 17)

You want “Type A data engineers” (“buy” ones) for the foundation, and “Type B data engineeris (“build”) when you are more mature. (p. 22)

2. The Data Engineering lifecycle

…

3. Designing good data architecture

Non really data specific. They even quote 📖 Fundamentals of Software Architecture. An Engineering Approach. These are their principles:

Choose common components wisely.

Plan for failure.

Architect for scalability.

Architecture is leadership.

Always be architecting.

Build loosely coupled systems.

Make reversible decisions. (One-way vs two-way doors)

Prioritize security.

Embrace FinOps

Data lakes vs. warehouses (Examples and types of data architecture)

I was particularly interested in why would someone want a data lake nowadays as a central piece of their architecture, even in the case of managed lakehouse such as Databricks (much less something built on your own, such as this DIY from GCP: Open data lakehouse on Google Cloud | Google Cloud Blog), compared to the simplicity of a cloud warehouse such as Big Query or Snowflake. Is it just about cost? I didn´t find anything here surprising:

Traditional data lakes ended up being super expensive:

Cheap, off-the-shelf hardware would replace custom vendor solutions. In reality, big data costs ballooned as the complexities of managing Hadoop clusters forced companies to hire large teams of engineers at high salaries. (p. 102)

We have convergence: cloud Data Warehouses allowing text, JSONs, querying from cloud storage, and Data lakes offerings with ACID, tables… (Lakehouse. Convergence of Data Lake and Data Warehouse). And of course, you may mix and match from a plethora of data services from your cloud vendor. Though, who else is offering a really integrated lakehouse apart from Databricks?
Regarding the Modern Data Stack, it is the way to go:

Regardless of where “modern” goes (we share our ideas in Chapter 11), we think the key concept of plug-and-play modularity with easy-to-understand pricing and implementation is the way of the future. (p. 104)

The key advantage of a Lakehouse. Convergence of Data Lake and Data Warehouse is interoperability:

It’s much easier to exchange data between tools when stored in an open file format. Reserializing data from a proprietary database format incurs overhead in processing, time, and cost. In a data lakehouse architecture, various tools can connect to the metadata layer and read data directly from object storage. (p. 217)

(To be continued)

Dr. Mario's 2nd 🧠

Explorer