Pretty flat :/

Team roles

Data team roles. (p. 91)

No full stacks:

There are some “full-stack” data scientists out there, but very few and far between (p. 58).

… you’ll need two to five data engineers for each data scientist. This ratio is so high because the 80 to 90 percent of the work for tasks fits into data engineering tasks rather than data science tasks (p. 98)

But no super division:

The same team is responsible for both the data pipeline code and keeping it running. This method is chosen to prevent the quintessential “throw it over the fence” problems that have long existed between developers and operations staff where developers create code of questionable quality that the operations team is forced to deal with the problems (p. 15)

Other notes

Related with Data Eng. roadmap

For organizations that previously had only a data science team, the first task that the new data engineering team should take on is cleaning up the technical debt from the data science team (p. 58).

On Nobody wants to be a Data Engineer, and Measuring a data team impact:

Data engineering and —especially— operations won’t get the credit they deserve unless management makes a concerted effort to educate others (p. 103).

Suggested KPIs for Data Engineering (p. 133) Measuring a data team impact (which, to be hones, is again a bit disappointing):

  • Improvement in data quality.
  • Socialization of data internally and/or externally.
  • Increased self-service data for internal consumption.
  • (For new data engineering teams) Enabling the data science team to do previously impossible tasks.
  • Improvements to data scientist quality of work life.
  • Increased automation of machine learning, code deployment, or code creation.

Metadata service importance:

While many firms went begrudgingly into their versions GDPR compliance, one segment realized upside. If you look across Uber, Lyft, Netflix, LinkedIn, Stitch Fix, and other firms roughly in that level of maturity, they each have an open source project regarding a knowledge graph of metadata about dataset usage— Amundsen, Data Hub, Marquez, and so on (p. 115).

To argue for good Model serving:

However, having data science deliver ad hoc models to production is not recommended. Models are data, so they should be subject to the same governance that other data receives, such as data security, auditing, and traceability (p.117).

On writing before starting the project:

We go through every nook and crannny of what they [business] want to do and the business reasons for doing it. Often, the data engineers are getting anxious to start whiteboarding and diagramming out the architecture (p. 126).

A twist on Boring technology:

Learn to only select technology that you hate and look to hate more techonology every day (…). I find that the best technology selection is when you know fully well what is bad about a technology but still is the best technology for the task.