rw-book-cover

Metadata

Highlights

we have observed that most data pipelines would ideally be expressed with a combination of both relational queries and complex procedural algorithms. Unfortunately, these two classes of systems— relational and procedural—have until now remained largely disjoin (View Highlight)

Not sure what this means

to our knowledge one of the most widely-used systems with a “languageintegrated” API similar to DryadLINQ [20] (View Highlight)

significantly easier for users to work with thanks to their integration in a full programming language. For example, users can break up their code into Scala, Java or Python functions that pass DataFrames between them to build a logical plan, and will still benefit from optimizations across the whole plan when they run an output operation (View Highlight)

one can define UDFs that operate on an entire table by taking its name, as in MADLib [12], and use the distributed Spark API within them (View Highlight)

we wanted to enable external developers to extend the optimizer—for example, by adding data source specific rules that can push filtering or aggregation into external storage systems, or support for new data types. Catalyst supports both rule-based and cost-based optimization (View Highlight)

This memory format is exclusive of Spark and it’s not Arrow, although Arrow can be used to convert to Pandas DFs

Spark SQL can materialize (often referred to as “cache”) hot data in memory using columnar storage. Compared with Spark’s native cache, which simply stores data as JVM objects (View Highlight)

Functional languages were designed in part to build compilers, so we found Scala well-suited to this task. Nonetheless, Catalyst is, to our knowledge, the first productionquality query optimizer built on such a language. (View Highlight)

rules may need to execute multiple times to fully transform a tree (View Highlight)

functional transformations on immutable trees make the whole optimizer very easy to reason about and debug. They also enable parallelization in the optimizer, although we do not yet exploit this. (View Highlight)

In the physical planning phase, Catalyst may generate multiple plans and compare them based on cost. All other phases are purely rule-based (View Highlight)

support code generation to speed up execution (View Highlight)

expressions such as (x+y)+1. Without code generation, such expressions would have to be interpreted for each row of data, by walking down a tree of Add, Attribute and Literal nodes. This introduces large amounts of branches and virtual function calls that slow down execution. With code generation, we can write a function to translate a specific expression tree to a Scala AST (View Highlight)

they result directly in a Scala AST instead of running the Scala parser at runtime (View Highlight)

I don’t understand the expression interpreter

combine code-generated evaluation with interpreted evaluation for expressions we do not yet generate code for, since the Scala code we compile can directly call into our expression interpreter (View Highlight)