Dr. Mario's 2nd 🧠

❯

❯

❯

Best Practices for Bucketing in Spark SQL

Best Practices for Bucketing in Spark SQL

1 min read

rw-book-cover

Metadata

Author: David Vrba
Full Title:: Best Practices for Bucketing in Spark SQL
Category:: 🗞️Articles
Document Tags:: spark Spark
URL:: https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53
Finished date:: 2023-05-02

Highlights

We need to save the data as a table (a simple save function is not sufficient) because the information about the bucketing needs to be saved somewhere. Calling saveAsTable will make sure the metadata is saved in the metastore (if the Hive metastore is correctly set up) (View Highlight)

Together with bucketBy, we can call also sortBy, this will sort each bucket by the specified fields. Calling sortBy is optional, (View Highlight)

There are two main areas where bucketing can help, the first one is to avoid shuffle in queries with joins and aggregations, the second one is to reduce the I/O with a feature called bucket pruning (View Highlight)

what happens if only one table is bucketed and the other is not. The answer actually depends on the number of buckets and the number of shuffle partitions. If the number of buckets is greater or equal to the number of shuffle partitions, Spark will shuffle only one side of the join (View Highlight)

Graph View

Metadata
Highlights

Created with Quartz v4.5.1 © 2025