Metadata
- Author: David Vrba
- Full Title:: Best Practices for Bucketing in Spark SQL
- Category:: 🗞️Articles
- Document Tags:: spark Spark
- URL:: https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53
- Finished date:: 2023-05-02
Highlights
We need to save the data as a table (a simple save function is not sufficient) because the information about the bucketing needs to be saved somewhere. Calling saveAsTable will make sure the metadata is saved in the metastore (if the Hive metastore is correctly set up) (View Highlight)
Together with bucketBy, we can call also sortBy, this will sort each bucket by the specified fields. Calling sortBy is optional, (View Highlight)
There are two main areas where bucketing can help, the first one is to avoid shuffle in queries with joins and aggregations, the second one is to reduce the I/O with a feature called bucket pruning (View Highlight)
what happens if only one table is bucketed and the other is not. The answer actually depends on the number of buckets and the number of shuffle partitions. If the number of buckets is greater or equal to the number of shuffle partitions, Spark will shuffle only one side of the join (View Highlight)