Metadata
- Author: robinlinacre.com
- Full Title:: Demystifying Apache Arrow
- Category:: 🗞️Articles
- URL:: https://www.robinlinacre.com/demystifying_arrow/
- Finished date:: 2023-02-06
Highlights
Faster csv reading
Data in Arrow is stored in-memory in record batches, a 2D data structure containing contiguous columns of data of equal length. A ‘table’ can be created from these batches without requiring additional memory copying, because tables can have ‘chunked’ columns (i.e. sections of data, each part representing a contiguous chunk of memory). This design means that data can be read in parallel rather than the single-threaded approach of pandas. (View Highlight)
Faster User Defined Functions (UDF) in PySpark
The use of Arrow almost completely eliminates the serialisation and deserialisation step, and also allows data to be processed in columnar batches, meaning more efficient vectorised algorithms can be used. (View Highlight)
Arrow provides translators that are then able to convert this into language-specific in-memory formats like the pandas dataframe (View Highlight)
This problem of translation is one that Arrow aims eventually to eliminate altogether: ideally there would be a single in-memory format for dataframes rather than each tool having its own representation. (View Highlight)
Writing parquet files
Why not just persist the data to disk in Arrow format, and thus have a single, cross-language data format that is the same on-disk and in-memory? One of the biggest reasons is that Parquet generally produces smaller data files, which is more desirable if you are IO-bound. This will especially be the case if you are loading data from cloud storage like such as AWS S3. (View Highlight)
The trade-offs for columnar data are different for in-memory. For data on disk (View Highlight)
This makes it desirable to have close cooperation in the development of the in-memory and on-disk formats, with predictable and speedy translation between the two representations — and this is what is offered by the Arrow and Parquet formats. (View Highlight)
A common (cross-language) in memory representation of data frames (View Highlight)
These ideas come together in the description of Arrow as an ‘API for data’. The idea is that Arrow provides a cross-language in-memory data format, and an associated query execution language that provide the building blocks for analytics libraries. Like any good API, Arrow provides a performant solution to common problems without the user needing to fully understand the implementation (View Highlight)
This paves the way for a step change in data processing capabilities, with analytics libraries: • being parallelised by default • applying highly optimised calculations on dataframes • no longer needing to work with the constraint that the full dataset must fit into memory (View Highlight)