2025-11-07 –, Talk Track 1
PySpark’s Arrow-based Python UDFs open the door to dramatically faster data processing by avoiding expensive serialization overhead. At the same time, Polars, a high-performance DataFrame library built on Rust, offers zero-copy interoperability with Apache Arrow. This talk shows how combining these two technologies unlocks new performance gains: writing Arrow UDFs with Polars in PySpark can deliver performance speedups compared to Python UDFs. Attendees will learn how Arrow UDFs work in PySpark, how it can be used with other data processing libraries, and how to apply this approach to real-world Spark pipelines for faster, more efficient workloads.
Objective
To introduce PyData practitioners to Arrow Python UDFs in PySpark and demonstrate how they can be used integrated with other execution engines such as Polars to accelerate performance. The session highlights the mechanics, benchmarks, and practical applications of this approach.
Key Takeaways
• Why Arrow UDFs matter and how they work in PySpark
• How Polars leverages zero-copy Arrow integration to accelerate UDFs
• Performance comparisons: Polars Arrow UDFs vs Pandas UDFs
• Practical guidance for adopting this pattern in Spark pipelines
• How this approach fits into the broader PyData + Arrow ecosystem
Audience
• Data engineers and data scientists working with PySpark at scale
• Practitioners interested in Arrow and next-generation DataFrame tools
• Engineers seeking concrete strategies to optimize Spark UDFs
Background Knowledge Expected
• Familiarity with PySpark DataFrames and UDFs
• Some experience with Pandas or Polars helpful, but not required
Previous knowledge expected
Software Engineer at Databricks. Apache Spark Committer.