PyData Global 2025

Lessons learnt in optimizing a large-scale pandas application using Polars, FireDucks and cuDF: Go Smart and Save More!
2025-12-09 , Analytics, Visualization & Decision Science

In general, a Data Scientist spends significant efforts in transforming the raw data into a more digestible format before training an AI model or creating visualisations. Traditional tools such as pandas have long been the linchpin in this process, offering powerful capabilities but not without limitations. With numerous possible ways to write the same thing in pandas, often a user ends up selecting the uneconomical, inefficient ones, leading to large computational costs with the growth in data size. We introduce a couple of frequently occurring intricate performance issues in pandas, and what we have learnt in solving the same using popular high-performance pandas alternatives: Polars, FireDucks and cuDF. The talk intends to highlight one of the best practices (breaking out of the loops) that one should follow while dealing with large-scale data analysis, while demonstrating the key advantages of the high-performance pandas alternatives based on different scenarios.


It is a known factor that pandas might be slow when dealing with large-scale data analysis, but the know-how of writing effective pandas application might save you a lot. For a data scientist who is primarily specialised in finding the key insights out of the data, it might be difficult to program from the perspective of runtime memory consumption, effective data flow optimization etc. High-performance pandas alternatives like Polars, FireDucks, cuDF etc. are designed to address these issues and can be very useful in saving a lot of operational cost (e.g., cloud cost, human cost etc.). We will talk about the key lessons we have learnt in optimizing a large-scale pandas application and the decision points in selecting the high-performance pandas alternatives. It can be very useful for the contemporary data professional who loves the flexible user APIs in pandas and wants to enhance the performance of their application without much effort when dealing with voluminous and complex data on a regular basis.

The key takeaways would be as follows:
1. How the choice and execution order of API calls in writing an data-related application impacts its performance.
2. How to stop thinking the loop-based approach and design the algorithms using DataFrame APIs.
3. How the internal query optimizers in libraries like Polars, FireDucks etc, can be useful to bring SQL-like optimizations at python-level.
4. Whether to pay a large migration cost for optimizing an existing pandas-based application or to go smart with some minor modifications and save more operational cost.


Prior Knowledge Expected:

No

Sourav has 12+ years of professional experience at NEC Corporation in the diverse fields of High-Performance Computing, Distributed Programming, Compiler Design, and Data Science. Currently, his team at NEC R&D Lab, Japan, is researching various data processing-related algorithms. Blending the mixture of different niche technologies related to compiler framework, high-performance computing, and multi-threaded programming, they have developed a Python library named FireDucks with highly compatible pandas APIs for DataFrame-related operations. In his previous engagements, he has worked in research and development of performance-critical AI and Big Data solutions, optimization of several legacy applications related to weather prediction, earth-quake simulation, etc., written in C++ and Fortran. He has been speaking at several meetups and technical conferences related to HPC and Data Science.