PyData Global 2025

Bodo DataFrames: a fast and scalable HPC-based drop-in replacement for Pandas
2025-12-11 , Data Engineering & Infrastructure

Pandas is a popular library for data scientists but it struggles with large datasets; programs either become too slow or run out of memory. In this talk, we introduce Bodo DataFrames (https://github.com/bodo-ai/Bodo) as a drop-in replacement for the Pandas library that uses high performance computing (HPC) based techniques such as Message Passing Interface (MPI) and JIT compilation for acceleration and scaling. We give an overview of its architecture and explain how it avoids the problems of Pandas (while keeping user code the same), go over concrete examples, and finally discuss current limitations. This talk is for Pandas users who would like to run their code on larger data while avoiding frustrating code rewrites to other APIs. Basic knowledge of Pandas and Python is recommended.


Despite its popularity for data manipulation tasks, Pandas struggles at scale due to its single threaded execution and significant Python-based overheads. In this talk, we introduce Bodo DataFrames as a solution to scaling Pandas with a single line of code change; simply replace import pandas as pd with import bodo.pandas as pd.

Bodo DataFrames transforms Pandas code into lazily evaluated plans, enabling database-quality query optimizations, and runs on a streaming, parallel backend using the Message Passing Interface (MPI) for fast worker-to-worker communication. This design avoids out-of-memory errors and is easily scalable from laptop to large cloud cluster. Unlike other data processing engines, Bodo DataFrames combine powerful techniques from high performance computing (HPC) and databases while remaining fully Pandas compatible.

We will present multiple examples and benchmarks demonstrating how to use Bodo DataFrames. The first example will show how to scale a simple program covering functions like reading/writing Parquet files, Series-datetime, merge, and groupby-agg. The next example will demonstrate how to accelerate user defined functions (i.e. map and apply) using Bodo DataFrames builtin support for Just-In-Time (JIT) compilation. The final example will demonstrate how to use Bodo DataFrames support for the Apache Iceberg format, which provides schema evolution and time travel for ever-changing datasets. We will also discuss how Bodo DataFrames falls back to Pandas when it doesn't support all operations of a workload, and planned future work.

This talk is designed for users of Pandas; data scientists, data engineers and AI/ML practitioners, who are interested in accelerating and scaling their workloads easily. In addition to a new tool under their belt, attendees will walk away with an understanding of techniques from HPC and databases, unlocking deeper insights into aspects of performance and memory utilization.


Prior Knowledge Expected:

Yes

Scott is a Software Engineer at Bodo.ai, where he has worked on the performance and reliability of the BodoSQL engine, contributed to the Bodo Just-In-Time Python Compiler, and is currently working on Bodo DataFrames. He earned his undergraduated in computer science from Carnegie Mellon University.