PyData Seattle 2025

Taming the Data Tsunami: An Open-Source Playbook to Get Ready for ML
2025-11-08 , Talk Track 2

Machine learning teams today are drowning in massive volumes of raw, redundant data that inflate training costs, slow down experimentation, and degrade model quality. The core architectural flaw is that we apply control too late—after the data has already been moved into centralized stores or training clusters—creating waste, instability, and long iteration cycles. What if we could fix this problem right at the source?

In this talk, we’ll discuss an open-source playbook for shifting ML data filtering, transformation, and governance upstream, directly where data is generated. We’ll walk through a declarative, policy-as-code framework for building distributed pipelines that intelligently discard noise, balance datasets, and enrich signals before they ever reach your model training infrastructure.

Drawing from real-world ML workflows, we’ll show how this “upstream control” approach can reduce dataset size by 50–70%, cut model onboarding time in half, and embed reproducibility and compliance directly into the ML lifecycle—rather than patching them in afterward.

Attendees will leave with:
- A mental model for analyzing and optimizing the ML data supply chain.
- An understanding of open-source tools for declarative, source-level ML data controls.
- Actionable strategies to accelerate iteration, lower training costs, and improve model outcomes.


For over a decade, the Python ecosystem has given us a powerful arsenal to tame data. We started with the interactive magic of Pandas on a single machine, a revolutionary step that made complex analysis accessible. When our ambitions (and data) outgrew our laptops, we turned to Dask and Spark to scale our computations across clusters. More recently, projects like Apache Arrow began solving the critical problem of creating a standardized, efficient language for these distributed systems to speak.

Each step in this journey solved a painful bottleneck. Yet, in our success, we've created a new one: the runaway cost and complexity of the "ingest-it-all-first" paradigm. Our cloud bills have become a tax on raw, unfiltered data, and our elegant downstream tools—from Airflow and dbt to our own ML models—are forced to waste expensive cycles sifting through noise just to find the signal.

This talk argues for the next logical step in our stack's evolution: an Upstream Data Control Plane. We'll explore an open-source playbook for applying intelligent filtering, transformation, and governance before data ever hits your expensive lakehouse. Just as Dask parallelized our processing and Arrow standardized our memory, this approach optimizes our data in motion, ensuring that our powerful downstream systems operate only on the high-value signals we care about. Join us to learn a declarative, policy-as-code framework that makes your entire data stack cheaper, faster, and more resilient.


Prior Knowledge Expected:

No previous knowledge expected

I am CEO and co-founder of Expanso, and the Bacalhau Project helping, deploying and organizing our community building the next generation of the Internet.

Previously, I was co-director of Research Development at Protocol Labs, led Open Source Machine Learning Strategy at Azure, product management for Kubernetes on behalf of Google, launched Google Kubernetes Engine, and co-founded the Kubeflow project and the SAME project. I have also worked at Microsoft, Amazon and Chef and co-founded three startups.

When not spending too much time in service of electrons, I can be found on a mountain (on skis), traveling the world (via restaurants) or participating in kid activities, of which there are a lot more than I remember than when I was that age.