David Aronchick PyData Seattle 2025

David Aronchick
.ical

I am CEO and co-founder of Expanso, and the Bacalhau Project helping, deploying and organizing our community building the next generation of the Internet.

Previously, I was co-director of Research Development at Protocol Labs, led Open Source Machine Learning Strategy at Azure, product management for Kubernetes on behalf of Google, launched Google Kubernetes Engine, and co-founded the Kubeflow project and the SAME project. I have also worked at Microsoft, Amazon and Chef and co-founded three startups.

When not spending too much time in service of electrons, I can be found on a mountain (on skis), traveling the world (via restaurants) or participating in kid activities, of which there are a lot more than I remember than when I was that age.

Session

11-08

10:55

45min

Taming the Data Tsunami: An Open-Source Playbook to Get Ready for ML

David Aronchick

Machine learning teams today are drowning in massive volumes of raw, redundant data that inflate training costs, slow down experimentation, and degrade model quality. The core architectural flaw is that we apply control too late—after the data has already been moved into centralized stores or training clusters—creating waste, instability, and long iteration cycles. What if we could fix this problem right at the source?

In this talk, we’ll discuss an open-source playbook for shifting ML data filtering, transformation, and governance upstream, directly where data is generated. We’ll walk through a declarative, policy-as-code framework for building distributed pipelines that intelligently discard noise, balance datasets, and enrich signals before they ever reach your model training infrastructure.

Drawing from real-world ML workflows, we’ll show how this “upstream control” approach can reduce dataset size by 50–70%, cut model onboarding time in half, and embed reproducibility and compliance directly into the ML lifecycle—rather than patching them in afterward.

Attendees will leave with:
- A mental model for analyzing and optimizing the ML data supply chain.
- An understanding of open-source tools for declarative, source-level ML data controls.
- Actionable strategies to accelerate iteration, lower training costs, and improve model outcomes.

Room 301B

David Aronchick .ical

Session

David Aronchick
.ical