David Aronchick
David Aronchick is CEO of Expanso (expanso.io), the global, intelligent pipeline company.
Previously, he led Compute over Data at Protocol Labs, Open Source Machine Learning Strategy at Azure, was a product management for Kubernetes on behalf of Google, launched Google Kubernetes Engine, and co-founded the Kubeflow project and the SAME project. He has also worked at Amazon, Chef and co-founded three startups.
When not spending too much time in service of electrons, he can be found on a mountain (on skis). traveling the world (via restaurants) or participating in kid activities, of which there are a lot more than he remembers than when he was that age.
Session
Machine learning teams today are drowning in massive volumes of raw, redundant data that inflate training costs, slow down experimentation, and degrade model quality. The core architectural flaw is that we apply control too late, after the data has already been moved into centralized stores or training clusters, creating waste, instability, and long iteration cycles. What if we could fix this problem right at the source?
In this talk, we’ll discuss a playbook for shifting ML data filtering, transformation, and governance upstream, directly where data is generated. We’ll walk through a declarative, policy-as-code framework for building distributed pipelines that intelligently discard noise, balance datasets, and enrich signals before they ever reach your model training infrastructure.
Drawing from real-world ML workflows, we’ll show how this “upstream control” approach can reduce dataset size, cut model onboarding time in half, and embed reproducibility and compliance directly into the ML lifecycle rather than patching them in afterward.
Attendees will leave with:
- A mental model for analyzing and optimizing the ML data supply chain.
- An understanding of tools for declarative, source-level ML data controls.
- Actionable strategies to accelerate iteration, lower training costs, and improve model outcomes.