2025-12-10 –, General Track
Machine learning teams today are drowning in massive volumes of raw, redundant data that inflate training costs, slow down experimentation, and degrade model quality. The core architectural flaw is that we apply control too late, after the data has already been moved into centralized stores or training clusters, creating waste, instability, and long iteration cycles. What if we could fix this problem right at the source?
In this talk, we’ll discuss a playbook for shifting ML data filtering, transformation, and governance upstream, directly where data is generated. We’ll walk through a declarative, policy-as-code framework for building distributed pipelines that intelligently discard noise, balance datasets, and enrich signals before they ever reach your model training infrastructure.
Drawing from real-world ML workflows, we’ll show how this “upstream control” approach can reduce dataset size, cut model onboarding time in half, and embed reproducibility and compliance directly into the ML lifecycle rather than patching them in afterward.
Attendees will leave with:
- A mental model for analyzing and optimizing the ML data supply chain.
- An understanding of tools for declarative, source-level ML data controls.
- Actionable strategies to accelerate iteration, lower training costs, and improve model outcomes.
For over a decade, the Python ecosystem has given us a powerful arsenal to tame data. We started with the interactive magic of Pandas on a single machine, a revolutionary step that made complex analysis accessible. When our ambitions (and data) outgrew our laptops, we turned to Dask and Spark to scale our computations across clusters. More recently, projects like Apache Arrow began solving the critical problem of creating a standardized, efficient language for these distributed systems to speak.
Each step in this journey solved a painful bottleneck. Yet, in our success, we've created a new one: the runaway cost and complexity of the "ingest-it-all-first" paradigm. Our cloud bills have become a tax on raw, unfiltered data, and our elegant downstream tools—from Airflow and dbt to our own ML models—are forced to waste expensive cycles sifting through noise just to find the signal.
This talk argues for the next logical step in our stack's evolution: an Upstream Data Control Plane. We'll explore an playbook for applying intelligent filtering, transformation, and governance before data ever hits your expensive lakehouse. Just as Dask parallelized our processing and Arrow standardized our memory, this approach optimizes our data in motion, ensuring that our powerful downstream systems operate only on the high-value signals we care about. Join us to learn a declarative, policy-as-code framework that makes your entire data stack cheaper, faster, and more resilient.
David Aronchick is CEO of Expanso (expanso.io), the global, intelligent pipeline company.
Previously, he led Compute over Data at Protocol Labs, Open Source Machine Learning Strategy at Azure, was a product management for Kubernetes on behalf of Google, launched Google Kubernetes Engine, and co-founded the Kubeflow project and the SAME project. He has also worked at Amazon, Chef and co-founded three startups.
When not spending too much time in service of electrons, he can be found on a mountain (on skis). traveling the world (via restaurants) or participating in kid activities, of which there are a lot more than he remembers than when he was that age.