PyData Seattle 2025

Building Web-Scale Data Pipelines: From 248 Billion Documents to Production-Ready ML Datasets

Processing web-scale data for machine learning remains one of the hardest challenges in production AI systems. This talk reveals the engineering strategies behind transforming 248 billion raw web documents into a clean, taxonomically-labeled 24-trillion-token dataset ready for LLM training.

You'll learn practical techniques for building robust data pipelines that actually scale: multi-layered deduplication strategies that reduce data volume by 90%, quality filtering that preserves valuable technical content, and distributed inference systems achieving 32,000 requests per second. We'll explore real production code using modern Python tools like Daft for orchestration and vLLM for inference, showing how to develop locally and scale to thousands of GPUs without major refactoring.

Whether you're processing gigabytes or petabytes, you'll leave with actionable patterns for data deduplication, quality assessment, and large-scale model inference that you can apply to your own projects immediately.


What and Why: This talk demonstrates how to build production data pipelines that process web-scale data efficiently. The topic is critical because most teams struggle when scaling beyond single-machine processing, often rewriting entire pipelines or facing astronomical compute costs. Using a real case study of transforming 248 billion web documents into 24 trillion clean tokens, I'll show engineering patterns that scale seamlessly from gigabytes to petabytes.

Who: Data engineers and ML engineers who need to process large datasets but are frustrated by tools that don't scale or require complete rewrites. Intermediate Python knowledge expected (pandas, basic async concepts).

Type and Tone: Hands-on and practical with real code examples. The tone is pragmatic - focusing on what actually works in production rather than theoretical ideals.

Outline:

  • Minutes 0-5: Why traditional approaches fail at scale
  • Minutes 5-15: Multi-layered deduplication reducing data by 90%
  • Minutes 15-25: Quality filtering that preserves technical content
  • Minutes 25-35: Distributed inference achieving 32,000 RPS with Daft/vLLM
  • Minutes 35-40: Q&A

Takeaway: You'll leave knowing how to build pipelines that scale without rewrites, implement efficient deduplication strategies, and reduce compute costs by a significant amount through smart optimization patterns.


Prior Knowledge Expected:

Previous knowledge expected