Wrangling Internet-scale Image Datasets PyData Seattle 2025

Wrangling Internet-scale Image Datasets
.ical
2025-11-07 14:35–15:20, Room 301A

Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible.
In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable.
Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.

This is an engineering-focused talk on the pragmatic challenges of prepping image datasets at scale.

Topics we will cover:

How do you go from a huge list of URLs to usable images

Benchmark image datasets for machine learning (e.g. LAION-5B) are open-source and published in major conferences. However, research teams are not data providers, and more often than not the ‘datasets’ they provide are URL and metadata lists.

Practical engineering approaches to managing large volumes of data
- Choosing file formats to optimize parallel and sequential reads and writes
- Why your ls command will freeze
- Bringing observability to data pipelines with tools such as Weights & Biases

Ensuring data quality at scale

Use of ML models to assess data quality at scale (CLIP similarity score, OCR metrics, Aesthetics scores)
The importance of human evaluation to supplement quantitative approaches
Bookkeeping and metadata as critical elements to reproducible pipelines

Additional consideration when training models with large datasets
- Random vs. sequential access and shuffle buffers (pseudo i.i.d. sampling)
- Efficient ablation over dataset parameters

This talk will be useful for data engineers and MLEs who want to train on benchmark, open-source image datasets.

Exposure to linux, kubernetes, file systems and ML fundamentals will be useful, but little background knowledge is required.

Prior Knowledge Expected: No previous knowledge expected

Carlos Garcia Jurado Suarez

My name is Carlos Garcia Jurado Suarez, and I’m a Software and Machine Learning Engineering Consultant at CodePointers, helping research organizations.

I have over 25 years of experience as an engineer, applied scientist, and manager at organizations of all sizes: Big Tech (Microsoft Research, Meta), early and growth stage startups and academia. My expertise and passion are in Machine Learning and Scientific Computing, and in particular bridging the research and engineering worlds.

I hold master's degrees in Computer Science and in Applied Mathematics, both from the University of Washington, as well as a bachelor's degree in Physics from ITESM, in Monterrey, Mexico.

Nicholas Merchant

Machine Learning Engineer with experience training billion-parameter generative models and building
high-throughput data pipelines across 1000+ GPUs. Specializes in scalable PyTorch training, structured
dataset curation, and distributed infra for large-scale multimodal systems.

Wrangling Internet-scale Image Datasets .ical 2025-11-07 14:35–15:20, Room 301A

Wrangling Internet-scale Image Datasets
.ical
2025-11-07 14:35–15:20, Room 301A