PyData Seattle 2025

Carlos Garcia Jurado Suarez

My name is Carlos Garcia Jurado Suarez, and I’m a Software and Machine Learning Engineering Consultant helping research organizations.

I have over 25 years of experience as an engineer, applied scientist, and manager at organizations of all sizes: Big Tech (Microsoft Research, Meta), early and growth stage startups and academia. My expertise and passion are in Machine Learning and Scientific Computing, and in particular bridging the research and engineering worlds.

I hold master's degrees in Computer Science and in Applied Mathematics, both from the University of Washington, as well as a bachelor's degree in Physics from ITESM, in Monterrey, Mexico.


Session

11-07
14:35
45min
Wrangling Internet-scale Image Datasets
Carlos Garcia Jurado Suarez, Nicholas Merchant

Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible.
In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable.
Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.

Talk Track 3