Nicholas Merchant
Machine Learning Engineer with experience training billion-parameter generative models and building
high-throughput data pipelines across 1000+ GPUs. Specializes in scalable PyTorch training, structured
dataset curation, and distributed infra for large-scale multimodal systems.
Session
Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible.
In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable.
Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.