2025-12-09 –, Machine Learning & AI
Training state-of-the-art Large Language Models (LLMs) increasingly rely on the availability of clean, diverse, and large-scale datasets. The traditional CPU-based preprocessing pipelines often become a bottleneck when curating datasets that span tens or hundreds of terabytes. In this talk, we introduce NeMo Curator, an open-source, GPU-accelerated data curation framework developed by NVIDIA. Built on Python and powered by RAPIDS, NeMo Curator enables scalable, high-throughput data processing for LLMs, including semantic deduplication, filtering, classification, PII redaction, and synthetic data generation. With support for multi-node, multi-GPU environments, the framework has demonstrated up to 7% improvement in downstream model performance on large-scale benchmarks. We will walk through its modular pipeline design, highlight real-world applications, and show how to integrate it into existing workflows for fast, reproducible, and efficient LLM training.
The development and performance of Large Language Models (LLMs) increasingly rely on the availability of high-quality, diverse, and representative datasets. Scaling data preparation for LLMs remains a significant bottleneck in training pipelines, particularly when dealing with massive raw web-scale data. Traditional CPU-based preprocessing frameworks are often too slow and resource-intensive to meet the growing demand for efficiency, scalability, and compliance. This talk presents NeMo Curator, an open-source, GPU-accelerated data curation framework designed to accelerate and streamline the preparation of massive datasets across multi-node, multi-GPU infrastructures.
NeMo Curator introduces a modular pipeline architecture that enables high throughput preprocessing with native integration of RAPIDS for GPU acceleration. Its functionality spans semantic deduplication, heuristic filtering, automated classification, personally identifiable information (PII) redaction, and synthetic data generation. These features work in tandem to reduce noise, eliminate redundancy, and enhance data quality, ultimately improving LLM training outcomes. With support for reward-based filtering and configurable augmentation modules, NeMo Curator can generate or enhance data in low-resource domains while maintaining quality and diversity.
This talk will provide an informative walkthrough of NeMo Curator’s capabilities and show how its pipelines can be integrated into existing workflows to preprocess massive datasets efficiently. Attendees will see how to configure and execute the framework through Python APIs, leveraging both single-node and distributed environments. By the end of this talk, participants will become familiar with scalable data curation techniques and walk away with practical tools to enhance their own LLM training pipelines using GPU-accelerated infrastructure.
Detailed Outlines:
1. Challenges in Scaling LLM Data Preparation (5 min)
2. Overview of NeMo Curator Framework (10 min)
3. Pipeline Modules and Functional Components (5 min)
4. Demonstration: Multi-GPU Pipeline Execution (5 min)
5. Case Studies and Performance Metrics (5 min)
Targeted Audience:
• Data Scientist, ML/AI Engineer, AI Researcher
Yes
Allison Ding is a developer advocate for GPU-accelerated AI APIs, libraries, and tools at NVIDIA, with a specialization in large language models (LLMs) and advanced data science techniques. She brings over eight years of hands-on experience as a data scientist, focusing on managing and delivering end-to-end data science solutions. Her academic background includes a strong emphasis on natural language processing (NLP) and generative AI. Allison holds a master’s degree in Applied Statistics from Cornell University and a master’s degree in Computer Science from San Francisco Bay University.