PyData Global 2025

Allison Ding

Allison Ding is a developer advocate for GPU-accelerated AI APIs, libraries, and tools at NVIDIA, with a specialization in large language models (LLMs) and advanced data science techniques. She brings over eight years of hands-on experience as a data scientist, focusing on managing and delivering end-to-end data science solutions. Her academic background includes a strong emphasis on natural language processing (NLP) and generative AI. Allison holds a master’s degree in Applied Statistics from Cornell University and a master’s degree in Computer Science from San Francisco Bay University.


Session

12-09
17:00
30min
Scaling Data Processing for LLMs with NeMo Curator
Allison Ding

Training state-of-the-art Large Language Models (LLMs) increasingly rely on the availability of clean, diverse, and large-scale datasets. The traditional CPU-based preprocessing pipelines often become a bottleneck when curating datasets that span tens or hundreds of terabytes. In this talk, we introduce NeMo Curator, an open-source, GPU-accelerated data curation framework developed by NVIDIA. Built on Python and powered by RAPIDS, NeMo Curator enables scalable, high-throughput data processing for LLMs, including semantic deduplication, filtering, classification, PII redaction, and synthetic data generation. With support for multi-node, multi-GPU environments, the framework has demonstrated up to 7% improvement in downstream model performance on large-scale benchmarks. We will walk through its modular pipeline design, highlight real-world applications, and show how to integrate it into existing workflows for fast, reproducible, and efficient LLM training.

Machine Learning & AI
Machine Learning & AI