Scaling Image Captioning Workflows with Ray Data, Ray Data LLM and vLLM PyData Seattle 2025

Scaling Image Captioning Workflows with Ray Data, Ray Data LLM and vLLM
.ical

2025-11-08 10:55–11:40, Talk Track 2

Processing large-scale image datasets for captioning presents coordination challenges that often lead to complex, difficult-to-maintain systems. I've been exploring how Ray Data can simplify these workflows while improving throughput and reliability.This talk demonstrates how to build image captioning pipelines combining Ray Data's batch processing capabilities, Ray Data LLM's batch inference capabilities, vLLM for efficient model serving. I'll walk through how we:
1. Structure data processing pipelines using Ray Data's map and map_batches operations
2. Use Ray Data LLM's Processor object which encapsulates logic for performing batch inference with LLMs on a Ray Data dataset.
3. Integrate vLLM for high-throughput batch inference on vision-language models
4. Handle fault tolerance and checkpointing for long-running jobs
5. Optimize GPU resource utilization across distributed workloads

We'll explore practical patterns for processing thousand of images, including data loading strategies, batching considerations, and state management approaches. The talk showcases how Ray Data's and Ray Data LLM's abstractions can replace complex actor coordination patterns, demonstrating a path from prototype-scale scripts to production-ready pipelines that can handle real-world computer vision datasets.

Target Audience
ML engineers working with large-scale vision datasets, researchers scaling computer vision experiments, and teams building production ML pipelines.

Talk Outline (40 minutes)

Problem Context (8 minutes)
1. Challenges in large-scale image processing workflows
2. Common patterns: actor coordination, queue management, state tracking
2. Trade-offs between simplicity and scale in existing approaches

Ray Data Approach (15 minutes)
1. Ray Data fundamentals for batch processing
2. Integration patterns with vLLM for vision-language models
3. Code walkthrough: data loading, batching, and result handling
4. Resource management and GPU sharing strategies

Production Considerations (5 minutes)
1. Checkpointing and restart strategies
2. Error handling and monitoring approaches
3. Performance characteristics and optimization techniques

Q&A (2 minutes)

Key Technical Points
1. Practical Ray Data usage patterns for vision workloads
2. vLLM integration for efficient batch inference
3. Resource optimization techniques for GPU-intensive pipelines
4. State management without external coordination systems

Prior Knowledge Expected:

Previous knowledge expected

Anindya Saha

Anindya is a Machine Learning Platform Engineer at Zoox, building scalable infrastructure for distributed training of LLMs and VLMs. Previously at Lyft, he led the development of Spark Notebooks on Kubernetes to accelerate ML prototyping. He has worked across LLMOps, MLOps, and data infrastructure, and has built systems for training, serving, and monitoring ML models at scale using Kubernetes, Spark, and modern ML tooling.

This speaker also appears in:

From Notebook to Cloud at Lightspeed: Accelerating ML Development with Ray

Scaling Image Captioning Workflows with Ray Data, Ray Data LLM and vLLM .ical 2025-11-08 10:55–11:40, Talk Track 2

Scaling Image Captioning Workflows with Ray Data, Ray Data LLM and vLLM
.ical

2025-11-08 10:55–11:40, Talk Track 2