PyData Seattle 2025

Supercharging Multimodal Feature Engineering with Lance and Ray
2025-11-08 , Talk Track 1

Efficient feature engineering is key to unlocking modern multimodal AI workloads. In this talk, we’ll dive deep into how Lance - an open-source format with built-in indexing, random access, and data evolution - works seamlessly with Ray’s distributed compute and UDF capabilities. We’ll walk through practical pipelines for preprocessing, embedding computation, and hybrid feature serving, highlighting concrete patterns attendees can take home to supercharge their own multimodal pipelines. See https://lancedb.github.io/lance/integrations/ray to learn more about this integration.


As AI workloads shift from purely tabular data to multimodal domains including vectors, images, audio, video, and text, the requirements for feature engineering pipelines have fundamentally changed. Teams must preprocess diverse data types at scale, compute and store large embeddings, support hybrid retrieval, and manage evolving schemas without downtime or costly rewrites. This talk explores how Lance and Ray together address these needs.

Lance is an open table and file format purpose-built for AI data. It provides native support for vectors and blobs, random access, and efficient indexing (e.g. IVF-PQ, HNSW, inverted and n-gram indexes). Its data evolution capabilities enable schema changes and backfilling without rewriting entire datasets. During the talk, we will compare and contrast Lance’s data-evolution approach with Parquet, Iceberg, and Delta Lake, highlighting where those solutions excel and where Lance’s design offers advantages for multimodal feature pipelines.

Ray is a widely adopted distributed compute framework for scalable data processing, training, and serving. With Ray Datasets and user-defined functions (UDFs), it becomes straightforward to parallelize transformations and embedding computation. We will walk through an end-to-end case study starting from simply pip install lance-ray, then demonstrate how to efficiently add new feature columns and backfill them with LLM-generated data, leveraging Ray for distributed execution and Lance for fast storage, indexing, and iteration.

To ensure this work integrates with existing enterprise stacks in production, we will also touch upon Lance's flexible namespace API that connects Lance datasets to current infrastructure such as Hive Metastore, AWS Glue, Unity Catalog, Apache Polaris, etc. We will demonstrate how to connect to these systems when using features in lance-ray, showing that Lance datasets can be registered alongside Parquet- or Iceberg-backed tables, preserving governance and discovery workflows while incrementally adopting Lance’s multimodal capabilities.

Finally, we would like to say thank you to all the community contributors who made lance-ray a reality and continue to add features, improve performance, and expand the ecosystem. Their efforts are what make it possible for practitioners to build and share efficient multimodal AI feature engineering pipelines.


Prior Knowledge Expected:

No previous knowledge expected

Jack Ye is a software engineer at LanceDB. He is a PMC member of Apache Iceberg and contributor to various open source projects in the data infra domain such as Apache Spark and Trino. Before joining LanceDB, Jack was a tech lead at AWS for products including SageMaker Lakehouse, S3 Tables, EMR and Athena integration with Iceberg and Delta Lake.