PyData Berlin 2025

AI-Ready Data in Action: Powering Smarter Agents
2025-09-01 , B09

This hands-on workshop focuses on what AI engineers do most often: making data AI-ready and turning it into production-useful applications. Together with dltHub and LanceDB, you’ll walk through an end-to-end workflow: collecting and preparing real-world data with best practices, managing it in LanceDB, and powering AI applications with search, filters, hybrid retrieval, and lightweight agents. By the end, you’ll know how to move from raw data to functional, production-ready AI setups without the usual friction. We will touch upon multi-modal data and going to production with this end-to-end use case.


Modern AI applications are only as powerful as the data that fuels them. Yet, much of the real-world data AI engineers encounter is messy, incomplete, or unoptimized data. In this hands-on tutorial, AI-Ready Data in Action: Powering Smarter Agents, participants will walk through the full lifecycle of preparing unstructured data, embedding it into LanceDB, and leveraging it for search and agentic applications. Using a real-world dataset, attendees will incrementally ingest, clean, and vectorize text data, tune hybrid search strategies, and build a lightweight chat agent to surface relevant results. The tutorial concludes by showing how to take a working demo into production. By the end, participants will gain practical experience in bridging the gap between messy raw data and production-ready pipelines for AI applications.

Prior knowledge

  • Basic Python programming.
  • Awareness of embeddings, vectors, and AI search concepts (we’ll explain where needed).

The tutorial is designed to be accessible: engineers familiar with Python should be able to follow along step by step.

Key Takeaways

By the end of the tutorial, participants will:

  1. Understand the end-to-end workflow of taking raw, real-world data and preparing it for AI applications.
  2. Build and run an incremental dlt pipeline to ingest real data into LanceDB.
  3. Apply text preprocessing and generate embeddings for semantic search.
  4. Optimize retrieval with vector and hybrid search strategies.
  5. Implement a lightweight AI agent capable of surfacing relevant issues from a natural language description.
  6. Learn how to transition from a demo project to a production setup using LanceDB Cloud.

Outline

  • Introduce dlt (data load tool) and how it enables schema evolution, incremental loading, and normalization in pipelines.
  • Introduce LanceDB and explain embeddings, vector search, hybrid retrieval and multi-modal data for AI applications.
  • Ingest and preprocess a real dataset with dlt, generate embeddings, and load it into LanceDB following best data engineering practices.
  • Optimize search in LanceDB by tuning parameters, selecting distance metrics, and adding hybrid retrieval.
  • Build a lightweight AI agent that queries LanceDB and returns the most relevant issues from natural-language prompts.
  • Demonstrate the path to production using automation, monitoring, and LanceDB Cloud for scaling and reliability.
  • Conclude with key takeaways and an open Q&A.

Expected audience expertise: Domain:

Novice

Prerequisites:

TBA

Abstract as a tweet (X) or toot (Mastodon):

TBA

Violetta Mishechkina leads Solutions Engineering at dltHub, helping teams build AI-ready data pipelines using the open-source library dlt. With a background in ML and MLOps, she focuses on turning messy, real-world data into reliable inputs for production systems. Over the past few year, Violetta has led several workshops on AI and data engineering, sharing practical insights with data teams across industries.

Chang is the CEO/Co-founder of LanceDB and has been making data tooling for ML/AI for almost two decades.
One of the original co-authors of the pandas project, Chang started LanceDB to make it easy for AI teams to work with all of the data that doesn't fit neatly into all of those dataframes - from embeddings to images, from audio to video, at petabyte scale.