PyData Seattle 2025
Real-time machine learning depends on features and data that by definition can’t be pre-computed. Detecting fraud, accurate chat bots, and product recommendations at scale requires both processing events that emerge seconds ago and building context from a multitude of data sources. How do we build an infrastructure platform that executes complex data pipelines (< 5ms) end-to-end and on-demand?
All while meeting data teams where they are–in Python–the language of ML. We’ll share how we built a Symbolic Python Interpreter that accelerates ML pipelines by transpiling Python into DAGs of static expressions. These expressions are optimized and run at scale with Velox–an OSS (~4k stars) unified query engine (C++) from Meta.
Testing and building document-based AI agents tipically requires requires access to real documents, which can be sensitive or hard to access. Synthetic data can be a safe and flexible alternative to real data. This talk explores how synthetic data can be used to prototype and evaluate document-based AI agents without requiring access to sensitive or proprietary documents.
In this session, we will cover why synthetic data is increasingly relevant for LLM training as well as for the development of successful AI agent workflows. We will address the challenges and expectations for synthetic document generation, such as structure capture, semantics, and dependencies that mimic real-world existing documents.
Furthermore, in this session we will cover how to integrate synthetic data into the desing of multi-agent systems, while ensuring coordenation, reliability and reproducibility of such systems. Finally, we will showcase a practical workflow for document retrieval, showcasing how synthetic data can accelerate experimentation and improve trust for AI agent driven systems.
By focusing on both the role of synthetic data and the complexity of AI agents workflows design, this talk highlights hands-on strategies for safe experimentation while addressing challenges of reproducibility and scalability.
Data scientists need data to train their models. The process of feeding the training algorithm with data is loosely described as "data loading." This talk looks at the data loading process from a data engineer's perspective. We will describe common techniques such as splits, shuffling, clumping, epochs, and distribution. We will show how the way data is loaded can have impacts on training speed and model quality. Finally, we examine what constraints these workloads put on data systems and discuss best practices for preparing a database to serve as a source for data loading.
AI initiatives don’t stall because of weak models or scarce GPUs—they stall because organizations (and their LLMs) can’t reliably find, connect, and trust their own tabular data. Traditional catalogs promised order but turned into graveyards of stale metadata: manually curated, impossible to maintain, and blind to the messy realities of enterprise-scale environments.
What’s needed is a semantic foundation that doesn’t just document data, but deterministically maps it—every valid join, entity, and lineage verifiable against the data itself.
This talk explores methods designed for that reality: statistical profiling to reveal true distributions, functional type detection to identify natural keys and relationships, deterministic join validation to separate signal from noise, and entity-centric mapping that organizes data around business concepts rather than table names. These approaches automate what was once brittle and manual, keeping catalogs alive, current, and grounded in evidence.
Agents need timely and relevant context data to work effectively in an interactive environment. If an agent takes more than a few seconds to react to an action in a client applicatoin, users will not perceive it as intelligent - just laggy.
Real-time context engineering involves building real-time data pipelines to pre-process application data and serve relevant and timely context to agents. This talk will focus on how you can leverage application identifiers (user ID, session ID, article ID, order ID, etc) to identify which real-time context data to provide to agents. We will contrast this approach with the more traditional RAG approach of using vector indexes to retrieve chunks of relevent text using the user query. Our approach will necessitate the introduction of the Agent-to-Agent protocol, an emerging standard for defining APIs for agents.
We will also demonstrate how we provide real-time context data from applications inside Python agents using the Hopsworks feature store. We will walk through an example of an interactive application (TikTok clone).
Why can we solve some equations with neat formulas, while others stubbornly resist every trick we know? Equations with squares bow to the quadratic formula. Those with cubes and fourth powers also have solutions. But then the magic stops. And when we, as data scientists, add exponentials, logarithms, or trigonometric terms into models, the resulting equations often cross into territory where no closed-form solutions exist.
This talk is both fun and useful. With Python and SymPy, we’ll “cheat” our way through centuries of mathematics, testing families of equations to see when closed forms appear and when numerical methods are our only option. Attendees will enjoy surprising examples, a bit of mathematical history, and practical insight into when exact solutions exist — and when to stop searching and switch to numerical methods.
The proliferation of AI/ML workloads across commercial enterprises, necessitates robust mechanisms to track, inspect and analyze their use of on-prem/cloud infrastructure. To that end, effective insights are crucial for optimizing cloud resource allocation with increasing workload demand, while mitigating cloud infrastructure costs and promoting operational stability.
This talk will outline an approach to systematically monitor, inspect and analyze AI/ML workloads’ properties like runtime, resource demand/utilization and cost attribution tags . By implementing granular inspection across multi-player teams and projects, organizations can gain actionable insights into resource bottlenecks, identify opportunities for cost savings, and enable AI/ML platform engineers to directly attribute infrastructure costs to specific workloads.
Cost attribution of infrastructure usage by AI/ML workloads focuses on key metrics such as compute node group information, cpu usage seconds, data transfer, gpu allocation , memory and ephemeral storage utilization. It enables platform administrators to identify competing workloads which lead to diminishing ROI. Answering questions from data scientists like "Why did my workload run for 6 hours today, when it took only 2 hours yesterday" or "Why did my workload start 3 hours behind schedule?" also becomes easier.
Through our work on Metaflow, we will showcase how we built a comprehensive framework for transparent usage reporting, cost attribution, performance optimization, and strategic planning for future AI/ML initiatives. Metaflow is a human centric python library that enables seamless scaling and management of AI/ML projects.
Ultimately, a well-defined usage tracking system empowers organizations to maximize the return on investment from their AI/ML endeavors while maintaining budgetary control and operational efficiency. Platform engineers and administrators will be able to gain insights into the following operational aspects of supporting a battle hardened ML Platform:
1.Optimize resource allocation: Understand consumption patterns to right-size clusters and allocate resources more efficiently, reducing idle time and preventing bottlenecks.
-
Proactively manage capacity: Forecast future resource needs based on historical usage trends, ensuring the infrastructure can scale effectively with increasing workload demand.
-
Facilitate strategic planning: Make informed decisions regarding future infrastructure investments and scaling strategies.
4.Diagnose workload execution delays: Identify resource contention, queuing issues, or insufficient capacity leading to delayed workload starts.
Data Scientists on the other hand will gain clarity on factors that influence workload performance. Tuning them can lead to efficiencies in runtime and associated cost profiles.
As generative AI systems become more powerful and widely deployed, ensuring safety and security is critical. This talk introduces AI red teaming—systematically probing AI systems to uncover potential risks—and demonstrates how to get started using PyRIT (Python Risk Identification Toolkit), an open-source framework for automated and semi-automated red teaming of generative AI systems. Attendees will leave with a practical understanding of how to identify and mitigate risks in AI applications, and how PyRIT can help along the way.
Modern data pipelines are fast and expressive, but ensuring data quality is often not as straightforward. This talk introduces Paguro, an open-source, feature-rich validation and metadata library designed on top of the Polars DataFrame library. Paguro enables users to validate both single Data(Lazy)Frames and collections of Data(Lazy)Frames together, and provides beautifully formatted terminal diagnostics that explain why and where validation failed. Attendees will learn how to integrate the lightweight, fast, and composable validation toolkit into their workflows, from exploration to production, using a familiar Polars-native syntax.
PySpark’s Arrow-based Python UDFs open the door to dramatically faster data processing by avoiding expensive serialization overhead. At the same time, Polars, a high-performance DataFrame library built on Rust, offers zero-copy interoperability with Apache Arrow. This talk shows how combining these two technologies unlocks new performance gains: writing Arrow UDFs with Polars in PySpark can deliver performance speedups compared to Python UDFs. Attendees will learn how Arrow UDFs work in PySpark, how it can be used with other data processing libraries, and how to apply this approach to real-world Spark pipelines for faster, more efficient workloads.
Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible.
In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable.
Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) strike a rare balance: they combine the flexibility of complex models with the clarity of simple ones.
They often achieve performance comparable to black-box models, yet remain:
- Easy to interpret
- Computationally efficient
- Aligned with the growing demand for transparency in AI
With recent U.S. AI regulations (White House, 2022) and increasing pressure from decision-makers for explainable models, GAMs are emerging as a natural choice across industries.
Audience
This guide is for readers with some background in Python and statistics, including:
- Data scientists
- Machine learning engineers
- Researchers
Takeaway
By the end, you’ll understand:
- The intuition behind GAMs
- How to build and apply them in practice
- How to interpret and explain GAM predictions and results in Python
Prerequisites
You should be comfortable with:
- Basic regression concepts
- Model regularization
- The bias–variance trade-off
- Python programming
Learn how to build accurate retail demand forecasts using MLForecast, an open-source Python library that automates feature engineering and machine learning models, with practical examples for common retail scenarios.
This lighthearted educational talk explores the wild west of dataframes. We discuss where dataframes got their origin (it wasn't R), how dataframes have evolved over time, and why dataframe is such a confusing term (what even is a dataframe?). We will look at what makes dataframes special from both a theoretical computer science perspective (the math is brief, I promise!) and from a technology landscape perspective. This talk doesn't advocate for any specific tool or technology, but instead surveys the broad field of dataframes as a whole.
Large language models are often too large to run on personal machines, requiring specialized hardware with massive memory. Quantization provides a way to shrink models, speed them up, and reduce memory usage - all while retaining most of their accuracy.
This talk introduces the fundamentals of neural network quantization, key techniques, and demonstrates how to apply them using Keras’s extensible quantization framework.
Living on Washington State’s peninsula offers endless beauty, nature, and commuting challenges. In this talk, I’ll share how I built an agentic AI system that creates and compares optimal routes to the mainland, factoring in ferry schedules, costs, driving distances, and live traffic. Originally a testbed for the Model Context Protocol (MCP) framework, this project now manages my travel schedule, generates expense estimates, and sends timely notifications for events. I’ll give a comprehensive overview of MCP, show how to quickly turn ideas into working agentic AI, and discuss practical integration with real-world APIs. Attendees will leave with actionable insights and a roadmap for building their own agentic AI solutions.
Most AI pipelines still treat models like Python UDFs, just another function bolted onto Spark, Pandas, or Ray. But models aren’t functions: they’re expensive, stateful, and unreliable. In this talk, we’ll explore why this mental model breaks at scale and share practical patterns for treating models as first-class citizens in your pipelines.
In the world of AI voice agents, especially in sensitive contexts like healthcare, audio clarity is everything. Background noise—a barking dog, a TV, street sounds—degrades transcription accuracy, leading to slower, clunkier, and less reliable AI responses. But how do you solve this in real-time without breaking the bank?
This talk chronicles our journey at a health-tech startup to ship background noise filtration at scale. We'll start with the core principles of noise reduction and our initial experiments with open-source models, then dive deep into the engineering architecture required to scale a compute-hungry ML service using Python and Kubernetes. You'll learn about the practical, operational considerations of deploying third-party models and, most importantly, how to measure their true impact on the product.
Processing large-scale image datasets for captioning presents coordination challenges that often lead to complex, difficult-to-maintain systems. I've been exploring how Ray Data can simplify these workflows while improving throughput and reliability.This talk demonstrates how to build image captioning pipelines combining Ray Data's batch processing capabilities, Ray Data LLM's batch inference capabilities, vLLM for efficient model serving.
Advancements in deep learning for biomedical image processing have led to the development of promising algorithms across multiple clinical domains, including radiology, digital pathology, ophthalmology, cardiology, and dermatology, among others. With robust AI models demonstrating commendable results, it is crucial to understand that their limited interpretability can impede the clinical translation of deep learning algorithms. The inference mechanism of these black-box models is not entirely understood by clinicians, patients, regulatory authorities, and even algorithm developers, thereby exacerbating safety concerns. In this interactive talk, we will explore some novel explainability techniques designed to interpret the decision-making process of robust deep learning algorithms for biomedical image processing. We will also discuss the impact and limitations of these techniques and analyze their potential to provide medically meaningful algorithmic explanations. Open-source resources for implementing these interpretability techniques using Python will be covered to provide a holistic understanding of explaining deep learning models for biomedical image processing.
This talk is distilled from a course that Ojas Ramwala designed, which received the best seminar award for the highest graduate student enrollment at the Department of Biomedical Informatics and Medical Education at the University of Washington, Seattle.
Efficient feature engineering is key to unlocking modern multimodal AI workloads. In this talk, we’ll dive deep into how Lance - an open-source format with built-in indexing, random access, and data evolution - works seamlessly with Ray’s distributed compute and UDF capabilities. We’ll walk through practical pipelines for preprocessing, embedding computation, and hybrid feature serving, highlighting concrete patterns attendees can take home to supercharge their own multimodal pipelines. See https://lancedb.github.io/lance/integrations/ray to learn more about this integration.
Machine learning teams today are drowning in massive volumes of raw, redundant data that inflate training costs, slow down experimentation, and degrade model quality. The core architectural flaw is that we apply control too late—after the data has already been moved into centralized stores or training clusters—creating waste, instability, and long iteration cycles. What if we could fix this problem right at the source?
In this talk, we’ll discuss an open-source playbook for shifting ML data filtering, transformation, and governance upstream, directly where data is generated. We’ll walk through a declarative, policy-as-code framework for building distributed pipelines that intelligently discard noise, balance datasets, and enrich signals before they ever reach your model training infrastructure.
Drawing from real-world ML workflows, we’ll show how this “upstream control” approach can reduce dataset size by 50–70%, cut model onboarding time in half, and embed reproducibility and compliance directly into the ML lifecycle—rather than patching them in afterward.
Attendees will leave with:
- A mental model for analyzing and optimizing the ML data supply chain.
- An understanding of open-source tools for declarative, source-level ML data controls.
- Actionable strategies to accelerate iteration, lower training costs, and improve model outcomes.
The world of generative AI is expanding. New models are hitting the market daily. The field has bifurcated between model training and model inference. The need for fast inference has led to numerous Tile languages to be developed. These languages use concepts from linear algebra and borrow common numpy apis. In this talk we will show how tiling works and how to build inference models from scratch in pure Python with embedded tile languages. The goal is to provide attendees with a good overview that can be integrated in common data pipelines.
LLM apps fail without reliable, reproducible evaluation. This talk maps the open‑source evaluation landscape, compares leading techniques (RAGAS, G-Eval, graders) and frameworks (DeepEval, Phoenix, LangFuse, OpenAI Evals), and shows how to combine unit tests, RAG‑specific evals, and observability to ship higher‑quality systems.
Attendees leave with a decision checklist, code patterns, and a production‑ready playbook.
Women make up only 22% of data and AI roles and contribute just 3% of Python commits, leaving a “missing 78%” of untapped talent and perspective. This talk shares what happened when our community doubled overnight, revealing hidden demand for inclusive spaces in scientific Python.
We’ll present the data behind this growth, examine systemic barriers, and introduce the VIM framework (Visibility–Invitation–Mechanism) — a research-backed model for building resilient, inclusive communities. Attendees will leave with practical, reproducible strategies to grow engagement, improve retention, and ensure that the future of AI and Python is shaped by all voices, not just the few.
Fine-tuning improves what an LLM knows, but it does little to guarantee how the model behaves under real-world prompt variation. Small changes in format, phrasing, or ordering can cause accuracy to collapse, exposing brittle decision boundaries. This talk presents practical methods for evaluating and improving robustness, including FORMATSPREAD for estimating performance spread, DivSampling for generating diverse stress tests, mixture-of-formats for structured variation, and alignment-aware techniques such as adversarial contrast sets and multilingual perturbations. We also show how post-training optimization with Direct Preference Optimization (DPO) can integrate robustness feedback into the alignment loop.
Modern LLM applications rely heavily on embeddings and vector databases for retrieval-augmented generation (RAG). But in 2025, researchers and OWASP flagged vector databases as a new attack surface — from embedding inversion (recovering sensitive training text) to poisoned vectors that hijack prompts. This talk demystifies these threats for practitioners and shows how to secure your RAG pipeline with real-world techniques like encrypted stores, anomaly detection, and retrieval validation. Attendees will leave with a practical security checklist for keeping embeddings safe while still unlocking the power of retrieval.
Most ML models excel at prediction, answering questions like "Who will buy our product?" or "Which customers are likely to churn?". But when it comes to making actionable decisions, prediction alone can be misleading. Correlation does not imply causation, and business decisions require understanding causal relationships to drive the right outcomes.
In this talk, we will explore how causal machine learning, specifically uplift modeling, can bridge the gap between prediction and decision making. Using a real-world use case, we will showcase how uplift modeling helps identify who will respond positively to interventions while avoiding those who they might deter.
This talk examines multi-threaded parallel inference on PyTorch models using the new No-GIL, free-threaded version of Python. Using a simple 124M parameter GPT2 model that we train from scratch, we explore the novel new territory unlocked by free-threaded Python: parallel PyTorch model inference, where multiple threads, unimpeded by the Python GIL, attempt to generate text from a transformer-based model in parallel.
Over the past few years, large language models (LLMs) have transformed the AI landscape, becoming an integral part of our daily workflows. Now, a new wave of AI innovation is emerging: AI agents. These agents go beyond static responses - they can reason, take actions, use tools, and solve multi-step problems, often with minimal human guidance. This evolution is driven by advancements like tool use and the Model Context Protocol (MCP), which enable models to interact with real-world environments. In this hands-on workshop, participants will learn how to build practical, task-oriented AI agents.
Hosted by a 5x worlds qualifiers robotics team from Bellevue, WA
As datasets continue to grow in both size and complexity, CPU-based visualization pipelines often become bottlenecks, slowing down exploratory data analysis and interactive dashboards. In this session, we’ll demonstrate how GPU acceleration can transform Python-based interactive visualization workflows, delivering speedups of up to 50x with minimal code changes. Using libraries such as hvPlot, Datashader, cuxfilter, and Plotly Dash, we’ll walk through real-world examples of visualizing both tabular and unstructured data and demonstrate how RAPIDS, a suite of open-source GPU-accelerated data science libraries from NVIDIA, accelerates these workflows. Attendees will learn best practices for accelerating preprocessing, building scalable dashboards, and profiling pipelines to identify and resolve bottlenecks. Whether you are an experienced data scientist or developer, you’ll leave with practical techniques to instantly scale your interactive visualization workflows on GPUs.
This talk explores how AI agents integrated directly into Jupyter notebooks can help with every part of your data science work. We'll cover the latest notebook-focused agentic features in VS Code, demonstrating how they automate tedious tasks like environment management or graph styling, enhance your "scratch notebook" to sharable code, and more generally streamline data science workflows directly in notebooks.
While AI copilots like Cursor and Claude Code have recently revolutionized software engineering workflows, many data scientists have so far been let down by the promise of AI. In this session, we’ll explore the capabilities of the Sphinx copilot, a Jupyter-native tool built specifically for data scientists, and learn tips and tricks on how to best leverage it to accelerate analytical workflows.
AI/ML workloads depend heavily on complex software stacks, including numerical computing libraries (SciPy, NumPy), deep learning frameworks (PyTorch, TensorFlow), and specialized toolchains (CUDA, cuDNN). However, integrating these dependencies into Bazel-based workflows remains challenging due to compatibility issues, dependency resolution, and performance optimization. This session explores the process of creating and maintaining Bazel packages for key AI/ML libraries, ensuring reproducibility, performance, and ease of use for researchers and engineers.
In this hands-on tutorial, we’ll walk through building a lightweight, agent-style workflow that takes a user-specified topic and uses retrieval-augmented generation (RAG) to perform deep research, summarize insights, and generate a podcast-style script. We’ll also show how to convert that script into audio using a simple text-to-speech tool.
This is a beginner-friendly, practical workshop that introduces key concepts in agent task design and content orchestration using LLMs.
Do you need to move your code from notebooks into production? Or do you want to level up your software engineering skills? In this tutorial, we will show you how to turn a Jupyter notebook into a robust, reproducible Python script. You will learn how to use tools for converting notebooks into scripts, how to make your code modular, and how to write unit tests.
LLMs have a lot of hype around them these days. Let’s demystify how they work and see how we can put them in context for data science use. As data scientists, we want to make sure our results are inspectable, reliable, reproducible, and replicable. We already have many tools to help us in this front. However, LLMs provide a new challenge; we may not always be given the same results back from a query. This means trying to work out areas where LLMs excel in, and use those behaviors in our data science artifacts. This talk will introduce you to LLMs, the Chatlas packages, and how they can be integrated into a Shiny to create an AI-powered dashboard (using querychat). We’ll see how we can leverage the tasks LLMs are good at to better our data science products.
Have you ever wished your Python libraries were faster? Could run on GPUs without changing your huge codebase, or switching to a different library, or rewriting everything in a faster language (C, Rust)? Discover how API dispatching mechanism redirects function calls to a FASTER implementation present in a separate backend package by various approaches like leveraging the Python's entry_point
specification, Array API Standards in various Scientific Python projects!
How can you use LLMs in professional settings where cloud APIs are off-limits due to cost, privacy, or compliance? In this talk, we’ll explore how to run powerful, open-source models like Mistral and LLaMA locally — and make them useful in the real world.
We’ll cover the engineering patterns, trade-offs, and deployment approaches that make local LLMs production-ready. You’ll learn how to build a private internal knowledge assistant that runs completely offline using RAG (retrieval-augmented generation), local embeddings, and quantized models. A short live demo will show it in action — answering organization-specific questions without sending a single token to the cloud.
DataMaps are ML-powered visualizations of high-dimensional data, and in this talk the data is collections of embedding vectors. Interactive DataMaps run in-browser as web-apps, potentially without any code running on the web server. DataMap tech can be used to visualize, say, the entire collection of chunks in a RAG vector database.
The best-of-breed tools of this new DataMap technique are liberally licensed open source. This presentation is an introduction to building with those repos. The maths will be mentioned only in passing; the topic here is simply how-to with specific tools. Talk attendees will be learning about Python tools, which produce high-quality web UIs.
DataMapPlot is the premiere tool for rendering a DataMap as a web-app. Here is a live demo thereof:
http://connoiter.com/datamap/cff30bc1-0576-44f0-a07c-60456e131b7b
00-10: Intro to DataMaps
10-15: A pipeline blueprint and a DataMap file format that gets assembled by pipelines
15-35: demos tour of such tools as UMAP, HDBSCAP, DataMapPlot, Toponomy, etc.
35-40: Q & A
Fast iteration is the backbone of machine learning innovation. I’ve been exploring how to enable ML engineers to prototype and scale training workloads with minimal friction and maximal flexibility - all without leaving the comfort of Python. This talk demonstrates how Ray can be used as a powerful framework for accelerating ML development workflows through standalone persistent ray clusters as well as ephemeral ray clusters per job.
Traditional subgraph isomorphism algorithms like VF2 rely on sequential tree-search that can't leverage parallel computing. This talk introduces Δ-Motif, a data-centric approach that transforms graph matching into data operations using Python's data science stack.
Δ-Motif decomposes graphs into small "motifs" to reconstruct matches. By representing graphs as tabular data with RAPIDS cuDF and Pandas, we achieve 10-595X speedups over VF2 without custom GPU kernels.
I'll demonstrate practical applications from social networks to quantum computing, and show when GPU acceleration provides the biggest benefits for graph analysis problems. Perfect for data scientists working with network analysis, recommendation systems, or pattern matching at scale
The problem of address matching arrives when the address of one physical place is written in two or more different ways. This situation is very common in companies that receive records of customers from different sources. The differences can be classified as syntactic and semantic. In the first type, the meaning is the same but the way they are written is different. For example, one can find "Street" vs "St". In the second type, the meaning is not exactly the same. For example, one can find "Road" instead of "Street". To solve this problem and match addresses, we have a couple of approaches. The first and simple is by using similarity metrics. The second uses natural language and transformers. This is a hands-on talk and is intended for data process analyst. We are going to go through these solutions implemented in a Jupyter notebook using Python.