PyData Seattle 2025

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
55min
Registration & Breakfast
Room 301B
08:55
08:55
10min
Opening Notes
Room 301B
09:05
09:05
40min
Keynote: Josh Starmer - Communicating Concepts, Clearly Explained!!! (Or, why I don’t worry about AI taking my job and sense of purpose away from me.)
Josh Starmer

In this talk I'll discuss 3 goals I have when I try to communicate complicated topics. I'll then illustrate how I used these goals to guide the development of my most popular video, PCA Step-by-Step, which has over 3.4 million views.

Room 301B
09:45
09:45
25min
Break
Room 301B
10:10
10:10
45min
Data Loading for Data Engineers
Weston Pace

Data scientists need data to train their models. The process of feeding the training algorithm with data is loosely described as "data loading." This talk looks at the data loading process from a data engineer's perspective. We will describe common techniques such as splits, shuffling, clumping, epochs, and distribution. We will show how the way data is loaded can have impacts on training speed and model quality. Finally, we examine what constraints these workloads put on data systems and discuss best practices for preparing a database to serve as a source for data loading.

Room 313
10:10
45min
Real-TIme Context Engineering for Agents
Jim Dowling

Agents need timely and relevant context data to work effectively in an interactive environment. If an agent takes more than a few seconds to react to an action in a client applicatoin, users will not perceive it as intelligent - just laggy.

Real-time context engineering involves building real-time data pipelines to pre-process application data and serve relevant and timely context to agents. This talk will focus on how you can leverage application identifiers (user ID, session ID, article ID, order ID, etc) to identify which real-time context data to provide to agents. We will contrast this approach with the more traditional RAG approach of using vector indexes to retrieve chunks of relevent text using the user query. Our approach will necessitate the introduction of the Agent-to-Agent protocol, an emerging standard for defining APIs for agents.

We will also demonstrate how we provide real-time context data from applications inside Python agents using the Hopsworks feature store. We will walk through an example of an interactive application (TikTok clone).

Room 301B
10:55
10:55
45min
Building valuable Deterministic products in a Probabilistic world
John Carney

For the first time in computing history the paradigm of designing applications is following a probabilistic approach rather than a deterministic approach. Large language models have generated huge amounts of excitement, with investors, management, and engineers who are using them in product development. However both according to peer reviewed studies, and anecdotal observations, it has proven difficult to translate this optimism into business value.

Room 301B
10:55
45min
Explore Solvable and Unsolvable Equations with SymPy
Carl Kadie

Why can we solve some equations with neat formulas, while others stubbornly resist every trick we know? Equations with squares bow to the quadratic formula. Those with cubes and fourth powers also have solutions. But then the magic stops. And when we, as data scientists, add exponentials, logarithms, or trigonometric terms into models, the resulting equations often cross into territory where no closed-form solutions exist.

This talk is both fun and useful. With Python and SymPy, we’ll “cheat” our way through centuries of mathematics, testing families of equations to see when closed forms appear and when numerical methods are our only option. Attendees will enjoy surprising examples, a bit of mathematical history, and practical insight into when exact solutions exist — and when to stop searching and switch to numerical methods.

Room 313
11:40
11:40
45min
Optimizing AI/ML Workloads: Resource Management and Cost Attribution
Saurabh Garg

The proliferation of AI/ML workloads across commercial enterprises, necessitates robust mechanisms to track, inspect and analyze their use of on-prem/cloud infrastructure. To that end, effective insights are crucial for optimizing cloud resource allocation with increasing workload demand, while mitigating cloud infrastructure costs and promoting operational stability.

This talk will outline an approach to systematically monitor, inspect and analyze AI/ML workloads’ properties like runtime, resource demand/utilization and cost attribution tags . By implementing granular inspection across multi-player teams and projects, organizations can gain actionable insights into resource bottlenecks, identify opportunities for cost savings, and enable AI/ML platform engineers to directly attribute infrastructure costs to specific workloads.

Cost attribution of infrastructure usage by AI/ML workloads focuses on key metrics such as compute node group information, cpu usage seconds, data transfer, gpu allocation , memory and ephemeral storage utilization. It enables platform administrators to identify competing workloads which lead to diminishing ROI. Answering questions from data scientists like "Why did my workload run for 6 hours today, when it took only 2 hours yesterday" or "Why did my workload start 3 hours behind schedule?" also becomes easier.

Through our work on Metaflow, we will showcase how we built a comprehensive framework for transparent usage reporting, cost attribution, performance optimization, and strategic planning for future AI/ML initiatives. Metaflow is a human centric python library that enables seamless scaling and management of AI/ML projects.

Ultimately, a well-defined usage tracking system empowers organizations to maximize the return on investment from their AI/ML endeavors while maintaining budgetary control and operational efficiency. Platform engineers and administrators will be able to gain insights into the following operational aspects of supporting a battle hardened ML Platform:

1.Optimize resource allocation: Understand consumption patterns to right-size clusters and allocate resources more efficiently, reducing idle time and preventing bottlenecks.

  1. Proactively manage capacity: Forecast future resource needs based on historical usage trends, ensuring the infrastructure can scale effectively with increasing workload demand.

  2. Facilitate strategic planning: Make informed decisions regarding future infrastructure investments and scaling strategies.

4.Diagnose workload execution delays: Identify resource contention, queuing issues, or insufficient capacity leading to delayed workload starts.

Data Scientists on the other hand will gain clarity on factors that influence workload performance. Tuning them can lead to efficiencies in runtime and associated cost profiles.

Room 301B
11:40
45min
Red Teaming AI: Getting Started with PyRIT for Safer Generative AI Systems
Roman Lutz

As generative AI systems become more powerful and widely deployed, ensuring safety and security is critical. This talk introduces AI red teaming—systematically probing AI systems to uncover potential risks—and demonstrates how to get started using PyRIT (Python Risk Identification Toolkit), an open-source framework for automated and semi-automated red teaming of generative AI systems. Attendees will leave with a practical understanding of how to identify and mitigate risks in AI applications, and how PyRIT can help along the way.

Room 313
12:25
12:25
60min
Lunch
Room 301B
13:25
13:25
45min
Keynote: Zaheera Valani- Driving Data Democratization with the Databricks Data Intelligence Platform
Zaheera Valani

Join us for our Keynote with Zaheera Valani

Room 301B
14:10
14:10
25min
Break
Room 301B
14:35
14:35
45min
Know Your Data(Frame) with Paguro: Declarative and Composable Validation and Metadata using Polars
Bernardo Dionisi

Modern data pipelines are fast and expressive, but ensuring data quality is often not as straightforward. This talk introduces Paguro, an open-source, feature-rich validation and metadata library designed on top of the Polars DataFrame library. Paguro enables users to validate both single Data(Lazy)Frames and collections of Data(Lazy)Frames together, and provides beautifully formatted terminal diagnostics that explain why and where validation failed. Attendees will learn how to integrate the lightweight, fast, and composable validation toolkit into their workflows, from exploration to production, using a familiar Polars-native syntax.

Room 313
14:35
45min
Polars on Spark: Unlocking Performance with Arrow Python UDFs
Allison Wang, Shujing Yang

PySpark’s Arrow-based Python UDFs open the door to dramatically faster data processing by avoiding expensive serialization overhead. At the same time, Polars, a high-performance DataFrame library built on Rust, offers zero-copy interoperability with Apache Arrow. This talk shows how combining these two technologies unlocks new performance gains: writing Arrow UDFs with Polars in PySpark can deliver performance speedups compared to Python UDFs. Attendees will learn how Arrow UDFs work in PySpark, how it can be used with other data processing libraries, and how to apply this approach to real-world Spark pipelines for faster, more efficient workloads.

Room 301B
14:35
45min
Wrangling Internet-scale Image Datasets
Carlos Garcia Jurado Suarez, Nicholas Merchant

Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible.
In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable.
Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.

Room 301A
15:20
15:20
45min
Generalized Additive Models: Explainability Strikes Back
Pedro Albuquerque

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) strike a rare balance: they combine the flexibility of complex models with the clarity of simple ones.

They often achieve performance comparable to black-box models, yet remain:
- Easy to interpret
- Computationally efficient
- Aligned with the growing demand for transparency in AI

With recent U.S. AI regulations (White House, 2022) and increasing pressure from decision-makers for explainable models, GAMs are emerging as a natural choice across industries.


Audience

This guide is for readers with some background in Python and statistics, including:
- Data scientists
- Machine learning engineers
- Researchers


Takeaway

By the end, you’ll understand:
- The intuition behind GAMs
- How to build and apply them in practice
- How to interpret and explain GAM predictions and results in Python


Prerequisites

You should be comfortable with:
- Basic regression concepts
- Model regularization
- The bias–variance trade-off
- Python programming

Room 313
15:20
45min
Multi-Series Forecasting at Scale with StatsForecast
Khuyen Tran, Yibei Hu

Learn how to build fast and reliable retail demand forecasts using StatsForecast, an open-source Python library for scalable statistical forecasting. This session will cover techniques including rolling-origin cross-validation and conformal prediction, with practical retail demand examples.

Room 301B
15:20
45min
Panel: Building Data-Driven Startups with User-Centric Design
Eloisa Elias T, Yinhan Liu, Joshua Ahmed, Yujian Tang, Pedro Luraschi

Creating successful data products requires more than just powerful algorithms it demands a deep understanding of user needs. In this panel, founders and leaders from innovative data-driven startups share their strategies for designing user-centric data products including Python-based tools.

Room 301A
16:05
16:05
45min
Actually using GPs in practice with PyMC
Bill Engels

This talk will be about the Gaussian process (GP) functionality in the open source Python package PyMC, and how to use GPs effectively for models in the real world. The goal will be to bridge the (wide!) gap between theory and practice, using an example from baseball. By the end of the talk you'll know what's possible in PyMC and how to avoid common pitfalls.

Room 313
16:05
45min
We don't dataframe shame: A love letter to dataframes
Devin Petersohn

This lighthearted educational talk explores the wild west of dataframes. We discuss where dataframes got their origin (it wasn't R), how dataframes have evolved over time, and why dataframe is such a confusing term (what even is a dataframe?). We will look at what makes dataframes special from both a theoretical computer science perspective (the math is brief, I promise!) and from a technology landscape perspective. This talk doesn't advocate for any specific tool or technology, but instead surveys the broad field of dataframes as a whole.

Room 301B
08:00
08:00
60min
Registration & Breakfast
Room 301B
09:00
09:00
40min
Keynote: Chang She - Never Send a Human to do an Agent's Search
Chang She

Keynote by Chang She

Room 301B
09:45
09:45
25min
Break
Room 301B
10:10
10:10
45min
How to Optimize your Python Program for Slowness: Inspired by New Turing Machine Results
Carl Kadie

Many talks show how to make Python code faster. This one flips the script: what if we try to make our Python as slow as possible? By exploring deliberately inefficient programs — from infinite loops to Turing machines that halt only after an astronomically long time — we’ll discover surprising lessons about computation, large numbers, and the limits of programming languages. Inspired by new Turing machine results, this talk will connect Python experiments with deep questions in theoretical computer science.

Room 301B
10:10
45min
Practical Quantization in Keras: Running Large Models on Small Devices
Jyotinder Singh

Large language models are often too large to run on personal machines, requiring specialized hardware with massive memory. Quantization provides a way to shrink models, speed them up, and reduce memory usage - all while retaining most of their accuracy.

This talk introduces the fundamentals of neural network quantization, key techniques, and demonstrates how to apply them using Keras’s extensible quantization framework.

Room 313
10:10
45min
There and back again... by ferry or I-5?
JustinCastilla

Living on Washington State’s peninsula offers endless beauty, nature, and commuting challenges. In this talk, I’ll share how I built an agentic AI system that creates and compares optimal routes to the mainland, factoring in ferry schedules, costs, driving distances, and live traffic. Originally a testbed for the Model Context Protocol (MCP) framework, this project now manages my travel schedule, generates expense estimates, and sends timely notifications for events. I’ll give a comprehensive overview of MCP, show how to quickly turn ideas into working agentic AI, and discuss practical integration with real-world APIs. Attendees will leave with actionable insights and a roadmap for building their own agentic AI solutions.

Room 301A
10:55
10:55
45min
Building Agents with Agent Bricks and MCP
Denny Lee

Want to create AI agents that can do more than just generate text? Join us to explore how combining Databricks' Agent Bricks with the Model Context Protocol (MCP) unlocks powerful tool-calling capabilities. We'll show you how MCP provides a standardized way for AI agents to interact with external tools, data and APIs, solving the headache of fragmented integration approaches. Learn to build agents that can retrieve both structured and unstructured data, execute custom code and tackle real enterprise challenges.

Room 301A
10:55
45min
Scaling Background Noise Filtration for AI Voice Agents
Stephen Cheng

In the world of AI voice agents, especially in sensitive contexts like healthcare, audio clarity is everything. Background noise—a barking dog, a TV, street sounds—degrades transcription accuracy, leading to slower, clunkier, and less reliable AI responses. But how do you solve this in real-time without breaking the bank?

This talk chronicles our journey at a health-tech startup to ship background noise filtration at scale. We'll start with the core principles of noise reduction and our initial experiments with open-source models, then dive deep into the engineering architecture required to scale a compute-hungry ML service using Python and Kubernetes. You'll learn about the practical, operational considerations of deploying third-party models and, most importantly, how to measure their true impact on the product.

Room 313
10:55
45min
Taming the Data Tsunami: An Open-Source Playbook to Get Ready for ML
David Aronchick

Machine learning teams today are drowning in massive volumes of raw, redundant data that inflate training costs, slow down experimentation, and degrade model quality. The core architectural flaw is that we apply control too late—after the data has already been moved into centralized stores or training clusters—creating waste, instability, and long iteration cycles. What if we could fix this problem right at the source?

In this talk, we’ll discuss an open-source playbook for shifting ML data filtering, transformation, and governance upstream, directly where data is generated. We’ll walk through a declarative, policy-as-code framework for building distributed pipelines that intelligently discard noise, balance datasets, and enrich signals before they ever reach your model training infrastructure.

Drawing from real-world ML workflows, we’ll show how this “upstream control” approach can reduce dataset size by 50–70%, cut model onboarding time in half, and embed reproducibility and compliance directly into the ML lifecycle—rather than patching them in afterward.

Attendees will leave with:
- A mental model for analyzing and optimizing the ML data supply chain.
- An understanding of open-source tools for declarative, source-level ML data controls.
- Actionable strategies to accelerate iteration, lower training costs, and improve model outcomes.

Room 301B
11:40
11:40
45min
Explainable AI for Biomedical Image Processing
Ojas Ankurbhai Ramwala

Advancements in deep learning for biomedical image processing have led to the development of promising algorithms across multiple clinical domains, including radiology, digital pathology, ophthalmology, cardiology, and dermatology, among others. With robust AI models demonstrating commendable results, it is crucial to understand that their limited interpretability can impede the clinical translation of deep learning algorithms. The inference mechanism of these black-box models is not entirely understood by clinicians, patients, regulatory authorities, and even algorithm developers, thereby exacerbating safety concerns. In this interactive talk, we will explore some novel explainability techniques designed to interpret the decision-making process of robust deep learning algorithms for biomedical image processing. We will also discuss the impact and limitations of these techniques and analyze their potential to provide medically meaningful algorithmic explanations. Open-source resources for implementing these interpretability techniques using Python will be covered to provide a holistic understanding of explaining deep learning models for biomedical image processing.

This talk is distilled from a course that Ojas Ramwala designed, which received the best seminar award for the highest graduate student enrollment at the Department of Biomedical Informatics and Medical Education at the University of Washington, Seattle.

Room 313
11:40
45min
Supercharging Multimodal Feature Engineering with Lance and Ray
Jack Ye

Efficient feature engineering is key to unlocking modern multimodal AI workloads. In this talk, we’ll dive deep into how Lance - an open-source format with built-in indexing, random access, and data evolution - works seamlessly with Ray’s distributed compute and UDF capabilities. We’ll walk through practical pipelines for preprocessing, embedding computation, and hybrid feature serving, highlighting concrete patterns attendees can take home to supercharge their own multimodal pipelines. See https://lancedb.github.io/lance/integrations/ray to learn more about this integration.

Room 301B
11:40
45min
Why Models Break Your Pipelines (and How to Make Them First-Class Citizens)
Everett Kleven

Most AI pipelines still treat models like Python UDFs, just another function bolted onto Spark, Pandas, or Ray. But models aren’t functions: they’re expensive, stateful, and difficult to configure. In this talk, we’ll explore why this mental model breaks at scale and share practical patterns for treating models as first-class citizens in your pipelines.

Room 301A
12:25
12:25
60min
Lunch
Room 301B
13:25
13:25
45min
Lightning Talks

Sign up for a 5-minute lightning talk at the NumFOCUS booth on Friday.

Room 301B
14:10
14:10
25min
Break
Room 301B
14:35
14:35
45min
Building Inference Workflows with Tile Languages
Andy Terrel

The world of generative AI is expanding. New models are hitting the market daily. The field has bifurcated between model training and model inference. The need for fast inference has led to numerous Tile languages to be developed. These languages use concepts from linear algebra and borrow common numpy apis. In this talk we will show how tiling works and how to build inference models from scratch in pure Python with embedded tile languages. The goal is to provide attendees with a good overview that can be integrated in common data pipelines.

Room 313
14:35
45min
Evaluation is all you need
Sebastian Duerr

LLM apps fail without reliable, reproducible evaluation. This talk maps the open‑source evaluation landscape, compares leading techniques (RAGAS, Evaluation Driven Development) and frameworks (DeepEval, Phoenix, LangFuse, and braintrust), and shows how to combine tests, RAG‑specific evals, and observability to ship higher‑quality systems.
Attendees leave with a decision checklist, code patterns, and a production‑ready playbook.

Room 301A
14:35
45min
The Missing 78%: What We Learned When Our Community Doubled Overnight
Noor Aftab

Women make up only 22% of data and AI roles and contribute just 3% of Python commits, leaving a “missing 78%” of untapped talent and perspective. This talk shares what happened when our community doubled overnight, revealing hidden demand for inclusive spaces in scientific Python.

We’ll present the data behind this growth, examine systemic barriers, and introduce the VIM framework (Visibility–Invitation–Mechanism) — a research-backed model for building resilient, inclusive communities. Attendees will leave with practical, reproducible strategies to grow engagement, improve retention, and ensure that the future of AI and Python is shaped by all voices, not just the few.

Room 301B
15:20
15:20
45min
Democratizing (Py)Data: Remote computing for all
C.A.M. Gerlach

PhD students, postdocs and independent researchers often struggle when trying to scale their code and data beyond their local machine, to a HPC cluster or the cloud. This is even more difficult if they don’t happen to have access to IT staff and resources to set up the necessary infrastructure, as is the case in many developing countries. We introduce a new open source, extensible remote development architecture, supported in version 6.1 of the Spyder scientific environment and IDE, that allows users to manage packages, browse files and run code remotely on a completely austere host from the comfort of their local machine.

Room 313
15:20
45min
Prompt Variation as a Diagnostic Tool: Exposing Contamination, Memorization, and True Capability in LLMs
Aziza Mirsaidova

Prompt variation isn't just an engineering nuisance, it's a window into fundamental LLM limitations. When a model's accuracy drops from 95% to 75% due to minor rephrasing, we're not just seeing brittleness; we're potentially exposing data contamination, spurious correlations, and shallow pattern matching. This talk explores prompt variation as a powerful diagnostic tool for understanding LLM reliability. We discuss how small changes in format, phrasing, or ordering can cause accuracy to collapse revealing about models memorizing benchmark patterns or learning superficial correlations rather than robust task representations. Drawing from academic and industry research, you will learn to distinguish between LLM's true capability and memorization, identify when models are pattern-matching rather than reasoning, and build evaluation frameworks that expose these vulnerabilities before deployment.

Room 301B
15:20
45min
Securing Retrieval-Augmented Generation: How to Defend Vector Databases Against 2025 Threats
Rajesh

Modern LLM applications rely heavily on embeddings and vector databases for retrieval-augmented generation (RAG). But in 2025, researchers and OWASP flagged vector databases as a new attack surface — from embedding inversion (recovering sensitive training text) to poisoned vectors that hijack prompts. This talk demystifies these threats for practitioners and shows how to secure your RAG pipeline with real-world techniques like encrypted stores, anomaly detection, and retrieval validation. Attendees will leave with a practical security checklist for keeping embeddings safe while still unlocking the power of retrieval.

Room 301A
16:05
16:05
45min
Beyond Just Prediction: Causal Thinking in Machine Learning
Avik Basu

Most ML models excel at prediction, answering questions like "Who will buy our product?" or "Which customers are likely to churn?". But when it comes to making actionable decisions, prediction alone can be misleading. Correlation does not imply causation, and business decisions require understanding causal relationships to drive the right outcomes.

In this talk, we will explore how causal machine learning, specifically uplift modeling, can bridge the gap between prediction and decision making. Using a real-world use case, we will showcase how uplift modeling helps identify who will respond positively to interventions while avoiding those who they might deter.

Room 313
16:05
45min
Diversity Panel: Data for All: Empowering Underrepresented Voices in Data Science and Analytics
Eloisa Elias T, Anquida Adams, Micheleen Harris, Oli Dinov, Heejoon Ahn

Data science has the power to shape industries and societies. This panel will focus on empowering underrepresented groups in data science through education, access to tools, and career opportunities. Panelists will share their journeys, discuss the importance of democratizing data skills, and explore how to make the field more accessible to diverse talent.

Room 301A
16:05
45min
Unlocking Parallel PyTorch Inference (and More!) with Python Free-Threading
Trent Nelson

From the speaker who got kicked off the stage after 54 minutes of his 45-minute PyParallel talk at PyData NYC 2013, comes a new talk foaming about the virtues of Python's new free-threaded support!

Room 301B
17:30
17:30
120min
Conference Social

Join your fellow conference attendees and local meetup members at Bellevue Brewing Company - Spring District Brewpub 12190 NE District Wy, Bellevue, WA 98005

https://maps.app.goo.gl/3HSM4WvPXSfVWS3f7

Room 301B
08:00
08:00
60min
Registration & Breakfast
Room 127
09:00
09:00
90min
Building Intelligent DIY Robots: From Hardware to Vision Systems
FTC 18225 High Definition

In this talk, Ethan Lee, lead programmer of an FTC (FIRST Tech Challenge) high school robotics team, and Jake Poznanski, startup founder and software engineer, will show how software, hardware, and data converge to build intelligent robots. Ethan will discuss how FTC robots apply computer vision, including OpenCV and neural networks, to convert raw camera data into autonomous robot action. He will also examine the challenges of operating under strict computation constraints, such as latency, calibration, and synchronization. Jake will explore the process of creating a DIY robot, such as CAD design, electronics, and message passing.

Room 121
09:00
90min
Scaling Large-Scale Interactive Data Visualization with Accelerated Computing
Allison Ding

As datasets continue to grow in both size and complexity, CPU-based visualization pipelines often become bottlenecks, slowing down exploratory data analysis and interactive dashboards. In this session, we’ll demonstrate how GPU acceleration can transform Python-based interactive visualization workflows, delivering speedups of up to 50x with minimal code changes. Using libraries such as hvPlot, Datashader, cuxfilter, and Plotly Dash, we’ll walk through real-world examples of visualizing both tabular and unstructured data and demonstrate how RAPIDS, a suite of open-source GPU-accelerated data science libraries from NVIDIA, accelerates these workflows. Attendees will learn best practices for accelerating preprocessing, building scalable dashboards, and profiling pipelines to identify and resolve bottlenecks. Whether you are an experienced data scientist or developer, you’ll leave with practical techniques to instantly scale your interactive visualization workflows on GPUs.

Room 118
09:00
90min
There's no place like home: using AI agents in Jupyter notebooks
Sarah Kaiser

This talk explores how AI agents integrated directly into Jupyter notebooks can help with every part of your data science work. We'll cover the latest notebook-focused agentic features in VS Code, demonstrating how they automate tedious tasks like environment management or graph styling, enhance your "scratch notebook" to sharable code, and more generally streamline data science workflows directly in notebooks.

Room 127
10:30
10:30
30min
Break
Room 127
11:00
11:00
90min
Building Bazel Packages for AI/ML: SciPy, PyTorch, and Beyond
Ramesh Oswal, Jiten Oswal

AI/ML workloads depend heavily on complex software stacks, including numerical computing libraries (SciPy, NumPy), deep learning frameworks (PyTorch, TensorFlow), and specialized toolchains (CUDA, cuDNN). However, integrating these dependencies into Bazel-based workflows remains challenging due to compatibility issues, dependency resolution, and performance optimization. This session explores the process of creating and maintaining Bazel packages for key AI/ML libraries, ensuring reproducibility, performance, and ease of use for researchers and engineers.

Room 121
11:00
90min
Building a Deep Research Agentic Workflow
nidhin pattaniyil, Ravi Kumar Yadav

OpenAI and Gemini's Deep Research offerings are a great way to get a detailed research report on a topic.

In this beginner friendly tutorial, we’ll walk through building a simple lightweight agent workflow to perform deep research.

Room 127
11:00
90min
Going From Notebooks to Production Code
Catherine Nelson, Robert Masson

Do you need to move your code from notebooks into production? Or do you want to level up your software engineering skills? In this tutorial, we will show you how to turn a Jupyter notebook into a robust, reproducible Python script. You will learn how to use tools for converting notebooks into scripts, how to make your code modular, and how to write unit tests.

Room 118
11:00
90min
How to make datamap web-apps of embedding vectors via open source tooling
John Tigue

Datamaps are ML-powered visualizations of high-dimensional data, and in this talk the data is collections of embedding vectors. Interactive datamaps run in-browser as web-apps, potentially without any code running on the web server. Datamap tech can be used to visualize, say, the entire collection of chunks in a RAG vector database.

The best-of-breed tools of this new datamap technique are liberally licensed open source. This presentation is an introduction to building with those repos. The maths will be mentioned only in passing; the topic here is simply how-to with specific tools. Talk attendees will be learning about Python tools, which produce high-quality web UIs.

DataMapPlot is the premiere tool for rendering a datamap as a web-app. Here is a live demo thereof:
https://connoiter.com/datamap/cff30bc1-0576-44f0-a07c-60456e131b7b

00-25: Intro to datamaps
25-45: Pipeline architecture
45-55: demos touring such tools as UMAP, HDBSCAN, DataMapPlot, Toponomy, etc.
55-90: Group coding

A Google account is required to log in to Google Colab, where participants can run the workshop notebooks. A Hugging Face API key (token) is needed to download Gemma models.

Room 122
12:30
12:30
60min
Lunch
Room 127
13:30
13:30
210min
GPU Accelerated Python
Andy Terrel

Accelerating Python using the GPU is much easier than you might think. We will explore the powerful CUDA-enabled Python ecosystem in this tutorial through hands-on examples using some of the most popular accelerated scientific computing libraries.

Room 127
13:30
90min
LLMs, Chatbots, and Dashboards: Visualize and Analyze Your Data with Natural Language
Daniel Chen

LLMs have a lot of hype around them these days. Let’s demystify how they work and see how we can put them in context for data science use. As data scientists, we want to make sure our results are inspectable, reliable, reproducible, and replicable. We already have many tools to help us in this front. However, LLMs provide a new challenge; we may not always be given the same results back from a query. This means trying to work out areas where LLMs excel in, and use those behaviors in our data science artifacts. This talk will introduce you to LLMs, the Chatlas packages, and how they can be integrated into a Shiny to create an AI-powered dashboard (using querychat). We’ll see how we can leverage the tasks LLMs are good at to better our data science products.

Room 118
13:30
90min
Newcomer Sprint!
C.A.M. Gerlach, Eloisa Elias T, Fangchen Li, Rachel Wagner-Kaiser, Jake Stevens-Haas, Joseph Holsten

Looking to contribute to open source, but wasn’t sure where to start? Want to level up your skills in debugging, programming, collaboration and more? Curious about how to fix a bug or add a feature you’re missing in your favorite software project? Come to our special newcomer sprint to learn how and try it for yourself! Newcomers to Python or open source are welcome and encouraged, as well as attendees with open source experience to help guide them!

Room 121
15:00
15:00
30min
Break
Room 118
15:30
15:30
90min
Subgraph Isomorphism at Scale with data science tools
Esteban Ginez

Traditional subgraph isomorphism algorithms like VF2 rely on sequential tree-search that can't leverage parallel computing. This talk introduces Δ-Motif, a data-centric approach that transforms graph matching into data operations using Python's data science stack.
Δ-Motif decomposes graphs into small "motifs" to reconstruct matches. By representing graphs as tabular data with RAPIDS cuDF and Pandas, we achieve 10-595X speedups over VF2 without custom GPU kernels.
I'll demonstrate practical applications from social networks to quantum computing, and show when GPU acceleration provides the biggest benefits for graph analysis problems. Perfect for data scientists working with network analysis, recommendation systems, or pattern matching at scale

Room 121
15:30
90min
The Problem of Address Matching: a Journey through NLP and AI
Ivan Perez Avellaneda

The problem of address matching arrives when the address of one physical place is written in two or more different ways. This situation is very common in companies that receive records of customers from different sources. The differences can be classified as syntactic and semantic. In the first type, the meaning is the same but the way they are written is different. For example, one can find "Street" vs "St". In the second type, the meaning is not exactly the same. For example, one can find "Road" instead of "Street". To solve this problem and match addresses, we have a couple of approaches. The first and simple is by using similarity metrics. The second uses natural language and transformers. This is a hands-on talk and is intended for data process analyst. We are going to go through these solutions implemented in a Jupyter notebook using Python.

Room 118