To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Friday, Nov. 7, 2025

Saturday, Nov. 8, 2025

Sunday, Nov. 9, 2025

08:00

55min

Registration & Breakfast

Room 301B

08:55

10min

Opening Notes

Room 301B

09:05

40min

Keynote: Josh Starmer - Communicating Concepts, Clearly Explained!!! (Or, why I don’t worry about AI taking my job and sense of purpose away from me.)

Josh Starmer

In this talk I'll discuss 3 goals I have when I try to communicate complicated topics. I'll then illustrate how I used these goals to guide the development of my most popular video, PCA Step-by-Step, which has over 3.4 million views.

Room 301B

09:45

25min

Break

Room 301B

10:10

45min

Data Loading for Data Engineers

Weston Pace

Data scientists need data to train their models. The process of feeding the training algorithm with data is loosely described as "data loading." This talk looks at the data loading process from a data engineer's perspective. We will describe common techniques such as splits, shuffling, clumping, epochs, and distribution. We will show how the way data is loaded can have impacts on training speed and model quality. Finally, we examine what constraints these workloads put on data systems and discuss best practices for preparing a database to serve as a source for data loading.

Room 313

10:10

45min

Real-TIme Context Engineering for Agents

Jim Dowling

Agents need timely and relevant context data to work effectively in an interactive environment. If an agent takes more than a few seconds to react to an action in a client applicatoin, users will not perceive it as intelligent - just laggy.

Real-time context engineering involves building real-time data pipelines to pre-process application data and serve relevant and timely context to agents. This talk will focus on how you can leverage application identifiers (user ID, session ID, article ID, order ID, etc) to identify which real-time context data to provide to agents. We will contrast this approach with the more traditional RAG approach of using vector indexes to retrieve chunks of relevent text using the user query. Our approach will necessitate the introduction of the Agent-to-Agent protocol, an emerging standard for defining APIs for agents.

We will also demonstrate how we provide real-time context data from applications inside Python agents using the Hopsworks feature store. We will walk through an example of an interactive application (TikTok clone).

Room 301B

10:55

45min

Building valuable Deterministic products in a Probabilistic world

John Carney

For the first time in computing history the paradigm of designing applications is following a probabilistic approach rather than a deterministic approach. Large language models have generated huge amounts of excitement, with investors, management, and engineers who are using them in product development. However both according to peer reviewed studies, and anecdotal observations, it has proven difficult to translate this optimism into business value.

Room 301B

10:55

45min

Explore Solvable and Unsolvable Equations with SymPy

Carl Kadie

Why can we solve some equations with neat formulas, while others stubbornly resist every trick we know? Equations with squares bow to the quadratic formula. Those with cubes and fourth powers also have solutions. But then the magic stops. And when we, as data scientists, add exponentials, logarithms, or trigonometric terms into models, the resulting equations often cross into territory where no closed-form solutions exist.

This talk is both fun and useful. With Python and SymPy, we’ll “cheat” our way through centuries of mathematics, testing families of equations to see when closed forms appear and when numerical methods are our only option. Attendees will enjoy surprising examples, a bit of mathematical history, and practical insight into when exact solutions exist — and when to stop searching and switch to numerical methods.

Room 313

11:40

45min

Optimizing AI/ML Workloads: Resource Management and Cost Attribution

Saurabh Garg

The proliferation of AI/ML workloads across commercial enterprises, necessitates robust mechanisms to track, inspect and analyze their use of on-prem/cloud infrastructure. To that end, effective insights are crucial for optimizing cloud resource allocation with increasing workload demand, while mitigating cloud infrastructure costs and promoting operational stability.

This talk will outline an approach to systematically monitor, inspect and analyze AI/ML workloads’ properties like runtime, resource demand/utilization and cost attribution tags . By implementing granular inspection across multi-player teams and projects, organizations can gain actionable insights into resource bottlenecks, identify opportunities for cost savings, and enable AI/ML platform engineers to directly attribute infrastructure costs to specific workloads.

Cost attribution of infrastructure usage by AI/ML workloads focuses on key metrics such as compute node group information, cpu usage seconds, data transfer, gpu allocation , memory and ephemeral storage utilization. It enables platform administrators to identify competing workloads which lead to diminishing ROI. Answering questions from data scientists like "Why did my workload run for 6 hours today, when it took only 2 hours yesterday" or "Why did my workload start 3 hours behind schedule?" also becomes easier.

Through our work on Metaflow, we will showcase how we built a comprehensive framework for transparent usage reporting, cost attribution, performance optimization, and strategic planning for future AI/ML initiatives. Metaflow is a human centric python library that enables seamless scaling and management of AI/ML projects.

Ultimately, a well-defined usage tracking system empowers organizations to maximize the return on investment from their AI/ML endeavors while maintaining budgetary control and operational efficiency. Platform engineers and administrators will be able to gain insights into the following operational aspects of supporting a battle hardened ML Platform:

1.Optimize resource allocation: Understand consumption patterns to right-size clusters and allocate resources more efficiently, reducing idle time and preventing bottlenecks.

Proactively manage capacity: Forecast future resource needs based on historical usage trends, ensuring the infrastructure can scale effectively with increasing workload demand.
Facilitate strategic planning: Make informed decisions regarding future infrastructure investments and scaling strategies.

4.Diagnose workload execution delays: Identify resource contention, queuing issues, or insufficient capacity leading to delayed workload starts.

Data Scientists on the other hand will gain clarity on factors that influence workload performance. Tuning them can lead to efficiencies in runtime and associated cost profiles.

Room 301B

11:40

45min

Red Teaming AI: Getting Started with PyRIT for Safer Generative AI Systems

Roman Lutz

As generative AI systems become more powerful and widely deployed, ensuring safety and security is critical. This talk introduces AI red teaming—systematically probing AI systems to uncover potential risks—and demonstrates how to get started using PyRIT (Python Risk Identification Toolkit), an open-source framework for automated and semi-automated red teaming of generative AI systems. Attendees will leave with a practical understanding of how to identify and mitigate risks in AI applications, and how PyRIT can help along the way.

Room 313

12:25

60min

Lunch

Room 301B

13:25

45min

Keynote: Zaheera Valani- Driving Data Democratization with the Databricks Data Intelligence Platform

Zaheera Valani

Join us for our Keynote with Zaheera Valani

Room 301B

14:10

25min

Break

Room 301B

14:35

45min

Know Your Data(Frame) with Paguro: Declarative and Composable Validation and Metadata using Polars

Bernardo Dionisi

Modern data pipelines are fast and expressive, but ensuring data quality is often not as straightforward. This talk introduces Paguro, an open-source, feature-rich validation and metadata library designed on top of the Polars DataFrame library. Paguro enables users to validate both single Data(Lazy)Frames and collections of Data(Lazy)Frames together, and provides beautifully formatted terminal diagnostics that explain why and where validation failed. Attendees will learn how to integrate the lightweight, fast, and composable validation toolkit into their workflows, from exploration to production, using a familiar Polars-native syntax.

Room 313

14:35

45min

Polars on Spark: Unlocking Performance with Arrow Python UDFs

Allison Wang, Shujing Yang

PySpark’s Arrow-based Python UDFs open the door to dramatically faster data processing by avoiding expensive serialization overhead. At the same time, Polars, a high-performance DataFrame library built on Rust, offers zero-copy interoperability with Apache Arrow. This talk shows how combining these two technologies unlocks new performance gains: writing Arrow UDFs with Polars in PySpark can deliver performance speedups compared to Python UDFs. Attendees will learn how Arrow UDFs work in PySpark, how it can be used with other data processing libraries, and how to apply this approach to real-world Spark pipelines for faster, more efficient workloads.

Room 301B

14:35

45min

Wrangling Internet-scale Image Datasets

Carlos Garcia Jurado Suarez, Nicholas Merchant

Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible.
In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable.
Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.

Room 301A

15:20

45min

Generalized Additive Models: Explainability Strikes Back

Pedro Albuquerque

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) strike a rare balance: they combine the flexibility of complex models with the clarity of simple ones.

They often achieve performance comparable to black-box models, yet remain:
- Easy to interpret
- Computationally efficient
- Aligned with the growing demand for transparency in AI

With recent U.S. AI regulations (White House, 2022) and increasing pressure from decision-makers for explainable models, GAMs are emerging as a natural choice across industries.

Audience

This guide is for readers with some background in Python and statistics, including:
- Data scientists
- Machine learning engineers
- Researchers

Takeaway

By the end, you’ll understand:
- The intuition behind GAMs
- How to build and apply them in practice
- How to interpret and explain GAM predictions and results in Python

Prerequisites

You should be comfortable with:
- Basic regression concepts
- Model regularization
- The bias–variance trade-off
- Python programming

Room 313