PyData Virginia 2025

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
60min
Registration
Auditorium 5
08:00
60min
Registration
Auditorium 4
08:00
60min
Registration
Auditorium 3
09:00
09:00
15min
Opening Notes
Auditorium 5
09:15
09:15
45min
Keynote: Building AI-First Organizations
Rajkumar Venkatesan

As businesses strive to become AI-first, the pivotal role of AI practitioners extends beyond technical implementation to encompass strategic stewardship. This transition necessitates a profound understanding of organizational goals, data governance, and ethical considerations. By aligning AI initiatives with business objectives, fostering cross-functional collaboration, and addressing challenges such as data privacy and employee adaptation, AI professionals can drive effective transformation. This keynote explores the essential competencies and approaches required for AI practitioners to lead their organizations successfully into an AI-centric future.

Auditorium 5
10:00
10:00
20min
Break
Auditorium 5
10:00
20min
Break
Auditorium 4
10:00
20min
Break
Auditorium 3
10:20
10:20
35min
Bayesian Risk Analysis For Large Multi-Modal Data
Sihang Jiang

In the era of big data, multi-modal data from multiple sources or modalities has become increasingly prevalent in various fields such as healthcare. The National COVID Cohort Collaborative (N3C) provides researchers with abundant clinical data in different forms by aggregating and harmonizing Electronic Health Records (EHR) data across different clinical organizations in the United States, making it convenient for researchers to analyze COVID-related topics and build models with large multimodal data. Bayesian risk analysis has advantages in handling the complexities and heterogeneities of multi-modal healthcare data, specifically in cohort studies when researchers try to answer questions of interest in public health or medicine field regarding COVID and Long COVID.

Auditorium 3
10:20
35min
Making the most of test-time compute in LLMs
Suhas Pai

Reasoning models like OpenAI's o3 and DeepSeek's R1 herald a new paradigm that leverages test-time compute to solve tasks requiring reasoning. These models represent a departure from traditional LLMs, upending long-held assumptions about them. In this session, we will discuss the different dimensions along which test-time compute can be expended and scaled. We will showcase best practices for prompting reasoning models as well as how to direct test-time compute towards achieving desired results. Finally, we will demonstrate how to train our own reasoning models specific to our domain or use case.

Auditorium 5
10:20
35min
Practical Applications of Apache Arrow
Will Ayd, Matt Topol

Data system interoperability remains a significant challenge in open source ecosystems, with high costs in development time and resources when moving data across complex infrastructures. The Apache Arrow project offers a standardized solution to reduce these integration challenges.

Will Ayd (Apache Arrow Committer and pandas maintainer) and Matt Topol (Apache Arrow PMC Member and author of "In Memory Analytics with Apache Arrow") will discuss how Apache Arrow is changing the data landscape. A brief overview of Arrow standards will be provided, while also reviewing real world implementations of where the Arrow specification has driven down the cost of data interoperability.

Auditorium 4
10:55
10:55
35min
Data wrangling with DuckDB
Will Angel

Learn how to wrangle data in Python with DuckDB, a fast, open source, in-process analytical SQL database!

Auditorium 4
10:55
35min
Evaluating LLMs at S&P Global: Building a Robust Evaluation Framework for GenAI Productivity Tools
MacKenzye Leroy

Discover how S&P Global built an enterprise-grade evaluation framework that transformed our GenAI deployment process. Through automated monitoring, expert validation, & continuous testing, we’ve streamlined the document integration step of our RAG tools, while ensuring our AI tools maintain consistent quality and reliability.

Auditorium 5
10:55
35min
Saving Lives with Data Science: How data science shortened the COVID-19 pandemic by 2 months
Greg Michaelson

When every day counted during the COVID-19 pandemic, data science became an essential catalyst in accelerating the path to widespread vaccination. This talk delves into the data-driven strategies that enabled the U.S. government’s vaccine trials to move faster, cutting crucial weeks—6 to 8, by our estimates—off the timeline to deployment. Through sophisticated geospatial modeling, we identified and swiftly mobilized trial recruitment efforts in emerging hot zones, ensuring that each candidate pool was both numerically sufficient and demographically representative. Attendees will discover how advanced analytics, predictive modeling, and interdisciplinary collaboration converged to target the right communities at the right time, ultimately expediting vaccine availability. This behind-the-scenes look at rapid-response data science highlights not just the technical innovations, but the decisive cultural and operational shifts that turned real-time insights into life-saving action.

Auditorium 3
11:30
11:30
35min
Maximizing Multimodal: Exploring the search frontier of text-to-image models to improve visual find-ability for creatives
Nathan Day

Text-to-Image models, like CLIP, have brought us into a new frontier of visual search. Whether it's searching by circling a section of a photo or powering image generators like Dalle-E the gap between pixels and tokens has never been smaller. This talk discusses how we are improving search and empowering designers with these models at Eezy, a stock art marketplace.

Auditorium 5
11:30
35min
The Art of Brain Data in ASD Subjects: Celebrating Neurodiversity Through Aesthetic Data Visualization
Siwen Liao

In our project, we took MRI-derived brain data and reinterpreted it through an aesthetic lens. Using multidimensional scaling (MDS) to distill complex patterns in cortical anatomy, we transformed these insights into physical 3D-printed brain models. Each sculpture serves as a tangible narrative, celebrating both the subtle and striking differences between male and female brains, whether neurotypical or affected by ASD.

Auditorium 3
11:30
35min
Zero Code Change GPU-Powered Graph Analytics with NetworkX and cuGraph
Ralph Liu

Graphs are a fundamental form of storing data. This is because everything is connected! Hence, Graphs are very useful for modeling and solving a wide variety of real-world problems.

While NetworkX is amazing for getting started with Graphs, the library encounters bottlenecks in performance at scale.

Is there a solution out there for users who want more performance from NX and also Open-Source developers who want to implement fast algorithms? Yes! Thanks to the magic of dispatching.

NetworkX now supports dispatching to various backends, including the GPU accelerated cuGraph library by Nvidia RAPIDS.

Attend this talk to learn about how you can use nx-cugraph – the cuGraph-powered backend for NetworkX – and how it unlocks exciting new possibilities for you to solve real-world graph analytics problems.

Auditorium 4
12:05
12:05
30min
Exploring Eviction Trends in Virginia
Samantha Toet, Dr. Michele Claibourn

Where do landlords engage in more eviction actions? What characteristics of renters or landlords increase the practice of serial filing? There is widespread interest in using administrative data -- information collected by government and agencies in the implementation of public programs -- to evaluate systems and promote most just outcomes. Working with the Civil Court Data Initiative of Legal Services Corporation, we use data collected from civil court records in Virginia to analyze the behavior of landlords. Expanding on our Virginia Evictors Catalog, we use data on court evictions to build additional data tools to support the work of legal and housing advocates and model key eviction outcomes to contribute to our understanding of landlord behavior.

Auditorium 3
12:05
30min
Fine tuning embeddings for semantic caching
Tyler Hutcherson, Srijith Rajamohan, Waris Gill

Large Language Models (LLMs) have opened new frontiers in natural language processing but often come with high inference costs and slow response times in production. In this talk, we’ll show how semantic caching using vector embeddings—particularly for frequently asked questions—can mitigate these issues in a RAG architecture. We’ll also discuss how we used contrastive fine-tuning methods to boost embedding model performance to accurately identify duplicate questions. Attendees will leave with strategies for reducing infrastructure costs, improving RAG latency, and strengthening the reliability of their LLM-based applications. Basic familiarity with NLP or foundation models is helpful but not required.

Auditorium 5
12:05
30min
Practical Multi Armed Bandits
Benjamin Bengfort

Multi-armed bandits are a reinforcement learning tool often used in environments where the cost or rewards of different choices are unknown or where those functions may change over time. The good news is that as far as implementation goes, bandits are surprisingly easy to implement; however, in practice, the difficulty comes from defining a reward function that best targets your specific use case. In this talk, we will discuss how to use bandit algorithms effectively, taking note of practical strategies for experimental design and deployment of bandits in your applications.

Auditorium 4
12:35
12:35
60min
LUNCH BREAK
Auditorium 5
12:35
60min
LUNCH BREAK
Auditorium 4
12:35
60min
Author Chat & Book Signing
Renee Teate, Will Ayd, Matt Topol, Suhas Pai

Lunchtime chat with data science authors, with some offering book giveaways and signing books!

Auditorium 3
13:35
13:35
60min
Panel: Principles for Effective and Successful Data Scientists
Renee Teate, Aaron Baker, David Der

What truly makes a data scientist effective in their job and career? Come hear from our panel of data scientists, each with their unique pathway into data science, discuss the principles that matter: pathways to data science, translating business problems, and what technical expertise means for data science. Grow your insight into becoming the kind of data scientist people trust to solve the right problems, the right way.

Auditorium 5
14:35
14:35
20min
BREAKS & SNACKS
Auditorium 5
14:35
20min
BREAKS & SNACKS
Auditorium 4
14:35
20min
BREAKS & SNACKS
Auditorium 3
14:55
14:55
35min
Addressing Climate Change with AI
Dan Loehr

This talk will survey how AI is currently used to address climate change, and describe possible future use cases. This high-level overview will touch on various aspects of climate change (e.g. energy, transportation, land use), of AI (e.g. image processing, reinforcement learning, LLMs), and of their intersection. The talk will conclude with resources for learning more about this area, and suggestions for contributing to current and future efforts.

Auditorium 5
14:55
35min
Using Changepoint and Bayesian Analysis to Drive Safety Improvements in Mining
Mauricio Mathey

In the mining industry's pursuit of zero harm, distinguishing real safety improvements from random variation is crucial yet challenging. This talk demonstrates how classical changepoint analysis and Bayesian methods provide safety teams at Asarco LLC with rigorous tools to objectively evaluate progress towards our zero-harm goal. Using near miss reporting and lost time metrics, we will show how these statistical approaches help identify meaningful trends while avoiding misleading conclusions from natural variation. While the focus is on mining, these methods are applicable to other safety-critical and data-limited scenarios. No prior experience with changepoint analysis is required.

Auditorium 3
14:55
35min
Using Python to Unlock Insights from OpenStreetMap Data at Scale
Cory Eicher

Geospatial data can unlock valuable insights. OpenStreetMap includes electric power and telecommunication infrastructure geospatial data, and it is already “open”. This presentation will demonstrate how to use Python to “unlock the insights” available in OSM power and telecommunications geospatial data.

Auditorium 4
15:30
15:30
35min
Real-Time Fitness Leaderboards with Open-Source Moose
David Der

Ever wished you could power live leaderboards for fitness challenges or dynamically award wellness badges in real time? Traditional OLTP systems often buckle under the pressure of continuous writes and aggregate reads. In this talk, we’ll explore how Moose, an open-source OLAP platform, enables rapid ingestion and lightning-fast queries on health and workout data. We’ll walk through a demo of creating real-time fitness leaderboards, awarding achievement badges, and using Python-based tools for data ingestion and visualization. Attendees will learn how an OLAP approach streamlines the architecture for modern wellness and health applications.

Auditorium 5
15:30
35min
The Secret Sauce of Customer Satisfaction: Turning Data Pipelines into Data Products
Josh Fairchild, Liam Agnew

What comes to mind when you think of an exceptional customer experience? Whether it was a "peak experience" or a "dumpster fire", it stuck with you! We recognize the importance of great customer experiences in industries like retail and hospitality—but what about in data? Does long-term success depend on creating exceptional customer experiences, or are client expectations just challenges to manage?

In this session we will share insights from a data and analytics project Elder Research is implementing for a Quick-Service Restaurant corporation. By prioritizing the customer experience and embracing a "Data as a Product" mindset, data teams can drive greater business value and build stronger, more sustainable client relationships.

Auditorium 3
15:30
35min
Versioning Multimodal Data: Metadata & Beyond
Dmitry Petrov

The team behind DVC has spent years tackling data versioning challenges. With the rise of AI, we’ve seen new complexities emerge - especially with multimodal datasets like images, video, audio, and text. This talk shows why multimodal data versioning is different and how Pydantic provides a powerful way to structure and integrate metadata.

Auditorium 4
16:05
16:05
35min
AI Ready Data
Hamish Brookeman, Alec Gosse

In today’s AI first era, customers expect data products to be deeply interconnected, consumable with minimal effort and widely available. The need for ‘AI Ready Data’, suitable for consumption directly by AI agents has never been clearer.

Auditorium 4
16:05
35min
Machine Learning Pipelines in Higher Education: Lessons Learned Taking Models From Training to Production
Brian Richards

Building machine learning models with live, human-centric data is often a messy endeavor. However, by thinking about the entire machine learning pipeline and the lifecycle of the population being modeled we can prevent the model (and data scientist) from overpromising and underdelivering. Come learn about potential pitfalls that occur when working with human-centric data and what you can do to prevent it from ruining your model performance.

Auditorium 3
16:05
60min
Panel: Bridging the Gap: Collaborative Approaches to Data Science
Christopher N. Eichelberger, Thomas Loeber, Manikandarajan Shanmugavel, Renee Teate

During this expert panel, we'll explore the critical intersections of data science, engineering, and stakeholder engagement in today's organizations. This discussion will address how to break down silos between technical disciplines, establish effective collaboration models, create rapid experimentation frameworks, and successfully transition projects from exploration to production. Our panelists bring diverse perspectives on building integrated teams that balance innovation with enterprise standards while delivering real value.

Auditorium 5
16:40
16:40
35min
Visualization of higher-dimensional feature spaces during model training
Vivek Dhand

Modern machine learning models typically utilize extremely high-dimensional feature spaces, which inhibits robustness and explainability. Finer-grained control over model training requires more powerful tools for observing and interacting with latent features as they evolve over time. In this talk, we give several examples of visualizations of nearest-neighbor graphs that illuminate common training pitfalls and provide practical insights for diagnosing model performance issues.

Auditorium 4
16:40
35min
What is Geometric Algebra and can it help me?
Alex Arsenovic

An introduction to Geometric Algebra, with a focus on how it can (and can't) be used as a practical computational tool in Python. The discussion will present concrete examples which make use of the open source python library ‘Kingdon’. The audience should leave with a grasp of what GA is and what it isn't, so that they can decide if it is a tool worthy of their cognitive investment.

Auditorium 3
08:00
08:00
60min
REGISTRATION
Room 120
08:00
60min
REGISTRATION
Room 130
08:00
60min
REGISTRATION
Room 140
09:00
09:00
90min
Mastering LLMs: From Prompt Engineering to Agentic AI
John Berryman

This workshop will provide a comprehensive introduction to Large Language Models (LLMs), covering their capabilities, structure, and practical applications. Participants will learn prompt engineering techniques, retrieval-augmented generation (RAG), agentic AI design, fine-tuning strategies, and model evaluation methods. The session will conclude with a discussion on the future of AI-powered reasoning machines.

Room 120
09:00
90min
Responsible AI with SciPy
Andrea Hobby

SciPy is a powerful library for scientific and technical computing in Python. The primary objectives of this presentation are to explore the core concepts of Responsible AI and to demonstrate these concepts with SciPy.

Room 130
09:00
90min
Tutorial on Image Classification using Scikit-Image, Scikit-learn, and PyTorch
Matt Litz

Tutorial on building an image segmentation and classification pipeline for binary or multiclass classification using the popular packages scikit-learn, scikit-image and PyTorch.

Room 140
10:30
10:30
30min
BREAKS
Room 120
10:30
30min
BREAKS
Room 130
10:30
30min
BREAKS
Room 140
11:00
11:00
90min
A Beginner's Guide to Variational Inference
Chris Fonnesbeck

When Bayesian modeling scales up to large datasets, traditional MCMC methods can become impractical due to their computational demands. Variational Inference (VI) offers a scalable alternative, trading exactness for speed while retaining the essence of Bayesian inference.

In this tutorial, we’ll explore how to implement and compare VI techniques in PyMC, including the Adaptive Divergence Variational Inference (ADVI) and the cutting-edge Pathfinder algorithm.

Starting with simple models like linear regression, we’ll gradually introduce more complex, real-world applications, comparing the performance of VI against Markov Chain Monte Carlo (MCMC) to understand the trade-offs in speed and accuracy.

This tutorial will arm participants with practical tools to deploy VI in their workflows and help answer pressing questions, like "What do I do when MCMC is too slow?", or "How does VI compare to MCMC in terms of approximation quality?".

Room 140
11:00
90min
Building Rich RAG Systems with Docling: Unlock Information from Tables, Images, and Complex Documents
Krishna Rekapalli

Traditional PDF extraction tools often struggle with complex layouts, tables, and images, Docling (an opensource Python library developed at IBM) excels at extracting structured information from these elements, enabling the creation of richer, more accurate vector databases. This hands-on tutorial will guide participants through building a Retrieval Augmented Generation (RAG) system using Docling, an open-source document processing library.

Participants will learn how to harness Docling's advanced capabilities to build superior RAG systems that can understand and retrieve information from complex document elements that traditional tools might miss. Participants will learn how to handle complex documents, extract structured information, and create an efficient vector database for semantic search. The session will cover best practices for document parsing, chunking strategies, and integration with popular LLM frameworks.

Room 120
11:00
90min
Data Viz in Python as a Tool to Study HIV Health Disparities
Dr. Kimberly Deas

Health disparities remain a critical challenge in public health, demanding innovative approaches to uncover inequities and drive actionable change. This webinar will demonstrate how Python can serve as a powerful tool for creating data visualizations that illustrate the unequal burden of HIV across different populations. Participants will learn how Python’s popular libraries, such as Matplotlib, Seaborn, and Plotly, can transform complex datasets into accessible, impactful visuals.
Using an HIV dataset containing demographic, geographic, and clinical variables, this session will guide attendees through a series of practical examples. From creating heatmaps and geospatial maps to analyzing temporal trends, the webinar emphasizes how to identify and communicate key social determinants related to race, gender, socioeconomic status, and access to care. Through hands-on demonstrations, attendees will see how Python’s capabilities streamline data analysis and visualization workflows.
Key takeaways from the session include identifying regions and communities in Texas, disproportionately affected by HIV, uncovering intersectional factors influencing health outcomes, and leveraging visual tools to inform policy and resource allocation. Special attention will be given to designing visuals that resonate with non-technical audiences, ensuring findings are actionable for public health professionals and policymakers.

Room 130
12:30
12:30
60min
LUNCH
Room 120
12:30
60min
LUNCH
Room 130
12:30
60min
LUNCH
Room 140
13:30
13:30
90min
Build Your Own Data Science AI Agents
Chuxin Liu, Astha Puri, Niharika Krishnan, Michelle Rojas

When “AI Agent” became the buzz word, have you ever wondered: what exactly is an AI agent? What is the multi-agent system? And how can you use the power of AI agents in your day-to-day data science workflow? In this hands-on tutorial, I will introduce AI agents and demonstrate how to design, build, and manage a multi-agent system for your data science workflows. Participants will learn how to break down complex tasks, assign AI agents to collaborate effectively, and ensure accuracy and reliability in their outputs. We will also discuss the trade-offs, limitations, and best practices for incorporating AI agents into data science projects.

Room 120
13:30
90min
Getting Started with RAPIDS: GPU-Accelerated Data Science for PyData Users
Naty Clementi, Mike McCarty

In this introductory hands-on tutorial, participants will learn how to accelerate their data workflows with RAPIDS, an open-source suite of libraries designed to leverage the power of NVIDIA GPUs for end-to-end data pipelines. Using familiar PyData APIs like cuDF (GPU-accelerated pandas) and cuML (GPU-accelerated machine learning), attendees will explore how to seamlessly integrate these tools into their existing workflows with minimal code changes, achieving significant speedups in tasks such as data processing and model training.

Room 130
13:30
90min
Introduction to Wikidata
Lane Rasberry, Robin Isadora Brown

We will review Wikipedia, introduce Wikidata, then demonstrate queries to access wiki content

Room 140
15:00
15:00
30min
BREAKS & SNACKS
Room 120
15:00
30min
BREAKS & SNACKS
Room 130
15:00
30min
BREAKS & SNACKS
Room 140
15:30
15:30
90min
Blazing the AI Trail: Using LangGraph to Conquer the Oregon Trail
Robert Shelton

Agents have become one of the most talked-about topics in the AI community, but much of the discussion focuses on its potential impact rather than practical implementation. This hands-on workshop will guide data scientists and engineers through building a complete workflow using LangGraph, and will show how to define custom tools, implement vector retrieval, leverage semantic caching, incorporate allow/block list routing, and structure model output for downstream consumption. In order to participate, attendees will need to have python (>=3.11), docker, an OpenAI api key, and the starter code for the project cloned.

Starter code: https://github.com/redis-developer/agents-redis-lang-graph-workshop

Note: participants can test their environment setup ahead of time by following the Readme and running python test_setup.py before the workshop.

Room 120
15:30
90min
From Pandas to PySpark
Cynthia Ukawu

Tired of waiting for massive datasets to load on your local machine? In this beginner-friendly tutorial, we’ll explore how to scale your data analysis skills from pandas to PySpark using a real-world anime dataset. We’ll walk through the basics of distributed computing, discuss why Spark was created, and demonstrate the benefits of working with PySpark for big data tasks—including reading, cleaning, and transforming millions of records with ease. By the end of this workshop, you’ll understand how PySpark harnesses cluster computing to handle large-scale data and you’ll be comfortable applying these techniques to your own projects.

Participant Requirements:
- A laptop (any OS) with an internet connection
- A Google account (to access Colab notebooks and slides)
- Familiarity with Python and pandas

Here's the link to the Google Colab to follow along 👇🏾
https://colab.research.google.com/drive/1fi0cTQ1NIE5kDEH0ynp2sqDuVeiBJJWU?usp=sharing

Here are the slides 👇🏾
https://drive.google.com/file/d/11JIih1VzLxTJ9O6PeGzqD_e8vumTZQmw/view?usp=sharing

Room 130