PyData Global 2025

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Tuesday, Dec. 9, 2025

Wednesday, Dec. 10, 2025

Thursday, Dec. 11, 2025

10:30

Building LLM-Powered Applications for Data Scientists and Software Engineers

hugo bowne-anderson

This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using multimodal AI models to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications.

If you're a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems.

Machine Learning & AI

Machine Learning & AI

12:00

Python Meets Excel: Smarter Workflows for Analysts and Data Teams

Python drives modern data workflows, yet Excel remains the lingua franca of business. Many Python-based data teams struggle when the “last mile” of delivery still involves exporting results to Excel for business users. This talk explores practical ways for Python users to automate, scale, and enhance Excel-heavy processes using open-source libraries.
This talk will help you bridge the gap between code and the business-facing spreadsheet world.
We will discuss real-world use cases for report generation, batch processing, and dashboard templating, all from a Python-first perspective.

When AI Makes Things Up: Understanding and Tackling Hallucinations

AI systems are increasingly being integrated into real-world products - from chatbots and search engines to summarisation tools and coding assistants. Yet, despite their fluency, these models can produce confident but false or misleading information, a phenomenon known as hallucination. In production settings, such errors can erode user trust, misinform decisions, and introduce serious risks. This talk unpacks the root causes of hallucinations, explores their impact on various applications, and highlights emerging techniques to detect and mitigate them. With a focus on practical strategies, the session offers guidance for building more trustworthy AI systems fit for deployment.

Machine Learning & AI

Machine Learning & AI

12:30

Fast, Cost-Efficient Analytics on Blockchain data using DuckDB - Solana as a case study

Busirah Olaitan Hammed

Abstract:

Blockchain generates millions of transactions daily, making it a rich yet complex source of data for developers, analysts, and researchers. While Google BigQuery offers public access to Solana’s historical data, repeated querying at scale can become costly and slow, especially during iterative exploration and analysis.

In this talk, I’ll demonstrate a practical workflow that combines the power of BigQuery for data extraction with the speed and flexibility of DuckDB for local, in-memory analytics. We’ll show how to efficiently query Solana data in BigQuery, export it to partitioned Parquet files, and use DuckDB to run fast, repeatable SQL queries without incurring additional cloud costs.

You'll learn:
- Basic terms in blockchain data structure and how transactions are saved.
- How to navigate and query Solana’s public datasets on BigQuery.
- How to export filtered blockchain data to efficient Parquet files.
- How DuckDB can serve as a lightweight analytics engine for on-chain data.
- Tips for partitioning, enriching, and automating your Solana data pipeline.

This demo would all run within Google collab to save time and also enable participant follow through the session.

Whether you're working on blockchain analytics, wallet behavior analysis, or on-chain data engineering, this talk will equip you with a practical approach to blockchain data workflows using open tools.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

Scaling Fuzzy Product Matching with BM25: A Comparative Study of Python and Database Solutions

Aniket Abhay Kulkarni

Tired of exact matches failing on messy data? This talk showcases how BM25, a powerful fuzzy search algorithm, tackles the challenge of enriching massive datasets with noisy product names. We'll compare practical, large-scale implementations using Python's bm25s library (accelerated by GPUs) and DuckDB's built-in full-text search. Join us to learn how to achieve fast, accurate data integration and discover the optimal tools for your fuzzy matching needs.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

torchTextClassifiers : Modernizing Text classification for French National Statistics

Cédric Couralet, Meilame Tayebjee

Discover how Insee (French National Statistics Institute) transitioned from fastText to a PyTorch-based model for text classification by developing and open-sourcing the torchTextClassifiers python package. This presentation will cover the creation, deployment, and practical applications of torchTextClassifiers in modernizing automatic coding systems, benefiting Insee and other European National Statistical Institutes (NSIs).

Machine Learning & AI

Machine Learning & AI

13:00

Harnessing Generative Models for Synthetic Non-Life Insurance Data

Claudio Giorgio Giancaterino

This study is oriented to a synthetic non-life insurance premium dataset generated using several Generative Models. As a benchmark, a Conditional Gaussian Mixture Model has been employed. The validation of the generated data involved several steps: data visualisation, comparison with univariate analysis, PCA and UMAP representations between the trained data and the generated samples. In addition, check the consistency of data produced, the statistical Kolmogorov–Smirnov test and predictive modelling of frequency and severity with Generalised Linear Models (GLMs) exploited by Tweedie distribution as a measure of the generated data's quality, followed by the evidence of features importance. For further comparison, advanced Deep Learning architectures have been employed: Conditional Variational Autoencoders (CVAEs), CVAEs enhanced with a Transformer Decoder, a Conditional Diffusion Model, and Large Language Models. The analysis assesses each model’s ability to capture the underlying distributions, preserve complex dependencies, and maintain relationships intrinsic to the premium data. These findings provide insightful directions for enhancing synthetic data generation in insurance, with potential applications in risk modelling, pricing strategies with data scarcity, and regulatory compliance.

Machine Learning & AI

Machine Learning & AI

Python Beyond the Code: Unlocking Hidden Contributions in Open Source

Contributing to open source isn’t just about code. Documentation, testing, community support, and issue triaging are critical but often overlooked. In this talk, I’ll share how Python developers — from junior to senior can make a meaningful, visible impact in open source. Whether you're new to open source or looking to expand your profile, this session will help you discover practical, beginner-friendly ways to contribute and stay engaged in the long term.

13:30

Lessons learnt in optimizing a large-scale pandas application using Polars, FireDucks and cuDF: Go Smart and Save More!

In general, a Data Scientist spends significant efforts in transforming the raw data into a more digestible format before training an AI model or creating visualisations. Traditional tools such as pandas have long been the linchpin in this process, offering powerful capabilities but not without limitations. With numerous possible ways to write the same thing in pandas, often a user ends up selecting the uneconomical, inefficient ones, leading to large computational　costs　with the growth in data size. We introduce a couple of frequently occurring　intricate performance issues in pandas, and what we have learnt in solving the same using popular high-performance pandas alternatives: Polars, FireDucks and cuDF. The talk intends to highlight one of the best practices (breaking out of the loops) that one should follow while dealing with large-scale data analysis, while demonstrating the key advantages of the high-performance pandas alternatives based on different scenarios.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

Open Source Models' Security- Adversarial attacks, Poisoning & Sponge

The use of open-source models is rapidly increasing. According to Gartner, during the Magnetic Era, their adoption is expected to triple compared to foundational models. However, this rise in usage also brings heightened cybersecurity risks. In this lecture, we will explore the unique vulnerabilities associated with open-source models, the algorithmic techniques used to exploit them, and how our startup is addressing these challenges.

14:00

Opening Notes & Keynote by Isabel Zimmerman

Keynote by Isabel Zimmerman

15:00

Designing a Fast, Offline-Capable Reverse Geocoder in Python: An Open Source Alternative to Big Geo APIs

Sooraj Sivadasan

While commercial reverse geocoding APIs, such as Google Maps or Mapbox, are effective, they are also costly, have rate limitations, and are not appropriate for offline or privacy-sensitive settings.

Using available datasets and Python modules like cKDTree, shapely, and geopandas, we will demonstrate how to create a quick, scalable, offline-capable reverse geocoding system in Python in this session.

You will learn how to:
- Convert geographic shapefiles into effective spatial indices
- Perform location lookups in milliseconds using tree search and vector mathematics
- Handle edge cases like unclear borders, cities with identical names, and GPS noise
- Improve performance and memory usage through multiprocessing

The system is fully open source and has been production-tested in a high-throughput environment. Whether you are developing applications for edge inference, mapping, or logistics, this talk will help you take control of your geospatial infrastructure without depending on costly commercial APIs.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

From Feature Engineering to Context Engineering for Agents

Context Engineering for Agents involves getting relevant data into the LLM’s prompt and builds on in-context learning capabilities of LLMs. But LLMs have finite sized context windows, so you can't just dump unprocessed context data into your Agent's LLM prompt. You need to select the right data, process it into the correct format, and compress or summarize the data before its use as context data.

In this talk, we will introduce techniques for selection, preprocessing, and compression of context data, taking inspiration from the tried and tested techniques used for feature engineering for ML. What goes around, comes around.

Machine Learning & AI

Machine Learning & AI

Lane detection in self-driving using only NumPy

Are you a scientist or a developer looking to understand how to use NumPy to solve computer vision problems?
NumPy is a Python package that provides the multidimensional array object which you can use to solve the lane detection problem in computer vision for self-driving cars or autonomous driving. You can apply non-machine learning techniques using NumPy to find the straight lines on street images. No other external libraries, just python with NumPy.

The Lifecycle of a Jupyter Environment: From Exploration to Production-Grade Pipelines

Most data science projects start with a simple notebook—a spark of curiosity, some exploration, and a handful of promising results. But what happens when that experiment needs to grow up and go into production?

This talk follows the story of a single machine learning exploration that matures into a full-fledged ETL pipeline. We’ll walk through the practical steps and real-world challenges that come up when moving from a Jupyter notebook to something robust enough for daily use.

We’ll cover how to:

Set clear objectives and document the process from the beginning
Break messy notebook logic into modular, reusable components
Choose the right tools (Papermill, nbconvert, shell scripts) based on your workflow—not just the hype
Track environments and dependencies to make sure your project runs tomorrow the way it did today
Handle data integrity, schema changes, and even evolving labels as your datasets shift over time

And as a bonus: bring your results to life with interactive visualizations using tools like PyScript, Voila, and Panel + HoloViz

Live from PyData Boston

15:30

Enhancing Apache NiFi 2.x with Python Processors

In this talk, I will delve into the world of Apache NiFi 2.0 Python processors, exploring the capabilities they offer and demonstrating how to build custom processors to enhance your data processing pipelines.

By the end of this talk, participants will have a comprehensive understanding of building and optimizing Apache NiFi 2.0 Python processors, enabling them to integrate Python seamlessly into their data processing workflows.

This session is suitable for data engineers, architects, and anyone interested in harnessing the combined power of Apache NiFi and Python for efficient data integration and flow management. One of the main uses is to build prompts and call open LLM and AI. NiFi excels at integration, I will cover some interesting sources, sinks and enrichments and show when Python is helpful.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

Python Worst Practices: Learn from the Expert

Data and Analytics Comedian Evan Wimpey is here to roast his own codebase! Enjoy the walk through of the worst Python habits. In this talk, you'll get to see:
* Incomprehensible variable names
* final_final_2.ipynb files
* rerunning the same cell and hoping it works this time
* imports that are never used
* debugging with print
* ML models that are validated on training data
* code so poorly written that even ChatGPT can't understand it
* and more!

16:00

Text Mining Orkut’s Community Data with Python: Cultural Memory, Platform Neglect, and Digital Amnesia

Rodrigo Silva Ferreira

Orkut was once the emotional and cultural core of Brazil’s internet. Its scraps, testimonials, and communities gave users a way to publicly shape identity, build relationships, and engage with everything from music and religion to politics and humor. When Google shut it down in 2014, most of its data was deleted. What remains today is fragmented and buried in the Wayback Machine.

In this talk, I use Python to recover and analyze limited traces of Orkut’s digital legacy. I scraped thousands of community names from archived HTML using requests and BeautifulSoup, processed them with multilingual sentence embeddings from sentence-transformers, and applied scikit-learn and BERTopic to cluster the data, surface major social themes, and quantify them. These techniques reveal how users created meaning, formed subcultures, and expressed identity through online interactions.

Alongside the technical walkthrough, I draw on Cory Doctorow’s concept of enshittification, defined as the slow decline of platforms as they shift from serving users to exploiting them. Orkut is a case of enshittification by neglect: its shutdown led not just to the death of a platform, but to the erasure of a generation’s digital memory. According to Google's farewell announcement, over its 10 years of existence, Orkut hosted 51 million communities, 120 million discussion topics, and more than 1 billion interactions; most of which were permanently deleted.

This talk is for Python users interested not only in working with social media text data but also in uncovering the cultural narratives embedded within it. It invites the audience to see datasets as more than technical artifacts, viewing them instead as living records of online social life.

16:15

Using Traditional AI and LLMs to Automate Complex and Critical Documents in Healthcare

Aman Bhandari, Lily Xu

Informed Consent Forms (ICFs) are critical documents in clinical trials. They are the first, and often most crucial, touchpoint between a patient and a clinical trial study. Yet the process of developing them is laborious, high-stakes, and heavily regulated. Each form must be tailored to jurisdictional requirements and local ethics boards, reviewed by cross-functional teams, and written in plain language that patients can understand. Producing them at scale across countries and disease areas demands manual effort and creates major operational bottlenecks. We used a combination of traditional AI and large language models to autodraft the ICF across clinical trial types, across countries and across disease areas at scale. The build, test, iteration and deployment offers both technical and non technical lessons learned for generative AI applications for complex documents at scale and for meaningful impact.

Live from PyData Boston

16:30

Combining Zarr, HDF5, and TIFF into a single data format

Mark Kittisopikul, Ph.D.

TIFF, HDF5, and Zarr represent a few choices to store large n-dimensional arrays which represent scientific and machine learning data. Trade-offs have to be considered when selecting one of these formats. While TIFF files are recognized by many applications particularly for imaging, they are limited in the number of dimensions, two, traditionally, or three in the case of GeoTIFF. HDF5 was created to support hierarchical scientific data with arrays up to 32 dimensions, but are mainly readable by scientific applications. Neither TIFF nor HDF5 were designed with the cloud in mind. Meanwhile, Zarr reimagined HDF5 in the era of cloud computing and key-value object stores. In retrospect, these disparate formats have many similarities. I will demonstrate how to take advantage of these similarities to combine the formats and make data accessible to a wide range of local and cloud-based application without duplicating the data itself.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

Why Julia's GPU-Accelerated ODE Solvers are 20x-100x Faster than Jax and PyTorch

Chris Rackauckas

You may have seen the benchmark results and thought, "how the heck are the Julia ODE solvers on GPUs orders of magnitude faster than the GPU-accelerated Python libraries, that can't be true?" In this talk I will go into detail about the architectural differences between the Julia approaches to generating GPU-accelerated solvers vs the standard ML library approach to GPU usage. By the end of the talk you'll have a good enough understanding of models of GPU acceleration to understand why this performance difference exists, and the many applications that can take advantage of this performance improvement.

17:00

Debugging LLM Pipelines in Python- Engineering Lessons from the Trenches

LLMs are powerful but things can quickly go wrong when you’re building real apps. Prompts may fail, tool calls can break, and outputs often behave in unexpected ways. In this talk, we’ll look at real-world issues developers face when working with Python-based LLM pipelines and how to fix them. You’ll walk away with practical debugging tips and tools to help make your AI apps more stable and trustworthy.

Machine Learning & AI

Machine Learning & AI

Where Have All the Metrics Gone?

Dr. Rebecca Bilbro

How exactly does one validate the factuality of answers from a Retrieval-Augmented Generation (RAG) system? Or measure the impact of the new system prompt for your customer service agent? What do you do when stakeholders keep asking for "accuracy" metrics that you simply don't have? In this talk, we’ll learn how to define (and measure) what “good” looks like when traditional model metrics don’t apply.

Live from PyData Boston

17:30

Bridging Interactive Data Science and Big Data with Hybrid Execution

Hybrid Execution is a new capability introduced in the open-source Modin library that lets developers write familiar pandas code while automatically selecting the most efficient execution backend. Small datasets run locally for fast, interactive development, while larger workloads are transparently pushed down to distributed backends for scalable, high-performance execution. This approach enables faster development for rapid prototyping and iteration and future-proofs pipelines as data volumes grow.

Modernizing JSON for Julia

JSON support and interfaces vary widely across languages and Julia has been no different. As Julia has evolved as a language, patterns and best practices with regards to interfaces have also evolved with how to best leverage Julia's unique strengths: multiple dispatch, library composability, and zero-cost abstraction. The original JSON.jl package has been rewritten from scratch for a (finally!) 1.0 release bringing JSON support in Julia up to modern best practices and patterns, combining functionality from at least 3(!) existing JSON packages into one unified library offering.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

Scaling Data Processing for LLMs with NeMo Curator

Training state-of-the-art Large Language Models (LLMs) increasingly rely on the availability of clean, diverse, and large-scale datasets. The traditional CPU-based preprocessing pipelines often become a bottleneck when curating datasets that span tens or hundreds of terabytes. In this talk, we introduce NeMo Curator, an open-source, GPU-accelerated data curation framework developed by NVIDIA. Built on Python and powered by RAPIDS, NeMo Curator enables scalable, high-throughput data processing for LLMs, including semantic deduplication, filtering, classification, PII redaction, and synthetic data generation. With support for multi-node, multi-GPU environments, the framework has demonstrated up to 7% improvement in downstream model performance on large-scale benchmarks. We will walk through its modular pipeline design, highlight real-world applications, and show how to integrate it into existing workflows for fast, reproducible, and efficient LLM training.

Machine Learning & AI

Machine Learning & AI

18:00

Communicating Data Quality: Making the Invisible Visible (and Fun!) with Pointblank

Richard Iannone

Ensuring and communicating data quality (DQ) is one of the most persistent challenges in data-driven organizations. Data scientists, engineers, and analysts often struggle not just with detecting DQ issues, but with presenting those issues in actionable ways for diverse stakeholders across an organization (e.g., pipeline owners, fellow developers, less-technical colleagues, etc). On top of this, DQ work has an image problem as it can be seen as tedious, opaque, or even adversarial.

This talk introduces Pointblank, a Python package designed to make data quality validation and communication both robust and approachable. The library provides a comprehensive set of tools for profiling, validating, and reporting on data quality. There’s a strong focus on beautiful and actionable outputs as well. It can help you to generate tabular validation reports, data summaries, and granular error reporting that make it easy for anyone (technical or not) to understand what’s wrong and why.

Attendees will learn how Pointblank can help their teams not only catch data issues early, but also communicate them effectively, fostering a culture of shared responsibility for data quality. The talk will include live demos of common DQ workflows, showing how Pointblank turns a traditionally painful process into something transparent, productive, and even a little bit fun.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

I Built a Transformer from Scratch So You Don’t Have To

Want to understand how transformers actually work without wading through 10,000 lines of framework code or drowning in tensor shapes? This talk walks you through building a transformer model from scratch — no pre-trained shortcuts, no black-box abstractions — just clean PyTorch code and good old-fashioned curiosity. You'll walk away with a clearer mental model of how attention, encoders, decoders, and masking really work.

Machine Learning & AI

Machine Learning & AI

projspec: what's this project anyway?

Most code and related workflows take place in "projects", directories with descriptive metadata. There are so many types of these around these days, it is hard to know what is contained where. projspec solves this for the majority of the python-data ecosystem, so that you can introspect your projects, act on them, and search across all your projects, local or remote.

18:30

Keynote by Lisa Amini- What’s Next in AI for Data and Data Management?

Advances in large language models (LLMs) have propelled a recent flurry of AI tools for data management and operations. For example, AI-powered code assistants leverage LLMs to generate code for dataflow pipelines. RAG pipelines enable LLMs to ground responses with relevant information from external data sources. Data agents leverage LLMs to turn natural language questions into data-driven answers and actions. While challenges remain, these advances are opening exciting new opportunities for data scientists and engineers. In this talk, we will examine recent advances, along with some still incubating in research labs, with the goal of understanding where this is all heading, and present our perspective on what’s next for AI in data management and data operations.

19:30

The SAT math gap: gender difference or selection bias?

Why do male test takers consistently score about 30 points higher than female test takers on the mathematics section of the SAT? Does this reflect an actual difference in math ability, or is it an artifact of selection bias—if young men with low math ability are less likely to take the test than young women with the same ability?

This talk presents a Bayesian model that estimates how much of the observed difference can be explained by selection effects. We’ll walk through a complete Bayesian workflow, including prior elicitation with PreliZ, model building in PyMC, and validation with ArviZ, showing how Bayesian methods disentangle latent traits from observed outcomes and separate the signal from the noise.

No prior knowledge of Bayesian statistics is required; attendees should be familiar with Python and common probability distributions.

Live from PyData Boston

20:45

The Boringly Simple Loop Powering GenAI Apps

Sebastian Wallkötter

Do you feel lost in the jungle of GenAI frameworks and buzzwords? Here's a way out. Take any GenAI app, peel away the fluff, and look at its core. You'll find the same pattern: a boringly simple nested while loop. I will show you how this loop produces chat assistants, AI agents, and multi-agent systems. Then we'll cover how RAG, tool-calling, and memory are like lego bricks we add as needed. This gives you a first-principles based map. Use it to build GenAI apps from scratch; no frameworks needed.

Live from PyData Boston

11:30

Using MCP to turn Claude into a Football Opposition Analyst

Advanced statistics are transforming sports analysis, but many coaches and ex-players struggle to access meaningful insights due to complex data and jargon. Generative AI offers a solution.

In this talk, I’ll demonstrate how I used Model Context Protocol (MCP) to turn Anthropic’s Claude Desktop into a football opposition analyst, making advanced performance data accessible and actionable.

Topics include how MCP enables AI to interpret domain-specific knowledge and real examples of AI-generated football insights.

Machine Learning & AI

Machine Learning & AI

12:00

Getting big OpenStreetMap data with QuackOSM

OpenStreetMap data is publicly available, but it's hard to get it downloaded at scale without domain knowledge and an external technology stack.

With QuackOSM, you can easily work with whole-country vector and tag data without installing additional dependencies - come and find out how you can use it in your next project!

Data Engineering & Infrastructure

Data Engineering & Infrastructure

PyData/Sparse & Finch: extending sparse computing in the Python ecosystem

Mateusz Sokół, Willow Marie Ahrens

Scientific Python Ecosystem offers a wide variety of numerical packages, such as NumPy, CuPy, or JAX. One of the domains that also captures a lot of attention in the community is sparse computing.

In this talk, we will present the current landscape of sparse computing in the Python ecosystem and our efforts to revive/expand it. Our main contributions to the Python ecosystem cover: (1) making a novel Finch sparse tensor compiler and Galley scheduler available for the community, (2) standardizing various aspects of sparse computing. We will show how to use the Finch compiler with the PyData/Sparse package and how it outperforms well-established alternatives for multiple kernels, such as MTTKRP or SDDMM.

Real-world use-cases will show you how, step-by-step, Python practitioners can migrate their code to an Array API compatible version and benefit from tensor operator fusion and autoscheduling capabilities offered by the Finch compiler.

Apart from the existing Julia implementation, the number of sparse backends offered by PyData/Sparse will grow in the future to provide a Python-native alternatives for scipy.sparse and Numba solutions. One of them that is currently under development is finch-tensor-lite, a pure Python rewrite of Finch.jl compiler, meant to make the solution lightweight by dropping Julia runtime dependency while providing the majority of features.

The Human Side: Leading and Mentoring Global Data Teams in the Age of AI

Building great AI-driven products starts with empowered teams. Hear proven strategies for leading, mentoring, and growing distributed engineering teams, with lessons in innovation, compliance, and diversity from global digital enterprises.

Machine Learning & AI

Machine Learning & AI

12:30

Realtime Financial Fraud Detection with Modern Python

César Soto Valero

Building ML models for financial fraud detection sounds straightforward, until you have to evaluate, validate, and deploy them in real-world pipelines. This talk walks through the practical stack, metrics, and mindsets needed to build fraud detection systems with modern Python. We'll cover key challenges like concept drift, extreme class imbalance, false-positive overload, and why the usual ML workflows fall short. Along the way, we’ll explore a real-world architecture using classical ML, deep learning, and GNNs, plus the validation techniques and production patterns that make or break fraud systems. If you're tired of toy problems and want patterns that survive real money and real latency, this talk’s for you.

Machine Learning & AI

Machine Learning & AI

13:00

EffVer: Versioning code by the effort required to upgrade

Jacob Tomlinson

Many notable PyData projects including Jupyter Hub, Matplotlib and JAX follow a versioning scheme called EffVer, where instead of making promises around backward compatibility they communicate the likelihood and magnitude of the work required to adopt a new version.

In this talk we will dive into EffVer, what it is and what it means for developers and users. We will discuss how to apply EffVer to your own projects and how to depend on projects that use it.

How to Effectively use text embeddings in tree based models

Claudio Salvatore Arcidiacono

Text embeddings are a powerful tool for encoding the essence of unstructured text data into a structured, dense, multidimensional vector representation. Due to their inner structure, tree based models such as decision trees, gradient boosted decision trees and random forests struggle to effectively use text embeddings features. This is due to the fact that trees can use only one feature every time they split, so the number of used embedding dimensions is limited to the tree depth.

Other models, such as linear models for example, can use text embeddings more effectively because they are able to use all of the embedding dimensions simultaneously.

In this presentation we will present a novel approach to transform text embedding features into a format that tree-based models can effectively use. The proposed approach combines the strengths of non-tree based models with predictive power of tree based models to create a more effective feature representation for tree-based models.

Machine Learning & AI

Machine Learning & AI

RDepot - 100% open source enterprise management of Python and R repositories

Jonas Van Malder

RDepot is a solution for the management of R package repositories in an enterprise environment. Python support has recently been implemented and this talk will introduce RDepot to the Python community. It allows to submit packages through a user interface or API and to automatically update and publish Python and R repositories. In this talk we will walk Python users and developers through different features of RDepot and demonstrate how these can be useful in different scenarios.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

Reviving Survival Analysis: Timeless, Yet Overlooked?

Survival analysis tackles one of the oldest and most universal questions in data science: Can we learn from the past when something will happen in the future? I will introduce you to the core concepts of survival analysis, visualize time-to-event datasets with python and R, and introduce pertinent probability distributions. Classical analysis methods for fitting such datasets - some developed long before the age of modern computing - will be confronted to machine-learning approaches. Along the way, surprising paradoxes and counterintuitive results will reveal why survival analysis is not merely a blend of regression and classification, but an important prediction problem of its own.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

13:30

Hands-on with Blosc2: Accelerating Your Python Data Workflows

Francesc Alted, Luke Shaw

As datasets grow, I/O becomes a primary bottleneck, slowing down scientific computing and data analysis. This tutorial provides a hands-on introduction to Blosc2, a powerful meta-compressor designed to turn I/O-bound workflows into CPU-bound ones. We will move beyond basic compression and explore how to structure data for high-performance computation.

Participants will learn to use the python-blosc2 library to compress and decompress data with various codecs and filters, optimizing for speed and ratio. The core of the tutorial will focus on the Blosc2 NDArray object, a chunked, N-dimensional array that lives on disk or in memory. Through a series of interactive exercises, you will learn how to perform out-of-core mathematical operations and analytics directly on compressed arrays, effectively handling datasets larger than available RAM.

We will also cover practical topics like data storage backends, two-level partitioning for faster data slicing, and how to integrate Blosc2 into existing NumPy-based workflows. You will leave this session with the practical skills needed to significantly accelerate your data pipelines and manage massive datasets with ease.

Optimal Variable Binning in Logistic Regression

Charaf ZGUIOUAR

In many regulated industries—finance, healthcare, insurance—logistic regression remains the model of choice for its interpretability and regulatory acceptability. Yet capturing non-linear effects and interactions often requires variable binning, and naive approaches (equal-width or quantile cuts) can either wash out signal or invite overfitting. In this 30-minute session, data scientists and risk analysts with a working knowledge of logistic regression and Python will learn to:

-Diagnose the weaknesses of basic binning strategies.
-Select and apply optimal-binning algorithms for different use cases.
-Assess bin stability and guard against model overfit.

All code, data samples, and a notebook will be available on GitHub.

Machine Learning & AI

Machine Learning & AI

14:00

Bundestag Chat: Discovering Political Landscape with RAG Systems

Piotr Kalota, Matthias Boeck

Retrieval-Augmented Generation (RAG) systems are transforming how we interact with unstructured data using Large Language Models (LLMs). While it’s now relatively easy to stand up a basic RAG prototype, deploying a robust, customizable, and production-ready system remains challenging.
In this talk, we present our open-source RAG blueprint through the lens of a real-world application: Bundestag Chat—a system that enables users to explore and converse with German parliamentary speeches. We’ll demonstrate how the blueprint streamlined development and scaling, and how its modular architecture allowed for seamless integration of components like LlamaIndex, Hugging Face embeddings, PGVector, Langfuse, and Ragas.
Attendees will walk away with practical insights into customizing RAG pipelines for real use cases, whether building internal tools or user-facing applications. We’ll also explore build-vs-buy trade-offs, retrieval and scaling strategies, and considerations around privacy, evaluation, and monitoring.

Machine Learning & AI

Machine Learning & AI

From Ideas to APIs: Delivering Fast with Modern Python

César Soto Valero

The modern Python ecosystem shortens the distance between idea and implementation. This talk presents a focused workflow to move from a business question to a working prototype, fast. We'll explore reproducible environments (uv, Docker), quick data iteration with polars and duckdb, clean project scaffolding (pyproject.toml), and lightweight service layers with FastAPI and pydantic. Along the way, we’ll integrate tests (pytest), static checks (mypy), and fast linting (ruff). You’ll leave with a reusable structure, toolchain recommendations, and a mental model for optimizing feedback loops and development in modern Python projects.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

🚪🚪🐐 Lessons in Decision Making from the Monty Hall Problem

Switch or stay, what do you say? And more importantly, why?

The Monty Hall Problem is a well-known brain teaser from which we can learn important lessons in decision making that are useful in general and in particular for data scientists.

If you are not familiar with this problem, prepare to be perplexed 🤯. If you are, I hope to shine light on aspects that you might not have considered 💡.

I introduce the problem and solve with three types of intuitions: Common, Bayesian and Causal. I summarise with a discussion on lessons learnt for better data decision making.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

14:30

Quiet on Set: Building an On-Air Sign with Open Source Technologies

While many of us have adapted to work from home life, one major problem remains: finding an easy way to keep folks in your home away from your workspace when you’re on an important call. Dust off your Raspberry Pi––let’s build a custom on-air sign with Apache Kafka®, Apache Flink®, and Apache Iceberg™!

We’ll begin by writing Python scripts to capture key events––such as when a Zoom meeting is running and when a camera is being used––and produce it into Kafka. The live data are then consumed by a Raspberry Pi script to drive the operation of a custom designed on-air sign. From there, you’ll be introduced to the ins and outs of FlinkSQL for stream processing as we wrangle the data into a better format for downstream use. And, finally, we’ll see Iceberg in action and learn how to use query engines to analyze meeting and recording trends.

By the end of the session, you’ll be well-acquainted with this powerful trio of open source technologies and know how you could use the same scaffolding and scale out a simple, at-home project to millions of users and simultaneous events.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

We Have an AI in the Room: Can You Still Trust Technical Interviews?

Modern AI tools are increasingly used by candidates to cheat during technical interviews — often in real time. This talk explores how these tools work, which interview formats are most vulnerable, and how to design assessments that accurately reveal a candidate’s true technical ability. Ideal for hiring managers and engineers involved in technical assessment.

Machine Learning & AI

Machine Learning & AI

15:00

15:00

60min

Keynote- David Aronchick

General Track

16:00

Building Production-Ready Research AI Assistants with One-Command Setup

Cainã Max Couto da Silva

Academic research is often fragmented across dense PDFs, complex jargon, and scattered media articles, making it hard to access for students, interns, and the broader public. To address this, we introduce SciChat: an open-source Research AI Assistant that unifies a lab’s papers and media coverage into a conversational system, where anyone can ask natural language questions and receive structured answers with full source citations.

This talk demonstrates how to build and deploy a production-ready RAG pipeline that uses Landing.AI for vision-based PDF parsing, Firecrawl for media extraction, and LangGraph for agentic orchestration. The entire system is containerized with FastAPI and Streamlit, launching with a single command: docker compose up.

Attendees will learn how to turn scattered research artifacts into a transparent, queryable knowledge base, making lab insights accessible, reproducible, and conversational for all.

Machine Learning & AI

Machine Learning & AI

Decisions Under Uncertainty: A Hands‑On Guide to Bayesian Decision Theory

We often must make decisions under uncertainty—should you carry an umbrella if there's a 30 % chance of rain? Bayesian decision theory provides a principled, probabilistic framework to answer such questions by combining beliefs (probabilities), utilities (what matters to us), and actions to maximize expected gain.

This talk:
- Introduces key decision‑theoretic concepts in intuitive terms.
- Uses a toy umbrella example to ground ideas in relatable context.
- Demonstrates applications in Bayesian optimization (PoI/EI) and Bayesian experimental design.
- Is hands‑on—with Python code and practical tools—so participants leave ready to apply these ideas to real‑world problems.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

GPU Python for the Real World: Practical Steps to GPU-Accelerated Python with RAPIDS

Jacob Tomlinson, Naty Clementi

NVIDIA GPUs offer unmatched speed and efficiency for data processing and model training, significantly reducing the time and cost associated with these tasks. Using GPUs is even more tempting when you use zero-code-change plugins and libraries. You can use PyData libraries including pandas, polars and networkx without needing to rewrite your code to get the benefits of GPU acceleration. We can also mix in GPU native libraries like Numba, CuPy and pytorch to accelerate our workflows from end-to-end.

However, integrating GPUs into our workflow can be a new challenge where we need to learn about installation, dependency management, and deployment in the Python ecosystem. When writing code, we also need to monitor performance, leverage hardware effectively, and debug when things go wrong

This is where RAPIDS and its tooling ecosystem comes to the rescue. RAPIDS, is a collection of open source software libraries to execute end-to-end data pipelines on NVIDIA GPUs using familiar PyData APIs.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

16:30

Optimizing AI/ML Workloads: Resource Management and Cost Attribution

The proliferation of AI/ML workloads across commercial enterprises, necessitates robust mechanisms to track, inspect and analyze their use of on-prem/cloud infrastructure. To that end, effective insights are crucial for optimizing cloud resource allocation with increasing workload demand, while mitigating cloud infrastructure costs and promoting operational stability.

This talk will outline an approach to systematically monitor, inspect and analyze AI/ML workloads’ properties like runtime, resource demand/utilization and cost attribution tags . By implementing granular inspection across multi-player teams and projects, organizations can gain actionable insights into resource bottlenecks, identify opportunities for cost savings, and enable AI/ML platform engineers to directly attribute infrastructure costs to specific workloads.

Cost attribution of infrastructure usage by AI/ML workloads focuses on key metrics such as compute node group information, cpu usage seconds, data transfer, gpu allocation , memory and ephemeral storage utilization. It enables platform administrators to identify competing workloads which lead to diminishing ROI. Answering questions from data scientists like "Why did my workload run for 6 hours today, when it took only 2 hours yesterday" or "Why did my workload start 3 hours behind schedule?" also becomes easier.

Through our work on Metaflow, we will showcase how we built a comprehensive framework for transparent usage reporting, cost attribution, performance optimization, and strategic planning for future AI/ML initiatives. Metaflow is a human centric python library that enables seamless scaling and management of AI/ML projects.

Ultimately, a well-defined usage tracking system empowers organizations to maximize the return on investment from their AI/ML endeavors while maintaining budgetary control and operational efficiency. Platform engineers and administrators will be able to gain insights into the following operational aspects of supporting a battle hardened ML Platform:

1.Optimize resource allocation: Understand consumption patterns to right-size clusters and allocate resources more efficiently, reducing idle time and preventing bottlenecks.

Proactively manage capacity: Forecast future resource needs based on historical usage trends, ensuring the infrastructure can scale effectively with increasing workload demand.
Facilitate strategic planning: Make informed decisions regarding future infrastructure investments and scaling strategies.

4.Diagnose workload execution delays: Identify resource contention, queuing issues, or insufficient capacity leading to delayed workload starts.

Data Scientists on the other hand will gain clarity on factors that influence workload performance. Tuning them can lead to efficiencies in runtime and associated cost profiles.

Machine Learning & AI

Machine Learning & AI

Python Polars: The Definitive Crash Course

Jeroen Janssens

Polars is a lightning fast DataFrame library that is taking the data science community by storm. Its elegant and expressive API makes analyses pleasant to write and efficient to run. In this workshop, we’ll demonstrate how Polars enables data scientists to go from raw data to reports–by reading, transforming, and visualizing data.

17:00

Let Me Structure Freely? How to Improve LLM Structured Output Quality

Ever wonder why structured LLM output doesn’t feel as reliable as its natural language responses? At Khan Academy, we asked ourselves the same thing—especially as we leaned heavily on JSON-based structured outputs to power our AI tutor, Khanmigo.

Surprisingly, the root of the problem often lies in one of the most familiar tools in a Python developer’s toolbox: the humble dict. In this talk, we follow the story of how dictionary ordering can shape (and sometimes distort) structured LLM output. We’ll walk through how different frameworks—OpenAI, Claude, LangChain, OpenRouter, vLLM—handle structured responses, and why those differences matter more than you’d expect.

Along the way, we’ll share practical best practices we’ve developed to improve structured output reliability, observe subtle failure cases, and debug weird edge behaviors. If you’re building LLM apps with structured output, you’ll leave with concrete tips—and a deeper appreciation for the details that make or break your system.

Machine Learning & AI

Machine Learning & AI

fastplotlib: driving scientific discovery through data visualization

Kushal Kolar, Caitlin Lewis

Fast interactive visualization remains a considerable barrier in analyses pipelines for large neuronal datasets. Here, we present fastplotlib, a scientific plotting library featuring an expressive API for very fast visualization of scientific data. Fastplotlib is built upon pygfx which utilizes the GPU via WGPU, allowing it to interface with modern graphics APIs such as Vulkan for fast rendering of objects. Fastplotlib is non-blocking, allowing for interactivity with data after plot generation. Ultimately, fastplotlib is a general purpose scientific plotting library that is useful for the fast and live visualization and analysis of complex datasets.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

17:30

Bayesian Decision Analysis with PyMC: Beyond A/B Testing

This hands-on tutorial introduces practical Bayesian inference using PyMC, focusing on A/B testing, decision-making under uncertainty, and hierarchical modeling. With real-world examples, you'll learn how to build and interpret Bayesian models, evaluate competing hypotheses, and implement adaptive strategies like Thompson sampling. Whether you're working in marketing, healthcare, public policy, UX design, or data science more broadly, these techniques offer powerful tools for experimentation, decision-making, and evidence-based analysis.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

Build your own Personal Data Warehouse

Michael Alan Washington

Tired of paying for cloud compute just to view your own data? Discover how to build a completely free, open-source personal data warehouse that runs entirely on your machine.
– Import data from Excel, CSV, SQL Server, and Microsoft Fabric
– Use AI-powered Python/C# code for advanced data transformations
– Generate SSRS-style reports – no cloud required
– Leverage local compute power to avoid cloud costs

Machine Learning & AI

Machine Learning & AI

18:00

LLMs, Chatbots, and Dashboards: Visualize Your Data with Natural Language

LLMs have a lot of hype around them these days. Let's demystify how they work and see how we can put them in context for data science use. As data scientists, we want to make sure our results are inspectable, reliable, reproducible, and replicable. We already have many tools to help us in this front. However, LLMs provide a new challenge; we may not always be given the same results back from a query. This means trying to work out areas where LLMs excel in, and use those behaviors in our data science artifacts. This talk will introduce you to LLms, the Chatlas package, and how they can be integrated into a Shiny to create an AI-powered dashboard. We'll see how we can leverage the tasks LLMs are good at to better our data science products.

Machine Learning & AI

Machine Learning & AI

Time series analysis for coupled neurons.

The complex nervous system provides a repertoire of evolutionary properties like neuron spiking, bursting, and chaos that are yet to be fully understood. One approach is to tackle these time-dependent properties using the technique of "dynamical systems”, such as ordinary differential equations. Since the popular work by Hodgkin and Huxley, many dynamical systems models of neurons have been proposed, of which FitzHugh–Nagumo and Morris–Lecar models draw special attention. The nervous system is made of a network of neurons, possessing a complex structural and functional topology. This topology is a function of different parameters, among which the coupling strength plays a major role. Our focus would be to systematically study the effect of various coupling strategies on the firing patterns exhibited by a collection of neurons. In this workshop, my goal is to popularize a reduced-order model of neuron dynamics known as the “denatured Morris–Lecar” system and to teach how Python can be efficiently used to perform research on time series analysis of coupled neurons.

18:30

UQLM: Detecting LLM Hallucinations with Uncertainty Quantification in Python

Dylan Bouchard, Mohit Singh Chauhan

As LLMs become increasingly embedded in critical applications across healthcare, legal, and financial domains, their tendency to generate plausible-sounding but false information poses significant risks. This talk introduces UQLM, an open-source Python package for uncertainty-aware generation that flags likely hallucinations without requiring ground truth data. UQLM computes response-level confidence scores from token probabilities, consistency across sampled responses, LLM judges, and tunable ensembles. Attendees will learn practical strategies for implementing hallucination detection in production systems and leave with code examples they can immediately apply to improve the reliability of their LLM-powered applications. No prior uncertainty quantification background required.

Machine Learning & AI

Machine Learning & AI

11:30

Building a Lightweight Feature Store for Electricity Grid Forecasts with Polars

Get a firsthand look at how we built a lightweight feature store to accelerate electricity grid forecasting. We’ll cover our decision process, design choices, and implementation using Polars and Google Cloud Storage. Expect lessons learned, real-world bumps, and a clear view of the costs, trade-offs and benefits of our solution.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

Revolutionizing Safety Log Analysis in Oil and Gas: A Multi-Stage LLM Approach for Enhanced Hazard Identification

Andrew Yule, Iain Docherty

In this presentation, we demonstrate how Large Language Models (LLMs) can revolutionize safety log analysis in the oil and gas industry. Our research with a major operator involved processing 15,000 safety observations through a novel multi-stage pipeline. First, we developed a domain-specific categorical framework aligned with industry standards. We then implemented an unsupervised learning approach using sentence transformers to calculate semantic similarity between observations and predefined categories. This enabled multi-dimensional classification with weighted confidence percentages. Finally, we deployed a fine-tuned LLM to assign priority scores and enhance categorization accuracy, all while maintaining data privacy through on-premises processing. The resulting system streamlines real-time safety log processing, enabling more efficient identification of potential hazards and trends. Our implementation demonstrates significant improvements in classification accuracy and processing efficiency compared to traditional methods, providing actionable insights for proactive safety management.

Machine Learning & AI

Machine Learning & AI

12:00

How Big are SLMs

Jayita Bhattacharyya

Small Language Models (SLMs) are designed to deliver high performance with significantly fewer parameters compared to Large Language Models (LLMs). Typically, SLMs range from 100 million to 30 billion parameters, enabling them to operate efficiently on devices with limited computational resources, such as smartphones and embedded systems

Machine Learning & AI

Machine Learning & AI

12:30

Engineering Large-scale geospatial raster processing with xarray and dask

CLINTON OYOGO DAVID

Geospatial analysis often involves harmonizing and processing raster datasets from diverse sources with varying resolutions, coordinate systems, and data formats. This talk demonstrates how you can build efficient, scalable pipelines for zonal statistics extraction using Python’s scientific computing stack, xarray, and dask to handle rasters that would otherwise overwhelm traditional processing approaches.
Through a real-world case study of processing multi-source geospatial data for small-area estimation of poverty, we’ll explore practical strategies for memory-efficient raster harmonization, parallel computing workflows, and automated statistical aggregation across administrative boundaries.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

Supercharge your Python performance with FFIs for AI workflows

Shivay Lamba, Rudraksh Karpe

Python is the go-to language in AI for its simplicity, but it often struggles with heavy computations due to the Global Interpreter Lock (GIL). This talk shows how Foreign Function Interfaces (FFIs) like Cython, ctypes, cffi, and PyO3 can dramatically enhance Python performance by calling native C, C++, or Rust code. Attendees will learn to identify bottlenecks, apply FFIs effectively, and accelerate AI and data science workflows.

Machine Learning & AI

Machine Learning & AI

13:00

Automating ML with PyCaret: Train & Compare Multiple Models to Find the Best Performer

Manjunath Janardhan

This Live demonstration shows how PyCaret, an open-source low-code machine learning library, can dramatically simplify model training and comparison workflows. PyCaret is democratizing machine learning by empowering anyone to train multiple algorithms and compare their performance with minimal code. Attendees will witness live demonstrations of training various ML algorithms and using automated comparison techniques to select the best performer based on key metrics. Perfect for data scientists, developers, and ML enthusiasts looking to spend less time coding and more time on model analysis and selection.

Machine Learning & AI

Machine Learning & AI

When the Meter Maxes Out: Chernobyl Disaster Lessons for ML Systems in Production

Idan Richman Goshen

At 1:23 a.m. on 26 April 1986, the RBMK-4 graphite-moderated reactor at Chernobyl exploded. Every dosimeter still working inside flat-lined at 3.6 R/h, its maximum reading, while lethal radiation raged unseen. That single detail from Chernobyl is the perfect allegory for what can go wrong in modern machine-learning pipelines: clipped features, hidden distribution shifts, missing logs, runaway feedback loops, and more. This talk unpacks key incidents from the disaster and map each one to an equivalent failure mode in production ML, showing how silent risk creeps into data systems and how to engineer for resilience. Attendees will leave with a practical set of questions to ask, signals to track, and cultural habits that keep models (and the businesses that rely on them) well clear of their own meltdowns. No nuclear physics required.

13:30

Accelerate deployment of your Python data science apps using ShinyProxy

Tobia De Koninck

ShinyProxy is 100% open-source software to deploy data science apps in an enterprise context. This talk will - for the first time - introduce ShinyProxy to the Python community. We'll start with a realistic example to explore what it takes to deploy a data science app for production use. Throughout the talk, you'll see how ShinyProxy addresses many of the common challenges faced when deploying apps.
These include authentication, scaling, security (such as TLS), audit logging, version control, reproducibility, and more. The main goal of ShinyProxy is to ensure data scientists can focus on doing science instead of spending time on technical requirements, procedures and maintenance. This talk is tailored for both data scientists and anyone interested in setting up ShinyProxy. No deep technical knowledge is required to follow along. At the end of the talk, you'll know everything to get started with ShinyProxy and to deploy your first app.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

Computer Vision Data Version Control and Reproducibility at Scale

Computer vision, the field focused on enabling machines to interpret and understand visual data, tackles challenges like image recognition, object detection, and scene understanding. PyData tools play a critical role in solving these issues by offering robust libraries like TensorFlow, PyTorch, Keras, and Langchain for building and training machine learning models, performing image processing, and managing large datasets. This hands-on session will enable attendees to learn how to optimize computer vision projects with end-to-end version control baked in.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

Streaming AI Workflows in Python: Kafka Queues and Flink-Powered LLM Inference

Shekhar Prasad Rajak, bhrathjatoth

Python users working on real-time analytics—from payment processing and fraud detection to AI-driven support—rely on message queues to keep data moving reliably and efficiently. Traditional message queues, however, can struggle with large-scale, concurrent workloads, especially when you need durability and replayability.

In this session, we’ll show how Kafka 4.0 introduces robust queue semantics to distributed streaming, empowering Python applications to handle fair, concurrent, and isolated message processing at scale—using familiar Kafka Python clients and frameworks.

But the power lies in what you can build next. We’ll demonstrate how Apache Flink can connect Kafka event streams to real-time Large Language Model (LLM) inference for tasks like sentiment analysis and summarization, all orchestrated via Python APIs and remote model endpoints for powerful, flexible AI inference.

To complete the picture, we’ll cover how enriched results can be stored in popular data lake solutions—such as Apache Iceberg—enabling long-term analytics, time travel, and integration with downstream data science workflows. Support for Iceberg and other lakehouse formats is optional, giving you flexibility to choose the right data backend for your needs.

Machine Learning & AI

Machine Learning & AI

14:00

From Handwritten Notes to Smart Knowledge: Build Local AI Agents with Python

piotr stepinski

Your notebooks are full of insights—but they’re scattered and hard to search.
In this live-coding session I’ll show how to turn handwritten notes into a searchable, connected knowledge base using local AI and minimal Python.

We start with AnythingLLM’s UI for quick wins, then move to Python agents that:
• classify note types,
• extract key ideas,
• build a personal knowledge graph.

The entire stack runs on your laptop with MLC-AI—no cloud, no data leaks.
You’ll leave with a reusable agent blueprint you can drop into any data-processing workflow tomorrow.

Machine Learning & AI

Machine Learning & AI

GPU Accelerated Zarr

The zarr-python 3.0 release includes native support for device buffers, enabling Zarr workloads to run on compute accelerators like NVIDIA GPUs. This enables you to get more work done faster.

This talk is primarily intended for people who are at least somewhat familiar with Zarr and are curious about accelerating their n-dimensional array workload with GPUs. That said, we will start with a brief introduction to Zarr and why you might want to consider it as a storage format for the n-dimensional arrays (commonly seen in geospatial, microscopy, or genomics domains, among others). We'll see what factors affect performance and how to maximize throughput for your data analysis or deep learning pipeline. Finally, we'll preview the future improvements to GPU-accelerated Zarr and the packages building on top of it, like xarray and cubed.

After attending this talk, you'll have the knowledge needed to determine if using zarr-python's support for device buffers can help accelerate your workload.

14:30

Detecting Regime Shifts in Time Series with Python: Entropy-Based Change-Point Detection

Sergei Nasibian

Financial and other real-world time series often experience abrupt regime changes that can break assumptions and invalidate models. This talk shows how to use k-nearest neighbor entropy estimators combined with clustering algorithms, implemented entirely in Python, to detect these change-points early. We’ll explore practical examples with financial market data, discuss strengths and limitations, and provide reusable open-source code. Attendees will leave with tools to make their time series models more robust to sudden structural changes.

Machine Learning & AI

Machine Learning & AI

15:00

Keynote- Noor Aftab- The Next Commit: Building Inclusive, Data-Driven Ecosystems for Responsible AI

Python is the number one language on GitHub, yet a git log of our shared future reveals a critical system failure. While Python adoption is exponential, women author only 2–3% of core-repository commits and comprise just 22% of the global AI talent pool. This talk moves beyond rhetoric to present a 5-Step Engineering Framework based on quantitative research from 24 global tech communities. We will introduce the VIM Model (Visibility, Invitation, Mechanism)—a proven architecture that drove 179% membership growth and 99% retention in pilot programs.

16:00

Future proof your AI product

In this talk I will cover frequent AI system problems caused by using prompts and opaque frameworks instead of a descriptive programmatic approach, using DSPy.

Machine Learning & AI

Machine Learning & AI

Garbage In, Lawsuit Out: Building Compliant and Reproducible ML Pipelines

Your model might pass all the benchmarks—but can it survive a subpoena? In the race to ship AI, most teams are building workflows that look great in dashboards but fall apart under legal, regulatory, or ethical pressure. Because the real liability doesn’t live in your model weights—it’s buried in your data.

16:30

Animating Equity: Python Dashboards for Small-Town Housing and Displacement Risk

This talk demonstrates how open-source Python tools like censusdis, pandas, and folium can be combined to create an interactive, time-enabled dashboard for visualizing economic vulnerability, housing affordability, and displacement risk in small communities. Using Oxford, NC as a case study, the talk showcases a multi-year, multi-indicator mapping project designed to support equitable local planning.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

Bodo DataFrames: a fast and scalable HPC-based drop-in replacement for Pandas

Scott Routledge

Pandas is a popular library for data scientists but it struggles with large datasets; programs either become too slow or run out of memory. In this talk, we introduce Bodo DataFrames (https://github.com/bodo-ai/Bodo) as a drop-in replacement for the Pandas library that uses high performance computing (HPC) based techniques such as Message Passing Interface (MPI) and JIT compilation for acceleration and scaling. We give an overview of its architecture and explain how it avoids the problems of Pandas (while keeping user code the same), go over concrete examples, and finally discuss current limitations. This talk is for Pandas users who would like to run their code on larger data while avoiding frustrating code rewrites to other APIs. Basic knowledge of Pandas and Python is recommended.

Data Engineering & Infrastructure

Data Engineering & Infrastructure

HPC Implementation of a Hybrid Recommender System in Julia

José Quenum, marthin thomas

This talk discusses a hybrid recommender system implemented in Julia for preselecting job applicants. The recommender system is built using a neural network adopting a hybrid architecture that combines convolutional layers of a graph neural network and a transformer (both encoder and decoder). We discuss the preprocessing of applicant metadata and job adverts to generate a heterogeneous graph. Next, we present the recommender as a model and its training using an HPC.

Machine Learning & AI

Machine Learning & AI

17:00

Automate the Boring Stuff with LLMs and Agents

Bruno Gonçalves

This hands-on tutorial will guide participants through the process of building intelligent automation solutions using the Google Gemini API, LangChain, and LangGraph. We will focus specifically on the powerful concept of generative AI (GenAI) agents, demonstrating how to construct autonomous workflows that tackle everyday, tedious tasks. Through a series of practical, real-world examples, attendees will learn to design, implement, and deploy LLM-powered agents to streamline their work.

Machine Learning & AI

Machine Learning & AI

Beyond Just Prediction: Causal Thinking in Machine Learning

Most ML models excel at prediction, answering questions like "Who will buy our product?" or "Which customers are likely to churn?". But when it comes to making actionable decisions, prediction alone can be misleading. Correlation does not imply causation, and business decisions require understanding causal relationships to drive the right outcomes.

In this talk, we will explore how causal machine learning, specifically uplift modeling, can bridge the gap between prediction and decision making. Using a real-world use case, we will showcase how uplift modeling helps identify who will respond positively to interventions while avoiding those who they might deter.

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

Connected Identities: Rethinking Identity and Access Management with Neo4j and Python

Access control is ultimately about relationships—between people, systems, and resources. In this talk, we’ll look at how modeling connected identities with a graph database unlocks a more efficient and transparent way to manage Identity and Access Management (IAM).

Using Neo4j and Python, we’ll walk through a practical approach to building an IAM system that prioritizes clarity, performance, and portability. You’ll learn how to model users, roles, and permissions as a connected graph, write access logic in Cypher, and deploy a lightweight system that scales without adding complexity.

In this fast-paced talk, you’ll learn how to :

Map users, roles, and permissions like a detective
Write smart queries to control access
Build a lightweight, graph-powered IAM engine

No graph skills? No problem. Just bring Python and curiosity.

17:30

Enhancing Marketplace Competitiveness: A Bayesian Approach to modelling the cold start problem

Agustin Figueroa Nazar

This session shows how Bayesian statistical modeling helps determine when you have collected enough data about new products, so that they are ready for competition. We'll explore:
how this approach enables efficient decision-making with minimal data
why we chose Bayesian over machine learning models
how we covered for the required assumptions
how this enables a risk-management approach while providing interpretable results that business stakeholders can understand and trust

You will learn how to identify a Bayesian problem at your company and how to navigate the modelling with real-world data!

Analytics, Visualization & Decision Science

Analytics, Visualization & Decision Science

18:30

TinyTroupe: Enhancing Marketing Insights through LLM-Powered Multiagent Persona Simulation

Understanding customer behavior is essential in marketing. Traditionally, marketers rely on methods such as surveys, customer interviews, and focus groups to gather insights. However, these approaches can be expensive, time-consuming, and limited in scale and diversity.
Recently, multi-agent simulation powered by Large Language Models (LLMs) is emerging as an innovative technique. TinyTroupe, for example, enables the creation of different personas (e.g., budget‑minded Gen‑Z shoppers, premium‑seeking parents), allowing marketers to predict and optimize advertising effectiveness or replace time-consuming interviews rapidly.
In this talk, I will introduce the key concepts of LLM-powered multi-agent simulations, demonstrate their practical application in marketing through TinyTroupe, and share actionable insights and recommendations.

Machine Learning & AI

Machine Learning & AI