PyData Global 2025

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
11:30
11:30
30min
When AI Makes Things Up: Understanding and Tackling Hallucinations
Aarti Jha

AI systems are increasingly being integrated into real-world products - from chatbots and search engines to summarisation tools and coding assistants. Yet, despite their fluency, these models can produce confident but false or misleading information, a phenomenon known as hallucination. In production settings, such errors can erode user trust, misinform decisions, and introduce serious risks. This talk unpacks the root causes of hallucinations, explores their impact on various applications, and highlights emerging techniques to detect and mitigate them. With a focus on practical strategies, the session offers guidance for building more trustworthy AI systems fit for deployment.

Machine Learning & AI
Machine Learning & AI
12:00
12:00
30min
Harnessing Generative Models for Synthetic Non-Life Insurance Data
Claudio Giorgio Giancaterino

This study is oriented to a synthetic non-life insurance premium dataset generated using several Generative Models. As a benchmark, a Conditional Gaussian Mixture Model has been employed. The validation of the generated data involved several steps: data visualisation, comparison with univariate analysis, PCA and UMAP representations between the trained data and the generated samples. In addition, check the consistency of data produced, the statistical Kolmogorov–Smirnov test and predictive modelling of frequency and severity with Generalised Linear Models (GLMs) exploited by Tweedie distribution as a measure of the generated data's quality, followed by the evidence of features importance. For further comparison, advanced Deep Learning architectures have been employed: Conditional Variational Autoencoders (CVAEs), CVAEs enhanced with a Transformer Decoder, a Conditional Diffusion Model, and Large Language Models. The analysis assesses each model’s ability to capture the underlying distributions, preserve complex dependencies, and maintain relationships intrinsic to the premium data. These findings provide insightful directions for enhancing synthetic data generation in insurance, with potential applications in risk modelling, pricing strategies with data scarcity, and regulatory compliance.

Machine Learning & AI
Machine Learning & AI
12:00
30min
Python Meets Excel: Smarter Workflows for Analysts and Data Teams
DR NISHA ARORA

Python drives modern data workflows, yet Excel remains the lingua franca of business. Many Python-based data teams struggle when the “last mile” of delivery still involves exporting results to Excel for business users. This talk explores practical ways for Python users to automate, scale, and enhance Excel-heavy processes using open-source libraries.
This talk will help you bridge the gap between code and the business-facing spreadsheet world.
We will discuss real-world use cases for report generation, batch processing, and dashboard templating, all from a Python-first perspective.

General Track
General Track
12:30
12:30
90min
Fast, Cost-Efficient Analytics on Blockchain data using DuckDB - Solana as a case study
Busirah Olaitan Hammed

Abstract:

Blockchain generates millions of transactions daily, making it a rich yet complex source of data for developers, analysts, and researchers. While Google BigQuery offers public access to Solana’s historical data, repeated querying at scale can become costly and slow, especially during iterative exploration and analysis.

In this talk, I’ll demonstrate a practical workflow that combines the power of BigQuery for data extraction with the speed and flexibility of DuckDB for local, in-memory analytics. We’ll show how to efficiently query Solana data in BigQuery, export it to partitioned Parquet files, and use DuckDB to run fast, repeatable SQL queries without incurring additional cloud costs.

You'll learn:
- Basic terms in blockchain data structure and how transactions are saved.
- How to navigate and query Solana’s public datasets on BigQuery.
- How to export filtered blockchain data to efficient Parquet files.
- How DuckDB can serve as a lightweight analytics engine for on-chain data.
- Tips for partitioning, enriching, and automating your Solana data pipeline.

This demo would all run within Google collab to save time and also enable participant follow through the session.

Whether you're working on blockchain analytics, wallet behavior analysis, or on-chain data engineering, this talk will equip you with a practical approach to blockchain data workflows using open tools.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
12:30
30min
Scaling Fuzzy Product Matching with BM25: A Comparative Study of Python and Database Solutions
Aniket

Tired of exact matches failing on messy data? This talk showcases how BM25, a powerful fuzzy search algorithm, tackles the challenge of enriching massive datasets with noisy product names. We'll compare practical, large-scale implementations using Python's bm25s library (accelerated by GPUs) and DuckDB's built-in full-text search. Join us to learn how to achieve fast, accurate data integration and discover the optimal tools for your fuzzy matching needs.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
12:30
30min
torchTextClassifiers : Modernizing Text classification for French National Statistics
Cédric Couralet, Meilame Tayebjee

Discover how Insee (French National Statistics Institute) transitioned from fastText to a PyTorch-based model for text classification by developing and open-sourcing the torchTextClassifiers python package. This presentation will cover the creation, deployment, and practical applications of torchTextClassifiers in modernizing automatic coding systems, benefiting Insee and other European National Statistical Institutes (NSIs).

Machine Learning & AI
Machine Learning & AI
13:00
13:00
90min
Building LLM-Powered Applications for Data Scientists and Software Engineers
hugo bowne-anderson

This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using multimodal AI models to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications.

If you're a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems.

Machine Learning & AI
Machine Learning & AI
13:00
30min
Python Beyond the Code: Unlocking Hidden Contributions in Open Source
Iyanu Falaye

Contributing to open source isn’t just about code. Documentation, testing, community support, and issue triaging are critical but often overlooked. In this talk, I’ll share how Python developers — from junior to senior can make a meaningful, visible impact in open source. Whether you're new to open source or looking to expand your profile, this session will help you discover practical, beginner-friendly ways to contribute and stay engaged in the long term.

General Track
General Track
13:30
13:30
30min
Lessons learnt in optimizing a large-scale pandas application using Polars, FireDucks and cuDF: Go Smart and Save More!
Sourav Saha

In general, a Data Scientist spends significant efforts in transforming the raw data into a more digestible format before training an AI model or creating visualisations. Traditional tools such as pandas have long been the linchpin in this process, offering powerful capabilities but not without limitations. With numerous possible ways to write the same thing in pandas, often a user ends up selecting the uneconomical, inefficient ones, leading to large computational costs with the growth in data size. We introduce a couple of frequently occurring intricate performance issues in pandas, and what we have learnt in solving the same using popular high-performance pandas alternatives: Polars, FireDucks and cuDF. The talk intends to highlight one of the best practices (breaking out of the loops) that one should follow while dealing with large-scale data analysis, while demonstrating the key advantages of the high-performance pandas alternatives based on different scenarios.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
14:00
14:00
30min
Designing a Fast, Offline-Capable Reverse Geocoder in Python: An Open Source Alternative to Big Geo APIs
Sooraj Sivadasan

While commercial reverse geocoding APIs, such as Google Maps or Mapbox, are effective, they are also costly, have rate limitations, and are not appropriate for offline or privacy-sensitive settings.

Using available datasets and Python modules like cKDTree, shapely, and geopandas, we will demonstrate how to create a quick, scalable, offline-capable reverse geocoding system in Python in this session.

You will learn how to:
- Convert geographic shapefiles into effective spatial indices
- Perform location lookups in milliseconds using tree search and vector mathematics
- Handle edge cases like unclear borders, cities with identical names, and GPS noise
- Improve performance and memory usage through multiprocessing

The system is fully open source and has been production-tested in a high-throughput environment. Whether you are developing applications for edge inference, mapping, or logistics, this talk will help you take control of your geospatial infrastructure without depending on costly commercial APIs.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
14:00
30min
Lane detection in self-driving using only NumPy
Emma Saroyan

Are you a scientist or a developer looking to understand how to use NumPy to solve computer vision problems?
NumPy is a Python package that provides the multidimensional array object which you can use to solve the lane detection problem in computer vision for self-driving cars or autonomous driving. You can apply non-machine learning techniques using NumPy to find the straight lines on street images. No other external libraries, just python with NumPy.

General Track
General Track
14:30
14:30
30min
Enhancing Apache NiFi 2.x with Python Processors
Timothy Spann

In this talk, I will delve into the world of Apache NiFi 2.0 Python processors, exploring the capabilities they offer and demonstrating how to build custom processors to enhance your data processing pipelines.

By the end of this talk, participants will have a comprehensive understanding of building and optimizing Apache NiFi 2.0 Python processors, enabling them to integrate Python seamlessly into their data processing workflows.

This session is suitable for data engineers, architects, and anyone interested in harnessing the combined power of Apache NiFi and Python for efficient data integration and flow management. One of the main uses is to build prompts and call open LLM and AI. NiFi excels at integration, I will cover some interesting sources, sinks and enrichments and show when Python is helpful.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
14:30
30min
From Feature Engineering to Context Engineering for Agents
Jim Dowling

Context Engineering for Agents involves getting relevant data into the LLM’s prompt and builds on in-context learning capabilities of LLMs. But LLMs have finite sized context windows, so you can't just dump unprocessed context data into your Agent's LLM prompt. You need to select the right data, process it into the correct format, and compress or summarize the data before its use as context data.

In this talk, we will introduce techniques for selection, preprocessing, and compression of context data, taking inspiration from the tried and tested techniques used for feature engineering for ML. What goes around, comes around.

Machine Learning & AI
Machine Learning & AI
14:30
30min
Python Worst Practices: Learn from the Expert
Evan Wimpey

Data and Analytics Comedian Evan Wimpey is here to roast his own codebase! Enjoy the walk through of the worst Python habits. In this talk, you'll get to see:
* Incomprehensible variable names
* final_final_2.ipynb files
* rerunning the same cell and hoping it works this time
* imports that are never used
* debugging with print
* ML models that are validated on training data
* code so poorly written that even ChatGPT can't understand it
* and more!

General Track
General Track
15:00
15:00
60min
Keynote
General Track
16:30
16:30
90min
Bayesian Decision Analysis with PyMC: Beyond A/B Testing
Allen Downey

This hands-on tutorial introduces practical Bayesian inference using PyMC, focusing on A/B testing, decision-making under uncertainty, and hierarchical modeling. With real-world examples, you'll learn how to build and interpret Bayesian models, evaluate competing hypotheses, and implement adaptive strategies like Thompson sampling. Whether you're working in marketing, healthcare, public policy, UX design, or data science more broadly, these techniques offer powerful tools for experimentation, decision-making, and evidence-based analysis.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
16:30
30min
Combining Zarr, HDF5, and TIFF into a single data format
Mark Kittisopikul, Ph.D.

TIFF, HDF5, and Zarr represent a few choices to store large n-dimensional arrays which represent scientific and machine learning data. Trade-offs have to be considered when selecting one of these formats. While TIFF files are recognized by many applications particularly for imaging, they are limited in the number of dimensions, two, traditionally, or three in the case of GeoTIFF. HDF5 was created to support hierarchical scientific data with arrays up to 32 dimensions, but are mainly readable by scientific applications. Neither TIFF nor HDF5 were designed with the cloud in mind. Meanwhile, Zarr reimagined HDF5 in the era of cloud computing and key-value object stores. In retrospect, these disparate formats have many similarities. I will demonstrate how to take advantage of these similarities to combine the formats and make data accessible to a wide range of local and cloud-based application without duplicating the data itself.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
16:30
30min
Debugging LLM Pipelines in Python- Engineering Lessons from the Trenches
Kalyan Prasad

LLMs are powerful but things can quickly go wrong when you’re building real apps. Prompts may fail, tool calls can break, and outputs often behave in unexpected ways. In this talk, we’ll look at real-world issues developers face when working with Python-based LLM pipelines and how to fix them. You’ll walk away with practical debugging tips and tools to help make your AI apps more stable and trustworthy.

Machine Learning & AI
Machine Learning & AI
16:30
30min
Why Julia's GPU-Accelerated ODE Solvers are 20x-100x Faster than Jax and PyTorch
Chris Rackauckas

You may have seen the benchmark results and thought, "how the heck are the Julia ODE solvers on GPUs orders of magnitude faster than the GPU-accelerated Python libraries, that can't be true?" In this talk I will go into detail about the architectural differences between the Julia approaches to generating GPU-accelerated solvers vs the standard ML library approach to GPU usage. By the end of the talk you'll have a good enough understanding of models of GPU acceleration to understand why this performance difference exists, and the many applications that can take advantage of this performance improvement.

General Track
General Track
17:00
17:00
30min
Scaling Data Processing for LLMs with NeMo Curator
Allison Ding

Training state-of-the-art Large Language Models (LLMs) increasingly rely on the availability of clean, diverse, and large-scale datasets. The traditional CPU-based preprocessing pipelines often become a bottleneck when curating datasets that span tens or hundreds of terabytes. In this talk, we introduce NeMo Curator, an open-source, GPU-accelerated data curation framework developed by NVIDIA. Built on Python and powered by RAPIDS, NeMo Curator enables scalable, high-throughput data processing for LLMs, including semantic deduplication, filtering, classification, PII redaction, and synthetic data generation. With support for multi-node, multi-GPU environments, the framework has demonstrated up to 7% improvement in downstream model performance on large-scale benchmarks. We will walk through its modular pipeline design, highlight real-world applications, and show how to integrate it into existing workflows for fast, reproducible, and efficient LLM training.

Machine Learning & AI
Machine Learning & AI
17:30
17:30
30min
I Built a Transformer from Scratch So You Don’t Have To
Jen Wei

Want to understand how transformers actually work without wading through 10,000 lines of framework code or drowning in tensor shapes? This talk walks you through building a transformer model from scratch — no pre-trained shortcuts, no black-box abstractions — just clean PyTorch code and good old-fashioned curiosity. You'll walk away with a clearer mental model of how attention, encoders, decoders, and masking really work.

Machine Learning & AI
Machine Learning & AI
17:30
30min
Modernizing JSON for Julia
Jacob Quinn

JSON support and interfaces vary widely across languages and Julia has been no different. As Julia has evolved as a language, patterns and best practices with regards to interfaces have also evolved with how to best leverage Julia's unique strengths: multiple dispatch, library composability, and zero-cost abstraction. The original JSON.jl package has been rewritten from scratch for a (finally!) 1.0 release bringing JSON support in Julia up to modern best practices and patterns, combining functionality from at least 3(!) existing JSON packages into one unified library offering.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
18:00
18:00
30min
Communicating Data Quality: Making the Invisible Visible (and Fun!) with Pointblank
Richard Iannone

Ensuring and communicating data quality (DQ) is one of the most persistent challenges in data-driven organizations. Data scientists, engineers, and analysts often struggle not just with detecting DQ issues, but with presenting those issues in actionable ways for diverse stakeholders across an organization (e.g., pipeline owners, fellow developers, less-technical colleagues, etc). On top of this, DQ work has an image problem as it can be seen as tedious, opaque, or even adversarial.

This talk introduces Pointblank, a Python package designed to make data quality validation and communication both robust and approachable. The library provides a comprehensive set of tools for profiling, validating, and reporting on data quality. There’s a strong focus on beautiful and actionable outputs as well. It can help you to generate tabular validation reports, data summaries, and granular error reporting that make it easy for anyone (technical or not) to understand what’s wrong and why.

Attendees will learn how Pointblank can help their teams not only catch data issues early, but also communicate them effectively, fostering a culture of shared responsibility for data quality. The talk will include live demos of common DQ workflows, showing how Pointblank turns a traditionally painful process into something transparent, productive, and even a little bit fun.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
18:00
30min
projspec: what's this project anyway?
Martin Durant

Most code and related workflows take place in "projects", directories with descriptive metadata. There are so many types of these around these days, it is hard to know what is contained where. projspec solves this for the majority of the python-data ecosystem, so that you can introspect your projects, act on them, and search across all your projects, local or remote.

General Track
General Track
12:00
12:00
30min
Getting big OpenStreetMap data with QuackOSM
Kamil Raczycki

OpenStreetMap data is publicly available, but it's hard to get it downloaded at scale without domain knowledge and an external technology stack.

With QuackOSM, you can easily work with whole-country vector and tag data without installing additional dependencies - come and find out how you can use it in your next project!

Data Engineering & Infrastructure
Data Engineering & Infrastructure
12:00
30min
PyData/Sparse & Finch: extending sparse computing in the Python ecosystem
Mateusz Sokół, Willow Marie Ahrens

Scientific Python Ecosystem offers a wide variety of numerical packages, such as NumPy, CuPy, or JAX. One of the domains that also captures a lot of attention in the community is sparse computing.

In this talk, we will present the current landscape of sparse computing in the Python ecosystem and our efforts to revive/expand it. Our main contributions to the Python ecosystem cover: (1) making a novel Finch sparse tensor compiler and Galley scheduler available for the community, (2) standardizing various aspects of sparse computing. We will show how to use the Finch compiler with the PyData/Sparse package and how it outperforms well-established alternatives for multiple kernels, such as MTTKRP or SDDMM.

Real-world use-cases will show you how, step-by-step, Python practitioners can migrate their code to an Array API compatible version and benefit from tensor operator fusion and autoscheduling capabilities offered by the Finch compiler.

Apart from the existing Julia implementation, the number of sparse backends offered by PyData/Sparse will grow in the future to provide a Python-native alternatives for scipy.sparse and Numba solutions. One of them that is currently under development is finch-tensor-lite, a pure Python rewrite of Finch.jl compiler, meant to make the solution lightweight by dropping Julia runtime dependency while providing the majority of features.

General Track
General Track
12:00
30min
The Human Side: Leading and Mentoring Global Data Teams in the Age of AI
amar naik

Building great AI-driven products starts with empowered teams. Hear proven strategies for leading, mentoring, and growing distributed engineering teams, with lessons in innovation, compliance, and diversity from global digital enterprises.

Machine Learning & AI
Machine Learning & AI
12:30
12:30
30min
Realtime Financial Fraud Detection with Modern Python
César Soto Valero

Building ML models for financial fraud detection sounds straightforward, until you have to evaluate, validate, and deploy them in real-world pipelines. This talk walks through the practical stack, metrics, and mindsets needed to build fraud detection systems with modern Python. We'll cover key challenges like concept drift, extreme class imbalance, false-positive overload, and why the usual ML workflows fall short. Along the way, we’ll explore a real-world architecture using classical ML, deep learning, and GNNs, plus the validation techniques and production patterns that make or break fraud systems. If you're tired of toy problems and want patterns that survive real money and real latency, this talk’s for you.

Machine Learning & AI
Machine Learning & AI
13:00
13:00
30min
EffVer: Versioning code by the effort required to upgrade
Jacob Tomlinson

Many notable PyData projects including Jupyter Hub, Matplotlib and JAX follow a versioning scheme called EffVer, where instead of making promises around backward compatibility they communicate the likelihood and magnitude of the work required to adopt a new version.

In this talk we will dive into EffVer, what it is and what it means for developers and users. We will discuss how to apply EffVer to your own projects and how to depend on projects that use it.

General Track
General Track
13:00
30min
How to Effectively use text embeddings in tree based models
Claudio Salvatore Arcidiacono

Text embeddings are a powerful tool for encoding the essence of unstructured text data into a structured, dense, multidimensional vector representation. Due to their inner structure, tree based models such as decision trees, gradient boosted decision trees and random forests struggle to effectively use text embeddings features. This is due to the fact that trees can use only one feature every time they split, so the number of used embedding dimensions is limited to the tree depth.

Other models, such as linear models for example, can use text embeddings more effectively because they are able to use all of the embedding dimensions simultaneously.

In this presentation we will present a novel approach to transform text embedding features into a format that tree-based models can effectively use. The proposed approach combines the strengths of non-tree based models with predictive power of tree based models to create a more effective feature representation for tree-based models.

Machine Learning & AI
Machine Learning & AI
13:00
30min
RDepot - 100% open source enterprise management of Python and R repositories
Jonas Van Malder

RDepot is a solution for the management of R package repositories in an enterprise environment. Python support has recently been implemented and this talk will introduce RDepot to the Python community. It allows to submit packages through a user interface or API and to automatically update and publish Python and R repositories. In this talk we will walk Python users and developers through different features of RDepot and demonstrate how these can be useful in different scenarios.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
13:00
30min
Reviving Survival Analysis: Timeless, Yet Overlooked?
Malte Tichy

Survival analysis tackles one of the oldest and most universal questions in data science: Can we learn from the past when something will happen in the future? I will introduce you to the core concepts of survival analysis, visualize time-to-event datasets with python and R, and introduce pertinent probability distributions. Classical analysis methods for fitting such datasets - some developed long before the age of modern computing - will be confronted to machine-learning approaches. Along the way, surprising paradoxes and counterintuitive results will reveal why survival analysis is not merely a blend of regression and classification, but an important prediction problem of its own.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
13:30
13:30
90min
Hands-on with Blosc2: Accelerating Your Python Data Workflows
Francesc Alted, Luke Shaw

As datasets grow, I/O becomes a primary bottleneck, slowing down scientific computing and data analysis. This tutorial provides a hands-on introduction to Blosc2, a powerful meta-compressor designed to turn I/O-bound workflows into CPU-bound ones. We will move beyond basic compression and explore how to structure data for high-performance computation.

Participants will learn to use the python-blosc2 library to compress and decompress data with various codecs and filters, optimizing for speed and ratio. The core of the tutorial will focus on the Blosc2 NDArray object, a chunked, N-dimensional array that lives on disk or in memory. Through a series of interactive exercises, you will learn how to perform out-of-core mathematical operations and analytics directly on compressed arrays, effectively handling datasets larger than available RAM.

We will also cover practical topics like data storage backends, two-level partitioning for faster data slicing, and how to integrate Blosc2 into existing NumPy-based workflows. You will leave this session with the practical skills needed to significantly accelerate your data pipelines and manage massive datasets with ease.

General Track
General Track
13:30
30min
Optimal Variable Binning in Logistic Regression
Charaf ZGUIOUAR

In many regulated industries—finance, healthcare, insurance—logistic regression remains the model of choice for its interpretability and regulatory acceptability. Yet capturing non-linear effects and interactions often requires variable binning, and naive approaches (equal-width or quantile cuts) can either wash out signal or invite overfitting. In this 30-minute session, data scientists and risk analysts with a working knowledge of logistic regression and Python will learn to:

-Diagnose the weaknesses of basic binning strategies.
-Select and apply optimal-binning algorithms for different use cases.
-Assess bin stability and guard against model overfit.

All code, data samples, and a turnkey notebook will be available on GitHub, so you can start experimenting immediately.

Machine Learning & AI
Machine Learning & AI
14:00
14:00
30min
Bundestag Chat: Discovering Political Landscape with RAG Systems
Piotr Kalota, Matthias Boeck

Retrieval-Augmented Generation (RAG) systems are transforming how we interact with unstructured data using Large Language Models (LLMs). While it’s now relatively easy to stand up a basic RAG prototype, deploying a robust, customizable, and production-ready system remains challenging.
In this talk, we present our open-source RAG blueprint through the lens of a real-world application: Bundestag Chat—a system that enables users to explore and converse with German parliamentary speeches. We’ll demonstrate how the blueprint streamlined development and scaling, and how its modular architecture allowed for seamless integration of components like LlamaIndex, Hugging Face embeddings, PGVector, Langfuse, and Ragas.
Attendees will walk away with practical insights into customizing RAG pipelines for real use cases, whether building internal tools or user-facing applications. We’ll also explore build-vs-buy trade-offs, retrieval and scaling strategies, and considerations around privacy, evaluation, and monitoring.

Machine Learning & AI
Machine Learning & AI
14:00
30min
From Ideas to APIs: Delivering Fast with Modern Python
César Soto Valero

The modern Python ecosystem shortens the distance between idea and implementation. This talk presents a focused workflow to move from a business question to a working prototype, fast. We'll explore reproducible environments (uv, Docker), quick data iteration with polars and duckdb, clean project scaffolding (pyproject.toml), and lightweight service layers with FastAPI and pydantic. Along the way, we’ll integrate tests (pytest), static checks (mypy), and fast linting (ruff). You’ll leave with a reusable structure, toolchain recommendations, and a mental model for optimizing feedback loops and development in modern Python projects.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
14:00
30min
🚪🚪🐐 Lessons in Decision Making from the Monty Hall Problem
Eyal Kazin

Switch or stay, what do you say? And more importantly, why?

The Monty Hall Problem is a well-known brain teaser from which we can learn important lessons in decision making that are useful in general and in particular for data scientists.

If you are not familiar with this problem, prepare to be perplexed 🤯. If you are, I hope to shine light on aspects that you might not have considered 💡.

I introduce the problem and solve with three types of intuitions: Common, Bayesian and Causal. I summarise with a discussion on lessons learnt for better data decision making.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
14:30
14:30
30min
Quiet on Set: Building an On-Air Sign with Open Source Technologies
Danica Fine

While many of us have adapted to work from home life, one major problem remains: finding an easy way to keep folks in your home away from your workspace when you’re on an important call. Dust off your Raspberry Pi––let’s build a custom on-air sign with Apache Kafka®, Apache Flink®, and Apache Iceberg™!

We’ll begin by writing Python scripts to capture key events––such as when a Zoom meeting is running and when a camera is being used––and produce it into Kafka. The live data are then consumed by a Raspberry Pi script to drive the operation of a custom designed on-air sign. From there, you’ll be introduced to the ins and outs of FlinkSQL for stream processing as we wrangle the data into a better format for downstream use. And, finally, we’ll see Iceberg in action and learn how to use query engines to analyze meeting and recording trends.

By the end of the session, you’ll be well-acquainted with this powerful trio of open source technologies and know how you could use the same scaffolding and scale out a simple, at-home project to millions of users and simultaneous events.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
14:30
30min
We Have an AI in the Room: Can You Still Trust Technical Interviews?
Kseniya Bernat

Modern AI tools are increasingly used by candidates to cheat during technical interviews — often in real time. This talk explores how these tools work, which interview formats are most vulnerable, and how to design assessments that accurately reveal a candidate’s true technical ability. Ideal for hiring managers and engineers involved in technical assessment.

Machine Learning & AI
Machine Learning & AI
15:00
15:00
60min
Keynote
General Track
16:00
16:00
30min
Building Production-Ready Research AI Assistants with One-Command Setup
Cainã Max Couto da Silva

Academic research is often fragmented across dense PDFs, complex jargon, and scattered media articles, making it hard to access for students, interns, and the broader public. To address this, we introduce SciChat: an open-source Research AI Assistant that unifies a lab’s papers and media coverage into a conversational system, where anyone can ask natural language questions and receive structured answers with full source citations.

This talk demonstrates how to build and deploy a production-ready RAG pipeline that uses Landing.AI for vision-based PDF parsing, Firecrawl for media extraction, and LangGraph for agentic orchestration. The entire system is containerized with FastAPI and Streamlit, launching with a single command: docker compose up.

Attendees will learn how to turn scattered research artifacts into a transparent, queryable knowledge base, making lab insights accessible, reproducible, and conversational for all.

Machine Learning & AI
Machine Learning & AI
16:00
30min
Decisions Under Uncertainty: A Hands‑On Guide to Bayesian Decision Theory
Quan Nguyen

We often must make decisions under uncertainty—should you carry an umbrella if there's a 30 % chance of rain? Bayesian decision theory provides a principled, probabilistic framework to answer such questions by combining beliefs (probabilities), utilities (what matters to us), and actions to maximize expected gain.

This talk:
- Introduces key decision‑theoretic concepts in intuitive terms.
- Uses a toy umbrella example to ground ideas in relatable context.
- Demonstrates applications in Bayesian optimization (PoI/EI) and Bayesian experimental design.
- Is hands‑on—with Python code and practical tools—so participants leave ready to apply these ideas to real‑world problems.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
16:00
90min
GPU Python for the Real World: Practical Steps to GPU-Accelerated Python with RAPIDS
Jacob Tomlinson, Naty Clementi

NVIDIA GPUs offer unmatched speed and efficiency for data processing and model training, significantly reducing the time and cost associated with these tasks. Using GPUs is even more tempting when you use zero-code-change plugins and libraries. You can use PyData libraries including pandas, polars and networkx without needing to rewrite your code to get the benefits of GPU acceleration. We can also mix in GPU native libraries like Numba, CuPy and pytorch to accelerate our workflows from end-to-end.

However, integrating GPUs into our workflow can be a new challenge where we need to learn about installation, dependency management, and deployment in the Python ecosystem. When writing code, we also need to monitor performance, leverage hardware effectively, and debug when things go wrong

This is where RAPIDS and its tooling ecosystem comes to the rescue. RAPIDS, is a collection of open source software libraries to execute end-to-end data pipelines on NVIDIA GPUs using familiar PyData APIs.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
16:00
30min
Text Mining Orkut’s Community Data with Python: Cultural Memory, Platform Neglect, and Digital Amnesia
Rodrigo Silva Ferreira

Orkut was once the emotional and cultural core of Brazil’s internet. Its scraps, testimonials, and communities gave users a way to publicly shape identity, build relationships, and engage with everything from music and religion to politics and humor. When Google shut it down in 2014, most of its data was deleted. What remains today is fragmented and buried in the Wayback Machine.

In this talk, I use Python to recover and analyze limited traces of Orkut’s digital legacy. I scraped thousands of community names from archived HTML using requests and BeautifulSoup, processed them with multilingual sentence embeddings from sentence-transformers, and applied scikit-learn and BERTopic to cluster the data, surface major social themes, and quantify them. These techniques reveal how users created meaning, formed subcultures, and expressed identity through online interactions.

Alongside the technical walkthrough, I draw on Cory Doctorow’s concept of enshittification, defined as the slow decline of platforms as they shift from serving users to exploiting them. Orkut is a case of enshittification by neglect: its shutdown led not just to the death of a platform, but to the erasure of a generation’s digital memory. According to Google's farewell announcement, over its 10 years of existence, Orkut hosted 51 million communities, 120 million discussion topics, and more than 1 billion interactions; most of which were permanently deleted.

This talk is for Python users interested not only in working with social media text data but also in uncovering the cultural narratives embedded within it. It invites the audience to see datasets as more than technical artifacts, viewing them instead as living records of online social life.

General Track
General Track
16:30
16:30
30min
Optimizing AI/ML Workloads: Resource Management and Cost Attribution
Saurabh Garg

The proliferation of AI/ML workloads across commercial enterprises, necessitates robust mechanisms to track, inspect and analyze their use of on-prem/cloud infrastructure. To that end, effective insights are crucial for optimizing cloud resource allocation with increasing workload demand, while mitigating cloud infrastructure costs and promoting operational stability.

This talk will outline an approach to systematically monitor, inspect and analyze AI/ML workloads’ properties like runtime, resource demand/utilization and cost attribution tags . By implementing granular inspection across multi-player teams and projects, organizations can gain actionable insights into resource bottlenecks, identify opportunities for cost savings, and enable AI/ML platform engineers to directly attribute infrastructure costs to specific workloads.

Cost attribution of infrastructure usage by AI/ML workloads focuses on key metrics such as compute node group information, cpu usage seconds, data transfer, gpu allocation , memory and ephemeral storage utilization. It enables platform administrators to identify competing workloads which lead to diminishing ROI. Answering questions from data scientists like "Why did my workload run for 6 hours today, when it took only 2 hours yesterday" or "Why did my workload start 3 hours behind schedule?" also becomes easier.

Through our work on Metaflow, we will showcase how we built a comprehensive framework for transparent usage reporting, cost attribution, performance optimization, and strategic planning for future AI/ML initiatives. Metaflow is a human centric python library that enables seamless scaling and management of AI/ML projects.

Ultimately, a well-defined usage tracking system empowers organizations to maximize the return on investment from their AI/ML endeavors while maintaining budgetary control and operational efficiency. Platform engineers and administrators will be able to gain insights into the following operational aspects of supporting a battle hardened ML Platform:

1.Optimize resource allocation: Understand consumption patterns to right-size clusters and allocate resources more efficiently, reducing idle time and preventing bottlenecks.

  1. Proactively manage capacity: Forecast future resource needs based on historical usage trends, ensuring the infrastructure can scale effectively with increasing workload demand.

  2. Facilitate strategic planning: Make informed decisions regarding future infrastructure investments and scaling strategies.

4.Diagnose workload execution delays: Identify resource contention, queuing issues, or insufficient capacity leading to delayed workload starts.

Data Scientists on the other hand will gain clarity on factors that influence workload performance. Tuning them can lead to efficiencies in runtime and associated cost profiles.

Machine Learning & AI
Machine Learning & AI
16:30
90min
Python Polars: The Definitive Crash Course
Jeroen Janssens

Polars is a lightning fast DataFrame library that is taking the data science community by storm. Its elegant and expressive API makes analyses pleasant to write and efficient to run. In this workshop, we’ll demonstrate how Polars enables data scientists to go from raw data to reports–by reading, transforming, and visualizing data.

General Track
General Track
17:00
17:00
30min
Let Me Structure Freely? How to Improve LLM Structured Output Quality
Boris

Ever wonder why structured LLM output doesn’t feel as reliable as its natural language responses? At Khan Academy, we asked ourselves the same thing—especially as we leaned heavily on JSON-based structured outputs to power our AI tutor, Khanmigo.

Surprisingly, the root of the problem often lies in one of the most familiar tools in a Python developer’s toolbox: the humble dict. In this talk, we follow the story of how dictionary ordering can shape (and sometimes distort) structured LLM output. We’ll walk through how different frameworks—OpenAI, Claude, LangChain, OpenRouter, vLLM—handle structured responses, and why those differences matter more than you’d expect.

Along the way, we’ll share practical best practices we’ve developed to improve structured output reliability, observe subtle failure cases, and debug weird edge behaviors. If you’re building LLM apps with structured output, you’ll leave with concrete tips—and a deeper appreciation for the details that make or break your system.

Machine Learning & AI
Machine Learning & AI
17:00
30min
fastplotlib: driving scientific discovery through data visualization
Kushal Kolar, Caitlin Lewis

Fast interactive visualization remains a considerable barrier in analyses pipelines for large neuronal datasets. Here, we present fastplotlib, a scientific plotting library featuring an expressive API for very fast visualization of scientific data. Fastplotlib is built upon pygfx which utilizes the GPU via WGPU, allowing it to interface with modern graphics APIs such as Vulkan for fast rendering of objects. Fastplotlib is non-blocking, allowing for interactivity with data after plot generation. Ultimately, fastplotlib is a general purpose scientific plotting library that is useful for the fast and live visualization and analysis of complex datasets.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
17:30
17:30
30min
Build your own Personal Data Warehouse
Michael Alan Washington

Tired of paying for cloud compute just to view your own data? Discover how to build a completely free, open-source personal data warehouse that runs entirely on your machine.
– Import data from Excel, CSV, SQL Server, and Microsoft Fabric
– Use AI-powered Python/C# code for advanced data transformations
– Generate SSRS-style reports – no cloud required
– Leverage local compute power to avoid cloud costs

Machine Learning & AI
Machine Learning & AI
18:00
18:00
30min
LLMs, Chatbots, and Dashboards: Visualize Your Data with Natural Language
Daniel Chen

LLMs have a lot of hype around them these days. Let's demystify how they work and see how we can put them in context for data science use. As data scientists, we want to make sure our results are inspectable, reliable, reproducible, and replicable. We already have many tools to help us in this front. However, LLMs provide a new challenge; we may not always be given the same results back from a query. This means trying to work out areas where LLMs excel in, and use those behaviors in our data science artifacts. This talk will introduce you to LLms, the Chatlas package, and how they can be integrated into a Shiny to create an AI-powered dashboard. We'll see how we can leverage the tasks LLMs are good at to better our data science products.

Machine Learning & AI
Machine Learning & AI
18:00
90min
Time series analysis for coupled neurons.
Indranil Ghosh

The complex nervous system provides a repertoire of evolutionary properties like neuron spiking, bursting, and chaos that are yet to be fully understood. One approach is to tackle these time-dependent properties using the technique of "dynamical systems”, such as ordinary differential equations. Since the popular work by Hodgkin and Huxley, many dynamical systems models of neurons have been proposed, of which FitzHugh–Nagumo and Morris–Lecar models draw special attention. The nervous system is made of a network of neurons, possessing a complex structural and functional topology. This topology is a function of different parameters, among which the coupling strength plays a major role. Our focus would be to systematically study the effect of various coupling strategies on the firing patterns exhibited by a collection of neurons. In this workshop, my goal is to popularize a reduced-order model of neuron dynamics known as the “denatured Morris–Lecar” system and to teach how Python can be efficiently used to perform research on time series analysis of coupled neurons.

General Track
General Track
18:30
18:30
30min
UQLM: Detecting LLM Hallucinations with Uncertainty Quantification in Python
Dylan Bouchard, Mohit Singh Chauhan

As LLMs become increasingly embedded in critical applications across healthcare, legal, and financial domains, their tendency to generate plausible-sounding but false information poses significant risks. This talk introduces UQLM, an open-source Python package for uncertainty-aware generation that flags likely hallucinations without requiring ground truth data. UQLM computes response-level confidence scores from token probabilities, consistency across sampled responses, LLM judges, and tunable ensembles. Attendees will learn practical strategies for implementing hallucination detection in production systems and leave with code examples they can immediately apply to improve the reliability of their LLM-powered applications. No prior uncertainty quantification background required.

Machine Learning & AI
Machine Learning & AI
11:30
11:30
30min
Building a Lightweight Feature Store for Electricity Grid Forecasts with Polars
Robin Troesch

Get a firsthand look at how we built a lightweight feature store to accelerate electricity grid forecasting. We’ll cover our decision process, design choices, and implementation using Polars and Google Cloud Storage. Expect lessons learned, real-world bumps, and a clear view of the costs, trade-offs and benefits of our solution.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
11:30
30min
Revolutionizing Safety Log Analysis in Oil and Gas: A Multi-Stage LLM Approach for Enhanced Hazard Identification
Andrew Yule, Iain Docherty

In this presentation, we demonstrate how Large Language Models (LLMs) can revolutionize safety log analysis in the oil and gas industry. Our research with a major operator involved processing 15,000 safety observations through a novel multi-stage pipeline. First, we developed a domain-specific categorical framework aligned with industry standards. We then implemented an unsupervised learning approach using sentence transformers to calculate semantic similarity between observations and predefined categories. This enabled multi-dimensional classification with weighted confidence percentages. Finally, we deployed a fine-tuned LLM to assign priority scores and enhance categorization accuracy, all while maintaining data privacy through on-premises processing. The resulting system streamlines real-time safety log processing, enabling more efficient identification of potential hazards and trends. Our implementation demonstrates significant improvements in classification accuracy and processing efficiency compared to traditional methods, providing actionable insights for proactive safety management.

Machine Learning & AI
Machine Learning & AI
12:00
12:00
30min
How Big are SLMs
Jayita Bhattacharyya

Small Language Models (SLMs) are designed to deliver high performance with significantly fewer parameters compared to Large Language Models (LLMs). Typically, SLMs range from 100 million to 30 billion parameters, enabling them to operate efficiently on devices with limited computational resources, such as smartphones and embedded systems

Machine Learning & AI
Machine Learning & AI
12:00
30min
When the Meter Maxes Out: Chernobyl Disaster Lessons for ML Systems in Production
Idan Richman Goshen

At 1:23 a.m. on 26 April 1986, the RBMK-4 graphite-moderated reactor at Chernobyl exploded. Every dosimeter still working inside flat-lined at 3.6 R/h, its maximum reading, while lethal radiation raged unseen. That single detail from Chernobyl is the perfect allegory for what can go wrong in modern machine-learning pipelines: clipped features, hidden distribution shifts, missing logs, runaway feedback loops, and more. This talk unpacks key incidents from the disaster and map each one to an equivalent failure mode in production ML, showing how silent risk creeps into data systems and how to engineer for resilience. Attendees will leave with a practical set of questions to ask, signals to track, and cultural habits that keep models (and the businesses that rely on them) well clear of their own meltdowns. No nuclear physics required.

General Track
General Track
12:30
12:30
30min
Engineering Large-scale geospatial raster processing with xarray and dask
CLINTON OYOGO DAVID

Geospatial analysis often involves harmonizing and processing raster datasets from diverse sources with varying resolutions, coordinate systems, and data formats. This talk demonstrates how you can build efficient, scalable pipelines for zonal statistics extraction using Python’s scientific computing stack, xarray, and dask to handle rasters that would otherwise overwhelm traditional processing approaches.
Through a real-world case study of processing multi-source geospatial data for small-area estimation of poverty, we’ll explore practical strategies for memory-efficient raster harmonization, parallel computing workflows, and automated statistical aggregation across administrative boundaries.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
12:30
30min
Supercharge your Python performance with FFIs for AI workflows
Shivay Lamba, Rudraksh Karpe

Python is the go-to language in AI for its simplicity, but it often struggles with heavy computations due to the Global Interpreter Lock (GIL). This talk shows how Foreign Function Interfaces (FFIs) like Cython, ctypes, cffi, and PyO3 can dramatically enhance Python performance by calling native C, C++, or Rust code. Attendees will learn to identify bottlenecks, apply FFIs effectively, and accelerate AI and data science workflows.

Machine Learning & AI
Machine Learning & AI
13:00
13:00
30min
Automating ML with PyCaret: Train & Compare Multiple Models to Find the Best Performer
Manjunath Janardhan

This Live demonstration shows how PyCaret, an open-source low-code machine learning library, can dramatically simplify model training and comparison workflows. PyCaret is democratizing machine learning by empowering anyone to train multiple algorithms and compare their performance with minimal code. Attendees will witness live demonstrations of training various ML algorithms and using automated comparison techniques to select the best performer based on key metrics. Perfect for data scientists, developers, and ML enthusiasts looking to spend less time coding and more time on model analysis and selection.

Machine Learning & AI
Machine Learning & AI
13:00
30min
Open Source Models' Security- Adversarial attacks, Poisoning & Sponge
natan katz

The use of open-source models is rapidly increasing. According to Gartner, during the Magnetic Era, their adoption is expected to triple compared to foundational models. However, this rise in usage also brings heightened cybersecurity risks. In this lecture, we will explore the unique vulnerabilities associated with open-source models, the algorithmic techniques used to exploit them, and how our startup is addressing these challenges.

General Track
General Track
13:30
13:30
30min
Accelerate deployment of your Python data science apps using ShinyProxy
Tobia De Koninck

ShinyProxy is 100% open-source software to deploy data science apps in an enterprise context. This talk will - for the first time - introduce ShinyProxy to the Python community. We'll start with a realistic example to explore what it takes to deploy a data science app for production use. Throughout the talk, you'll see how ShinyProxy addresses many of the common challenges faced when deploying apps.
These include authentication, scaling, security (such as TLS), audit logging, version control, reproducibility, and more. The main goal of ShinyProxy is to ensure data scientists can focus on doing science instead of spending time on technical requirements, procedures and maintenance. This talk is tailored for both data scientists and anyone interested in setting up ShinyProxy. No deep technical knowledge is required to follow along. At the end of the talk, you'll know everything to get started with ShinyProxy and to deploy your first app.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
13:30
30min
Streaming AI Workflows in Python: Kafka Queues and Flink-Powered LLM Inference
Shekhar Prasad Rajak, bhrathjatoth

Python users working on real-time analytics—from payment processing and fraud detection to AI-driven support—rely on message queues to keep data moving reliably and efficiently. Traditional message queues, however, can struggle with large-scale, concurrent workloads, especially when you need durability and replayability.

In this session, we’ll show how Kafka 4.0 introduces robust queue semantics to distributed streaming, empowering Python applications to handle fair, concurrent, and isolated message processing at scale—using familiar Kafka Python clients and frameworks.

But the power lies in what you can build next. We’ll demonstrate how Apache Flink can connect Kafka event streams to real-time Large Language Model (LLM) inference for tasks like sentiment analysis and summarization, all orchestrated via Python APIs and remote model endpoints for powerful, flexible AI inference.

To complete the picture, we’ll cover how enriched results can be stored in popular data lake solutions—such as Apache Iceberg—enabling long-term analytics, time travel, and integration with downstream data science workflows. Support for Iceberg and other lakehouse formats is optional, giving you flexibility to choose the right data backend for your needs.

Machine Learning & AI
Machine Learning & AI
14:00
14:00
30min
From Handwritten Notes to Smart Knowledge: Build Local AI Agents with Python
piotr stepinski

Your notebooks are full of insights—but they’re scattered and hard to search.
In this live-coding session I’ll show how to turn handwritten notes into a searchable, connected knowledge base using local AI and minimal Python.

We start with AnythingLLM’s UI for quick wins, then move to Python agents that:
• classify note types,
• extract key ideas,
• build a personal knowledge graph.

The entire stack runs on your laptop with MLC-AI—no cloud, no data leaks.
You’ll leave with a reusable agent blueprint you can drop into any data-processing workflow tomorrow.

Machine Learning & AI
Machine Learning & AI
14:00
30min
GPU Accelerated Zarr
Tom Augspurger

The zarr-python 3.0 release includes native support for device buffers, enabling Zarr workloads to run on compute accelerators like NVIDIA GPUs. This enables you to get more work done faster.

This talk is primarily intended for people who are at least somewhat familiar with Zarr and are curious about accelerating their n-dimensional array workload with GPUs. That said, we will start with a brief introduction to Zarr and why you might want to consider it as a storage format for the n-dimensional arrays (commonly seen in geospatial, microscopy, or genomics domains, among others). We'll see what factors affect performance and how to maximize throughput for your data analysis or deep learning pipeline. Finally, we'll preview the future improvements to GPU-accelerated Zarr and the packages building on top of it, like xarray and cubed.

After attending this talk, you'll have the knowledge needed to determine if using zarr-python's support for device buffers can help accelerate your workload.

General Track
General Track
14:30
14:30
30min
Detecting Regime Shifts in Time Series with Python: Entropy-Based Change-Point Detection
Sergei Nasibian

Financial and other real-world time series often experience abrupt regime changes that can break assumptions and invalidate models. This talk shows how to use k-nearest neighbor entropy estimators combined with clustering algorithms, implemented entirely in Python, to detect these change-points early. We’ll explore practical examples with financial market data, discuss strengths and limitations, and provide reusable open-source code. Attendees will leave with tools to make their time series models more robust to sudden structural changes.

Machine Learning & AI
Machine Learning & AI
15:00
15:00
60min
Keynote
General Track
16:30
16:30
30min
Animating Equity: Python Dashboards for Small-Town Housing and Displacement Risk
Matthew Cox

This talk demonstrates how open-source Python tools like censusdis, pandas, and folium can be combined to create an interactive, time-enabled dashboard for visualizing economic vulnerability, housing affordability, and displacement risk in small communities. Using Oxford, NC as a case study, the talk showcases a multi-year, multi-indicator mapping project designed to support equitable local planning.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
16:30
30min
Bodo DataFrames: a fast and scalable HPC-based drop-in replacement for Pandas
scott-routledge

Pandas is a popular library for data scientists but it struggles with large datasets; programs either become too slow or run out of memory. In this talk, we introduce Bodo DataFrames (https://github.com/bodo-ai/Bodo) as a drop-in replacement for the Pandas library that uses high performance computing (HPC) based techniques such as Message Passing Interface (MPI) and JIT compilation for acceleration and scaling. We give an overview of its architecture and explain how it avoids the problems of Pandas (while keeping user code the same), go over concrete examples, and finally discuss current limitations. This talk is for Pandas users who would like to run their code on larger data while avoiding frustrating code rewrites to other APIs. Basic knowledge of Pandas and Python is recommended.

Data Engineering & Infrastructure
Data Engineering & Infrastructure
16:30
30min
HPC Implementation of a Hybrid Recommender System in Julia
José Quenum, marthin thomas

This talk discusses a hybrid recommender system implemented in Julia for preselecting job applicants. The recommender system is built using a neural network adopting a hybrid architecture that combines convolutional layers of a graph neural network and a transformer (both encoder and decoder). We discuss the preprocessing of applicant metadata and job adverts to generate a heterogeneous graph. Next, we present the recommender as a model and its training using an HPC.

Machine Learning & AI
Machine Learning & AI
17:00
17:00
30min
Beyond Just Prediction: Causal Thinking in Machine Learning
Avik Basu

Most ML models excel at prediction, answering questions like "Who will buy our product?" or "Which customers are likely to churn?". But when it comes to making actionable decisions, prediction alone can be misleading. Correlation does not imply causation, and business decisions require understanding causal relationships to drive the right outcomes.

In this talk, we will explore how causal machine learning, specifically uplift modeling, can bridge the gap between prediction and decision making. Using a real-world use case, we will showcase how uplift modeling helps identify who will respond positively to interventions while avoiding those who they might deter.

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
17:00
30min
Connected Identities: Rethinking Identity and Access Management with Neo4j and Python
Irina Loghin

Access control is ultimately about relationships—between people, systems, and resources. In this talk, we’ll look at how modeling connected identities with a graph database unlocks a more efficient and transparent way to manage Identity and Access Management (IAM).

Using Neo4j and Python, we’ll walk through a practical approach to building an IAM system that prioritizes clarity, performance, and portability. You’ll learn how to model users, roles, and permissions as a connected graph, write access logic in Cypher, and deploy a lightweight system that scales without adding complexity.

In this fast-paced talk, you’ll learn how to :

  • Map users, roles, and permissions like a detective

  • Write smart queries to control access

  • Build a lightweight, graph-powered IAM engine

No graph skills? No problem. Just bring Python and curiosity.

General Track
General Track
17:30
17:30
30min
Enhancing Marketplace Competitiveness: A Bayesian Approach to modelling the cold start problem
Agustin Figueroa Nazar

This session shows how Bayesian statistical modeling helps determine when you have collected enough data about new products, so that they are ready for competition. We'll explore:
how this approach enables efficient decision-making with minimal data
why we chose Bayesian over machine learning models
how we covered for the required assumptions
how this enables a risk-management approach while providing interpretable results that business stakeholders can understand and trust

You will learn how to identify a Bayesian problem at your company and how to navigate the modelling with real-world data!

Analytics, Visualization & Decision Science
Analytics, Visualization & Decision Science
17:30
30min
From Maintainer to Monetizer: Commercializing Open Source Without Selling Your Soul
Sheikh Shuvo

Open source fuels the modern data stack, but most projects struggle to convert community value into sustainable businesses. This talk explores how maintainers and technical founders can transform open-source projects into thriving commercial ventures—without compromising openness. You’ll learn what works (and what doesn’t), hear playbooks from real-world examples, and walk away with a strategy you can apply to your own work.

General Track
General Track
18:30
18:30
30min
TinyTroupe: Enhancing Marketing Insights through LLM-Powered Multiagent Persona Simulation
Hajime Takeda

Understanding customer behavior is essential in marketing. Traditionally, marketers rely on methods such as surveys, customer interviews, and focus groups to gather insights. However, these approaches can be expensive, time-consuming, and limited in scale and diversity.
Recently, multi-agent simulation powered by Large Language Models (LLMs) is emerging as an innovative technique. TinyTroupe, for example, enables the creation of different personas (e.g., budget‑minded Gen‑Z shoppers, premium‑seeking parents), allowing marketers to predict and optimize advertising effectiveness or replace time-consuming interviews rapidly.
In this talk, I will introduce the key concepts of LLM-powered multi-agent simulations, demonstrate their practical application in marketing through TinyTroupe, and share actionable insights and recommendations.

Machine Learning & AI
Machine Learning & AI