<?xml version='1.0' encoding='utf-8' ?>
<iCalendar xmlns:pentabarf='http://pentabarf.org' xmlns:xCal='urn:ietf:params:xml:ns:xcal'>
    <vcalendar>
        <version>2.0</version>
        <prodid>-//Pentabarf//Schedule//EN</prodid>
        <x-wr-caldesc></x-wr-caldesc>
        <x-wr-calname></x-wr-calname>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>HCURNN@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-HCURNN</pentabarf:event-slug>
            <pentabarf:title>Python Meets Excel: Smarter Workflows for Analysts and Data Teams</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T120000</dtstart>
            <dtend>20251209T123000</dtend>
            <duration>003000</duration>
            <summary>Python Meets Excel: Smarter Workflows for Analysts and Data Teams</summary>
            <description>This talk is designed for Python developers, analysts, and data scientists who routinely interact with Excel-based deliverables in their organization. It focuses on practical workflows that enhance productivity and reproducibility without requiring the audience to write or understand VBA or Excel formulas.
The session begins by outlining common challenges Python users face when integrating with Excel, then introduces powerful Python tools that offer users seamless Excel file manipulation, specifically pandas, xlsxwriter, and xlwings. 
We will discuss some real-world use cases, such as generating reports, automating dashboards, creating custom functions in Excel and batch processing Excel files at scale.
The talk concludes with a summary of tools, limitations, and best practices for integrating Python into Excel-centric workflows. This is a conceptual and strategic talk aimed at helping Python professionals work more effectively with Excel natives in the business ecosystem.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/HCURNN/</url>
            <location>General Track</location>
            
            <attendee>DR NISHA ARORA</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>GPFCXZ@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-GPFCXZ</pentabarf:event-slug>
            <pentabarf:title>Python Beyond the Code: Unlocking Hidden Contributions in Open Source</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T130000</dtstart>
            <dtend>20251209T133000</dtend>
            <duration>003000</duration>
            <summary>Python Beyond the Code: Unlocking Hidden Contributions in Open Source</summary>
            <description>Open-source projects thrive on contributions, but those contributions don&#8217;t always come in the form of pull requests. In the Python community, roles such as documentation writing, bug reproduction, testing, onboarding, user feedback, and project coordination are vital to long-term sustainability.

This talk aims to dispel the myth that only seasoned developers or prolific coders can contribute meaningfully to open-source projects. Through real-world examples and lessons from my own experience working with Python-based open-source communities, I&#8217;ll walk the audience through practical paths for getting involved &#8212; even if you&apos;re just starting or come from a non-traditional background like product, design, or DevRel.

The session will outline the different ways contributions are recognized in the Python ecosystem, including the impact of GitHub discussions, contributing guides, documentation standards like reStructuredText or Markdown, and the importance of clear communication with maintainers.

Expected audience: Python developers, career switchers, junior engineers, community managers, and anyone curious about participating in open source.

Takeaway: You&apos;ll leave with an actionable roadmap to contribute beyond code and understand how to track and present your work to peers, employers, and the broader Python community.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/GPFCXZ/</url>
            <location>General Track</location>
            
            <attendee>Iyanu Falaye</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>QMUABM@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-QMUABM</pentabarf:event-slug>
            <pentabarf:title>Open Source Models&apos; Security- Adversarial attacks, Poisoning &amp; Sponge</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T133000</dtstart>
            <dtend>20251209T140000</dtend>
            <duration>003000</duration>
            <summary>Open Source Models&apos; Security- Adversarial attacks, Poisoning &amp; Sponge</summary>
            <description>In my lecture, I will discuss various methods for attacking machine learning models, including model poisoning, DDoS-style attacks, and the generation of adversarial examples&#8212;such as Projected Gradient Descent (PGD), Carlini-Wagner attacks, and others. We will also present defense strategies that are data-agnostic and focus on model-driven approaches to protecting AI systems, particularly those that use open-source models. We will also discuss the differentiation between protecting open-source models and regular LLM (what we are not OWASP LLM)</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/QMUABM/</url>
            <location>General Track</location>
            
            <attendee>Natan Katz</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>FHSZP7@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-FHSZP7</pentabarf:event-slug>
            <pentabarf:title>Opening Notes &amp; Keynote by Isabel Zimmerman</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T140000</dtstart>
            <dtend>20251209T150000</dtend>
            <duration>010000</duration>
            <summary>Opening Notes &amp; Keynote by Isabel Zimmerman</summary>
            <description>Isabel is a Senior Software Engineer at Posit, PBC.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/FHSZP7/</url>
            <location>General Track</location>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>TTDNXY@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-TTDNXY</pentabarf:event-slug>
            <pentabarf:title>Python Worst Practices: Learn from the Expert</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T153000</dtstart>
            <dtend>20251209T160000</dtend>
            <duration>003000</duration>
            <summary>Python Worst Practices: Learn from the Expert</summary>
            <description>This is meant to be comedy, but the best jokes always include a little bit of truth. Evan will share some hilarious stories of Python gone wrong, complete with code examples. That is, if the code even runs.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/TTDNXY/</url>
            <location>General Track</location>
            
            <attendee>Evan Wimpey</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>QHTA73@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-QHTA73</pentabarf:event-slug>
            <pentabarf:title>Text Mining Orkut&#8217;s Community Data with Python: Cultural Memory, Platform Neglect, and Digital Amnesia</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T160000</dtstart>
            <dtend>20251209T163000</dtend>
            <duration>003000</duration>
            <summary>Text Mining Orkut&#8217;s Community Data with Python: Cultural Memory, Platform Neglect, and Digital Amnesia</summary>
            <description>This talk explores how Python can be used to recover and analyze digital traces from a platform that once defined Brazil&#8217;s online culture. *Orkut*, active from 2004 to 2014, hosted millions of communities where users expressed identity, humor, politics, and emotion in public and often poetic ways. When the platform was shut down, nearly all of this user-generated data was deleted. Today, only fragmented pieces remain, preserved in the Wayback Machine.

I present a data analysis project that extracts and categorizes *Orkut* community names using open-source Python tools. I use `requests` and `BeautifulSoup` to scrape data from archived HTML snapshots. I then apply multilingual sentence embeddings from the `sentence-transformers` library to generate vector representations of the text, followed by clustering techniques using `scikit-learn` and `BERTopic` to uncover and quantify recurring social themes.

This technical walkthrough is grounded in a sociological lens. I draw on Cory Doctorow&#8217;s concept of *enshittification*, which describes how platforms degrade as they prioritize value extraction over user experience. *Orkut*&apos;s case illustrates how platform neglect can result not only in product death but also in large-scale cultural erasure. By treating community names as social artifacts, I show how data science can help recover forgotten histories and highlight overlooked communities at the intersection of digital humanities, memorialization, and cultural heritage.

Attendees will gain practical skills in web scraping, multilingual NLP, and unsupervised clustering. The talk also raises broader questions about data loss, platform decay, and the ethical role of data scientists, software engineers, and tech workers in preserving digital memory.

No advanced data science, scraping, text mining, or NLP knowledge is required. The talk is best suited for data scientists and Python developers interested in working with real-world social data and approaching datasets with both technical rigor and cultural sensitivity. Regardless of background, this talk is accessible to anyone interested in data science, NLP, and text mining.

**Time Breakdown (30 min)**
| **Time**  | **Section**                                                          |
| --------- | ------------------------------------------------------------------------------------- |
| 0&#8211;4 min   | Introduction to *Orkut* and its cultural role in Brazil and in the Global South                |
| 4&#8211;7 min   | Platform shutdown, data loss, digital memory and neglect         |
| 7&#8211;10 min  | Project overview: goals, ethical framing, and data source (Wayback) |
| 10&#8211;15 min | Scraping with `requests` and `BeautifulSoup` from archived HTML      |
| 15&#8211;20 min | Processing: multilingual embeddings with `sentence-transformers`     |
| 20&#8211;23 min | Clustering and theme discovery using `scikit-learn` and `BERTopic`   |
| 23&#8211;26 min | Insights: social themes, quantification, and what topic categories mattered to users   |
| 26&#8211;29 min | Reflection: *enshittification*, data loss, and cultural preservation   |
| 29&#8211;30 min | Final remarks and invitation to rethink data as memory + Q\&amp;A        |

**Additional remarks:**
1) A GitHub repository containing the scraping scripts, archived HTML files, datasets, and analysis will be shared with attendees.
2) This project was inspired by both personal nostalgia and frustration over the loss of access to my *Orkut*&apos;s profile, photos, testimonials, and communities.
3) Besides its overwhelming popularity in Brazil, *Orkut* also had a strong foothold in other countries across the Global South, such as India and China, reflecting its **broader appeal beyond the English-speaking tech centers**, typically prioritized in platform histories. This context would makes the talk proposal herein outlined even more interesting and compelling for a **PyData Global audience**.

| Country         | Traffic on Mar 31, 2004 | Traffic on Sep 30, 2014 |
|----------------|--------------------------|--------------------------|
| Brazil         | 5.16%                    | 55.5%                    |
| United States  | 51.36%                   | 3.3%                     |
| India          | &#8212;                        | 18.4%                    |
| China          | &#8212;                        | 6.4%                     |
| Japan          | 7.74%                    | 2.7%                     |
| Netherlands    | 4.10%                    | &#8212;                        |
| United Kingdom | 3.72%                    | &#8212;                        |
| Other          | 27.92%                   | 15.7%                    |
Reference: https://web.archive.org/web/20140109153358/http://www.alexa.com/siteinfo/orkut.com.br</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/QHTA73/</url>
            <location>General Track</location>
            
            <attendee>Rodrigo Silva Ferreira</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>FSTP8H@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-FSTP8H</pentabarf:event-slug>
            <pentabarf:title>Why Julia&apos;s GPU-Accelerated ODE Solvers are 20x-100x Faster than Jax and PyTorch</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T163000</dtstart>
            <dtend>20251209T170000</dtend>
            <duration>003000</duration>
            <summary>Why Julia&apos;s GPU-Accelerated ODE Solvers are 20x-100x Faster than Jax and PyTorch</summary>
            <description>This talk is about the results of the publication titled &quot;Automated translation and accelerated solving of differential equations on multiple GPU platforms&quot; which was published in 2024 demonstrating that the Julia GPU-based ODE solvers, specifically DiffEqGPU.jl, are 20x-100x faster than Jax (diffrax) and PyTorch (torchdiffeq). The publication goes into detail as to the architectural reasons for the performance difference, even going as far as recreating the ML style of GPU acceleration in Julia in order to demonstrate that such an approach loses the performance advantage, along with testing against alternative CUDA C++ implementations of a similar form to showcase exactly the effects of the architectural decisions on the resulting performance. However, as a highly technical article it can many times not be as easy to understand as it should. In this talk we&apos;re going to give a barebones &quot;no HPC background required&quot; explanation of how the Julia GPU stack enables a completely different approach from the &quot;standard&quot; ML libraries form of GPU acceleration, and how for some applications this can be majorly beneficial. We will note that the GPU design of the ML libraries is actually optimal for ML applications, but certain properties of some applications of ODE solvers make it require a completely different formulation.

We will additionally talk about other projects which have seen similar results, such as solving nonlinear systems in Julia (with NonlinearSolve.jl), GPU-accelerated optimization with Optimization.jl, and new global optimizer methods in ParallelParticleSwarms.jl which all rely on this technique and the special aspects of the Julia GPU infrastructure.

[1] https://www.sciencedirect.com/science/article/abs/pii/S0045782523007156</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/FSTP8H/</url>
            <location>General Track</location>
            
            <attendee>Chris Rackauckas</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>93KHNT@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-93KHNT</pentabarf:event-slug>
            <pentabarf:title>Bridging Interactive Data Science and Big Data with Hybrid Execution</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T173000</dtstart>
            <dtend>20251209T180000</dtend>
            <duration>003000</duration>
            <summary>Bridging Interactive Data Science and Big Data with Hybrid Execution</summary>
            <description>pandas is one of the most widely used tools in the Python ecosystem, but scaling it beyond memory limits has traditionally required significant refactoring or switching to other tools. In this talk, we introduce Hybrid Execution, a new capability powered by Modin that allows pandas code to seamlessly switch between local, in-memory execution and distributed backends. This approach preserves the familiar pandas API while enabling users to scale their workflows without rewriting code. We&apos;ll explore how Hybrid Execution works under the hood, how Modin enables backend flexibility, and what it means for building interactive, scalable data pipelines with pandas.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/93KHNT/</url>
            <location>General Track</location>
            
            <attendee>Jonathan Shi</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>JCXBBW@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-JCXBBW</pentabarf:event-slug>
            <pentabarf:title>projspec: what&apos;s this project anyway?</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T180000</dtstart>
            <dtend>20251209T183000</dtend>
            <duration>003000</duration>
            <summary>projspec: what&apos;s this project anyway?</summary>
            <description>Daily workflows in pydata usually occur in the context of projects - a directory tree of stuff, with special metadata files describing those contents. Many metadata specifications are in use for each of the many tools that operate on projects, storing information in small yaml, toml or json files, or in the pyproject.toml file for python-specific projects. This model encompasses not only the majority of the environment management tools and task runners in pydata (uv, pixi, poetry, etc) but other essential tools (e.g., git), definitions (e.g., hugging-face dataset), deployment (briefcase, helm, wheel) and workflow-specific metadata (e.g., pyscript). 

The range of possible metadata is bewildering! Most projects show how to invoke their functionality in README files, with the first step downloading some specific tool. In some way, all this flexibility has taken us backwards. There is no easy way to tell what type a project is and what definitions it contains without  reading the supporting documentation and browsing specific files, or even downloading the whole thing and running a specific tool against it.

projspec aspires to be a layer over the most common pydata related project types. It provides introspection of project type and contents from the metadata definitions, and this can be done on remote project directories too. For each project type, we infer a set of &quot;contents&quot; (things that are defined in the project and inherently part of it) and &quot;artifacts&quot; (things the project can make or do, usually by calling a subprocess). A project can be multiple types at once: a project designed to be executed with pixi, for instance, still likely contains git information and may also have dataset declarations, things that pixi is not concerned with. Projects may also contain sub-projects of the same or different type, e.g., a conda recipe alongside a code library.

Projspec, due to be released in time for this talk, will provide a handy API to work with projects of many types, including introspection and effecting actions. It will have a way to index many projects locally or remotely, to allow for querying with complex criteria, to find the project that matches your needs - contains certain datasets, depends on specific library/versions or is capable of creating particular output types. We will demonstrate all of this!</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/JCXBBW/</url>
            <location>General Track</location>
            
            <attendee>Martin Durant</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>JVJZFT@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-JVJZFT</pentabarf:event-slug>
            <pentabarf:title>Keynote by Lisa Amini- What&#8217;s Next in AI for Data and Data Management?</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T183000</dtstart>
            <dtend>20251209T191500</dtend>
            <duration>004500</duration>
            <summary>Keynote by Lisa Amini- What&#8217;s Next in AI for Data and Data Management?</summary>
            <description>Dr. Lisa Amini leads IBM&apos;s Data &amp; AI Platforms Research efforts globally, along with IBM&apos;s AI Horizons Network. She is also an IBM Distinguished Engineer (DE). The mission of the Data &amp; AI Platforms Research theme is to infuse generative and agentic AI throughout IBM&apos;s Data Platform, to make it more intelligent, self-service, and autonomous, and to optimize its performance on AI workloads.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/JVJZFT/</url>
            <location>General Track</location>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>QREPPX@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-QREPPX</pentabarf:event-slug>
            <pentabarf:title>Building LLM-Powered Applications for Data Scientists and Software Engineers</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T103000</dtstart>
            <dtend>20251209T120000</dtend>
            <duration>013000</duration>
            <summary>Building LLM-Powered Applications for Data Scientists and Software Engineers</summary>
            <description>This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using LLMs to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications.

If you&apos;re a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems.

**What You&apos;ll Learn:**

* How to integrate AI models and APIs into a practical application.
* Techniques to manage non-determinism and optimize outputs through prompt engineering.
* How to monitor, log, and evaluate AI systems to ensure reliability.
* The importance of handling structured outputs and using function calling in AI models.
* The software engineering side of building AI systems, including iterative development, debugging, and performance monitoring.
* Practical experience in building an app to query PDFs using multimodal models.

**What is Unique About This Session:**

This workshop uniquely bridges the gap between software engineering and generative AI development. While most AI workshops focus solely on model usage or tuning, this session emphasizes the entire AI software lifecycle &#8212; from prompt engineering to monitoring and tracing. Participants will learn how to manage non-determinism and create production-ready AI applications, giving them the knowledge to tackle the software engineering challenges of AI-powered apps. The hands-on approach ensures that attendees walk away with practical skills and a functional app.

**Workshop Prerequisite Knowledge:**
* Basic programming knowledge in Python.
* Familiarity with REST APIs.
* Experience working with Jupyter Notebooks or similar environments (preferred but not required).
* No prior experience with AI or machine learning is required.
* Most importantly, a sense of curiosity and a desire to learn!

If you have a background in data science, ML, or AI, this workshop will help you understand the software engineering side of building AI applications.

We will introduce you to certain modern frameworks in the workshop but the emphasis be on first principles and using vanilla Python and LLM calls to build AI-powered systems.

[All tutorial material will be in this github repository](https://github.com/hugobowne/AI-for-SWEs).</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/QREPPX/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>hugo bowne-anderson</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>BQLTSH@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-BQLTSH</pentabarf:event-slug>
            <pentabarf:title>When AI Makes Things Up: Understanding and Tackling Hallucinations</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T120000</dtstart>
            <dtend>20251209T123000</dtend>
            <duration>003000</duration>
            <summary>When AI Makes Things Up: Understanding and Tackling Hallucinations</summary>
            <description>This session will unpack the problem of AI hallucination - not just what it is, but how it surfaces in everyday use. We&#8217;ll look at the common causes, ranging from incomplete context to over-generalisation, and walk through detection and prevention techniques such as grounding, prompt design and RAG. Whether you&#8217;re building AI products or evaluating outputs, this talk will give you the tools to recognise hallucinations and reduce their risk.

##### Outline:
* Introduction to hallucinations in LLMs
* Common causes behind hallucinated outputs
* Impact on production applications
* Techniques for detecting and evaluating hallucinations
* Strategies to reduce hallucinations
* Best practices for building trustworthy AI products
* Key takeaways

##### Background Knowledge Required:
Beginner-friendly - no prior knowledge needed. Familiarity with LLMs is a plus but not necessary.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/BQLTSH/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Aarti Jha</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>JGSYEP@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-JGSYEP</pentabarf:event-slug>
            <pentabarf:title>torchTextClassifiers : Modernizing Text classification for French National Statistics</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T123000</dtstart>
            <dtend>20251209T130000</dtend>
            <duration>003000</duration>
            <summary>torchTextClassifiers : Modernizing Text classification for French National Statistics</summary>
            <description>Insee, France&apos;s National Institute of Statistics and Economic Studies, has long relied on fastText for automatic coding tasks. Recognizing the need to modernize and future-proof this critical functionality, we developed torchTextClassifiers &#8212; an open-source Python package that enables easy training and deployment of a PyTorch-based model for text classification, paving the way for further innovation in this domain.

This session will delve into the motivations behind replacing the archived fastText package, the design and implementation of torchTextClassifiers , and its integration into Insee&apos;s production environment. We&apos;ll discuss the challenges faced during this transition, including model compatibility, performance optimization, and user adoption.&#8203;

Attendees will gain insights into:&#8203;

- The rationale for moving from fastText to a PyTorch-based model&#8203; in production

- Packaging a PyTorch-based model architecture and open-source collaboration

- Key features and architecture of torchTextClassifiers &#8203;

- Deployment strategies within a public administration (MLOps, cloud native tools, security)

- Lessons learned and best practices for similar transitions&#8203;

This talk is intended for data scientists, machine learning engineers, and practitioners interested in NLP, model deployment, and open-source tool development.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/JGSYEP/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Meilame Tayebjee</attendee>
            
            <attendee>C&#233;dric Couralet</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>GMWTUK@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-GMWTUK</pentabarf:event-slug>
            <pentabarf:title>Harnessing Generative Models for Synthetic Non-Life Insurance Data</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T130000</dtstart>
            <dtend>20251209T133000</dtend>
            <duration>003000</duration>
            <summary>Harnessing Generative Models for Synthetic Non-Life Insurance Data</summary>
            <description>In classification and regression tasks, generative models aim to learn the joint probability distribution of data. These models focus on generating data points similar to the training data. Open insurance datasets are rare because they encode proprietary risk structures of the Company, limiting researchers&#8217; access to comprehensive data  for analysis and assessing new approaches. Generative models enable reproducible experimentation and innovation today.
In the talk I explore several generative models used to produce synthetic data.

1) Conditional Gaussian Mixture Models used as a benchmark;
2) Conditional Variational Autoencoders;
3) Conditional Variational Autoencoders with a Transformer Decoder;
4) Conditional Diffusion Model;
5) Large Language Models.

Finally, I gave the overall results, followed by different approaches.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/GMWTUK/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Claudio Giorgio Giancaterino</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>RL9RDQ@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-RL9RDQ</pentabarf:event-slug>
            <pentabarf:title>From Feature Engineering to Context Engineering for Agents</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T150000</dtstart>
            <dtend>20251209T153000</dtend>
            <duration>003000</duration>
            <summary>From Feature Engineering to Context Engineering for Agents</summary>
            <description>Context Engineering for Agents involves getting relevant data into the LLM&#8217;s prompt and builds on in-context learning capabilities of LLMs. But LLMs have finite sized context windows, so you can&apos;t just dump unprocessed context data into your Agent&apos;s LLM prompt. You need to select the right data, process it into the correct format, and compress or summarize the data before its use as context data. 

In this talk, we will introduce techniques for selection, preprocessing, and compression of context data, taking inspiration from the tried and tested techniques used for feature engineering for ML. What goes around, comes around.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/RL9RDQ/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Jim Dowling</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>JVPL8S@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-JVPL8S</pentabarf:event-slug>
            <pentabarf:title>Scaling Data Processing for LLMs with NeMo Curator</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T173000</dtstart>
            <dtend>20251209T180000</dtend>
            <duration>003000</duration>
            <summary>Scaling Data Processing for LLMs with NeMo Curator</summary>
            <description>The development and performance of Large Language Models (LLMs) increasingly rely on the availability of high-quality, diverse, and representative datasets. Scaling data preparation for LLMs remains a significant bottleneck in training pipelines, particularly when dealing with massive raw web-scale data. Traditional CPU-based preprocessing frameworks are often too slow and resource-intensive to meet the growing demand for efficiency, scalability, and compliance. This talk presents NeMo Curator, an open-source, GPU-accelerated data curation framework designed to accelerate and streamline the preparation of massive datasets across multi-node, multi-GPU infrastructures.

NeMo Curator introduces a modular pipeline architecture that enables high throughput preprocessing with native integration of RAPIDS for GPU acceleration. Its functionality spans semantic deduplication, heuristic filtering, automated classification, personally identifiable information (PII) redaction, and synthetic data generation. These features work in tandem to reduce noise, eliminate redundancy, and enhance data quality, ultimately improving LLM training outcomes. With support for reward-based filtering and configurable augmentation modules, NeMo Curator can generate or enhance data in low-resource domains while maintaining quality and diversity.

This talk will provide an informative walkthrough of NeMo Curator&#8217;s capabilities and show how its pipelines can be integrated into existing workflows to preprocess massive datasets efficiently. Attendees will see how to configure and execute the framework through Python APIs, leveraging both single-node and distributed environments. By the end of this talk, participants will become familiar with scalable data curation techniques and walk away with practical tools to enhance their own LLM training pipelines using GPU-accelerated infrastructure.

Detailed Outlines:
1.	Challenges in Scaling LLM Data Preparation (5 min)
2.	Overview of NeMo Curator Framework (10 min)
3.	Pipeline Modules and Functional Components (5 min)
4.	Demonstration: Multi-GPU Pipeline Execution (5 min)
5.	Case Studies and Performance Metrics (5 min)

Targeted Audience:
&#8226;	Data Scientist, ML/AI Engineer, AI Researcher</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/JVPL8S/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Allison Ding</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>QWXTAN@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-QWXTAN</pentabarf:event-slug>
            <pentabarf:title>I Built a Transformer from Scratch So You Don&#8217;t Have To</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T180000</dtstart>
            <dtend>20251209T183000</dtend>
            <duration>003000</duration>
            <summary>I Built a Transformer from Scratch So You Don&#8217;t Have To</summary>
            <description>Transformers power modern large language models, but their inner workings are often buried under complex libraries and unreadable abstractions. In this talk, we&#8217;ll peel back the layers and build the original Transformer architecture (Vaswani et al., 2017) step by step in PyTorch, from input embeddings to attention masks to the full encoder-decoder stack.

This talk is designed for attendees with a basic understanding of deep learning and PyTorch who want to go beyond surface-level blog posts and get a hands-on, conceptual grasp of what happens under the hood. You&apos;ll see how each part of the transformer connects back to the equations in the original paper, how to debug common implementation pitfalls, and how to avoid getting lost in tensor dimension hell.

This talk features:

&#128269; A walkthrough of key components: attention, positional encoding, encoder/decoder stack

&#129504; Visual explanations of attention masks, shapes, and residuals

&#9888;&#65039; Common bugs and debugging strategies (like handling shape mismatches and masking errors)

&#9989; Real-world implementation tips and tricks that demystify the architecture

By the end of the talk, attendees will:

Understand the full forward pass of a transformer

Know how each component connects to the original paper

Feel more confident reading or writing custom model architectures

The tone will be light-hearted and educational &#8212; ideal for those who are mathematically curious but don&#8217;t want to get bogged down in heavy theory. No prior experience building models from scratch required &#8212; just a working knowledge of Python and PyTorch.

**Prior Knowledge Expected**

Basic Python and PyTorch

Some familiarity with neural networks (e.g., feedforward, softmax)

No need for prior experience in building models from scratch</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/QWXTAN/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Jen Wei</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>8NYGXU@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-8NYGXU</pentabarf:event-slug>
            <pentabarf:title>Scaling Fuzzy Product Matching with BM25: A Comparative Study of Python and Database Solutions</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T123000</dtstart>
            <dtend>20251209T130000</dtend>
            <duration>003000</duration>
            <summary>Scaling Fuzzy Product Matching with BM25: A Comparative Study of Python and Database Solutions</summary>
            <description>**The problem at hand:**
Are you constantly battling messy, inconsistent product names across massive datasets? Traditional exact matching just doesn&apos;t cut it when you&apos;re trying to integrate data from various sources (like a 1-million-row internal catalog with a 3.8-million-row external one like Open Food Facts). This talk addresses that exact problem: how to efficiently and accurately find fuzzy matches, saving you countless hours of manual reconciliation and enabling robust data enrichment. It&apos;s crucial for anyone working with real-world, imperfect data at scale.

**Is this talk for me?**
This talk is for data engineers, data scientists, and analytics professionals who work with large-scale datasets and face challenges with data integration, record linkage, or building robust search functionalities. A basic understanding of dataframes and SQL will be helpful, but no deep prior knowledge of search algorithms is required.

This will be an informative and practical talk with a clear focus on real-world application. While we&apos;ll briefly cover the &quot;why&quot; behind BM25, the emphasis will be on &quot;how&quot; to implement and optimize it. We&apos;ll present concrete benchmarks and code examples, moving beyond theoretical concepts.

**What will I learn?**
By the end of this session, you will:
- Understand why BM25 is a superior choice for fuzzy matching noisy product names compared to traditional methods.
- See a practical, head-to-head comparison of implementing BM25 using Python libraries (specifically the optimized Cython bm25s) and DuckDB&apos;s native full-text search.
- Gain insights into performance implications (speed and memory usage) for each approach on large datasets, including the benefits of GPU acceleration with Dask CuDF.
- Learn production tips for persisting indexes, handling bulk queries, and managing memory effectively.
- Be equipped to choose the most suitable BM25 implementation for your specific data enrichment and fuzzy matching needs, allowing you to build faster and more accurate data pipelines.

**Any pre-requisite knowledge I should have?**
- A medium level background in python
- An introductory level information about DuckDB
- An introductory level information into how BM25 works would be bonus!</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/8NYGXU/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Aniket Abhay Kulkarni</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>VHX7E7@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-VHX7E7</pentabarf:event-slug>
            <pentabarf:title>Lessons learnt in optimizing a large-scale pandas application using Polars, FireDucks and cuDF: Go Smart and Save More!</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T133000</dtstart>
            <dtend>20251209T140000</dtend>
            <duration>003000</duration>
            <summary>Lessons learnt in optimizing a large-scale pandas application using Polars, FireDucks and cuDF: Go Smart and Save More!</summary>
            <description>It is a known factor that pandas might be slow when dealing with large-scale data analysis, but the know-how of writing effective pandas application might save you a lot. For a data scientist who is primarily specialized in finding the key insights out of the data, it might be difficult to program from the perspective of runtime memory consumption, effective data flow optimization etc. High-performance pandas alternatives like Polars, FireDucks, cuDF etc. are designed to address these issues and can be very useful in saving a lot of operational cost (e.g., cloud cost, human cost etc.). We will talk about the key lessons we have learnt in optimizing a large-scale pandas application and the decision points in selecting the high-performance pandas alternatives. It can be very useful for the contemporary data professional who loves the flexible user APIs in pandas and wants to enhance the performance of their application without much effort when dealing with voluminous and complex data on a regular basis. 

The key takeaways would be as follows:
  1. How the choice and execution order of API calls in writing an data-related application impacts its performance.
  2. How to stop thinking the loop-based approach and design the algorithms using DataFrame APIs. 
  3. How the internal query optimizers in libraries like Polars, FireDucks etc, can be useful to bring SQL-like optimizations at python-level.
  4. Whether to pay a large migration cost for optimizing an existing pandas-based application or to go smart with some minor modifications and save more operational cost.

Here is the [presentation deck](https://github.com/qsourav/PyData-Global-2025/blob/main/docs/PyDataGlobal_20251209.pdf) used during the talk.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/VHX7E7/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Sourav Saha</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>9CMRXJ@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-9CMRXJ</pentabarf:event-slug>
            <pentabarf:title>Communicating Data Quality: Making the Invisible Visible (and Fun!) with Pointblank</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T193000</dtstart>
            <dtend>20251209T200000</dtend>
            <duration>003000</duration>
            <summary>Communicating Data Quality: Making the Invisible Visible (and Fun!) with Pointblank</summary>
            <description>The overall goal of this talk is to get people excited about DQ and show how the Pointblank library makes DQ validation and communication easier, clearer, and more collaborative. I&#8217;ll demonstrate some practical workflows that will hopefully inspire attendees to treat DQ as a shared (yet approachable) responsibility.

Here&#8217;s an outline for this talk:

1. The Data Quality Communication Problem
- why DQ is hard: technical, social, and organizational barriers
- the &#8220;last mile&#8221; problem: not just finding issues, but making them clear and actionable
- the validation plan, execution, and report lifecycle 

2. Introducing Pointblank
- overview of the package and its philosophy: affordances for humans, not just machines
- key features: validation, profiling, reporting, and workflow support

3. Making Data Quality Actionable
- live demo: Python API for data profiling, validation, and missing value reports
- nice-looking outputs: tabular report, step-by-step summaries, and crystal-clear DQ messaging
how these outputs can help people get to the root of DQ problems faster

4. Flexible Workflows
- using LLMs to draft a validation plan
- creating a validation plan from YAML
- integrating with CI/CD and data pipelines

5. Designing this Library for Collaboration and Fun
- small design choices can make a big difference: easy-to-understand summaries, actionable extracts, and a user-friendly CLI
- my personal goal: make DQ work less annoying and more rewarding

I imagine the intended audience as being composed of data engineers, scientists, analysts, and anyone responsible for data quality. Also, this talk might interest team leads and managers looking to improve DQ culture in their organization. Insofar as skill level, this talk is suitable for Python users at any level.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/9CMRXJ/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Richard Iannone</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>HKWFL8@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-HKWFL8</pentabarf:event-slug>
            <pentabarf:title>Fast, Cost-Efficient Analytics on Blockchain data using DuckDB - Solana as a case study</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T123000</dtstart>
            <dtend>20251209T140000</dtend>
            <duration>013000</duration>
            <summary>Fast, Cost-Efficient Analytics on Blockchain data using DuckDB - Solana as a case study</summary>
            <description>This talk explores how to build a workflow for Solana blockchain data using BigQuery and DuckDB. You&apos;ll learn how to query Solana&#8217;s public datasets in BigQuery, export key data as Parquet files, and use DuckDB for high-speed, ideal for blockchain developers, data engineers, and analysts working with large on-chain datasets.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/HKWFL8/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Busirah Olaitan Hammed</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>CDEZQQ@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-CDEZQQ</pentabarf:event-slug>
            <pentabarf:title>Designing a Fast, Offline-Capable Reverse Geocoder in Python: An Open Source Alternative to Big Geo APIs</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T150000</dtstart>
            <dtend>20251209T153000</dtend>
            <duration>003000</duration>
            <summary>Designing a Fast, Offline-Capable Reverse Geocoder in Python: An Open Source Alternative to Big Geo APIs</summary>
            <description>Reverse geocoding &#8212; converting coordinates into readable place names &#8212; is a core building block of applications in logistics, mapping, mobility, and location intelligence. Yet developers are often locked into commercial APIs that are expensive, rate-limited, and unsuitable for offline or privacy-first use cases.

In this talk, we&#8217;ll walk through the architecture and implementation of a fast reverse geocoding engine built entirely in Python using open-source tooling. You&#8217;ll see how spatial data (such as OpenStreetMap shapefiles) can be indexed efficiently using `scipy`&apos;s `cKDTree`, queried with millisecond latency, and integrated into real-world systems.

We&#8217;ll explore performance trade-offs, data preprocessing techniques, and methods for dealing with ambiguous or noisy GPS data. The session includes benchmarks and a live walkthrough of the code powering the reverse geocoder &#8212; which is lightweight enough to run on a laptop or edge device.

Attendees will leave with a clear understanding of how to build and adapt this system for their own needs &#8212; and gain insight into how geospatial systems work behind the scenes.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/CDEZQQ/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Sooraj Sivadasan</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>3BLRCH@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-3BLRCH</pentabarf:event-slug>
            <pentabarf:title>Enhancing Apache NiFi 2.x with Python Processors</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T153000</dtstart>
            <dtend>20251209T160000</dtend>
            <duration>003000</duration>
            <summary>Enhancing Apache NiFi 2.x with Python Processors</summary>
            <description>In this talk, I will delve into the world of Apache NiFi 2.0 Python processors, exploring the capabilities they offer and demonstrating how to build custom processors to enhance your data processing pipelines.

By the end of this talk, participants will have a comprehensive understanding of building and optimizing Apache NiFi 2.0 Python processors, enabling them to integrate Python seamlessly into their data processing workflows.

This session is suitable for data engineers, architects, and anyone interested in harnessing the combined power of Apache NiFi and Python for efficient data integration and flow management. One of the main uses is to build prompts and call open LLM and AI. NiFi excels at integration, I will cover some interesting sources, sinks and enrichments and show when Python is helpful.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/3BLRCH/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Timothy Spann</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>AAGRYV@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-AAGRYV</pentabarf:event-slug>
            <pentabarf:title>Combining Zarr, HDF5, and TIFF into a single data format</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T163000</dtstart>
            <dtend>20251209T170000</dtend>
            <duration>003000</duration>
            <summary>Combining Zarr, HDF5, and TIFF into a single data format</summary>
            <description>Choosing a standard format for high dimensional (N &gt;= 2) array data is challenging in that one must consider trade-offs between compatible software packages, cloud optimization, and complexity, yet the need for such data has increased with recent advances in machine learning and volumetric imaging in the earth and biological sciences. The 927th installment of the XKCD comic series illustrates how standards proliferate: the existence of many prior and imperfect standards portends the creation of yet another standard to supplant the ones that came before often without considering similarities or compatability with prior standard formats. For n-dimensional data, TIFF, HDF5, and Zarr are now common formats in use across various fields and scientific domains. While TIFF and HDF5 were designed decades ago with flexible metadata structures, cloud optimization of these formats have helped to consolidate metadata in these formats and narrow the differences with the cloud-native file format Zarr. While Zarr has traditionally used individual keys for each compressed chunk, version 3 of the format introduces a sharding codec allowing multiple chunks to exist in the same file under a single key. The consolidation of chunks is reminiscent of tiles in TIFF files or chunked datasets in HDF5. Essentially each of these file formats have the capability to describe the location and sizes of individual blocks of data contained within. By taking advantage of metadata consolidation to achieve modularity, we can tailor and combine these formats to point to the same data blocks, avoiding duplication. The result is a hybrid file format that is simultaneously a TIFF, HDF5, and Zarr v3 shard. Readers of any of these formats can be used to read the same data blocks contained within this format.

To illustrate the concept of a combined Zarr, HDF5, and TIFF format, I have created an example Jupyter notebook demonstrating a small Python library that can write data in this hybrid format. I then show how data can be read using libtiff, h5py, or tensorstore, manipulated by h5py, and then have the changes read using the same libraries.
https://github.com/mkitti/simple_image_formats/blob/main/header_formats.ipynb</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/AAGRYV/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Mark Kittisopikul, Ph.D.</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>9HUY9G@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-9HUY9G</pentabarf:event-slug>
            <pentabarf:title>GPU Python for the Real World: Practical Steps to GPU-Accelerated Python with RAPIDS</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T193000</dtstart>
            <dtend>20251209T210000</dtend>
            <duration>013000</duration>
            <summary>GPU Python for the Real World: Practical Steps to GPU-Accelerated Python with RAPIDS</summary>
            <description>In this tutorial we will cover:
- Introduction to cuDF, cuML and more that showcases a simple example of data processing and model training on GPUs.
- Answers to questions like: &#8220;Where do I get a GPU?&#8221;, &#8220;How do I run a container on a VM with a GPU?&#8221;, &#8220;How do I install GPU packages into an existing environment?&#8221;, as well as follow along examples to get a GPU up and running.
- Troubleshooting and monitoring:  Examples of performance analysis, diagnostics, and debugging.

This is a hands-on tutorial, with multiple examples to get familiarized with the RAPIDS ecosystem. Participants should ideally have some experience using Python, pandas and sci-kit learn. We&apos;ll use cloud-based VMs, so familiarity with the cloud and resource creation is helpful but not required. No prior GPU knowledge is needed.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/9HUY9G/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Jacob Tomlinson</attendee>
            
            <attendee>Naty Clementi</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>8CFCDH@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-8CFCDH</pentabarf:event-slug>
            <pentabarf:title>The Lifecycle of a Jupyter Environment: From Exploration to Production-Grade Pipelines</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T150000</dtstart>
            <dtend>20251209T154000</dtend>
            <duration>004000</duration>
            <summary>The Lifecycle of a Jupyter Environment: From Exploration to Production-Grade Pipelines</summary>
            <description>- (3 mins) Intro 
    - I&apos;ve been supporting various groups in their developer experience since 2020 after being a freelance Python consultant. I&apos;ve worked on many many dozens of projects, unblocking users picking the right tools for the task at hand. 
    - It works on my machine 
    - What we&apos;re building today: ML pipeline with RAPIDS -&gt; Snowflake
    - We&apos;re going to watch a real project grow up
- (3 mins) Exploration - starting as a single messy notebook, sample data set. 
    - Why RAPIDS? GPU
        - Large data sets
        - GPU availability - remote machine, local GPU
        - workflows that work well with GPU 
    - Load Data cuDF / pandas
    - Quick EDA and data visualization
    - Train cuML / scikit-learn model 
    - no-code change philosophy
- (7 mins) Make it repeatable - Start with simple tried and true tools, explore where tools like Papermill help with flexibilty and reproducibility
    - common painpoints: operating cadence, specialized scenarios, manual execution is error prone
    - shell scripts versus papermill 
    - reproducible environments
    - generate HTML reports
    - pass through parameters in your notebook
- (8 mins) Make it reliable - Modular code &amp; testing
    - common painpoints: data schema changes, debugging issues, testing &amp; modularity
    - nbconvert + Python: turn your notebook into a script
    - turn a function into a module
    - dashboard with HoloViz / Panel, discuss choosing tools like Voila and PyScript
- (5 mins) Snowflake integration
    - common painpoints: data volume, coordinate with other data systems, audits
    - picking the right tools: cost complexity tradeoff
    - RAPIDS preprocessing to Snowflake storage
    - self-service access for stakeholders
- (3 mins) Conclusion 
    - Start simple
    - Add complexity when you feel specific pain</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/8CFCDH/</url>
            <location>Live from PyData Boston</location>
            
            <attendee>Dawn Wages</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>UHN9UX@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-UHN9UX</pentabarf:event-slug>
            <pentabarf:title>Using Traditional AI and LLMs to Automate Complex and Critical Documents in Healthcare</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T161500</dtstart>
            <dtend>20251209T165500</dtend>
            <duration>004000</duration>
            <summary>Using Traditional AI and LLMs to Automate Complex and Critical Documents in Healthcare</summary>
            <description>Informed Consent Forms are highly complex documents that require high precision and quality. A phase 2 / 3 clinical trial can have almost 1000 different forms that takes considerable time to complete.We identified this challenge that directly impacts trial timelines and patient engagement. The automated AI solution: the &#8220;ICF Autodrafter&#8221;, a custom LLM-powered application that automates the drafting of ICFs. This tool ingests a clinical trial protocol and ICF template and outputs a complete draft in minutes, cutting document preparation time by 90%. 

This solution is not generic automation. The backend logic parses highly structured protocol documents, segments them, and feeds the relevant content into a carefully fine-tuned LLM that maps text to specific ICF fields. The front-end is designed for usability by clinical trial managers, with human-in-the-loop reviews. This system has already supported ICF creation for more than ten trials and has achieved near-perfect consistency (97%) with human-generated content, underscoring the speed, quality, and robustness of the solution. 

We rigorously test version with A/B comparisons, iterated with feedback from end-users, and anchored all development within regulatory and ethical guardrails. The impact extends beyond efficiency. By standardizing and accelerating ICF production, we can reduce delays in trial start-up and potentially get medicines to patients faster, without compromising safety, compliance, or clarity. Furthermore, it also lays down a scalable model for future AI-driven document workflows across other parts of life sciences and healthcare.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/UHN9UX/</url>
            <location>Live from PyData Boston</location>
            
            <attendee>Aman Bhandari</attendee>
            
            <attendee>Lily Xu</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>V7GSU7@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-V7GSU7</pentabarf:event-slug>
            <pentabarf:title>Where Have All the Metrics Gone?</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T170000</dtstart>
            <dtend>20251209T174000</dtend>
            <duration>004000</duration>
            <summary>Where Have All the Metrics Gone?</summary>
            <description>In the good old supervised learning days, standard measures like accuracy, F1, and MSE were like blazes on the data science trail, showing us how to descend the gradient towards &quot;better&quot;. But now we&apos;re in uncharted analytics territory, where our work increasingly involves unlabeled data and generative AI outputs, and metrics are either unavailable or undefined.

The key to every successful trek is preparation. We have to move from thinking about &#8220;metrics as defaults&#8221; to &#8220;metrics as design choices.&quot; We also need to be ready to design those metrics before we even start testing, because when we devise metrics post-training, we risk HARKing (Hypothesizing After Results are Known) and losing our scientific footing. 

This talk will provide a field guide for translating different kinds of modern research questions into clearly-defined metrics, including:
* Metrics of the past and why they aren&apos;t as useful now (~5 min)
* Common failure modes when attempting to evaluate generative AI outputs and other unlabeled data (~8 min)
* Techniques for identifying proxies when labels are missing (~8 min)
* Defining criteria for open-ended outputs (~8 min)
* Open source Python libraries (including new tools like [outlines](https://github.com/dottxt-ai/outlines) and [dspy](https://github.com/stanfordnlp/dspy) as well as old favorites like [hypothesis](https://hypothesis.readthedocs.io/en/latest/) and [pytest](https://docs.pytest.org/en/stable/)) to equip you for your next data science adventure (~8 min)

Come learn how to define and adapt new metrics&#160;so that you&apos;ll be prepared for wherever your modeling journey takes you.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/V7GSU7/</url>
            <location>Live from PyData Boston</location>
            
            <attendee>Dr. Rebecca Bilbro</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>YHTMZY@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-YHTMZY</pentabarf:event-slug>
            <pentabarf:title>The SAT math gap: gender difference or selection bias?</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T193000</dtstart>
            <dtend>20251209T201000</dtend>
            <duration>004000</duration>
            <summary>The SAT math gap: gender difference or selection bias?</summary>
            <description>Overview

This talk uses the SAT math gap as a case study to demonstrate modern Bayesian modeling in practice. For decades, male test takers have outperformed female test takers on the SAT math section by about 30 points. This outcome could reflect an actual difference in ability, or it could be explained by selection bias, if boys with weaker math skills are less likely to take the SAT than girls with comparable skills.
I present a generative Bayesian model that explicitly incorporates this selection mechanism and estimates the fraction of the observed gap attributable to bias. The talk emphasizes workflow over theory: how to build, validate, and interpret Bayesian models using PyMC, ArviZ, and PreliZ.

Audience

The target audience includes data scientists, applied researchers, and engineers who:
* Use Python for data analysis,
* Have basic familiarity with probability distributions,
* Are curious about Bayesian modeling but do not necessarily have prior experience with PyMC or Bayesian statistics.

Learning goals

Attendees will learn:

* How to frame a substantive question as a Bayesian generative model,

* How to use PreliZ for prior elicitation, PyMC for model building, and ArviZ for diagnostics and posterior predictive checks,

* How to interpret results in terms of latent traits vs. observed outcomes,

* How Bayesian models can provide a principled way to reason about confounding and bias.


Outline (approx. 30&#8211;40 minutes)

Introduction &amp; background (5 min)
 &#8211; The SAT math gap and the debate over its causes
 &#8211; Why Bayesian inference is a good fit for this problem

Model construction (10 min)
 &#8211; Latent efficacy distribution
 &#8211; Selection mechanism (logistic link)
 &#8211; Noise modeling for score perturbations

Workflow demonstration (15 min)
 &#8211; Prior elicitation with PreliZ
 &#8211; Sampling and diagnostics with PyMC and ArviZ
 &#8211; Posterior predictive checks

Results &amp; interpretation (5&#8211;7 min)
 &#8211; Estimated contribution of selection bias to the observed gap
 &#8211; Broader implications for educational testing and applied modeling

Takeaways (3&#8211;5 min)
 &#8211; Lessons about Bayesian workflow
 &#8211; Relevance to real-world problems of bias and confounding


Materials

All code and data preprocessing will be available in a public GitHub repository so attendees can reproduce the analysis and adapt it to their own work.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/YHTMZY/</url>
            <location>Live from PyData Boston</location>
            
            <attendee>Allen  Downey</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>TZSWMW@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-TZSWMW</pentabarf:event-slug>
            <pentabarf:title>The Boringly Simple Loop Powering GenAI Apps</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251209T204500</dtstart>
            <dtend>20251209T212500</dtend>
            <duration>004000</duration>
            <summary>The Boringly Simple Loop Powering GenAI Apps</summary>
            <description>### Central Thesis
We are at a point where talking about GenAI apps has become more complex than building them. Social media is obsessed with the &quot;top 10 libraries for GenAI&quot;, search engines are swamped with shallow tutorials, and many devs I meet are rightfully confused what frameworks they should spend time on.

The answer is &quot;none, GenAI isn&apos;t all that complicated&quot;. However, that answer isn&apos;t sexy because it doesn&apos;t grab attention, doesn&apos;t sell consulting hours, and doesn&apos;t convince someone to buy an online course. Hence few people give it. That has to change!

That&apos;s what this talk is about: The boringly simple basics of building GenAI apps and how you can use a simple nested while loop to build assistants, AI agents, or multi-agent systems. Sometimes less is more.

### Takeaways
- Create prototypes of agentic apps from scratch using fundamental building blocks
- Choose the right components (like RAG or MCP) for your specific problem
- Debug agentic apps by spotting misconfigured context

### Target Audience
This talk is for the software engineer and data professional that wants to get hands-on with GenAI. Medium and Substack taught you concepts like RAG and AI Agents, social media hyped you up, and now it&#8217;s time to build. The only problem: Where do you start? How do you turn &quot;let&apos;s build something that does XYZ&quot; into a concrete software product? If you feel like you are sitting with a pile of Lego pieces while everyone else is playing with a completed spaceship, this talk is for you. It&apos;s for builders who are ready to go from reading to coding.

### Prerequisites
You should have working knowledge of Python and familiarity with LLM terminology (tokens, context window, system prompt, ...). If you&apos;re comfortable reading source code, you have everything you need. No prior experience with frameworks like LangChain, LlamaIndex, or others is necessary.

### Outline
**Introduction** (2 min)

**The core loop** (15 min)
- Introduction to the fundamental pattern that orchestrates GenAI apps (the &quot;core loop&quot;)
- Definition of the terms &quot;Turns&quot; and &quot;Traces&quot; that are foundational to building and optimizing flows
- Showcase on how to create assistants, workflows, AI agents, and multi-agent systems using this pattern

**Context Engineering** (15min)
- Introduction to the three parts of context engineering: Plans, Knowledge, and Tools.
- Discussion on how these parts relate to the core loop and where to define them
- Showcase how RAG, MCP, memory, etc. assist in setting up the system context

**Q&amp;A** (5min)
**Buffer** (3min)

### Bio
I&apos;m an engineer and open-source maintainer with a PhD in Computer Science and over a decade of hands-on experience building with AI/ML. Having scaled ImageIO, a foundational Python library, from 2 to 35 million monthly downloads, I know what it takes to build robust, scalable software. I co-founded PyData Stockholm and am deeply integrated into our data community. My current focus is to bring first principles thinking to the GenAI landscape and help developers build more robust systems.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/TZSWMW/</url>
            <location>Live from PyData Boston</location>
            
            <attendee>Sebastian Wallk&#246;tter</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>B3QRQA@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-B3QRQA</pentabarf:event-slug>
            <pentabarf:title>PyData/Sparse &amp; Finch: extending sparse computing in the Python ecosystem</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T120000</dtstart>
            <dtend>20251210T123000</dtend>
            <duration>003000</duration>
            <summary>PyData/Sparse &amp; Finch: extending sparse computing in the Python ecosystem</summary>
            <description>In this talk we&apos;re going to understand the current landscape of sparse computing in the Python ecosystem first. Then a high-level overview of the Finch technology and compiler&apos;s architecture will be presented together with other solutions vital for the project: Array API Standard and binsparse format.

Next, we&apos;re going to present a selected set of benchmarks - also focusing on real world use-cases: how Finch impacts users&apos; experience when writing sparse programs in Python. Last but not least a showcase of the current development will be shown - pure Python rewrite of Finch compiler.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/B3QRQA/</url>
            <location>General Track</location>
            
            <attendee>Mateusz Sok&#243;&#322;</attendee>
            
            <attendee>Willow Marie Ahrens</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>NMYJM8@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-NMYJM8</pentabarf:event-slug>
            <pentabarf:title>EffVer: Versioning code by the effort required to upgrade</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T130000</dtstart>
            <dtend>20251210T133000</dtend>
            <duration>003000</duration>
            <summary>EffVer: Versioning code by the effort required to upgrade</summary>
            <description>Intended Effort Versioning (EffVer), the version scheme where you just tell your users what order of magnitude to expect the upgrade effort to be.

Version numbers are hard to get right. Semantic Versioning (SemVer) communicates backward compatibility via version numbers which often lead to a false sense of security and broken promises. Calendar Versioning (CalVer) sits at the other extreme of communicating almost no useful information at all.

Many Python projects follow a looser scheme called EffVer where instead of making promises around backward compatibility they communicate the likelihood and magnitude of work required to adopt a new version.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/NMYJM8/</url>
            <location>General Track</location>
            
            <attendee>Jacob Tomlinson</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>UXHBEZ@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-UXHBEZ</pentabarf:event-slug>
            <pentabarf:title>Hands-on with Blosc2: Accelerating Your Python Data Workflows</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T133000</dtstart>
            <dtend>20251210T150000</dtend>
            <duration>013000</duration>
            <summary>Hands-on with Blosc2: Accelerating Your Python Data Workflows</summary>
            <description>## Audience &amp; Prerequisites

This tutorial is for data scientists, engineers, and researchers who work with large numerical datasets in Python.

Prerequisites: Attendees should have intermediate Python programming skills and be comfortable with the basics of NumPy arrays. No prior experience with Blosc2 is necessary.

Setup: Participants will need a laptop and can follow along using a provided cloud-based environment (e.g., Binder) or a local installation of Python, Jupyter, and the python-blosc2 library.

## Learning Objectives

By the end of this tutorial, attendees will be able to:

* Understand the core concepts behind the Blosc2 meta-compressor.
* Compress and decompress NumPy arrays, tuning parameters for optimal performance.
* Create, manipulate, and slice Blosc2 NDArray objects for out-of-core processing.
* Perform efficient mathematical computations directly on compressed data.
* Store and retrieve compressed datasets using different storage backends.
* Integrate Blosc2 into their existing data analysis workflows to mitigate I/O bottlenecks.

## Outline (90 minutes)

### Introduction &amp; Setup (10 mins)

  * The I/O Bottleneck Problem.
  * Core Concepts: What are meta-compressors, chunks, and blocks?
  * Tutorial environment setup (Jupyter notebooks).

### Part 1: Compression Fundamentals (20 mins)

  * Hands-on: Using blosc2.compress() and blosc2.decompress().
  * Exploring codecs (lz4, zstd), compression levels, and filters (shuffle, bitshuffle).
  * Exercise: Compressing a sample dataset and analyzing the trade-offs between speed and ratio. 

### Part 2: The NDArray - Computing on Compressed Data (35 mins)

  * Hands-on: Creating NDArray objects from scratch and from NumPy arrays.
  * Storing arrays on-disk vs. in-memory.
  * Exercise: Slicing and accessing data from an on-disk NDArray.
  * Performing mathematical operations (arr * 2 + 1) and reductions (arr.sum()) on compressed data.
  * Exercise: Analyzing a dataset larger than RAM.

### Part 3: Advanced Features &amp; Integration (20 mins)

  * Hands-on: Using two-level partitioning (meta-chunks) for faster slicing.
  * Brief overview of Caterva2 for sharing compressed data via an API.
  * Recap and Q&amp;A.

Repository: Tutorial materials including notebooks and datasets will be available at a public GitHub repository (link to be provided upon acceptance).</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/UXHBEZ/</url>
            <location>General Track</location>
            
            <attendee>Francesc Alted</attendee>
            
            <attendee>Luke Shaw</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>NKQFBQ@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-NKQFBQ</pentabarf:event-slug>
            <pentabarf:title>Keynote: David Aronchick- From Pandas to Policy-as-Code: The Future of ML Data Engineering</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T160000</dtstart>
            <dtend>20251210T163000</dtend>
            <duration>003000</duration>
            <summary>Keynote: David Aronchick- From Pandas to Policy-as-Code: The Future of ML Data Engineering</summary>
            <description>For over a decade, the Python ecosystem has given us a powerful arsenal to tame data. We started with the interactive magic of Pandas on a single machine, a revolutionary step that made complex analysis accessible. When our ambitions (and data) outgrew our laptops, we turned to Dask and Spark to scale our computations across clusters. More recently, projects like Apache Arrow began solving the critical problem of creating a standardized, efficient language for these distributed systems to speak.

Each step in this journey solved a painful bottleneck. Yet, in our success, we&apos;ve created a new one: the runaway cost and complexity of the &quot;ingest-it-all-first&quot; paradigm. Our cloud bills have become a tax on raw, unfiltered data, and our elegant downstream tools&#8212;from Airflow and dbt to our own ML models&#8212;are forced to waste expensive cycles sifting through noise just to find the signal.

This talk argues for the next logical step in our stack&apos;s evolution: an Upstream Data Control Plane. We&apos;ll explore an playbook for applying intelligent filtering, transformation, and governance before data ever hits your expensive lakehouse. Just as Dask parallelized our processing and Arrow standardized our memory, this approach optimizes our data in motion, ensuring that our powerful downstream systems operate only on the high-value signals we care about. Join us to learn a declarative, policy-as-code framework that makes your entire data stack cheaper, faster, and more resilient.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/NKQFBQ/</url>
            <location>General Track</location>
            
            <attendee>David Aronchick</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>BSY9GA@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-BSY9GA</pentabarf:event-slug>
            <pentabarf:title>Python Polars: The Definitive Crash Course</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T163000</dtstart>
            <dtend>20251210T180000</dtend>
            <duration>013000</duration>
            <summary>Python Polars: The Definitive Crash Course</summary>
            <description>Based on the book Python Polars: The Definitive Guide, we&#8217;ll teach the essentials of Polars to read, transform, and visualize data. While a hallmark of Polars is its speed, we&#8217;ll emphasize the benefits of its expression system for writing flexible, maintainable code.

This hands-on workshop will cover:

* Reading data from CSV, spreadsheets, Parquet, and databases
* Common transformations such as selecting, filtering, sorting, and aggregating
* Complex data types, including text, time, and nested structures 
* Expressions, the building blocks of every query
* Visualizing data

By the end of this workshop, attendees will have gained a solid understanding of Polars, and be equipped to start applying this lightning fast DataFrame library to their own datasets. No prior knowledge of Polars is required.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/BSY9GA/</url>
            <location>General Track</location>
            
            <attendee>Jeroen Janssens</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>N7EAFM@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-N7EAFM</pentabarf:event-slug>
            <pentabarf:title>Time series analysis for coupled neurons.</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T180000</dtstart>
            <dtend>20251210T193000</dtend>
            <duration>013000</duration>
            <summary>Time series analysis for coupled neurons.</summary>
            <description>This is a tutorial on hands-on time series analysis of coupled neuron models. We will build mathematical models of coupled neurons, and then utilize tools from nonlinear dynamics to analyze simulated time series. We will discuss various empirically informed coupling strategies and statistically efficient time series measures. This workshop is 100% Jupyter notebook and will have room to openly brainstorm ideas to extend and improve the studies. Let&#8217;s unravel some complex dynamics together!

The pipeline of this tutorial will be the following:
(i) Start by building a coupled neuron system based on different coupling strategies,
(ii) Simulate the system and generate time series data,
(iii) Perform time series analysis by computing various metrics from the nonlinear dynamics literature,
(iv) Finally, discuss what these metrics tell us about the temporal behavior of neurons.

Coupling strategies we are going to look at:
(i) Gap junction coupling
(ii) Chemical coupling
(iii) A hybrid coupling influenced by a superconductor model in physics
(iv) Electromagnetic coupling
(iv) Coupling, which is not pairwise but higher-order (A bit of background on graph theory is recommended)
(v) A random coupling strategy

We will implement the following methodologies/algorithms for time series analysis of coupled neuron models:
(i) Hurst exponent: measuring persistence of time series,
(ii) Sample entropy: measuring the complexity of time series,
(iii) 0&#8211;1 test: measuring chaos,
(iv) Kuramoto order-parameter: measuring synchrony between the neurons.

This tutorial is 100% Python. And I will be utilizing Jupyter Notebooks to deliver the workshop. Packages that need to be downloaded beforehand are:
(i) `matplotlib` for plotting,
(ii) `numpy` and `scipy` for scientific computations,
(iii) `nolds` for nonlinear measure for dynamical systems,
(iv) `pandas` for data handling.

The audience would find this interesting because it would be a hands-on introduction to how the mechanisms of neurons can be explored using different tools from the nonlinear dynamics literature. Mathematically modelling the dynamics of neurons has attracted several researchers in recent years because of the popularity of artificial intelligence. This field of neuron dynamics is booming, and delivering this workshop would be timely. I would also ensure to leave some room for brainstorming further ideas with the audience and how this study could be potentially extended and improved, thus an interactive session. 

The goal is to attract applied mathematicians, computer scientists, data scientists, engineers, and statisticians alike and provide them with a battery of tools to add to their knowledge base. The audience would then be able to apply these tools in domains other than neurodynamics, for example, climate, finance, or social science. The only technical background I would expect from the audience is familiarity with `matplotlib`, `numpy` and `pandas`, and some basic statistics (regression, correlation coefficient), linear algebra (matrix operations), and graphs (as in networks). After the tutorial, the audience will leave with a newly built insight into the mathematical modeling of neuron dynamics.

Here is the breakdown of the tutorial:

0&#8211;15 mins: Introduction to neurons as dynamical systems and why we care about their behavior over time. We will talk about a single neuron&apos;s behavior and the selection of a mathematical model. We will also talk about the bursting phenomenon in neurons.

15-30 mins: We will then mathematically model a coupled system of neurons. We will cover the topic of  &#8220;small networks&#8221; of neurons and what they teach us about the bigger picture: a complex, connected nervous system.

30-45 mins: Next, we will introduce various empirically informed coupling mechanisms. We will talk about how these couplings incorporate different firing patterns in the coupled neurons, ranging from regular behavior to chaotic firing.

45-75 mins: Finally, I will introduce time series analysis of neuron data. We will then implement the algorithms mentioned above to realize different dynamical properties of the neurons.

75-90 mins: Open the room to QA and brainstorm further ideas to improve/extend the analysis of neuron-time series data.

All materials for the tutorial can be accessed via this repository link: https://github.com/indrag49/PyData-Global-Tutorial-2025</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/N7EAFM/</url>
            <location>General Track</location>
            
            <attendee>Indranil Ghosh</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ETQTHC@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ETQTHC</pentabarf:event-slug>
            <pentabarf:title>Using MCP to turn Claude into a Football Opposition Analyst</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T113000</dtstart>
            <dtend>20251210T120000</dtend>
            <duration>003000</duration>
            <summary>Using MCP to turn Claude into a Football Opposition Analyst</summary>
            <description>Analysis in sports is changing. Advanced statistics like Wins Above Replacement (WAR) or Expected Goals (xG) are making their way into TV punditry and conversations in bars. But the people who need the information the most, ex-professionals and coaches without a background in statistics, often shun it.

Not because they don&apos;t see the value, but because the language is impenetrable, the underlying data is overwhelming, and the insights are difficult to translate.

Generative AI provides an opportunity to bridge the gap.

In this talk, I&apos;ll share how I used Model Context Protocol (MCP) to turn Anthropic&apos;s Claude Desktop into a football opposition analyst by providing access to team and player performance event data, and in turn lower the barriers so anyone can turn a sea of numbers into actions.

This talk will cover:

- How MCP enables AI to access and interpret domain-specific knowledge
- Real examples of AI-generated football insights in action</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/ETQTHC/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Adam Cowley</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>EKX7LV@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-EKX7LV</pentabarf:event-slug>
            <pentabarf:title>The Human Side: Leading and Mentoring Global Data Teams in the Age of AI</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T120000</dtstart>
            <dtend>20251210T123000</dtend>
            <duration>003000</duration>
            <summary>The Human Side: Leading and Mentoring Global Data Teams in the Age of AI</summary>
            <description>For engineering leaders, managers, and aspiring mentors. Session covers structures for remote work, upskilling, cross-cultural collaboration, promoting innovation, and embedding compliance and ethics in technical work&#8212;from real executive experience.

If you would like hands-on tutorials for any of the 30-minute talks, or wish to tailor for a specific audience (engineering, product, executive), content can be customized to fit workshop/intermediate/advanced levels.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/EKX7LV/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>amar naik</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>J7JK79@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-J7JK79</pentabarf:event-slug>
            <pentabarf:title>Realtime Financial Fraud Detection with Modern Python</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T123000</dtstart>
            <dtend>20251210T130000</dtend>
            <duration>003000</duration>
            <summary>Realtime Financial Fraud Detection with Modern Python</summary>
            <description>This talk distills a production&#8209;tested path for real&#8209;time financial fraud detection in Python (inc. choosing the right objective, validating in time, and shipping with guardrails).

Core idea:

Optimize the business decision (alerts under cost/latency constraints), not just the ML score.

Outline (30 minutes):

1. Problem framing: Adversaries, label delay, extreme imbalance, and why &#8220;accuracy&#8221; lies.

2. Metrics that matter: Precision and recall, AUC&#8209;PR vs ROC, cost&#8209;weighted utility, calibration for decisions.

3. Validation done right: Temporal splits, rolling/blocked CV with gap, prequential test&#8209;then&#8209;train, leak and drift traps.

4. Modeling under latency budgets: Where XGBoost shines, when to add tabular DL, injecting graph signals without blowing latency (simple handcrafted graph stats + GNNs).

5. From notebook to service: Small, testable core, FastAPI endpoint, thresholds and shadow mode, alert quotas, analyst feedback loops.

6. Operations &amp; monitoring: Drift indicators, calibration checks, label&#8209;delay dashboards, canaries/rollbacks.

7. Wrap&#8209;up/Q&amp;A: Failure modes and a 1&#8209;page runbook.

Attendee outcomes:

- A copy&#8209;and&#8209;adapt roadmap for deploying financial fraud detection services with Python.

- A latency&#8209;aware model selection heuristic.

- A minimal deployment pattern (service, thresholds, monitoring) that scales from pilot to production.

Prior knowledge expected:

- Basic Python and DataFrames, ML classification basics, HTTP/JSON.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/J7JK79/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>C&#233;sar Soto Valero</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>W9RJKW@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-W9RJKW</pentabarf:event-slug>
            <pentabarf:title>How to Effectively use text embeddings in tree based models</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T130000</dtstart>
            <dtend>20251210T133000</dtend>
            <duration>003000</duration>
            <summary>How to Effectively use text embeddings in tree based models</summary>
            <description>The presentation is aimed at Data Science and Machine Learning practitioners who are already familiar with tree-based models and want to learn how to effectively incorporate text embeddings features to boost the performances of their models.

The methodology showcased in the presentation is available in the sklearo open source package.

The structure of the talk will be as follows:

- **5 minutes** Overview of text embeddings, how tree-based models are built, and the challenges they face with text embeddings compared to linear models.
- **5 minutes** Explanation of how can we leverage non-tree based models to transform text embeddings into a format that tree based models can effectively use.
- **5 minutes** Explanation on *cross-fitting*, a technique used to avoid target leakage when generating features using the target variable.
- **5 minutes** Code examples of how this technique can be used in practice using the `sklearo` open source library.
- **5 minutes** Performance comparison of tree based models using text embeddings as-is vs using the transformed features.

Prior knowledge about fundamental machine learning concepts such as overfitting, cross-validation, and feature engineering is recommended but not required.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/W9RJKW/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Claudio Salvatore Arcidiacono</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>BT7M3S@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-BT7M3S</pentabarf:event-slug>
            <pentabarf:title>Optimal Variable Binning in Logistic Regression</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T133000</dtstart>
            <dtend>20251210T140000</dtend>
            <duration>003000</duration>
            <summary>Optimal Variable Binning in Logistic Regression</summary>
            <description>Despite the rise of complex &#8220;black-box&#8221; models, regulated environments still demand transparency. Properly binned variables not only improve model fit but also yield coefficients that the business and auditors can interpret. However, determining cut-points that preserve true signal while avoiding data-snooping bias is non-trivial.

By the end of this session, attendees will be able to:

- Understand the basic idea behind binning (the what)
- To know in which contexts variable binning makes sense (the when and why).
- Choose among popular optimal-binning techniques (e.g., ChiMerge, MDLP, decision-tree-based) based on data size, feature type, and operational constraints (the how).

Who Should Attend?

Data scientists and risk analysts who use logistic regression in regulated settings and need a reproducible, explainable feature-engineering pipeline.

Detailed 30-Minute Agenda

| Time | Topic |
| --- | --- |
| 0&#8211;3 min | Context &amp; Why Binning Matters in explainibility|
| 3&#8211;8 min | Pitfalls of Na&#239;ve Binning (examples from real-life) |
| 8&#8211;18 min | Binning as an optimization problem :  Algorithms &amp; Decision Criteria |
| 18&#8211;26 min | Hands-On Python Demo: From Data to Defensible Bins |
| 26&#8211;30 min | Q&amp;A, Resources &amp; Next Steps |

Prerequisites &amp; Materials

- Prerequisites: Basic Python (pandas, scikit-learn) and logistic-regression familiarity
- Materials: GitHub repo with notebook, data samples, will be shared during the talk

You&#8217;ll leave equipped to choose the right optimal&#8208;binning algorithm for your data.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/BT7M3S/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Charaf ZGUIOUAR</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>YPRZBE@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-YPRZBE</pentabarf:event-slug>
            <pentabarf:title>Bundestag Chat: Discovering Political Landscape with RAG Systems</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T140000</dtstart>
            <dtend>20251210T143000</dtend>
            <duration>003000</duration>
            <summary>Bundestag Chat: Discovering Political Landscape with RAG Systems</summary>
            <description>Retrieval-Augmented Generation (RAG) systems are among the most impactful applications of LLMs, allowing for intelligent querying and contextual understanding of unstructured data. However, turning a prototype into a polished, scalable product is often where complexity sets in.

In this talk, we walk through how our open-source RAG blueprint was used to create *Bundestag Chat*&#8212;a system that allows users to interact with over a decade of German parliamentary debates via a chat interface. This real-world use case illustrates the key benefits of our blueprint: modularity, observability, evaluation, and scalability.

Our architecture includes:

- **LlamaIndex** for document parsing and chunking,
- **Hugging Face embedding models** stored in a **PGVector** vector database,
- **Chainlit** for an intuitive chat UI,
- **Langfuse** for logging, observability, and feedback collection,
- **Ragas** for evaluating response quality across dimensions like faithfulness and relevance.

What made this system successful was the flexibility to swap components, configure data flows, and monitor performance from day one. This modular design made it straightforward to go from an initial prototype to a system deployed in a privacy-sensitive environment.

We&#8217;ll also contrast open-source and commercial RAG stacks, sharing insights on when to build versus buy. Topics include:

- Estimating system requirements across different workloads,
- Evaluating model performance and output reliability,
- Ensuring data privacy and legal compliance,
- Gathering and acting on human feedback to improve quality.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/YPRZBE/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Piotr Kalota</attendee>
            
            <attendee>Matthias Boeck</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>GS9GQP@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-GS9GQP</pentabarf:event-slug>
            <pentabarf:title>Building Production-Ready Research AI Assistants with One-Command Setup</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T160000</dtstart>
            <dtend>20251210T163000</dtend>
            <duration>003000</duration>
            <summary>Building Production-Ready Research AI Assistants with One-Command Setup</summary>
            <description>In this talk, we introduce Lab Lens: an open-source framework for Research AI Assistant that allows labs to ingest scientific papers and media coverage, build a vector database, and query it via natural language&#8212;all in one reproducible command.

This 30-minute talk will explore:

- **&#129504; Architecture:** How LangGraph, FastAPI, and Streamlit are combined with agentic reasoning for document Q&amp;A.
- **&#128196; Multi-modal Ingestion:** How Lab Lens uses Landing.AI (vision agentic document extraction) and Firecrawl to intelligently extract content from complex PDFs and dynamic media pages.
- **&#129302; LLM Workflow:** How intents are classified, documents retrieved, and responses synthesized with structured JSON output and source attribution.
- **&#128260; Reusability and Extensibility:** How any lab or research group can plug in their own documents and deploy in minutes.
- **&#9881;&#65039; One-Line Setup:** How a single YAML config and docker compose up sets up ingestion, vectorization, API, UI, and Slack bot integration.

We&apos;ll conclude with a live demo showing how Lab Lens answers real research questions using citation-backed reasoning, emphasizing transparency, reliability, and ease of use.

Lab Lens is designed for reproducibility, minimal setup, and immediate utility. If you&apos;re interested in bringing GenAI to your research workflow&#8212;or your research to the world&#8212;this talk will show you exactly how.

**Target Audience:** Researchers, students, and enthusiasts wanting practical AI tools.

**Prerequisites:**
- Python knowledge
- Familiarity with containerization concepts.

**Resources Provided:** Complete open-source codebase with Docker configuration for immediate deployment.
Remember that the main goal/advantage here is to make it accessible for the whole lab&apos;s documents (papers and media coverage), so anyone can ask about it with a source citation.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/GS9GQP/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Cain&#227; Max Couto da Silva</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>TXYJHL@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-TXYJHL</pentabarf:event-slug>
            <pentabarf:title>Optimizing AI/ML Workloads: Resource Management and Cost Attribution</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T163000</dtstart>
            <dtend>20251210T170000</dtend>
            <duration>003000</duration>
            <summary>Optimizing AI/ML Workloads: Resource Management and Cost Attribution</summary>
            <description>This abstract proposes a framework for systematically monitoring and analyzing AI/ML workloads to optimize resource utilization and effective cost attribution/management. By providing granular insights into resource consumption, the system helps identify cloud infra bottlenecks - leading to lower resource contention while promoting fairer use of resources.  Built on Metaflow, this approach enables transparent usage reporting, improved performance, and strategic planning for future AI/ML initiatives. Ultimately, it empowers organizations to maximize ROI from their AI/ML investments while maintaining budgetary control and operational efficiency for both platform engineers and data scientists.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/TXYJHL/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Saurabh Garg</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>YBZLZK@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-YBZLZK</pentabarf:event-slug>
            <pentabarf:title>Let Me Structure Freely? How to Improve LLM Structured Output Quality</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T170000</dtstart>
            <dtend>20251210T173000</dtend>
            <duration>003000</duration>
            <summary>Let Me Structure Freely? How to Improve LLM Structured Output Quality</summary>
            <description>Structured output (like JSON) is increasingly used in LLM applications to enforce a predictable schema and simplify downstream parsing. However, developers often assume that structured output is deterministic and robust&#8212;until they run into subtle bugs. At Khan Academy, we&#8217;ve run Khanmigo on structured JSON output since before it was even a supported feature. Along the way, we&#8217;ve learned a lot about where things can go wrong.

Our investigation began when we noticed inconsistent output quality across different LLM frameworks, even with identical prompts and models. The culprit? Python dictionary ordering and how different frameworks serialize JSON schemas.

We&apos;ll explore:

* How Python&apos;s evolution from unordered (pre-3.7) to insertion-ordered dictionaries affects LLM frameworks, or how it lingers through other frameworks in (post-3.7)
* Framework-specific serialization behaviors in OpenAI SDK, Anthropic SDK, LangChain, OpenRouter, and vLLM
* Measurable impact on output quality through A/B testing results

Attendees should have basic familiarity with Python and JSON, but no deep LLM expertise is required. We&apos;ll explain technical concepts clearly while providing actionable insights for immediate application.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/YBZLZK/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Boris Lau</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>SFG8MV@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-SFG8MV</pentabarf:event-slug>
            <pentabarf:title>Build your own Personal Data Warehouse</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T173000</dtstart>
            <dtend>20251210T180000</dtend>
            <duration>003000</duration>
            <summary>Build your own Personal Data Warehouse</summary>
            <description>Typically, a data warehouse operates in the cloud. When you access your data&#8212;even just to view it&#8212;you incur compute costs (in other words, you pay). But you already have a computer with a CPU and memory capable of handling most tasks. Wouldn&#8217;t it be great to view, edit, and transform your data right on your own machine? Now you can!

This free open source application allows you to import your data, transform it using AI to create python code to perform calculations, and report and export the results.

In this talk, Microsoft MVP Michael Washington shows how to:
&#8211; Import data from Excel, CSV, SQL Server, and Microsoft Fabric
&#8211; Use AI-powered Python/C# code for advanced data transformations
&#8211; Generate SSRS-style reports &#8211; no cloud required
&#8211; Leverage local compute power to avoid cloud costs

Whether you&#8217;re a developer, analyst, or data enthusiast, this session will help you take full control of your data with zero hosting fees. Live demos included!</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/SFG8MV/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Michael Alan Washington</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ARAZTG@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ARAZTG</pentabarf:event-slug>
            <pentabarf:title>LLMs, Chatbots, and Dashboards: Visualize Your Data with Natural Language</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T180000</dtstart>
            <dtend>20251210T183000</dtend>
            <duration>003000</duration>
            <summary>LLMs, Chatbots, and Dashboards: Visualize Your Data with Natural Language</summary>
            <description>This talk plans to provide data scientists the tools and techniques needed to integrate AI into their data products. Specifically around how to use APIs to work with chat providers and show where and how we can leverage tasks LLMs are good at to make sure we are confident with their output.

Talk breakdown:

0-5: introduction and where we can push LLMs
5-10: Example of tasks where the LLM can do well, and where can fail (in a data science context)
10-15: brief introduction to the Chatlas package
15-20: brief introduction on Shiny dashboards and integrating chatlast into Shiny
20-25: demo + example of putting everything together and how we can create an LLM-powered data science product.
25-30: Q+A / overflow</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/ARAZTG/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Daniel Chen</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>8U7WLS@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-8U7WLS</pentabarf:event-slug>
            <pentabarf:title>UQLM: Detecting LLM Hallucinations with Uncertainty Quantification in Python</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T183000</dtstart>
            <dtend>20251210T190000</dtend>
            <duration>003000</duration>
            <summary>UQLM: Detecting LLM Hallucinations with Uncertainty Quantification in Python</summary>
            <description>### Objective.
Show how to add uncertainty-aware controls to LLM apps using UQLM so practitioners can detect and handle hallucinations at generation time without ground truth data.

### Context and Gap.
Many hallucination detection methods assume existence of ground truth data, which is rarely available in production. Research has proposed ground-truth-free uncertainty quantification (UQ) techniques, but adoption suffers from fragmented tooling. UQLM packages these methods behind a simple API and provides a versatile suite of UQ-based confidence scorers that work across tasks.

### What you will see.
- Black-box UQ via response consistency from multiple samples
- White-box UQ from token log probabilities
- LLM-as-a-judge scoring
- Ensemble tuning and threshold selection for your use case
- Patterns for routing: block, warn, or escalate to human review

### Outline (30 minutes total).
- 0&#8211;4: Why hallucinations matter in production 
- 4&#8211;8: Limits of traditional hallucination detection approaches and where UQ fits
- 8&#8211;20: UQLM walkthrough and code examples
- 20&#8211;24: Choosing thresholds and tuning ensembles
- 24&#8211;27: Results on several use cases and interpreting confidence
- 27&#8211;30: Q&amp;A

### Expected background. 
Basic familiarity with LLMs and machine learning. No prior uncertainty quantification knowledge required.

### Key takeaways.
- When and why ground-truth-free hallucination detection is useful in production
- How to add UQLM to a Python app in a few lines of code
- Pros and cons of consistency-based, token-probability-based, and judge-based methods
- Practical guidance on thresholds, ensemble tuning, and handling low-confidence outputs</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/8U7WLS/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Dylan Bouchard</attendee>
            
            <attendee>Mohit Singh Chauhan</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ZS37FH@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ZS37FH</pentabarf:event-slug>
            <pentabarf:title>Reviving Survival Analysis: Timeless, Yet Overlooked?</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T130000</dtstart>
            <dtend>20251210T133000</dtend>
            <duration>003000</duration>
            <summary>Reviving Survival Analysis: Timeless, Yet Overlooked?</summary>
            <description>Since at least 1693, when the first actuarial tables were used for calculating insurance premiums, survival (or &quot;time-to-event&quot;) analysis has been relevant for many disciplines. Whether predicting when a mechanical component will fail, when a patient will recover, or when a customer will return a product, survival analysis has applications in nearly every domain - from engineering and medicine to finance and e-commerce. Despite its broad applicability and deep statistical foundations, survival analysis remains underappreciated in modern data science.

I therefore want to give the audience, who does not need to have heard of survival analysis before, an impression about what survival analysis is about, what one needs to be careful with, and which analytical and computational tools to use to get to reliable predictions. In a step-by-step constructive approach, I will slowly guide the audience from the simplest flavor of the fully observed time-to-event-problem to the more intricate versions that include censoring and truncation, in which managing one&apos;s own ignorance becomes the most important and challenging aspect. Numerous code examples in python and R will make the talk hands-on, and allow listeners to replicate the numerical experiments and visualizations. At the same time, I will constantly recur to lucid everyday-examples (what age should the house that you buy have so you avoid problems? how long can you use your winter tires on your car? why is milk often still good after the best-before date?) - and thereby hopefully convince the audience: Survival analysis is almost always everywhere.

Outline: 

- Motivation: The oldest problem in data science? [1 min]
- Introduction: Prediction problems that are in fact survival problems? [3 min]
- The simple case: Fully observed datasets. Visualization of the cumulative failure distribution. [3 min]
- The Weibull distribution as the working horse of survival analysis: How to model early failures, constant risks and wear-outs. [4 min]
- Why reporting another case of illness can be good news. [2 min]
- Censoring: What can we learn from not having observed anything yet? [2 min]
- The Kaplan-Meier estimator and the maximum-likelihood principle. [5 min]
- Machine Learning approaches to the survival problem. [3 min]
- Outlook: Which degree of individualized survival forecasts can we expect in the future? [2 min]

After the talk, the audience will be able to recognize the time-to-event problem in their own domain, and use the appropriate tools in python and R to analyze and model it.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/ZS37FH/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Malte Tichy</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>S7PC89@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-S7PC89</pentabarf:event-slug>
            <pentabarf:title>&#128682;&#128682;&#128016; Lessons in Decision Making from the Monty Hall Problem</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T140000</dtstart>
            <dtend>20251210T143000</dtend>
            <duration>003000</duration>
            <summary>&#128682;&#128682;&#128016; Lessons in Decision Making from the Monty Hall Problem</summary>
            <description>Imagine you&apos;re a contestant on a game show. Three doors stand before you: behind one is a prize car, behind the other two are goats. You choose a door, and the host&#8212;who knows what&apos;s behind each&#8212;reveals a goat behind one of the doors you didn&#8217;t pick. Now you&apos;re asked: &quot;Do you want to switch your choice or stay?&quot;

This is the essence of the Monty Hall Problem, a classic puzzle that famously baffles our intuitions about probability. While it may seem like just a fun brain teaser, it offers profound lessons for decision-making under uncertainty.

In this talk, we&apos;ll break down the Monty Hall Problem, explore its counterintuitive nature, and uncover what it teaches us about probabilistic reasoning and critical thinking. Together, we&apos;ll navigate multiple perspectives.

Key Topics:
* The Monty Hall Problem: Origins, setup, and why it confuses even experts
* Misconceptions and cognitive biases: Why our gut reactions often lead us astray
* Bayesian thinking: The power of belief updating in uncertain scenarios
* Information theory: How the host&apos;s actions reveal hidden information
* Causal reasoning: A fresh lens for understanding the game&apos;s dynamics
* Real-world takeaways: Applying these lessons to practical decision-making

By the end of this session, attendees will gain:

* A clear understanding of the Monty Hall Problem and its solution
* Insights into the pitfalls of intuitive probability judgments
* Strategies for approaching complex decisions and probabilistic reasoning

This session is for data scientists, analysts, and decision-makers at all experience levels. No advanced math is required&#8212;just curiosity and a willingness to rethink what you know about probability.

Join me to discover how a seemingly trivial game show puzzle can sharpen your decision-making skills and elevate your approach to statistics, data science, and beyond.

I have summarised this talk in this publication: [bit.ly/mh-lessons](https://bit.ly/mh-lessons).</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/S7PC89/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Eyal Kazin</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>J9JCL9@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-J9JCL9</pentabarf:event-slug>
            <pentabarf:title>Decisions Under Uncertainty: A Hands&#8209;On Guide to Bayesian Decision Theory</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T160000</dtstart>
            <dtend>20251210T163000</dtend>
            <duration>003000</duration>
            <summary>Decisions Under Uncertainty: A Hands&#8209;On Guide to Bayesian Decision Theory</summary>
            <description>This talk bridges everyday decision-making (umbrella example) with advanced techniques like Bayesian optimization and experimental design, and equips attendees with conceptual clarity and immediate code they can adapt to their data-driven workflows.

## Audience

Primarily data scientists, ML practitioners, and statisticians who:

- Have applied Bayesian models but want a broader decision-theory perspective.
- Want actionable insight into uncertainty-aware decision frameworks.
- Seek practical demos in Python.

## Outline

### Motivation &amp; Core Concepts (5&#8239;min)

- Frame real-world decision problems: rain or shine, clinical trials, A/B testing.
- Introduce Bayesian decision theory: beliefs &#215; utilities &#8594; action via expected utility maximization.

### Toy Example: Should I Bring an Umbrella? (8&#8239;min)

- Define: Probabilityp of rain; utility/loss matrix

| Action      | Rain         | No Rain            |
| ----------- | ------------ | ------------------ |
| Umbrella    | &#8211;1 (weight)  | &#8211;1 (inconvenience) |
| No Umbrella | &#8211;10 (soaked) | 0                  |

- Derive expected utility:
```
EU_umbrella = -1
EU_no_umbrella = -10p
```

So bring umbrella if p &gt; 0.1.

- Interactive Python demo: explore how p and utility values shift the decision point.

### Bayesian Optimization: PoI &amp; EI (8 min)

- Introduce Gaussian-process-based optimization and the need to trade off exploration vs. exploitation.
- Define Probability of Improvement (PoI) and Expected Improvement (EI)
- Show how they&apos;re derived from decision theory: choosing the next point to maximize expected gain.
- Python demo using GPyTorch: fit GP, compute PoI/EI acquisition functions, visualize decision boundary&#8212;why one chooses a high-uncertainty point vs. one near known good values.

### Bayesian Experimental Design (BED): Minimizing Uncertainty (8 min)

- Motivation: cost-sensitive data collection (labeling, surveys, medical tests).
- Define an information-based utility (e.g., expected reduction in entropy).
- Show how decision theory prescribes choosing the next experiment to maximize this expected utility.
- Python demo using OptBayesExpt.


### Summary &amp; Takeaways (1 min)

- Reiterate the decision-theoretic arc: belief &#8594; utility &#8594; action.
- Emphasize the unifying framework across umbrella example, optimization, and experimental design.
- Share resources &amp; practical tips: GPyTorch / scikit-optimize, OptBayesExpt</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/J9JCL9/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Quan Nguyen</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>8RUFNS@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-8RUFNS</pentabarf:event-slug>
            <pentabarf:title>fastplotlib: driving scientific discovery through data visualization</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T170000</dtstart>
            <dtend>20251210T173000</dtend>
            <duration>003000</duration>
            <summary>fastplotlib: driving scientific discovery through data visualization</summary>
            <description>Over the past decade, advanced analyses pipelines have been developed for the analysis of large datasets. However, fast visualization and live interactivity during data collection remains challenging. While current tools within the Python plotting ecosystem allow for interactive data visualization, they either fail to leverage modern GPUs efficiently, lack intuitive APIs for rapid prototyping, or require users to write their own shaders. Additionally, other popular plotting libraries, such as bokeh and matplotlib, are not geared towards fast interactive visualization with millions of objects. Given these challenges with current visualization tools, the need for a modern GPU-driven interactive plotting library exists. In this presentation, we will go through the technical details, as well as a brief demo on how fastplotlib makes fast interactive visualization of complex datasets possible. We will demonstrate the broad applicability of fastplotlib as a fast, general-purpose plotting library.
Fastplotlib is built on top of pygfx which is a cutting edge Python rendering engine that utilizes WGPU, which can efficiently leverage modern GPU and CPU hardware. WGPU is the successor to OpenGL and features a low overhead with respect to the amount of code per-draw-per-object allowing for speed even when rendering millions of objects. Pygfx is also non-blocking, which allows for interactivity and modification of already drawn objects. Fastplotlib utilizes the pygfx rendering library for fast visualization with an expressive API for scientific visualization. The benefits of fastplotlib are that it reduces boilerplate code which allows users to focus on their data without having to manage the underlying rendering process. Additionally, fastplotlib allows for animations as well as high-level interactivity among plots, which can be combined with lazy loading and lazy compute of very large datasets that are hundreds of gigabytes or terabytes in size. Furthermore, fastplotlib can be used in jupyter notebooks, allowing it to be used on cloud computing and other remote infrastructures for streaming visualizations of extremely large datasets. In total, these unique features and the underlying architecture create a plotting library that is fast, easy to use, and multifaceted.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/8RUFNS/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Kushal Kolar</attendee>
            
            <attendee>Caitlin Lewis</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>NJNHQB@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-NJNHQB</pentabarf:event-slug>
            <pentabarf:title>Bayesian Decision Analysis with PyMC: Beyond A/B Testing</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T173000</dtstart>
            <dtend>20251210T190000</dtend>
            <duration>013000</duration>
            <summary>Bayesian Decision Analysis with PyMC: Beyond A/B Testing</summary>
            <description>Bayesian methods offer a natural and interpretable framework for updating beliefs with data, and PyMC makes it easy to apply these techniques in practice. In this tutorial, we&#8217;ll walk through a series of examples that demonstrate the core concepts:

1. Bayesian A/B Testing with the Beta-Binomial Model

  * Represent prior beliefs with the beta distribution  
  * Use binomial likelihoods to model observed outcomes
  * Understand posterior distributions and credible intervals

2. Bayesian Bandits and Thompson Sampling

  * Go beyond hypothesis testing: estimate the probability of one version outperforming another
  * Use Thompson sampling to guide decision-making
  * Simulate and visualize an adaptive email campaign

3. Hierarchical Models for Partial Pooling and Prediction

  * Learn how to share information across variants
  * Use posterior predictive distributions to quantify uncertainty
  * Understand second-order probabilities

Hands-On Learning

Participants will follow along in Jupyter notebooks (hosted on Colab &#8212; no installation required). Exercises are embedded throughout, with guided solutions. Code is based on PyMC, ArviZ, and standard scientific Python libraries.

Prerequisites

  * Intermediate Python: basic familiarity with NumPy, plotting, and Jupyter notebooks
  * No prior experience with Bayesian statistics or PyMC is assumed
  * All materials run on Colab (no setup required)</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/NJNHQB/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Allen Downey</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>EXUXFR@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-EXUXFR</pentabarf:event-slug>
            <pentabarf:title>Getting big OpenStreetMap data with QuackOSM</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T120000</dtstart>
            <dtend>20251210T123000</dtend>
            <duration>003000</duration>
            <summary>Getting big OpenStreetMap data with QuackOSM</summary>
            <description>[QuackOSM](https://github.com/kraina-ai/quackosm) is a powerful and user-friendly library that streamlines the process of accessing and manipulating OpenStreetMap (OSM) vector and tags data. It&apos;s using the [DuckDB](http://duckdb.org/) engine with its [Spatial extension](https://duckdb.org/docs/extensions/spatial/overview), and PyArrow library that enables users to efficiently retrieve large-scale OSM data in the GeoParquet format.

It&apos;s similar in functionality to other available libraries, but it&apos;s faster, can work with bigger than memory datasets and doesn&apos;t require any additional dependencies.

---

Target audience:
Data engineers/analysts/scientists who have worked with or want to work with geospatial data.

---

Outline:
- Brief OpenStreetMap data introduction
- Introduction to DuckDB and PyArrow
- Why is it hard to work with big OSM datasets? Introduction to the OpenStreetMap data schema and PBF format.
- QuackOSM overview: basic usage, data filtering, example use-cases + benchmark against available libraries (OSMnx, Pyrosm, PyDriosm and others).
- Example of a simple ML model built on top of geospatial data</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/EXUXFR/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Kamil Raczycki</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>YTYRLZ@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-YTYRLZ</pentabarf:event-slug>
            <pentabarf:title>RDepot - 100% open source enterprise management of Python and R repositories</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T130000</dtstart>
            <dtend>20251210T133000</dtend>
            <duration>003000</duration>
            <summary>RDepot - 100% open source enterprise management of Python and R repositories</summary>
            <description>[RDepot](https://rdepot.io) is a solution for the management of Python and R package repositories in an enterprise environment.
It allows to submit packages through a user interface or API and to automatically update and publish Python and R repositories.
Multiple departments can manage their own repositories and different users can have different roles in the management of their packages.
With continuous integration infrastructure for quality assurance on Python and R packages, package uploads can be automated.
All configuration is declarative and RDepot can be set up as infrastructure as code, which is especially relevant in regulated contexts, since it makes validation activities much easier.
Packages from publicly available Python repositories such as [PyPi](https://pypi.org/) can be mirrored selectively in custom repositories for use behind a firewall, in internal networks and offline.
Combined with [Crane](https://craneserver.net), authentication and fine-grained authorization (using [OpenID Connect](https://openid.net/developers/how-connect-works/)) can be configured per repository, which offers extra security when dealing with sensitive data or sensitive methodology.

In this talk we will walk Python users and developers through different features of RDepot and demonstrate how these can be useful in different scenarios.
The logic of the different workflows will be explained and live demos will be given to see the open source solution in action.
We will make sure to address needs ranging from small research groups sharing a handful of packages up to multinational companies managing their Python (and R) code across the globe.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/YTYRLZ/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Jonas Van Malder</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>CCRL7W@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-CCRL7W</pentabarf:event-slug>
            <pentabarf:title>Modernizing JSON for Julia</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T133000</dtstart>
            <dtend>20251210T140000</dtend>
            <duration>003000</duration>
            <summary>Modernizing JSON for Julia</summary>
            <description>Over Julia&apos;s history, there have been a number of JSON packages providing various forms of JSON support:
* JSON.jl: oldest/original JSON package; very simple JSON support for reading/writing for mostly just core Julia data structures
* LazyJSON.jl: package that attempted to provide &quot;lazy&quot; parsing support where JSON could be scanned without fully materializing objects in memory; never quite fully &quot;finished&quot; the package/functionality and was thus, never really widely adopted
* JSON2.jl/JSON3.jl: Iterations on interfaces to support custom struct serialization/deserialization in Julia

The new 1.0 release to the JSON package combines the functionality from all these packages in a single, unified, *and modern* interface. Package functionality now includes:
* Same basic JSON support of reading/writing for core datastructures
* Support for lazily processing JSON including extracting deeply nested values without intermediate materialization
* A new JSON.Object structure that mimics a `Dict{Symbol, Any}` but preserves insertion (or in this case parse) order, allows dot access, and in most cases is faster with fewer memory allocations than Dict.
* Custom struct serialization/deserialization support that includes specifying field defaults, custom field lower/lift functionality, or directly mutating fields (of mutable structs) while parsing

This talk aims to cover the historical context leading to the JSON.jl 1.0 release, how the package leverages clean internal interfaces to provide a ton of functionality without exploding the codebase, and why the decision was made to ultimately rewrite the original JSON.jl package for a 1.0 release instead of yet-another-JSONX.jl type package.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/CCRL7W/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Jacob Quinn</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ZXVYCB@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ZXVYCB</pentabarf:event-slug>
            <pentabarf:title>From Ideas to APIs: Delivering Fast with Modern Python</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T140000</dtstart>
            <dtend>20251210T143000</dtend>
            <duration>003000</duration>
            <summary>From Ideas to APIs: Delivering Fast with Modern Python</summary>
            <description>This talk outlines a practical, opinionated workflow for building real things quickly using modern Python without relying on heavy frameworks or over-engineering.

Core idea: 

The shortest path from notebook to usable component is a repeatable, well-lit toolchain with the right structure.

Attendees will learn how to:

1. Scaffold a clean project using pyproject.toml, deterministic environments (uv), and lightweight automation (e.g. Makefile or CLI scripts).

2. Explore data rapidly with polars and duckdb, capturing the business logic in small, testable functions.

3. Wrap the logic in a minimal FastAPI app with pydantic validation, creating clean contracts and boundaries.

4. Add fast feedback mechanisms: tests with pytest, type safety via mypy, and low-friction code hygiene using ruff and pre-commit.

5. Package a handoff-friendly interface (command-line entrypoints, minimal docs) for teammates or deployment pipelines.

This talk isn&#8217;t a showcase of cutting-edge libraries. It&#8217;s a field guide on how to leverage modern Python tools and fostering repeatable software engineering habits to maximize value delivery.

You&#8217;ll leave with:

- A blueprint for rapid iteration.

- Reusable patterns for API-bound prototyping.

- A mindset that treats reproducibility as a first-class concern.

Prior knowledge expected:

Basic Python (functions, environments), familiarity with DataFrame operations, and HTTP/JSON fundamentals.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/ZXVYCB/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>C&#233;sar Soto Valero</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>K38JGZ@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-K38JGZ</pentabarf:event-slug>
            <pentabarf:title>Quiet on Set: Building an On-Air Sign with Open Source Technologies</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T143000</dtstart>
            <dtend>20251210T150000</dtend>
            <duration>003000</duration>
            <summary>Quiet on Set: Building an On-Air Sign with Open Source Technologies</summary>
            <description>Learn how to build a custom On-Air sign using Apache Kafka&#174;, Apache Flink&#174;, and Apache Iceberg&#8482;! See how to capture events like Zoom meetings and camera usage with Python, process data with FlinkSQL, analyze trends in your Iceberg tables, and bring it all together with a practical IoT project that easily scales out.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/K38JGZ/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Danica Fine</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>BGK8N8@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-BGK8N8</pentabarf:event-slug>
            <pentabarf:title>[BoF] From Data to Decisions: Leveraging Generative AI Across the Data Science Workflow</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251210T160000</dtstart>
            <dtend>20251210T170000</dtend>
            <duration>010000</duration>
            <summary>[BoF] From Data to Decisions: Leveraging Generative AI Across the Data Science Workflow</summary>
            <description>This Birds of a Feather session provides an opportunity for a cross-disciplinary dialogue about practical applications, challenges, ethical considerations, and emerging best practices for leveraging generative AI in data science.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/BGK8N8/</url>
            <location>Impact Scholarship Program</location>
            
            <attendee>Inessa Pawson</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>NSWVT3@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-NSWVT3</pentabarf:event-slug>
            <pentabarf:title>When the Meter Maxes Out: Chernobyl Disaster Lessons for ML Systems in Production</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T130000</dtstart>
            <dtend>20251211T133000</dtend>
            <duration>003000</duration>
            <summary>When the Meter Maxes Out: Chernobyl Disaster Lessons for ML Systems in Production</summary>
            <description>Software engineers aren&#8217;t nuclear engineers, yet the patterns behind catastrophic failure are uncannily transferable. In Chernobyl&#8217;s control room, a radiation gauge pinned at 3.6 R/h masked lethal reality; in production we truncate floats, or hide exploding metrics behind poorly chosen histogram bins. Operators overrode the reactor&#8217;s emergency cooling &#8220;just for this test&#8221;; we disable schema validation to hurry a back-fill. Steam-void reactivity formed a positive feedback loop; recommenders amplify popularity bias until user engagement collapses.

The session walks through several such parallels. Each mini-segment starts with the historical context, then immediately pivots into a modern use-case that demonstrates the ML analogue, for instance, an ad-ranking model whose session_depth feature is computed differently online than in training, yielding a negative CTR lift despite glowing offline metrics.
While the historical narrative keeps the material memorable, the engineering focus stays firmly on actionable prevention: tools like great expectations, out-of-distribution gates, reproducible datasets, and perhaps most importantly - a culture that treats &#8220;impossible&#8221; as a probability, not a certainty.

No specialized nuclear knowledge is assumed. Code examples (when present) use familiar PyData stack - NumPy, Pandas, scikit-learn. The use-cases, concepts and tools shown can appeal to both seasoned practitioners and those earlier in their ML journey.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/NSWVT3/</url>
            <location>General Track</location>
            
            <attendee>Idan Richman Goshen</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>PSNG8L@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-PSNG8L</pentabarf:event-slug>
            <pentabarf:title>GPU Accelerated Zarr</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T140000</dtstart>
            <dtend>20251211T143000</dtend>
            <duration>003000</duration>
            <summary>GPU Accelerated Zarr</summary>
            <description>This talk is targeted at users who have at least heard of zarr, but we will give a brief introduction of the basics. The primary purpose is to spread knowledge about zarr-python&#8217;s recently added support for device (GPU) buffers and arrays, and how it can be used to speed up your array-based workload.

An outline:

- Introduction

  - Brief overview of zarr (cloud-native format for storing chunked, n-dimensional arrays)
  - Brief example of how easy it is to use zarr-python&#8217;s native support for device arrays

- Overview of GPU-accelerated Zarr workloads

  - We&#8217;ll some high-level examples of how Zarr fits into larger workloads (e.g. analyzing climate simulations, as part of a deep learning pipeline)
  - We&#8217;ll discuss the key factors to think about when trying to maximize performance

- Overview of how it works
  - Show zarr&#8217;s configuration options for selecting between host and device buffers
  - An overview of the Zarr codec pipeline
  - Show how on-device decompression can be used, to accelerate decompression if that&#8217;s a bottleneck in your workload

- Benchmarks showing the speedup users can expect to see from GPU acceleration

- Preview of future work
  - Zarr-python currently only uses a single GPU, and doesn&#8217;t use any features like CUDA Streams. https://github.com/zarr-developers/zarr-python/issues/3271 tracks possible improvements for exposing additional parallelism.
  - We&#8217;ll look at a prototype of how CUDA streams enable asynchronous host-to-device memory copies, enabling you to start computing on one chunk of data while another chunk is being copied to the device.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/PSNG8L/</url>
            <location>General Track</location>
            
            <attendee>Tom Augspurger</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>FLD9SR@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-FLD9SR</pentabarf:event-slug>
            <pentabarf:title>Keynote- Noor Aftab- The Next Commit: Building Inclusive, Data-Driven Ecosystems for Responsible AI</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T150000</dtstart>
            <dtend>20251211T160000</dtend>
            <duration>010000</duration>
            <summary>Keynote- Noor Aftab- The Next Commit: Building Inclusive, Data-Driven Ecosystems for Responsible AI</summary>
            <description>Python powers the global AI ecosystem, yet 78% of the talent pool is missing.

We are building the most advanced systems in history with a critical &quot;innovation debt.&quot; This is evidenced not just by the gender gap, but by biased algorithms and higher error rates in production models. This talk treats this gap as an engineering crisis and provides a research-backed solution.

Drawing on published work from the SciPy Proceedings and a quantitative study of 24 global tech communities, we will introduce the 5-Step Engineering Framework. We will deconstruct the VIM Model (Visibility, Invitation, Mechanism), which drove 179% membership growth in the IBM Women in AI pilot.

Attendees will walk away with three actionable tools:

1) The System Audit: A method to measure &quot;innovation debt&quot; in your own teams using specific retention and contribution metrics.
2) The VIM Patch: A blueprint for deploying high-yield mechanisms&#8212;such as hands-on Python labs (requested by 76% of members)&#8212;that statistically outperform generic networking.
3) The Retention Fix: A step-by-step guide to stabilizing the &quot;leaky pipeline,&quot; specifically addressing the mid-career drop-off point where 50% of diverse talent currently leaves.

This session is for builders and maintainers ready to stop admiring the problem and commit to the fix.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/FLD9SR/</url>
            <location>General Track</location>
            
            <attendee>Noor Aftab</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>SBM8ZY@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-SBM8ZY</pentabarf:event-slug>
            <pentabarf:title>Garbage In, Lawsuit Out: Building Compliant and Reproducible ML Pipelines</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T160000</dtstart>
            <dtend>20251211T163000</dtend>
            <duration>003000</duration>
            <summary>Garbage In, Lawsuit Out: Building Compliant and Reproducible ML Pipelines</summary>
            <description>This session is a reality check for anyone shipping machine learning in production. We&#8217;ll walk through the dark corners of modern ML pipelines: mutable datasets with no history, mystery data sources with missing labels, and a forgotten column of PII that&#8217;s just been shipped to production. Then we&#8217;ll show how to fix it&#8212;without turning your data team into compliance officers. 

You&#8217;ll learn how to embed reproducibility, traceability, and policy enforcement into your pipeline without slowing it to a crawl: track every dataset change, version every experiment, validate against policy gates, and generate audit trails that actually mean something. Whether you&#8217;re dealing with GDPR, HIPAA, or just not wanting to get roasted by internal audit, this talk gives you the blueprint for ML you can defend in court&#8212;and still ship on time.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/SBM8ZY/</url>
            <location>General Track</location>
            
            <attendee>Itai Gilo</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>SSVDUG@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-SSVDUG</pentabarf:event-slug>
            <pentabarf:title>Connected Identities: Rethinking Identity and Access Management with Neo4j and Python</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T170000</dtstart>
            <dtend>20251211T173000</dtend>
            <duration>003000</duration>
            <summary>Connected Identities: Rethinking Identity and Access Management with Neo4j and Python</summary>
            <description>Access control: it sounds boring&#8212;until it breaks. In this talk, we&#8217;ll look at how to build a smarter Identity and Access Management (IAM) system using Neo4j and Python, and why graphs are a game-changer for modeling who can do what.

You&#8217;ll get a crash course in graph-based thinking for IAM, see how to represent users, roles, and permissions as connected data, and learn how a few Cypher queries can uncover misconfigurations, rogue access, and hidden connections&#8212;all in real time.

As systems scale and architectures grow more distributed, Identity and Access Management (IAM) often becomes a heavy, costly layer&#8212;difficult to maintain, expensive to scale, and slow to adapt. But it doesn&#8217;t have to be this way.

This talk introduces an approach to IAM that is lightweight, portable, and cost-efficient, using Neo4j and Python. By leveraging the natural connectedness of identity data&#8212;users, roles, permissions, and resources&#8212;we can model access in a way that&#8217;s easy to manage, fast to query, and flexible to deploy.

Attendees will learn how to build a graph-based IAM system that avoids complex cloud dependencies, offers real-time access insights, and supports role- and attribute-based access control without requiring massive infrastructure. Whether you&apos;re managing internal tools, building developer platforms, or scaling services, this approach provides strong access control without unnecessary overhead.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/SSVDUG/</url>
            <location>General Track</location>
            
            <attendee>Irina Loghin</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>AJD8TU@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-AJD8TU</pentabarf:event-slug>
            <pentabarf:title>Revolutionizing Safety Log Analysis in Oil and Gas: A Multi-Stage LLM Approach for Enhanced Hazard Identification</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T113000</dtstart>
            <dtend>20251211T120000</dtend>
            <duration>003000</duration>
            <summary>Revolutionizing Safety Log Analysis in Oil and Gas: A Multi-Stage LLM Approach for Enhanced Hazard Identification</summary>
            <description>This presentation explores a new application of Large Language Models (LLMs) in the oil and gas industry, specifically for safety log analysis. While oil and gas operators have traditionally been cautious in adopting LLM technologies, this project demonstrates a compelling use case that delivers tangible value through enhanced hazard identification and trend analysis. Attendees will learn how our multi-stage LLM pipeline processes safety observations to generate actionable insights while maintaining data privacy through on-premises processing. The presentation will showcase how this approach significantly improves classification accuracy and processing efficiency compared to traditional methods, providing a practical framework for organizations looking to leverage AI for safety management.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/AJD8TU/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Andrew Yule</attendee>
            
            <attendee>Iain Docherty</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ATM79G@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ATM79G</pentabarf:event-slug>
            <pentabarf:title>How Big are SLMs</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T120000</dtstart>
            <dtend>20251211T123000</dtend>
            <duration>003000</duration>
            <summary>How Big are SLMs</summary>
            <description>The development of SLMs addresses the growing demand for AI solutions that are cost-effective, energy-efficient, and capable of running locally to ensure data privacy and reduce latency. Recent advancements have demonstrated that SLMs can rival or even surpass larger models in specific tasks, thanks to optimized architectures and training methodologies .&#8203;
A notable example is Google&apos;s Gemma 3, a multimodal SLM family with models ranging from 1 to 27 billion parameters. Gemma 3 introduces vision understanding capabilities, supports longer context windows of at least 128K tokens, and employs architectural changes to reduce memory usage . The 27B parameter version of Gemma 3 has achieved competitive performance, ranking among the top 10 models in the LMSys Chatbot Arena with an Elo score of 1339 .
The shift towards SLMs signifies a paradigm change in AI development, focusing on creating models that are not only powerful but also accessible and adaptable to a wide range of applications. As the field evolves, SLMs are poised to play a crucial role in democratizing AI technology.&#8203;</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/ATM79G/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Jayita Bhattacharyya</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ECCYVF@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ECCYVF</pentabarf:event-slug>
            <pentabarf:title>Automating ML with PyCaret: Train &amp; Compare Multiple Models to Find the Best Performer</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T130000</dtstart>
            <dtend>20251211T133000</dtend>
            <duration>003000</duration>
            <summary>Automating ML with PyCaret: Train &amp; Compare Multiple Models to Find the Best Performer</summary>
            <description>Machine learning workflows often involve repetitive tasks, complex code, and time-consuming model comparisons. PyCaret changes this paradigm by democratizing machine learning - empowering anyone to train multiple algorithms and systematically compare their performance with low-code solutions. With PyCaret&apos;s philosophy of &quot;spend less time coding and more time on analysis,&quot; this library transforms the model selection process by automating training and comparison across multiple algorithms.
In this 30-minute session, you&apos;ll discover:

ML and PyCaret Fundamentals (13 mins)

1. What is Machine Learning, Machine Learning Algorithms and workflows
2. What is PyCaret

 Live Demo: Multi-Algorithm Training &amp; Comparison (10 mins)

1. Hands-on demonstration using the Diabetes Dataset
2. Training multiple algorithms simultaneously with minimal code
3. Automated model comparison using various performance metrics
4. Real-time exploration of model performance visualizations
5. Selecting the best performer based on key evaluation metrics


 Wrap-up &amp; Resources (2 mins)

1. Key takeaways and next steps
2. Access to GitHub repository with slides and demo notebooks

Q&amp;A (5 min)

Who Should Attend:

1. Data scientists looking to accelerate their workflow
2. Python developers interested in machine learning
3. ML practitioners seeking efficient model prototyping tools
4. Anyone curious about low-code ML solutions

Prerequisites:

1. Basic understanding of Python
2. Familiarity with machine learning concepts (helpful but not required)
3. No prior PyCaret experience needed

What You&apos;ll Take Away:

1. Practical knowledge of automated model training and comparison
2. Experience with systematic algorithm evaluation using PyCaret
3. Understanding of performance metrics for model selection
4. Ready-to-use code examples for multi-algorithm comparison
5. Confidence to choose the best ML algorithm for your specific projects

Join us for this fast-paced, demo-heavy session that will transform how you approach machine learning projects!</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/ECCYVF/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Manjunath Janardhan</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>7MEX7V@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-7MEX7V</pentabarf:event-slug>
            <pentabarf:title>Streaming AI Workflows in Python: Kafka Queues and Flink-Powered LLM Inference</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T133000</dtstart>
            <dtend>20251211T140000</dtend>
            <duration>003000</duration>
            <summary>Streaming AI Workflows in Python: Kafka Queues and Flink-Powered LLM Inference</summary>
            <description>This talk includes:

Live Python-oriented demo and architecture walkthrough.

Building an end-to-end pipeline: Kafka queue &#8594; Flink+LLM inference &#8594; (optional) Data lake storage (e.g., Iceberg).

Python code samples, best practices, and design patterns for powering real-time, intelligent analytics on modern cloud-native stacks.

Whether you&#8217;re developing in Jupyter Notebooks, Pandas, or PySpark, you&#8217;ll discover practical ways to combine Kafka, Flink, and LLMs in your Python data workflows&#8212;with or without a lakehouse backend.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/7MEX7V/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Shekhar Prasad Rajak</attendee>
            
            <attendee>bhrathjatoth</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>SPFEYP@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-SPFEYP</pentabarf:event-slug>
            <pentabarf:title>From Handwritten Notes to Smart Knowledge: Build Local AI Agents with Python</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T140000</dtstart>
            <dtend>20251211T143000</dtend>
            <duration>003000</duration>
            <summary>From Handwritten Notes to Smart Knowledge: Build Local AI Agents with Python</summary>
            <description>What you&#8217;ll learn
&#8226; When to stay in a UI vs. when Python is essential
&#8226; How to orchestrate agents with CrewAI and plug in custom logic
&#8226; Clean patterns for local LLM inference with MLC-AI
&#8226; A complete, copy-paste-ready pipeline for knowledge extraction &amp; linking

Live demos

AnythingLLM quick-start (2 min)
Python agent orchestration classifying &amp; linking 10+ handwritten notes (15 min)
Querying the resulting knowledge graph for recurring themes (3 min)
Take-home repo
GitHub repo + requirements.txt + Docker compose file so attendees can rerun everything on their own notes.

Prerequisites
Basic Python (functions, classes, pip install). No prior AI/ML knowledge required.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/SPFEYP/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>piotr stepinski</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>RQSLXN@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-RQSLXN</pentabarf:event-slug>
            <pentabarf:title>Detecting Regime Shifts in Time Series with Python: Entropy-Based Change-Point Detection</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T143000</dtstart>
            <dtend>20251211T150000</dtend>
            <duration>003000</duration>
            <summary>Detecting Regime Shifts in Time Series with Python: Entropy-Based Change-Point Detection</summary>
            <description>Time series data in finance, IoT, or sensor monitoring are rarely stationary &#8212; regime shifts happen suddenly, and failing to detect them early can lead to inaccurate predictions or large financial losses.

This talk presents a practical, Python-based approach to change-point detection in multivariate time series using k-nearest neighbor entropy estimators combined with clustering techniques. This method uses open-source libraries like NumPy, scikit-learn, and pandas, and can be adapted to various domains.

Takeaways:

- How to implement entropy-based change-point detection with open-source Python tools.

- How to identify and handle abrupt shifts in time series to make models more robust.

- How to apply these techniques beyond finance to any time series with regime shifts.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/RQSLXN/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Sergei Nasibian</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>DYXWAV@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-DYXWAV</pentabarf:event-slug>
            <pentabarf:title>Future proof your AI product</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T160000</dtstart>
            <dtend>20251211T163000</dtend>
            <duration>003000</duration>
            <summary>Future proof your AI product</summary>
            <description>Most LLM frameworks are too opaque and obscure what they are doing. New state of the art models are released every week and different models respond differently to the same prompts. These framework&apos;s hardcoded prompts within the library make it difficult to debug, update and improve the system. Also, walls of text are a terrible way to program, and hardly maintainable. DSPy is a better way, using abstractions to code your intent into the LLM without defining the prompt, making it future proof. Changing one line, you can change models, tasks or inference strategy.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/DYXWAV/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Breno Brito</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>EJJSKK@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-EJJSKK</pentabarf:event-slug>
            <pentabarf:title>HPC Implementation of a Hybrid Recommender System in Julia</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T163000</dtstart>
            <dtend>20251211T170000</dtend>
            <duration>003000</duration>
            <summary>HPC Implementation of a Hybrid Recommender System in Julia</summary>
            <description>In this talk, we present the implementation of a hybrid recommender system that helps preselect candidates for a job application. We discuss the preprocessing of the data following NLP techniques and building on various libraries, including TextAnalysis, Embeddings and MLJ. The input information (applicant metadata and job adverts) is aggregated into a heterogeneous graph, later converted into a GNN using GraphNeuralNetworks. The underlying model supporting the recommendations combines several graph convolutional layers and a transformer (encoder and decoder). To make the model&apos;s training more efficient, we rely on the Distributed and ClusterManagers libraries. Note that our preprocessing and training steps are implemented using a supercomputer. We present the implementation and the job submission details.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/EJJSKK/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Jos&#233; Quenum</attendee>
            
            <attendee>marthin thomas</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>7PTYQX@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-7PTYQX</pentabarf:event-slug>
            <pentabarf:title>TinyTroupe: Enhancing Marketing Insights through LLM-Powered Multiagent Persona Simulation</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T183000</dtstart>
            <dtend>20251211T190000</dtend>
            <duration>003000</duration>
            <summary>TinyTroupe: Enhancing Marketing Insights through LLM-Powered Multiagent Persona Simulation</summary>
            <description>**Agenda:**

- Introduction
- Business Context: customer understanding &amp; traditional research
- The Challenge: &#8220;Can&#8217;t we just use ChatGPT?&#8221;
- TinyTroupe: LLM-powered multi-agent persona simulation
- Code Walkthrough: end-to-end concept-test demo (running-shoe example)
- Summary &amp; practical tips

**Code Walkthrough Part**
https://github.com/takechanman1228/Effective-Persona-Simulation

**Key Takeaways:**
- Understand the core concepts and advantages of LLM-powered multi-agent persona simulation.
- Learn how to leverage TinyTroupe for efficient and insightful marketing analytics.

**Target Audience:**
- Data analysts and data scientists interested in customer analytics and marketing.
- Marketers, business analysts, and executives seeking innovative approaches to understanding customer behavior and optimizing marketing strategies.
- IT specialists and developers interested in applying LLM and multi-agent simulation technologies to real-world business scenarios.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/7PTYQX/</url>
            <location>Machine Learning &amp; AI</location>
            
            <attendee>Hajime Takeda</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>WRSZRV@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-WRSZRV</pentabarf:event-slug>
            <pentabarf:title>Computer Vision Data Version Control and Reproducibility at Scale</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T133000</dtstart>
            <dtend>20251211T150000</dtend>
            <duration>013000</duration>
            <summary>Computer Vision Data Version Control and Reproducibility at Scale</summary>
            <description>Petabytes of unstructured data stand as the cornerstone upon which triumphant Machine Learning (ML) models are built.&#160;One common method for researchers to extract subsets of data to their local environments is by simply using the age-old copy-paste, for model training. This method allows for iterative experimentation, but it also introduces challenges with the efficiency of data management when developing machine learning models, including reproducibility constraints, inefficient data transfer, alongside limited compute power.

This is where data version control technologies can help overcome these challenges for computer vision researchers.&#160;In this workshop we&apos;ll cover:

- How to use open source tooling to version control your data when working with data locally.
- Best practices for working with data, preventing the need to copy data locally, while enabling the training of models at scale directly on the cloud.&#160;This will be demoed with an OSS stack:
- Langchain
- Tensorflow
- PyTorch
- Keras

You will come away with practical methods to improve your data management when developing and iterating upon Machine Learning models, built for modern computer vision research.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/WRSZRV/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Joe Pringle</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>M78NZT@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-M78NZT</pentabarf:event-slug>
            <pentabarf:title>Animating Equity: Python Dashboards for Small-Town Housing and Displacement Risk</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T163000</dtstart>
            <dtend>20251211T170000</dtend>
            <duration>003000</duration>
            <summary>Animating Equity: Python Dashboards for Small-Town Housing and Displacement Risk</summary>
            <description>How do you turn raw census tables into something a small town can actually use to guide housing policy? In this talk, I walk through the design and development of an animated spatial dashboard built entirely with Python, designed to help local residents and planners in Oxford, North Carolina understand where their most vulnerable neighbors live &#8212; and how that vulnerability is changing over time.

Oxford is a rural town facing new development pressure, including non-contiguous annexation and suburban for-sale housing growth. While these changes promise tax base expansion, they also risk pushing out low-income renters, especially in historically underserved neighborhoods. My dashboard uses ACS 5-Year estimates and USDA Food Access data to visualize key indicators like rent burden, SNAP share, senior population, and a normalized displacement risk index &#8212; all animated from 2017 to 2023 using Leaflet.TimeDimension inside folium.

The talk is both a case study in data storytelling for place-based equity and a practical demo of working with geospatial census data in Python &#8212; no proprietary software or expensive tools required.

Outline (with time estimates)

0&#8211;5 min &#8212; Context: Why Oxford, NC? The risks of unchecked suburban growth for small cities

5&#8211;10 min &#8212; Data: ACS, USDA, and parcel-level value data via censusdis and publicly-available shapefiles

10&#8211;20 min &#8212; Dashboard architecture: Python data pipeline, Folium with TimeSliderChoropleth, adding map interactivity, overlays, and popups

20&#8211;25 min &#8212; Use case: Displacement risk and the intersection of rent burden, food access, and annexation

25&#8211;30 min &#8212; Q&amp;A, tips for adapting the method to other communities

Audience

This talk is intended for:

Data analysts, GIS specialists, and Python developers interested in civic tech or applied geospatial analysis

Planners, advocates, and public servants exploring how open data and open tools can improve policy transparency

Anyone working with small-area census data, especially at the block group or tract level

Attendees should have a basic familiarity with Python and data visualization libraries (pandas, folium, etc.), but no prior experience with geospatial programming is required.

Takeaways

Attendees will learn:

How to download and preprocess ACS data at the block group level using Python

How to build time-animated choropleth maps using folium + Leaflet.TimeDimension

How normalized composite indicators like a displacement risk index can help surface hidden patterns in small towns

How interactive mapping can drive better community conversations around housing, equity, and development</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/M78NZT/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Matthew Cox</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>VY398A@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-VY398A</pentabarf:event-slug>
            <pentabarf:title>Beyond Just Prediction: Causal Thinking in Machine Learning</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T170000</dtstart>
            <dtend>20251211T173000</dtend>
            <duration>003000</duration>
            <summary>Beyond Just Prediction: Causal Thinking in Machine Learning</summary>
            <description>## Audience
This talk is for data scientists and ML engineers at any level. Basic familiarity with Python and machine learning concepts is helpful but not required.

## Objective
Attendees will learn when to use causal thinking vs predictive modeling and how to implement uplift models using Python. They will also understand how to apply these techniques across different domains, such as marketing, healthcare, and other relevant fields.

## Details
Predictive ML models are used everywhere for data-driven decision making across industries. However, accurate forecasts don&apos;t always translate to optimal actions.

We will begin by exploring the fundamental challenges of deriving actions from model predictions, especially when determining the right audience to target. After that, we will dive into some fundamental concepts of causal inference and how it differs from traditional ML. We will then introduce uplift modeling and cover some key concepts, e.g., treatment effects, counterfactuals, meta-learning approaches, etc. We will see how these elements work together to create causal ML models. 

Finally, we will put theory into practice by building a sample uplift model in Python. We&apos;ll walk through each step using real-world intervention data (publicly available), demonstrating how this approach can dramatically improve decision-making and ensure that the interventions target the right audience for the right reasons.

## Outline
- Introduction and motivation [1 min]
- From correlation to causation [4 min]
   - Correlation vs Causation
   - When do we need a causal angle
- Core causal concepts [4 min]
   - Treatment effects
   - Counterfactuals
   - Intervention problem
- Uplift modeling concepts [5 min]
   - Four types of individual responses to a treatment
   - Meta learning approach
   - T-Learner and S-Learner comparison
- Hands-on case study [10 min]
   - Problem explanation and formulation
   - Predictive model output
   - Causal uplift model in Python
   - Compare targeting strategies and intervention impact
- Evaluation [4 min]
   - Why accuracy or F1 scores don&#8217;t work for uplift
   - Uplift curves
   - Qini coefficient
   - Explainability
- Practical Considerations [2 min]
   - A/B testing treatment effects
   - Cross-domain applications</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/VY398A/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Avik Basu</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ESFUQB@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ESFUQB</pentabarf:event-slug>
            <pentabarf:title>Enhancing Marketplace Competitiveness: A Bayesian Approach to modelling the cold start problem</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T173000</dtstart>
            <dtend>20251211T180000</dtend>
            <duration>003000</duration>
            <summary>Enhancing Marketplace Competitiveness: A Bayesian Approach to modelling the cold start problem</summary>
            <description>In this session, we will explore the application of Bayesian methodology to address the cold start problem in a recommendation system: determining if there is enough data for a new product in a marketplace to be accurately ranked, or if the product should get further exposure to reach that stage. 

The target audience of this talk is data analysts of all levels, data practitioners interested in modelling, and professionals working in recommendation systems. 

Unlike traditional machine learning models, Bayesian statistical modelling offers a robust framework for updating probabilities with new evidence, making it particularly suited for dynamic environments like online marketplaces. That way, one can update the learnings on the performance of a new product daily, allowing for efficient decision-making around &#8220;should I keep on exploring this new product or not?&#8221; while minimising the traffic investment and enabling a risk-management-based approach. We will also cover how we control for the assumptions that Bayesian requires. 

Key takeaways:
1. Understanding Bayesian Methods: Learn how Bayesian statistics can be applied to real-world business problems, offering a flexible and interpretable approach to decision-making.

2. Benefits Over Machine Learning: Discover why statistical modelling can be more advantageous than machine learning in certain business contexts, particularly when managing risk, handling sparse data and providing interpretable results to the business.

3. Practical Application: Learn about the challenges of applying bayesian models in a real marketplace.

Outline:
Introduction to the cold-start problem (2 min)
How we rank incoming activities at GetYourGuide and how modelling could make us more efficient (5 min)
Explaining the model (15 min)
Intro to a Bayesian binomial model (3 min)
Controlling for independence among trials (3 min)
Defining the prior (3 min)
Designing a stop criteria (6 min)
Risk-management: why Bayesian modelling over Machine Learning (5 min)
Questions (3 min)

Prerequisites

Learn what the cold start problem in a recommender system is (https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)).

Get familiar with Bayesian thinking (https://www.countbayesie.com/blog/2022/2/19/how-to-read-the-news-like-a-bayesian).

If you want to go fancy, read this paper: https://arxiv.org/pdf/2410.02126</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/ESFUQB/</url>
            <location>Analytics, Visualization &amp; Decision Science</location>
            
            <attendee>Agustin Figueroa Nazar</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>SRCNAR@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-SRCNAR</pentabarf:event-slug>
            <pentabarf:title>Building a Lightweight Feature Store for Electricity Grid Forecasts with Polars</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T113000</dtstart>
            <dtend>20251211T120000</dtend>
            <duration>003000</duration>
            <summary>Building a Lightweight Feature Store for Electricity Grid Forecasts with Polars</summary>
            <description>In this talk, we&#8217;ll share how we built a lightweight, production-ready feature store to support electricity grid forecasting. You&apos;ll hear a firsthand account of our journey&#8212;from identifying the need to accelerating model prototyping through feature standardization and flexibility.
We&#8217;ll start with a high-level overview of our decision-making process: why we chose to build rather than buy, and the trade-offs we considered. Then, we&#8217;ll dive into the architecture of our custom feature store, detailing how we leveraged Polars for fast processing and Google Cloud Storage as a scalable backend.
Expect an honest look at the challenges we faced, the benefits we gained, and the costs we encountered along the way. Whether you&apos;re considering building your own feature store or just curious about scaling ML for time series problems, this session will offer practical insights and real-world lessons.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/SRCNAR/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Robin Troesch</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>YN7DYP@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-YN7DYP</pentabarf:event-slug>
            <pentabarf:title>Engineering Large-scale geospatial raster processing with xarray and dask</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T123000</dtstart>
            <dtend>20251211T130000</dtend>
            <duration>003000</duration>
            <summary>Engineering Large-scale geospatial raster processing with xarray and dask</summary>
            <description>This talk addresses a common challenge faced by data scientists, data engineers, researchers, and geospatial analysts working with large-scale geospatial data: how to efficiently process and harmonize raster datasets that exceed memory limits, while maintaining both data integrity and computational performance. Attendees are expected to have a basic familiarity with Python and an understanding of fundamental geospatial concepts.

I will begin by outlining prevalent issues in geospatial data processing, such as memory constraints when working with large rasters, the difficulty of harmonizing datasets with varying resolutions and projections, and the computational cost of performing zonal statistics across multiple layers. To address these challenges, I will demonstrate how libraries like xarray and rioxarray offer elegant abstractions for geospatial data manipulation, while Dask facilitates out-of-core computation and parallel processing. A technical walkthrough will showcase a flexible pipeline designed to handle key data processing scenarios: downsampling, upsampling, masking, managing missing values, and other steps. 

I will do a live code demonstration from a project involving zonal statistics for small area poverty estimation. This will include processing layers such as population density, distance to healthcare, and nightlights to produce harmonized zonal statistics at administrative level three of a select country. To wrap up, we&#8217;ll briefly touch on optimization strategies, including chunking techniques and memory management.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/YN7DYP/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>CLINTON OYOGO DAVID</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>VS8HWU@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-VS8HWU</pentabarf:event-slug>
            <pentabarf:title>Accelerate deployment of your Python data science apps using ShinyProxy</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T133000</dtstart>
            <dtend>20251211T140000</dtend>
            <duration>003000</duration>
            <summary>Accelerate deployment of your Python data science apps using ShinyProxy</summary>
            <description>ShinyProxy is already a well known tool to deploy apps built using R and Shiny. This talk will - for the first time - introduce ShinyProxy to the Python community. In the first part of the talk we present how Bob wrote a super useful Python app, but struggles to get it deployed at Bob&apos;s company. A first challenge is to get hold of a server with all dependencies and libraries installed. Next, Bob is informed that the app must be protected using TLS and integrated with the existing authentication system. After these first obstacles Bob learns that there are even more requirements and gets stuck. The second part of this talk demonstrates how Bob can solve all these problems using ShinyProxy. For example, using container technology (Docker), Bob has full control on installing dependencies and libraries, while at the same time improving the reproducibility of the setup. This talk is tailored for both data scientists and anyone interested in setting up ShinyProxy. No deep technical knowledge is required to follow along. At the end of the talk, you&apos;ll know everything to get started with ShinyProxy and to deploy your first app. ShinyProxy supports almost any web application, including Streamlit, Dash, Voila and Gradio. Therefore, we don&apos;t focus on a specific framework. Everything covered in this talk is applicable to your favourite framework.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/VS8HWU/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Tobia De Koninck</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>UKDKZ7@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-UKDKZ7</pentabarf:event-slug>
            <pentabarf:title>Bodo DataFrames: a fast and scalable HPC-based drop-in replacement for Pandas</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T163000</dtstart>
            <dtend>20251211T170000</dtend>
            <duration>003000</duration>
            <summary>Bodo DataFrames: a fast and scalable HPC-based drop-in replacement for Pandas</summary>
            <description>Despite its popularity for data manipulation tasks, Pandas struggles at scale due to its single threaded execution and significant Python-based overheads. In this talk, we introduce Bodo DataFrames as a solution to scaling Pandas with a single line of code change; simply replace `import pandas as pd` with `import bodo.pandas as pd`.  

Bodo DataFrames transforms Pandas code into lazily evaluated plans, enabling database-quality query optimizations, and runs on a streaming, parallel backend using the Message Passing Interface (MPI) for fast worker-to-worker communication. This design avoids out-of-memory errors and is easily scalable from laptop to large cloud cluster. Unlike other data processing engines, Bodo DataFrames combine powerful techniques from high performance computing (HPC) and databases while remaining fully Pandas compatible.

We will present multiple examples and benchmarks demonstrating how to use Bodo DataFrames. The first example will show how to scale a simple program covering functions like reading/writing Parquet files, Series-datetime, merge, and groupby-agg. The next example will demonstrate how to accelerate user defined functions (i.e. map and apply) using Bodo DataFrames builtin support for Just-In-Time (JIT) compilation. The final example will demonstrate how to use Bodo DataFrames support for the Apache Iceberg format, which provides schema evolution and time travel for ever-changing datasets. We will also discuss how Bodo DataFrames falls back to Pandas when it doesn&apos;t support all operations of a workload, and planned future work.

This talk is designed for users of Pandas; data scientists, data engineers and AI/ML practitioners, who are interested in accelerating and scaling their workloads easily. In addition to a new tool under their belt, attendees will walk away with an understanding of techniques from HPC and databases, unlocking deeper insights into aspects of performance and memory utilization.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/UKDKZ7/</url>
            <location>Data Engineering &amp; Infrastructure</location>
            
            <attendee>Scott Routledge</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>SGNMQM@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-SGNMQM</pentabarf:event-slug>
            <pentabarf:title>How Do We Create Access for Those Who Don&#8217;t Show Up in Our Spaces?</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T120000</dtstart>
            <dtend>20251211T130000</dtend>
            <duration>010000</duration>
            <summary>How Do We Create Access for Those Who Don&#8217;t Show Up in Our Spaces?</summary>
            <description>Impact Scholars Program</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/SGNMQM/</url>
            <location>Impact Scholarship Program</location>
            
            <attendee>Anita Ihuman</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>PXLTKU@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-PXLTKU</pentabarf:event-slug>
            <pentabarf:title>BoF - networking session</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20251211T170000</dtstart>
            <dtend>20251211T180000</dtend>
            <duration>010000</duration>
            <summary>BoF - networking session</summary>
            <description>Impact Scholars Program</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/pydataglobal2025/talk/PXLTKU/</url>
            <location>Impact Scholarship Program</location>
            
        </vevent>
        
    </vcalendar>
</iCalendar>
