PyData Berlin 2025

0.1 PyData Berlin 2025 berlin2025 2025-09-01 2025-09-03 3 00:05 https://cfp.pydata.org Europe/Berlin B09 A Beginner's Guide to State Space Modeling Tutorial 2025-09-01T10:40:00+02:00 10:40 01:30 **State Space Models** (SSMs) are powerful tools for time series analysis, widely used in finance, economics, ecology, and engineering. They allow researchers to encode structural behavior into time series models, including *trends*, *seasonality*, *autoregression*, and *irregular fluctuations*, to name just a few. Many workhorse time series models, including ARIMA, VAR, and ETS, are special cases of the general statespace framework. In this practical, hands-on tutorial, attendees will **learn how to leverage PyMC's new state-space modeling** capabilities (`pymc_extras.statespace`) to build, fit, and interpret Bayesian state space models. Starting from fundamental concepts, we'll **explore several real-world use cases**, demonstrating how SSMs help tackle common time series challenges, such as handling missing observations, integrating external regressors, and generating forecasts. berlin2025-77507-a-beginner-s-guide-to-state-space-modeling PyData & Scientific Libraries Stack Alexandre AndorraJesse Grabowski en State Space Models offer **a structured yet flexible framework for time series analysis**. They elegantly handle latent processes like trends, seasonality, and noisy observations, making them particularly valuable in real-world applications. We'll start with a brief overview of the theory behind SSMs, followed by practical examples where participants will: - **Understand the components of SSMs**, including observation and state equations. - **Learn how to specify and fit SSMs** using PyMC's state space module. - Implement a **modeling workflow using a survey data example**, showing how to use SSMs to model the data and generate predictions. - **Explore advanced topics** such as incorporating external regressors, generating forecasts or building custom models. ### Target Audience This tutorial is aimed at data scientists, statisticians, and data analysts with a basic understanding of statistics and Python, who are interested in expanding their toolkit with Bayesian time series methods. Prior experience with PyMC is not required but will be beneficial. ### Takeaways By the end of this tutorial, attendees will: - Understand the **theoretical foundations** of State Space Models. - Be able to **implement common SSMs** (local level, trend, and seasonal models) in PyMC. - **Evaluate and interpret** Bayesian state space models using PyMC. - **Appreciate practical scenarios** where SSMs outperform traditional time series approaches. ### Background Knowledge Required Basic understanding of probability and statistics, and familiarity with Python. Prior experience with PyMC is not required but will be beneficial. ### Materials Distribution All tutorial materials, including notebooks and datasets, will be made available via a GitHub repository. ## Outline **0 - 10 min: Introduction to State Space Models** - What are SSMs, and why use them? **10 - 25 min: State Space Model Fundamentals** - Observation and state equations. - Latent states, Kalman filters, and smoothing in Bayesian frameworks. **25 - 55 min: Implementing SSMs with PyMC (Hands-On)** - Setting up a local-level model in PyMC. - Extending models to incorporate trends and seasonality. - Posterior inference: interpreting results and uncertainty. **55 - 75 min: Advanced State Space Modeling (Hands-On)** - Dealing with missing data and irregular intervals. - Adding external covariates (regression components). - Model diagnostics and posterior predictive checks. **75 - 85 min: Real-world Application Case Study** - Demonstrating an end-to-end modeling example with real data. - Discussing best practices for practical time series modeling. **85 - 90 min: Wrap-up and Interactive Q&A** - Open floor for questions and further resources. --- ## Additional Resources - [Introduction to PyMC state space module](https://www.youtube.com/watch?v=G9VWXZdbtKQ) - [Podcast episode on PyMC's state space module](https://learnbayesstats.com/episode/124-state-space-models-structural-time-series-jesse-grabowski) - [PyMC State Space Module GitHub Repository](https://github.com/pymc-devs/pymc-extras/tree/main/pymc_extras/statespace) We believe this tutorial will empower participants with practical knowledge of state space modeling in PyMC, enabling them to effectively analyze complex time series data using Bayesian approaches. false https://cfp.pydata.org/berlin2025/talk/GRZ3RG/ https://cfp.pydata.org/berlin2025/talk/GRZ3RG/feedback/ B09 More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB Tutorial 2025-09-01T13:40:00+02:00 13:40 01:30 Most Python developers reach for Pandas or Polars when working with tabular data—but DuckDB offers a powerful alternative that’s more than just another DataFrame library. In this tutorial, you’ll learn how to use DuckDB as an in-process analytical database: building data pipelines, caching datasets, and running complex queries with SQL—all without leaving Python. We’ll cover common use cases like ETL, lightweight data orchestration, and interactive analytics workflows. You’ll leave with a solid mental model for using DuckDB effectively as the “SQLite for analytics.” berlin2025-77715-more-than-dataframes-data-pipelines-with-the-swiss-army-knife-duckdb Data Handling & Engineering Mehdi Ouazza en The goal of this tutorial is to help Python users understand and use DuckDB not just as a DataFrame interface, but as a fully featured analytics database embedded in their Python workflows. We'll highlight real-world patterns where DuckDB shines compared to traditional libraries, especially for medium-scale datasets that don’t justify a full data warehouse. You’ll learn: - When and why to reach for DuckDB instead of Pandas/Polars - How DuckDB handles local files (CSV, Parquet, JSON, Postgres database, and more) - Using DuckDB to build lightweight, SQL-based data pipelines - Techniques for caching intermediate data in-process - How to analyze data from remote sources via HTTP or S3 - Tips for using DuckDB with Jupyter, dbt, or your favorite Python tools false https://cfp.pydata.org/berlin2025/talk/WXPVCS/ https://cfp.pydata.org/berlin2025/talk/WXPVCS/feedback/ B09 See only what you are allowed to see: Fine-Grained Authorization Tutorial 2025-09-01T15:40:00+02:00 15:40 01:30 Managing who can see or do what with your data is a fundamental challenge, especially as applications and data grow in complexity. Traditional role-based systems often lack the granularity needed for modern data platforms. Fine-Grained Authorization (FGA) addresses this by controlling access at the individual resource level. In this 90-minute hands-on tutorial, we will explore implementing FGA using OpenFGA, an open-source authorization engine inspired by Google's Zanzibar. Attendees will learn the core concepts of Relationship-Based Access Control (ReBAC) and get practical experience defining authorization models, writing relationship tuples, and performing authorization checks using the OpenFGA Python SDK. Bring your laptop ready to code to learn how to build secure and flexible permission systems for your data applications. berlin2025-77222-see-only-what-you-are-allowed-to-see-fine-grained-authorization Data Handling & Engineering Maria Knorps en This tutorial provides a practical, hands-on introduction to implementing Fine-Grained Authorization (FGA) for data-intensive applications using the open-source tool OpenFGA. As data platforms evolve and regulatory requirements become stricter, controlling access at a granular level – perhaps even row-level in a database context – becomes essential. Role-Based Access Control (RBAC), while common, often struggles to meet these complex needs, leading to insufficient flexibility or administrative overhead. We will introduce the concept of Relationship-Based Access Control (ReBAC), the authorization paradigm powering systems like Google's Zanzibar and OpenFGA. You'll learn how ReBAC defines permissions based on the relationships between users and objects (e.g., "Alice is a viewer of Document 'report_Q3'"), enabling highly flexible and scalable access control logic. The core of the tutorial will be dedicated to practical implementation. We will guide attendees through: 1.Setting up a local OpenFGA instance (e.g., using Docker). 2. Defining an authorization model using OpenFGA's Domain Specific Language (DSL) to represent resources, users, and the relationships between them. We will use a simplified data access scenario as our example, potentially inspired by challenges faced in research or data collaboration platforms. 3. Writing and managing relationship tuples in OpenFGA. 4. Using the OpenFGA Python SDK to connect your application logic to the authorization engine, 5. Exploring strategies for integrating this with application backend code and potentially addressing concepts like enforcing row-level permissions. Attendees will follow along with live coding examples and complete exercises designed to solidify their understanding and build confidence in applying FGA principles with OpenFGA. By the end of the 90 minutes, you will have a foundational understanding of FGA/ReBAC and the practical skills to start integrating OpenFGA into your own projects. The tutorial materials, including code examples and setup instructions, will be provided via a GitHub repository. false https://cfp.pydata.org/berlin2025/talk/B3STGX/ https://cfp.pydata.org/berlin2025/talk/B3STGX/feedback/ B07-B08 Beyond Linear Funnels: Visualizing Conditional User Journeys with Python Talk 2025-09-01T10:40:00+02:00 10:40 00:30 Optimizing user funnels is a common task for data analysts and data scientists. Funnels are not always linear in the real world. often, the next step depends on earlier responses or actions. This results in complex funnels that can be tricky to analyze. I’ll introduce an open-source Python library I developed that analyzes and visualizes non-linear, conditional funnels by utilizing Graphviz and Streamlit. It calculates conversion rates, drop-offs, time spent on each step, and highlights bottlenecks by color. Attendees will learn about how to quickly explore complex user journeys and generate insightful funnel data. berlin2025-77772-beyond-linear-funnels-visualizing-conditional-user-journeys-with-python Visualisation & Jupyter Yaseen Esmaeelpour en When we talk about funnels in analytics, most people think of linear funnels, where users move step-by-step through a fixed sequence of actions. But in real-world applications like dynamic forms, on-boarding flows, or diagnostic tools, funnels are often conditional and non-linear. The next step in the journey depends on user input at earlier stages, leading to different paths and variable funnel lengths for every user. An example is a vehicle pricing tool: while all users answer general questions (e.g., type, mileage), follow-up questions may differ based on previous answers. For instance, only users with electric cars are asked about battery capacity. This branching logic creates challenges for traditional funnel visualization techniques which mostly consider funnels as linear. Alternative immediate solutions are not perfect: Visuals like Sankey diagrams are too limited/general and often visually collapse under real-world data messiness (users going back and forth, drop-offs, missing events). Milestone-based funnels (where you set a few milestones during the funnel to mimic linear funnels) simplify things too much, hiding key details and masking where things actually break down. As a data analyst, I needed a way to understand and visualize such nonlinear flows in a more straightforward and consumable way. Finding no library that met this need out of the box, I created funnelius, a Python library that processes raw event logs into ready to consume funnel graphs. The library accepts a pandas DataFrame with user_id, action and action_timestamp columns. Then it will use pandas to transform DataFrame to a suitable format to feed into graphviz. It also adds necessary columns needed to filter and declutter the graph. Then it will visualize the funnel using dot rendering engine which includes: - Calculating key metrics for every step: number of users per step, conversion rates, time spent, percentage of total users and drop-offs. - conditional formatting based on different metrics to highlight bottlenecks. - Comparison with another dataframe and showing changes. - Showing the answers that users gave in each step and calculate the percentage of answers on every step.l. The graph can be fine tuned with some options like: - Only show top-N routes to declutter graph - Show/hide Dropped users data - Only include users who started from specific steps. If we know that users must have specific steps as a starting point, this helps remove possible data issues if any. - Define metrics that should be calculated There is also a Streamlit-based UI to interactively adjust parameters and export funnel analysis as PDF instead of doing it programmatically. This tool can be helpful for data analysts and data scientists with Python knowledge who need to analyse conditional funnels. Github Repository: https://github.com/yaseenesmaeelpour/funnelius false https://cfp.pydata.org/berlin2025/talk/VBCU9H/ https://cfp.pydata.org/berlin2025/talk/VBCU9H/feedback/ B07-B08 Democratizing Digital Maps: How Protomaps Changes the Game Talk 2025-09-01T11:20:00+02:00 11:20 00:30 Digital mapping has long been dominated by commercial providers, creating barriers of cost, complexity, and privacy concerns. This talk introduces Protomaps, an open-source project that reimagines how web maps are delivered and consumed. Using the innovative PMTiles format – a single-file approach to vector tiles – Protomaps eliminates complex server infrastructure while reducing bandwidth usage and improving performance. We'll explore how this technology democratizes cartography by making self-hosted maps accessible without API keys, usage quotas, or recurring costs. The presentation will demonstrate implementations with Leaflet and MapLibre, showcase customization options, and highlight cases where Protomaps enables privacy-conscious, offline-capable mapping solutions. Discover how this technology puts mapping control back in the hands of developers while maintaining the rich experiences modern applications demand. berlin2025-77698-democratizing-digital-maps-how-protomaps-changes-the-game Visualisation & Jupyter Veit Schiele en In today’s digital landscape, maps have become essential components of countless applications and services, from navigation and logistics to social platforms and data visualization. But for too long, the field has been dominated by a few companies whose services, while powerful, come with significant drawbacks: Usage quotas, tracking requirements, styling limitations, and recurring costs that can quickly skyrocket as applications grow. This talk will introduce Protomaps, an innovative open source mapping technology that is fundamentally reshaping the way digital maps are created, distributed and used. At its core, Protomaps utilizes the groundbreaking PMTiles format – a single-file approach to vector tiles that eliminates the need for complex tile server infrastructure while increasing performance and reducing bandwidth consumption. #### Technical innovation We start with the technical basics of protomaps and explain how the PMTiles format works and why it represents such a significant advance over conventional tile map approaches. Unlike conventional solutions that rely on thousands of individual tile files provided by a complex infrastructure, PMTiles bundles vector map data into a single, efficiently indexed file that can be hosted anywhere. The presentation will demonstrate how this approach enables progressive loading, allowing maps to render quickly at variable zoom levels while preserving the rich detail and interactive capabilities users expect from modern mapping solutions. We’ll examine the efficiency gains in terms of bandwidth usage, server requirements, and client-side rendering performance. #### Democratization in Practice This talk will focus on how Protomaps democratizes digital mapping in a tangible way: ##### Economic Accessibility By eliminating recurring API costs and usage-based pricing models, Protomaps opens up mapping opportunities for projects of all sizes, from hobby developers to non-profit organizations and educational institutions with limited budgets. ##### Technical Accessibility We demonstrate practical implementations with Leaflet and MapLibre GL and show how developers can integrate Protomaps with just a few lines of code and minimal configuration. ##### Customization Freedom Without the styling restrictions imposed by commercial vendors, Protomaps allows complete creative control over the appearance of the map. We show examples of customized maps that would be difficult or impossible to achieve with traditional services. ##### Privacy by Design As Protomaps enables fully self-hosted mapping solutions, there is no need to share user location data or mapping activity with third parties – a crucial aspect for privacy-conscious applications and those operating under strict regulatory frameworks. #### Takeaways for Attendees Participants will leave this session with the following knowledge: * Understand how PMTiles and Protomaps work * Know how to use Protomaps in their own projects * Customize maps to meet specific design and data needs * A new perspective on the possibilities of democratized digital mapping Whether you are a developer seeking cost-effective mapping solutions, an organization concerned about data privacy, or simply interested in the evolution of open source geospatial technology, this talk will give you valuable insight into how Protomaps is reshaping the landscape of digital cartography by putting powerful mapping capabilities back into the hands of developers and communities. false https://cfp.pydata.org/berlin2025/talk/QMPX9V/ https://cfp.pydata.org/berlin2025/talk/QMPX9V/feedback/ B07-B08 Beyond the Black Box: Interpreting ML models with SHAP Talk 2025-09-01T16:20:00+02:00 16:20 00:30 As machine learning models become more accurate and complex, explainability remains essential. Explainability helps not just with trust and transparency but also with generating actionable insights and guiding decision-making. One way of interpreting the model outputs is using SHapley Additive exPlanations (SHAP). In this talk, I will go through the concept of Shapley values and its mathematical intuition and then walk through a few real-world examples for different ML models. Attendees will gain a practical understanding of SHAP's strengths and limitations and how to use it to explain model predictions in their projects effectively. berlin2025-77394-beyond-the-black-box-interpreting-ml-models-with-shap Visualisation & Jupyter Avik Basu en ## Audience This talk is for Data Scientists and Machine Learning Engineers at any level. Basic knowledge of machine learning is useful but not necessary. ## Objective Attendees will learn why explainable machine learning is important and how to use and interpret SHAP values for their model. ## Details ML models behave as black boxes in most scenarios. The model predicts or provides a certain output, but it is very difficult to generate any actionable insights directly. This is mostly because we generally have no idea which features are contributing the most to the model's behavior internally. SHAP provides a way to explain model predictions and can be an important tool in a data scientist's toolbox. In this talk, we will begin by explaining to the audience the need for explainability and why it is essential to understand beyond what the model outputs. We will then briefly review the mathematical intuition behind Shapley values and their origins in game theory. After that, we will walk through a couple of case studies of tree-based and neural network-based models. We will be focusing on the interpretation of SHAP through various plots. Finally, we will discuss the best practices for interpreting SHAP visualizations, handling large datasets, and common pitfalls to avoid. ## Outline - Introduction and motivation [1 min] - Why explainability matters? [5 min] - Problem with black box models - Actionable insights - SHAP theory and intuition [5 min] - Shapley values - Game theory origins - SHAP - Case study 1: Tree-based model [4 min] - Problem definition - model output - SHAP visualization - Global plots - Local plots - Interpretation - Case study 2: Neural Network model [8 min] - Problem definition - Model output - SHAP visualization - Global plots - Local plots - Interpretation - Best practices and common pitfalls [4 min] - Interpret SHAP correctly - Avoid misleading explanations - Performance challenges for large datasets - Other techniques for explainability - Q/A [3 min] false https://cfp.pydata.org/berlin2025/talk/SB88M7/ https://cfp.pydata.org/berlin2025/talk/SB88M7/feedback/ B07-B08 Building an A/B Testing Framework with NiceGUI Talk 2025-09-01T17:00:00+02:00 17:00 00:30 NiceGUI is a Python-based web UI framework that enables developers to build interactive web applications without using JavaScript. In this talk, I’ll share how my team used NiceGUI to create an internal A/B testing platform entirely in Python. I’ll discuss the key requirements for the platform, why we chose NiceGUI, and how it helped us design the UI, display results, and integrate with the backend. This session will demonstrate how NiceGUI simplifies development, reduces frontend complexity, and speeds up internal tool creation for Python developers. berlin2025-77039-building-an-a-b-testing-framework-with-nicegui Visualisation & Jupyter Wessel van de Goor en NiceGUI is a Python-based web UI framework that enables developers to create full-featured, interactive web applications without needing to write JavaScript. In this talk, I’ll share how my team and I used NiceGUI to build an internal A/B testing platform entirely in Python. A/B testing is essential for validating new features and improving user experience, and by creating a custom platform, we were able to streamline experiment management and simplify data visualization. This talk is ideal for Python developers, data scientists, or anyone interested in creating web-based internal tools quickly. If you're looking for a solution that minimizes frontend complexity while providing a powerful framework for building interactive applications, this talk will provide valuable insights. No prior knowledge of JavaScript or frontend frameworks is necessary; familiarity with Python and basic web concepts will suffice. After a brief introduction, I’ll first explain what A/B testing is and why it’s so crucial for making data-driven decisions. I’ll also discuss why having a custom-built platform can help improve experiment efficiency and results interpretation. Next, I’ll dive into the key requirements we had for the platform, such as flexibility, ease of use, and seamless integration with our existing backend systems. I’ll also explain why we chose NiceGUI over other Python-based frameworks, emphasizing its ability to help us build a robust web application without the complexities of traditional frontend development. Throughout the talk, I’ll walk through how we used NiceGUI to design the user interface, display results, and integrate with the backend. I’ll focus on the development experience, highlighting the challenges we faced and how NiceGUI’s features allowed us to make rapid progress while keeping things simple and Pythonic. The takeaway for the audience will be understanding how NiceGUI simplifies the development of interactive web applications, focusing on internal tools like dashboards or experiment management platforms. I’ll also share the benefits we’ve experienced with the platform so far and discuss the lessons we’ve learned. Finally, I’ll explain how NiceGUI helped us create an interactive, production-ready tool with minimal frontend complexity. This session will demonstrate, through a specific use case, how NiceGUI can be an ideal solution for Python developers looking to quickly build internal tools, reduce frontend complexity, and speed up development cycles. Agenda: 1. Introduction & Background (5 minutes) 2. Requirements for an A/B Testing Platform (2 minutes) 3. Why We Chose NiceGUI (2 minutes) 4. How We Built It – Patterns & Architecture (10 minutes) 5. Benefits and Outcomes (3 minutes) 6. Challenges and Lessons Learned (3 minutes) false https://cfp.pydata.org/berlin2025/talk/VURY38/ https://cfp.pydata.org/berlin2025/talk/VURY38/feedback/ B05-B06 🛰️➡️🧑‍💻: Streamlining Satellite Data for Analysis-Ready Outputs Talk 2025-09-01T10:40:00+02:00 10:40 00:30 I will share how our team built an end-to-end system to transform raw satellite imagery into analysis-ready datasets for use cases like vegetation monitoring, deforestation detection, and identifying third-party activity. We streamlined the entire pipeline from automated acquisition and cloud storage to preprocessing that ensures spatial, spectral, and temporal consistency. By leveraging Prefect for orchestration, Anyscale Ray for scalable processing, and the open source STAC standard for metadata indexing, we reduced processing times from days to near real-time. We addressed challenges like inconsistent metadata and diverse sensor types, building a flexible system capable of supporting large-scale geospatial analytics and AI workloads. berlin2025-77590-streamlining-satellite-data-for-analysis-ready-outputs Data Handling & Engineering Vinayak Nair en Satellite imagery offers powerful insights for vegetation monitoring, deforestation detection, and identifying unauthorized activity but raw data isn’t analysis-ready. In this talk, I will share how our team built a scalable, cloud-native pipeline that automates satellite data acquisition, storage, and preprocessing into consistent, analysis-ready datasets (ARDs). Designed for flexibility and growth, the system handles various sensors and formats while ensuring high data quality. We use Prefect for workflow orchestration and Anyscale Ray for distributed processing, cutting processing times from days to near real-time. Open source SpatioTemporal Asset Catalog (STAC) standards enable robust metadata indexing, supporting fast querying and long-term interoperability. This adaptable architecture empowers fast, reliable geospatial analytics across domains. false https://cfp.pydata.org/berlin2025/talk/KCPVYN/ https://cfp.pydata.org/berlin2025/talk/KCPVYN/feedback/ B05-B06 Exploring Millions of High-dimensional Datapoints in the Browser for Early Drug Discovery Talk 2025-09-01T11:20:00+02:00 11:20 00:30 The visual exploration of large, high-dimensional datasets presents significant challenges in data processing, transfer, and rendering for engineering in various industries. This talk will explore innovative approaches to harnessing massive datasets for early drug discovery, with a focus on interactive visualizations. We will demonstrate how our team at Bayer utilizes a modern tech stack to efficiently navigate and analyze millions of data points in a high-dimensional embedding space. Attendees will gain insights into overcoming performance challenges, optimizing data rendering, and developing user-friendly tools for effective data exploration. We aim to demonstrate how these technologies can transform the way we interact with complex datasets in engineering applications and eventually allow us to find the needle in a multidimensional haystack. berlin2025-77898-exploring-millions-of-high-dimensional-datapoints-in-the-browser-for-early-drug-discovery Data Handling & Engineering Tim TenckhoffMatthias Orlowski en From initial screening to regulatory approval, developing new drugs can take over a decade. A major bottleneck is the early-stage identification of promising compounds, a process that increasingly relies on high-throughput image-based profiling and requires researchers to sift through vast oceans of potential molecular candidates. Analyzing these large-scale, high-dimensional datasets introduces challenges in data ingestion, transformation, and visualization. Overcoming those challenges has the potential to significantly accelerate the journey from discovery to delivery, thus providing life-saving treatments to patients faster. In this talk, we share how our team at Bayer engineered a system to navigate millions of cell-level data points in the browser. Starting with raw microscopy images, we use computer vision and deep learning models to extract morphological features. These features are aggregated into “consensus profiles” that enable robust comparisons across treatment conditions and experimental batches. We’ll present how we automated and optimized what was previously a four-week manual workflow using a tech stack including: • ⁠Apache Airflow for orchestrating parallel processing and ensuring reproducibility • ⁠GraphQL combined with REST for a balance of flexibility and speed in serving data • React and Next.js for building user interfaces that support real-time interaction with millions of records We’ll also showcase techniques for creating accessible and performant visualizations: scatter plots, dose-response curves, dendrograms, and similarity heatmaps. These visualizations were designed for scientists who are no software developers, so particular attention was paid to usability, accessibility, and performance. By presenting practical challenges and solutions, we will enable attendees to improve their approaches to data visualization and interaction in their own domains. We aim to convey how these technologies can transform the way we interact with complex datasets in engineering applications on a broad spectrum, empowering us with more efficient methodologies to locate the needle in a multidimensional haystack. false https://cfp.pydata.org/berlin2025/talk/8UJA37/ https://cfp.pydata.org/berlin2025/talk/8UJA37/feedback/ B05-B06 Democratizing Experimentation: How GetYourGuide Built a Flexible and Scalable A/B Testing Platform Talk 2025-09-01T12:00:00+02:00 12:00 00:30 At GetYourGuide, we transformed experimentation from a centralized, closed system into a democratized, self-service platform accessible to all analysts, engineers, and product teams. In this talk, we'll share our journey to empower individuals across the company to define metrics, create dimensions, and easily extend statistical methods. We'll discuss how we built a Python-based Analyzer toolkit enabling standardized, reusable calculations, and how our experimentation platform provides ad-hoc analytical capabilities through a flexible API. Attendees will gain practical insights into creating scalable, maintainable, and user-friendly experimentation infrastructure, along with access to our open-source sequential testing implementation. berlin2025-77352-democratizing-experimentation-how-getyourguide-built-a-flexible-and-scalable-a-b-testing-platform Data Handling & Engineering Konrad Richter en Experimentation is essential for data-driven product development, but centralized experimentation systems often become bottlenecks, limiting innovation and velocity. At GetYourGuide, we faced this challenge and decided to democratize experimentation, enabling analysts and product teams across the company to define, run, and analyze experiments independently. In this session, we'll share practical insights from our journey toward democratization, focusing on technical implementation details and lessons learned. **From Centralized to Decentralized Experimentation** Initially, experimentation at GetYourGuide was centralized, limiting flexibility and slowing down decision-making. We recognized the need to empower individual contributors (ICs) by creating a self-service experimentation platform. We'll discuss the practical challenges we encountered, including managing complexity, maintaining consistency, and ensuring data quality across decentralized teams. **Enabling Flexible Metric and Dimension Definitions** To democratize experimentation effectively, we needed to empower analysts to define their own metrics and dimensions without heavy engineering involvement. We'll share how we designed a modular SQL-template approach, allowing analysts to quickly create, test, and deploy new definitions. We'll illustrate this approach with real-world examples, such as conversion rate, revenue per visitor, channel splits, and platform segmentation, demonstrating how this flexibility significantly accelerated experimentation velocity. **Standardizing Statistical Calculations with the Analyzer Toolkit** Our initial experimentation infrastructure relied heavily on Looker data models, which proved insufficient for complex statistical methods like sequential testing. To address this, we built a Python-based analysis package, the Analyzer, that standardized statistical calculations and provided reusable components. We'll explain how analysts leverage this toolkit to ensure consistency, accuracy, and extensibility of statistical methods. We'll also share how the Analyzer became a valuable resource beyond experimentation, supporting broader analytical use-cases across the organization. **Batch Processing and API-Driven Experiment Results** To ensure timely access to experiment results, we implemented a robust batch processing pipeline that pre-calculates daily experiment impressions, metrics, and dimensions. Additionally, we developed a flexible API layer to enable analysts to retrieve specific experiment results dynamically, without waiting for scheduled batch jobs. We'll discuss the technical architecture behind this dual approach, highlighting how it balances efficiency, reliability, and flexibility. **Key Lessons and Takeaways** Attendees will leave this session with practical insights into: * Democratizing experimentation to accelerate innovation and velocity. * Best practices for designing flexible, scalable, and maintainable experimentation infrastructure. * Technical strategies for enabling self-service metric/dimension definitions, standardized statistical calculations, and extensible analytical capabilities. We'll conclude by briefly outlining our future plans, including additional discriminators, advanced statistical methods, and further UI enhancements aimed at continuous democratization. false https://cfp.pydata.org/berlin2025/talk/RQCNQV/ https://cfp.pydata.org/berlin2025/talk/RQCNQV/feedback/ B05-B06 The EU AI Act: Unveiling Lesser-Known Aspects, Implementation Entities, and Exemptions Talk 2025-09-01T13:40:00+02:00 13:40 00:30 The EU AI Act is already partly in effect which prohibits certain AI systems. After going through the basics, we cover some of the less talked about aspects of the Act, introducing entities involved in its implementation and how many high risk government and law enforcement use cases are excluded! berlin2025-77020-the-eu-ai-act-unveiling-lesser-known-aspects-implementation-entities-and-exemptions Ethics & Privacy Adrin Jalali en The EU AI Act is a groundbreaking regulatory framework, partly in effect, designed to govern AI systems based on their perceived risk. This talk provides an overview of the basics and explores lesser-discussed aspects of the Act, such as the entities involved in its implementation, the role of the private sector, and notable exemptions for high-risk government and law enforcement use cases. The AI Act categorizes AI systems into different groups based on their potential harm. The two most notable groups are unacceptable and high risk groups. Unacceptable risk systems, social scoring systems, unconsciously behavior manipulative systems, and mass CCTV facial recognition systems are among the prohibited group. On the other hand, high-risk systems including biometric identification systems, AI systems used in education and vocational training, and employment and worker management systems, must meet stringent obligations before entering the market. Surprisingly, the AI Act excludes many high-risk government and law enforcement use cases. AI systems used for national security, defense, and law enforcement tasks like border control, crime prevention, and criminal investigations are largely exempt. These exemptions aim to preserve public security and Member States' sovereignty but raise concerns about potential AI misuse in these sensitive areas. For instance, predictive policing tools, though controversial, fall outside the AI Act's scope. Additionally, the AI Act will not apply to AI systems used as research or development tools or to systems developed or used exclusively for military purposes. This leaves a substantial gap in the regulation of high-risk AI systems, emphasizing the need for complementary safeguards. One of the less talked about aspects is the complex ecosystem of entities involved in the AI Act's implementation. The European Artificial Intelligence Board is the Act's central hub, comprising representatives from each national supervisory authority, the European Data Protection Supervisor, and the Commission. The board will issue opinions and recommendations to ensure the AI Act's consistent application. National supervisory authorities, such as data protection agencies, will oversee the Act's enforcement, exchanging information through the board. The European Commission will facilitate cooperation among national authorities and with international organizations. When it comes to verifying submitted documents and claimed [lack] of high risk systems, there will be entities called notifying bodies which will be established by each Member State to assess and certify notified bodies. Notified bodies are conformity assessment bodies accredited to evaluate high-risk AI systems. These notified bodies, is a space where the private sector and startups can grow and engage with the regulatory bodies. They will play a crucial role in ensuring high-risk AI systems conform to the AI Act's requirements. Moreover, the AI Act introduces AI regulatory sandboxes, temporary experimental spaces allowing developers to test innovative AI systems under regulatory supervision. National competent authorities will establish and monitor these sandboxes, fostering innovation while minimizing risks. The private sector can engage with these sandboxes, creating opportunities for startups and established companies to develop and test their new systems. In conclusion, the EU AI Act is a comprehensive regulatory framework that establishes a complex ecosystem of implementation entities and offers opportunities for private sector engagement. However, it also presents notable exemptions for high-risk government and law enforcement use cases, sparking debates about its scope and effectiveness. Understanding these lesser-known aspects is crucial for navigating the AI Act's regulatory landscape and fostering responsible AI innovation. false https://cfp.pydata.org/berlin2025/talk/LZYBVH/ https://cfp.pydata.org/berlin2025/talk/LZYBVH/feedback/ B05-B06 What’s Really Going On in Your Model? A Python Guide to Explainable AI Talk 2025-09-01T14:20:00+02:00 14:20 00:30 As machine learning models become more complex, understanding why they make certain predictions is becoming just as important as the predictions themselves. Whether you're dealing with business stakeholders, regulators, or just debugging unexpected results, the ability to explain your model is no longer optional , it's essential. In this talk, we'll walk through practical tools in the Python ecosystem that help bring transparency to your models, including SHAP, LIME, and Captum. Through hands-on examples, you'll learn how to apply these libraries to real-world models from decision trees to deep neural networks and make sense of what's happening under the hood. If you've ever struggled to explain your model’s output or justify its decisions, this session will give you a toolkit to build more trustworthy, interpretable systems without sacrificing performance. berlin2025-77727-what-s-really-going-on-in-your-model-a-python-guide-to-explainable-ai Ethics & Privacy Yashasvi Misra en We’ve all been there, your machine learning model performs well in testing, but when it comes time to explain why it made a specific prediction, things get murky. In many real-world applications, especially in domains like healthcare, finance, or operations, being able to explain your model isn’t just helpful it’s critical.This talk is a practical walkthrough of explainable AI (XAI) tools in Python, aimed at data scientists and engineers who want to make their models more transparent and trustworthy. We’ll cover libraries like SHAP, LIME, and Captum, and show how to use them to generate both local and global explanations for models ranging from random forests to deep neural nets.You’ll see hands-on examples, common pitfalls to avoid, and ideas for integrating interpretability into your workflow whether you’re trying to debug your model or justify its predictions to a non-technical stakeholder.If you’ve ever wanted to better understand your own models or help others trust them this session is for you. false https://cfp.pydata.org/berlin2025/talk/JE8YJT/ https://cfp.pydata.org/berlin2025/talk/JE8YJT/feedback/ B05-B06 Consumer Choice Models with PyMC Marketing Talk 2025-09-01T16:20:00+02:00 16:20 00:30 Consumer choice models are an important part of product innovation and market strategy. In this talk we'll see how they can be used to learn about substitution goods and market shares in competitive markets using PyMC marketing's new consumer choice module. berlin2025-77512-consumer-choice-models-with-pymc-marketing PyData & Scientific Libraries Stack Nathaniel Forde en The market sets the price, but what drives market demand? Classical implementations of discrete choice models discovered that market structure needed to be explicitly encoded in the model to avoid the problem of implausible predictions about the substitution value of distinct products. We demonstrate this issue and how to resolve it by adding more explicit structure to the models of market demand while giving insight into what drives the utility of products for consumers. These consumer choice models find a natural expression in the Bayesian paradigm and we show how to fit them to real data with PyMC Marketing's Consumer Choice module. false https://cfp.pydata.org/berlin2025/talk/GW9EXL/ https://cfp.pydata.org/berlin2025/talk/GW9EXL/feedback/ B05-B06 Risk Budget Optimization for Causal Mix Models Talk 2025-09-01T17:00:00+02:00 17:00 00:30 Traditional budget planners chase the highest predicted return and hope for the best. Bayesian models take the opposite route: they quantify uncertainty first, then let us optimize budgets with that uncertainty fully on display. In this talk we’ll show how posterior distributions become a set of possible futures, and how risk‑aware loss functions convert those probabilities into spend decisions that balance upside with resilience. Whether you lead marketing, finance, or product, you’ll learn a principled workflow for turning probabilistic insight into capital allocation that’s both aggressive and defensible—no black‑box magic, just transparent Bayesian reasoning and disciplined risk management. berlin2025-77527-risk-budget-optimization-for-causal-mix-models PyData & Scientific Libraries Stack Carlos Trujillo en Budget planning often treats forecasts as fixed targets, leaving decision‑makers blind to the volatility hiding beneath the averages. This talk shows how Bayesian modelling turns every unknown—channel response, cost elasticity, future demand—into an explicit probability distribution. By simulating thousands of plausible futures, we can measure upside and downside simultaneously and translate a company’s risk appetite into clear optimisation objectives such as Value‑at‑Risk, Conditional VaR, entropic risk, or custom utility functions that respect budget caps and pacing rules. Using reproducible PyMC Code, we will walk through converting posterior samples into risk‑aware spend recommendations, and visualising trade‑offs so non‑technical stakeholders grasp both opportunity and exposure.  Attendees will leave with a notebook and code to adapt pymc bayesian models with Pymc-Marketing to perform marketing budgets, capital allocation, or any scenario where uncertainty and risk tolerance must shape financial decisions. false https://cfp.pydata.org/berlin2025/talk/3XMJM3/ https://cfp.pydata.org/berlin2025/talk/3XMJM3/feedback/ Kuppelsaal Narwhals: enabling universal dataframe support Talk (long) 2025-09-02T09:10:00+02:00 09:10 00:50 Ever tried passing a Polars Dataframe to a data science library and found that it...just works? No errors, no panics, no noticeable overhead, just...results? This is becoming increasingly common in 2025, yet only 2 years ago, it was mostly unheard of. So, what changed? A large part of the answer is: Narwhals. Narwhals is a lightweight compatibility layer between dataframe libraries which lets your code work seamlessly across Polars, pandas, PySpark, DuckDB, and more! And it's not just a theoretical possibility: with ~30 million monthly downloads and set as a required dependency of Altair, Bokeh, Marimo, Plotly, Shiny, and more, it's clear that it's reshaping the data science landscape. By the end of the talk, you'll understand why writing generic dataframe code was such a headache (and why it isn't anymore), how Narwhals works and how its community operates, and how you can use it in your projects today. The talk will be technical yet accessible and light-hearted. berlin2025-77228-narwhals-enabling-universal-dataframe-support PyData & Scientific Libraries Stack Marco Gorelli en Narwhals is a lightweight and extensible compatibility layer between dataframe libraries. It is already used by several major open source libraries including Altair, Bokeh, Marimo, Plotly, and more. You will learn how to use Narwhals to build dataframe-agnostic tools, how Narwhals gained traction in a short amount of time, and what the future of dataframes looks like. This is a technical talk, and basic familiarity with Python and dataframes will be assumed. We will cover: * What the data science landscape looked like in 2024 before Narwhals came onto the scene. * What problems Narwhals solves, why you can't "just convert to pandas" or "just use PyArrow". * How to use Narwhals, with an emphasis on lazy-only computation. * Static typing. * Narwhals and SQL. * Extending Narwhals with your own backend. * The Narwhals community, and how you can get involved. * What we think the future of dataframes looks like, and how you can help make it happen. Tool builders will learn how to build tools for modern dataframe libraries without sacrificing support for foundational classic libraries such as pandas. Data scientists will learn about what goes on under the hood when their favourite tools support their favourite dataframe libraries. Finally, everyone will learn from insights on community building and management. false https://cfp.pydata.org/berlin2025/talk/JKEHMH/ https://cfp.pydata.org/berlin2025/talk/JKEHMH/feedback/ B09 Probably Fun: Games to teach Machine Learning Tutorial 2025-09-02T10:40:00+02:00 10:40 01:30 In this tutorial, you will play several games that can be used to teach machine learning concepts. Each game can be played in big and small groups. Some involve hands- on material such as cards, some others involve electronic app. All games contain one or more concepts from Machine Learning. As an outcome, you will take away multiple ideas that make complex topics more understandable – and enjoyable. By doing so, we would like to demonstrate that Machine Learning does not require computers, but the core ideas can be exemplified in a clear and memorable way without. We also would like to demonstrate that gamification is not limited to online quiz questions, but offers ways for learners to bond. We will bring a set of carefully selected games that have been proven in a big classroom setting and contain useful abstractions of linear models, decision trees, LLMs and several other Machine Learning concepts. We also believe that it is probably fun to participate in this tutorial. berlin2025-77921-probably-fun-games-to-teach-machine-learning Education, Career & Life Dr. Kristian RotherShreyaasri Prakash en Board gaming has recently been declared part of the immaterial cultural heritage in Germany by UNESCO. Games encourage people to use their brains in a focused, constructive and peaceful way. This makes games a fantastic tool in the classroom. While many games contain algorithms and statistical models right under the surface, finding an actual model of Machine Learning is a bit harder. We have put some thought into creating or finding games that have a clear connection to Machine Learning. We have conducted a tutorial featuring board games at PyConDE 2025. This time, we have increased the ante and moved the focus from statistics to Machine Learning. Also, at PyData Berlin we expect a particular challenge: we do not expect a room with tables for 80+ people. Therefore, we chose game mechanics that work with minimal material and scale up to big groups. As a consequence, the games would be easier to adapt to a larger class, such as university courses and seminars. Also, we take care to limit the time a game requires. In a classroom situation this allows to use the game as a priming exercise that can be followed up with theory and/or practical exercises using computers and programming. The tutorial will be executed according to the following pseudocode (or lesson plan): 1. Game #1 is played in a plenary (5 min) 2. The presenters give a short introduction on why games matter (5 min) 3. The audience is randomly sampled into teams of 6 (2 min) 4. Game #2 is played in the teams in a cooperative manner (15 min) 5. Game #3 is played in the teams in a cooperative manner (15 min) 6. Game #4 is played with the teams competing against each other (20 min) 7. Winners are determined and applauded (5 min) 8. Game #5 is played in the plenary again (10 min) 9. Q & A and wrap-up (10 min) One of the presenters is certified as a board game educator "Fachkraft für Gesellschaftsspiele" by the Brettspielakademie (https://brettspielakademie.de/). The games and lessons have been field-tested with university courses and are made available under a Creative Commons license. You are free to reuse or modify them for your own teaching. Several games (mostly on statistics) and sample lesson plans are available on https://www.academis.eu/probably_fun/ . false https://cfp.pydata.org/berlin2025/talk/GBVFJ8/ https://cfp.pydata.org/berlin2025/talk/GBVFJ8/feedback/ B09 Deep Dive into the Synthetic Data SDK Tutorial 2025-09-02T13:40:00+02:00 13:40 01:30 In January the Synthetic Data SDK was introduced and it quickly is gaining traction as becoming the standard Open Source library for creating privacy-preserving synthetic data. In this hands-on tutorial we're going beyond the basics and we'll look at many of the advanced features of the SDK including differential privacy, conditional generation, multi-tables, and fair synthetic data. berlin2025-77707-deep-dive-into-the-synthetic-data-sdk Data Handling & Engineering Tobias Hann en This hands-on tutorial will take participants beyond the basics of the Synthetic Data SDK, the emerging open-source standard for creating privacy-preserving synthetic data. After a brief recap of the SDK’s core capabilities, the session will dive into advanced functionality, beginning with an in-depth exploration of differential privacy. Attendees will learn how the SDK integrates formal privacy guarantees, configure key parameters (i.e., epsilon and delta), and observe the trade-offs between privacy and utility through live examples. The session will then focus on conditional generation, demonstrating how users can guide synthetic data output based on specific constraints or target values - an essential feature for scenario testing and AI model validation. A dedicated section will cover multi-table synthesis, where participants will learn how to model and generate relational datasets with primary-foreign key dependencies, preserving structural and statistical integrity across multiple linked tables. Finally, the tutorial will introduce the concept of fair synthetic data, showing how the SDK supports data generation aligned with the principle of statistical parity to help reduce representational bias in downstream use cases. Each segment includes interactive coding exercises and real-world datasets to ensure practical understanding. Participants should have a working knowledge of Python and prior experience with the SDK or similar tools. false https://cfp.pydata.org/berlin2025/talk/W9Q7JY/ https://cfp.pydata.org/berlin2025/talk/W9Q7JY/feedback/ B09 Forget the Cloud: Building Lean Batch Pipelines from TCP Streams with Python and DuckDB Talk (long) 2025-09-02T16:00:00+02:00 16:00 00:45 Many industrial and legacy systems still push critical data over TCP streams. Instead of reaching for heavyweight cloud platforms, you can build fast, lean batch pipelines on-prem using Python and DuckDB. In this talk, you'll learn how to turn raw TCP streams into structured data sets, ready for analysis, all running on-premise. We'll cover key patterns for batch processing, practical architecture examples, and real-world lessons from industrial projects. If you work with sensor data, logs, or telemetry, and you value simplicity, speed, and control this talk is for you. berlin2025-77956-forget-the-cloud-building-lean-batch-pipelines-from-tcp-streams-with-python-and-duckdb Data Handling & Engineering Orell Garten en Cloud-native tools are everywhere. But not every system can or should move to the cloud. In many industries like manufacturing, logistics, or energy, TCP streams remain the backbone of real-time data exchange. These systems are often on-premise, resource-constrained, and mission-critical. This talk shows how you can build lean, powerful batch pipelines with source data coming from TCP streams using Python and DuckDB. All without the complexity of cloud services. We'll cover: - Why TCP streams still matter - Stream vs. Batch: Choosing the right model for industrial data - Pipeline architecture: From streams to batch - DuckDB + Python: The perfect combo for lightweight analytics - Key pitfalls along the way - Limitations of this approach You'll walk away with: - Ready-to-use patterns for TCP-based data pipelines - Insights on when to avoid unnecessary cloud complexity - Tips for building fast, reliable batch jobs on local infrastructure Whether you process factory sensor data, machine logs, or legacy telemetry, this talk will give you modern tools to make your data streams actionable and efficient. false https://cfp.pydata.org/berlin2025/talk/ZXTLEW/ https://cfp.pydata.org/berlin2025/talk/ZXTLEW/feedback/ B07-B08 One API to Rule Them All? LiteLLM in Production Talk 2025-09-02T12:00:00+02:00 12:00 00:30 Using LiteLLM in a Real-World RAG System: What Worked and What Didn’t LiteLLM provides a unified interface to work with multiple LLM providers—but how well does it hold up in practice? In this talk, I’ll share how we used LiteLLM in a production system to simplify model access and handle token budgets. I’ll outline the benefits, the hidden trade-offs, and the situations where the abstraction helped—or got in the way. This is a practical, developer-focused session on integrating LiteLLM into real workflows, including lessons learned around deployment, limitations, and decision points. If you’re considering LiteLLM, this talk offers a grounded look at using it beyond simple prototypes. berlin2025-77791-one-api-to-rule-them-all-litellm-in-production Generative AI Alina Dallmann en Building a real-world LLM system often means juggling different providers, endpoints, and API quirks. LiteLLM promises a unified interface across model backends—but how well does it hold up in production? In this talk, I’ll share how we integrated LiteLLM into a real-world system that includes budget usage tracking and other production concerns. From provider switching to budget handling, I’ll walk through the benefits we saw—and the challenges we hit. I’ll also touch on the limits of abstraction, and what to consider when balancing flexibility with simplicity. You’ll get a practical look at where LiteLLM helped us reduce complexity—and where it introduced trade-offs we didn’t expect. **Key Takeaways** - Understand how LiteLLM can be used to unify access to multiple LLM providers - Learn how it fits into a real production pipeline (especially routing and budget management) - Discover trade-offs related to abstraction, debugging, and control - Get inspiration for how to evaluate abstraction layers in your own LLM projects **Target Audience** - Developers and engineers working with LLMs in production - Anyone curious about LiteLLM’s strengths and limitations in a real system false https://cfp.pydata.org/berlin2025/talk/NUNXEV/ https://cfp.pydata.org/berlin2025/talk/NUNXEV/feedback/ B07-B08 Scaling Probabilistic Models with Variational Inference Talk 2025-09-02T13:40:00+02:00 13:40 00:30 This talk presents variational inference as a tool to scale probabilistic models. We describe practical examples with NumPyro and PyMC to demonstrate this method, going through the main concepts and diagnostics. Instead of going heavy into the math, we focus on the code and practical tips to make this work in real industry applications. berlin2025-77541-scaling-probabilistic-models-with-variational-inference /media/berlin2025/submissions/BCGJQB/numpyro_hierarchical_fore_LHObegf.png Dr. Juan Orduz en Probabilistic models have proven to be a great tool for solving business-critical problems in fields such as marketing, demand forecasting, and risk-based optimization. One of the biggest challenges is scaling these models to large data sets and efficiently utilizing modern computing power. This talk addresses the challenges of scaling probabilistic models using variational inference and other similar methods. We will explain the core concepts of variational inference in an accessible way, avoiding heavy mathematics. We will use practical examples with NumPyro and PyMC to demonstrate how to apply variational inference effectively. Starting with simple models and then showing applications with custom forecasting models and neural network components. Additionally, we will cover diagnostics such as simulation-based calibration and coverage to ensure model reliability. Our discussion will also include strategies for scaling, including mini-batch optimization and distributed computing. false https://cfp.pydata.org/berlin2025/talk/BCGJQB/ https://cfp.pydata.org/berlin2025/talk/BCGJQB/feedback/ B07-B08 Bridging Custom Schemas and Wikidata with an LLM-Assisted Interactive Python Tool Talk 2025-09-02T15:00:00+02:00 15:00 00:30 Many projects build knowledge graphs with custom schemas but struggle to align them with standard hubs like Wikidata. Manual mapping is tedious and error-prone, while fully automated methods often lack accuracy. This talk introduces `wikidata-mapper`, a Python tool leveraging Large Language Models (LLMs via `DSPy`) to suggest semantic mappings between simple YAML ontology schemas and Wikidata identifiers (QIDs/PIDs). We demonstrate its interactive workflow, including confidence-based auto-acceptance, batch suggestion/review modes for scalability, and a novel hierarchy suggestion feature. Learn how this tool combines LLM power with human oversight to efficiently ground custom knowledge representations in Wikidata, using libraries like `inquirer`, `tenacity`, and `platformdirs`. Ideal for KG practitioners, data engineers, and anyone needing to integrate custom schemas with public knowledge bases. berlin2025-77656-bridging-custom-schemas-and-wikidata-with-an-llm-assisted-interactive-python-tool Natural Language Processing & Audio (incl. Generative AI NLP) Sankalp Gilda en **(1) The Challenge: Isolated Knowledge & the Need for Grounding** Many data science and knowledge-based projects naturally involve creating custom schemas or lightweight ontologies – defining terms like `PROJECT_LEAD`, `SALES_REGION`, or `COMPONENT_FAILURE_MODE` that make sense within a specific context (e.g., in simple YAML files). While valuable locally, these schemas often become **isolated knowledge silos**. Without alignment to a common standard, integrating data across projects becomes difficult, querying consistently is a challenge, and opportunities to leverage vast amounts of external, structured knowledge are missed. How can we connect these custom concepts to the wider world? This is where **Wikidata** comes in. Why map to Wikidata specifically? - It's one of the largest, **collaboratively edited, multilingual knowledge bases** available, acting as a central structured data hub for Wikimedia projects and countless external applications (). - Its **vast scale and broad domain coverage** (from proteins to companies to historical events) make it an ideal **universal anchor point** for grounding diverse custom schemas. - Mapping provides immediate benefits: - **Standardization:** Your `SALES_REGION` aligns with a standard Wikidata geo-identifier (QID). - **Interoperability:** Your graph can now connect with other datasets linked via Wikidata IDs. - **Powerful Data Enrichment:** Link your internal 'Acme Corp' entity to its Wikidata QID, and you can instantly query for its industry (P452), headquarters coordinates (P625), official website (P856), subsidiaries (P355), and much more, automatically enriching your own data. However, **the mapping process itself is a major bottleneck.** Manually finding the correct Wikidata QID/PID for potentially hundreds of custom terms is extremely tedious, requires niche expertise, and is error-prone. Fully automated alignment tools often struggle with the semantic ambiguity inherent in custom schemas (e.g., does 'position' mean job title P39 or coordinate P625?). **(2) Our Approach: LLM-Assisted Interactive Mapping** This talk introduces a practical Python tool and workflow designed to **dramatically accelerate the process of mapping custom schema elements (from formats like YAML) to Wikidata identifiers**, while maintaining high accuracy through efficient human oversight. We combine the semantic reasoning of Large Language Models (LLMs) with an interactive workflow. **(3) Workflow Walkthrough:** We will walk through the tool's workflow: - **Input:** Starting with a custom schema/ontology (e.g., a YAML file defining entity types like 'PERSON', 'COMPANY' and relations like 'worksFor'). - **Candidate Search:** Automatically searching Wikidata (API client with caching and retries) for potential QID/PID matches. - **LLM Suggestion:** Using `dspy` to prompt an LLM with source term details and candidates, generating structured JSON suggestions (`LLMMappingSuggestion` Pydantic model) with confidence scores and reasoning. - **Interactive Review:** Demonstrating the `inquirer`-based CLI for quickly accepting/overriding/skipping LLM suggestions, aided by confidence thresholding for auto-acceptance. - **Batch Modes:** Explaining the scalable `--batch-suggest` and `--batch-review` modes for handling larger schemas efficiently. - **Hierarchy Suggestion:** Showcasing how the tool uses Wikidata's parent information (via SPARQL and another LLM prompt) to interactively suggest adding `subclass_of` links to the user's own schema, helping refine its structure. - **Output:** Generating the final schema file (YAML) enriched with Wikidata mappings and confirmed hierarchy links. **(4) Technical & Engineering Highlights:** - Briefly mention the Python libraries (`dspy`, `requests`, `tenacity`, `inquirer`, `pydantic`, `platformdirs`, `yaml`). - Highlight robustness features: API retries, caching, structured error handling, decoupled design. **(5) Demo:** _(Mention plan for a brief demo showcasing the interactive mapping and hierarchy suggestion.)_ **(6) Benefits & Conclusion:** This approach provides a pragmatic solution to a common data integration bottleneck. By intelligently combining LLM suggestions with efficient human validation, it allows teams to: * Significantly speed up mapping custom schemas to Wikidata. * Improve mapping consistency and accuracy. * Enrich internal knowledge graphs by linking them to a global standard. * Leverage Wikidata's structure to refine custom schemas. Attendees will learn about a practical workflow applying LLMs to KG alignment and gain insights into building robust, interactive data tools with the modern Python stack. false https://cfp.pydata.org/berlin2025/talk/WQ9EHR/ https://cfp.pydata.org/berlin2025/talk/WQ9EHR/feedback/ B07-B08 From Months to Minutes: Accelerating Compliance Reviews with GenAI Talk 2025-09-02T16:00:00+02:00 16:00 00:30 Transform time-consuming document compliance reviews into automated workflows with Generative AI. Through live demonstrations, learn how to build systems that extract policies from unstructured data, analyze the, conduct compliance assessments in minutes instead of weeks. Moreover, in some pipelines, you can already adjust non-compliant components to make the whole system compliant. Learn practical implementation patterns and ready-to-use templates to modernize your compliance workflows. berlin2025-77889-from-months-to-minutes-accelerating-compliance-reviews-with-genai Natural Language Processing & Audio (incl. Generative AI NLP) Elizaveta Zinovyeva en In today's rapidly evolving regulatory landscape, organizations struggle with the mounting complexity of document compliance reviews. This session demonstrates practical approaches to transforming this challenging process using Generative AI, focusing on real-world implementations and tangible results. What You'll Learn: * How to build intelligent document processing pipelines that extract structured information from various document formats * Techniques for automated policy analysis and compliance assessment * Practical patterns for implementing compliance verification workflows * Methods to automatically identify and remediate non-compliant components * Strategies for ensuring accuracy and maintaining human oversight where needed Live Demonstrations Will Include: * Building a document analysis pipeline * Implementing smart policy extraction from unstructured data * Setting up automated compliance checks * Creating compliance reports and remediation suggestions * Handling edge cases and complex scenarios Key Technical Components: * Document processing architectures * Prompt engineering for compliance use cases * Quality control mechanisms * Integration patterns with existing systems * Error handling and validation approaches Who Should Attend: * Developers working on compliance automation * Technical leaders modernizing compliance workflows * Solution architects designing document processing systems * Anyone interested in practical applications of GenAI true https://cfp.pydata.org/berlin2025/talk/WMRLUD/ https://cfp.pydata.org/berlin2025/talk/WMRLUD/feedback/ B05-B06 The Importance and Elegance of Polars Expressions Talk 2025-09-02T10:40:00+02:00 10:40 00:30 Polars is known for its speed, but its elegance comes from its use of expressions. In this talk, we’ll explore how Polars expressions work and why they are key to efficient and elegant data manipulation. Through real-world examples, you’ll learn how to create, expand, and combine expressions in Polars to wrangle data more effectively. berlin2025-77894-the-importance-and-elegance-of-polars-expressions PyData & Scientific Libraries Stack Jeroen Janssens en Polars has gained popularity for its speed, but what truly makes it stand out is its syntax, especially the use of expressions. The book Python Polars: The Definitive Guide defines an expression as "a tree of operations that describe how to construct one or more Series." In this talk, we’ll demystify this concept, explaining how expressions make Polars an elegant tool for data manipulation. We will cover: - Why expressions are crucial in Polars - A formal definition of an expression and what it means in practice - Creating expressions from existing columns or other values - Using expressions to select, filter, sort, and aggregate data - Applying expressions for aggregate statistics, mathematical transformations, and handling missing values - Combining expressions with operators, comparisons, and Boolean logic - A comparison of idiomatic vs. non-idiomatic Polars code By the end of this talk, you’ll understand how to leverage Polars expressions to write cleaner and more efficient data manipulation code. false https://cfp.pydata.org/berlin2025/talk/DBL9PQ/ https://cfp.pydata.org/berlin2025/talk/DBL9PQ/feedback/ B05-B06 Causal Inference in Network Structures: Lessons learned From Financial Services Talk 2025-09-02T11:20:00+02:00 11:20 00:30 *Causal inference techniques are crucial to understanding the impact of actions on outcomes.* *This talk shares lessons learned from applying these techniques in real-world scenarios where standard methods do not immediately apply. Our key question is: What is the causal impact of wealth planning services on a network of individual’s investments and securities? We'll examine the challenges posed by practical constraints and show how to deal with them before applying standard approaches like staggered difference-in-difference.* *This self-contained talk is prepared for general data scientists who want to add causal inference techniques to their toolbox and learn from real-world data challenges.* berlin2025-77643-causal-inference-in-network-structures-lessons-learned-from-financial-services PyData & Scientific Libraries Stack Danial Senejohnny en Wealth planning is a service offered by financial institutions. The advice helps clients grow their wealth through investing. This talk focuses on measuring the true impact of these services on a network of individual’s investments and securities. However, measuring this impact presents several practical challenges, which will be tackled in this talk: 1) Controlled experiments are often impossible in practice, leaving only observational data available. 2) Defining robust control groups is challenging when treatments are administered to individuals in relationship graphs at different times. 3) Analysis must account for multiple outcomes with different modalities—securities (time-series) and investing (binary). 4) The parallel-trend assumption doesn't immediately hold. 5) Market trends confounding effect on outcome needs to be corrected. false https://cfp.pydata.org/berlin2025/talk/HUNUEB/ https://cfp.pydata.org/berlin2025/talk/HUNUEB/feedback/ B05-B06 Building Reactive Data Apps with Shinylive and WebAssembly Talk 2025-09-02T12:00:00+02:00 12:00 00:30 WebAssembly is reshaping how Python applications can be delivered - allowing fully interactive apps that run directly in the browser, without a traditional backend server. In this talk, I’ll demonstrate how to build reactive, data-driven web apps using Shinylive for Python, combining efficient local storage with Parquet and extending functionality with optional FastAPI cloud services. We’ll explore the benefits and limitations of this architecture, share practical design patterns, and discuss when browser-based Python is the right choice. Attendees will leave with hands-on techniques for creating modern, lightweight, and highly responsive Python data applications. berlin2025-77770-building-reactive-data-apps-with-shinylive-and-webassembly PyData & Scientific Libraries Stack Christoph Scheuch en In recent years, WebAssembly (Wasm) has opened new frontiers for delivering Python applications - enabling fully interactive, browser-native experiences without requiring a traditional server backend. This paradigm shift is particularly exciting for data scientists and developers looking to build lightweight, highly responsive data apps that can be deployed as static websites, reducing infrastructure complexity while improving user experience. In this talk, I will walk through how to use Shinylive for Python, an emerging framework that combines reactive programming principles with the power of WebAssembly, to create rich data applications that run entirely in the browser. We’ll cover how Shinylive translates reactive code into client-side interactions, eliminating the need for round-trips to a Python server. I’ll also introduce techniques for integrating efficient local storage (via Apache Parquet) and show how optional FastAPI services can be layered on for hybrid architectures when needed. This talk is intended for data scientists, machine learning engineers, and Python developers who are interested in building modern web applications without becoming full-time JavaScript engineers. Attendees will leave with practical techniques for building and deploying reactive data apps that run entirely in the browser. false https://cfp.pydata.org/berlin2025/talk/GPZPFP/ https://cfp.pydata.org/berlin2025/talk/GPZPFP/feedback/ B05-B06 A quarter decade of learnings from scaling RAG to millions of users Talk 2025-09-02T13:40:00+02:00 13:40 00:30 Drawing on experience at Google designing 50+ RAG applications rolled out to millions of users, this talk presents a practical RAG design blueprint. We'll dissect key decision points for building robust knowledge bases (data types, chunking), selecting effective retrieval strategies beyond basic vector search (including Knowledge Graphs and Text2SQL based on query types), and generating meaningful responses tailored to user needs. Attendees will learn reusable patterns and trade-offs essential for building production-ready, scalable RAG systems. berlin2025-77868-a-quarter-decade-of-learnings-from-scaling-rag-to-millions-of-users Generative AI Jakob Pörschmann en Target Audience & Prerequisites: This talk is aimed at Data Scientists, Machine Learning Engineers, Software Engineers, and Architects who are building or planning to build applications integrating LLMs with proprietary or external knowledge bases. Attendees should have a basic understanding of Large Language Models and the core concept of RAG. Familiarity with different data storage paradigms (relational, graph, document, vector) is helpful but not essential. Talk Outline & Content: This session provides a structured blueprint to desining RAG systems, focusing on three critical stages: Building the Knowledge Base. Identifying and characterizing knowledge sources (internal/external, APIs). Handling structured, unstructured, and semi-structured data. Choosing appropriate chunking strategies (markdown, recursive, token-based, etc.) to preserve context for unstructured data. Understanding the implications of data control (self-managed vs. external). Retrieving the Right Content: Matching Techniques to Queries: Text Embeddings & Vector Search: Strengths for extractive queries on unstructured data. Knowledge Graphs (KG): Benefits for aggregate queries and exploring relationships within unstructured data. Text2SQL: Essential for analytical queries on structured data. Hybrid Approaches: Combining methods for broader coverage. Operational Considerations: Factors like query volume, latency requirements, data scale, and selecting appropriate storage/indexing solutions (using GCP options like Vertex AI Vector Search, AlloyDB, BigQuery as concrete examples of trade-offs, but keeping principles general). Integrating external sources like search APIs. Generating Meaningful Responses: Retrieval is only half the battle. We'll discuss: Tailoring generation to the user interface (Q&A, conversational, GUI). Handling query complexity: Strategies for breaking down complex questions (e.g., "Deep Research" concept) before summarization. Integrating RAG into wider systems or agentic applications. Takeaway: Attendees will leave with a practical, step-by-step blueprint for designing RAG systems. They will understand the critical design decisions at each stage, the trade-offs between different retrieval techniques (beyond just vectors), and how to tailor their architecture to specific data, query types, and scalability needs. The focus is on actionable patterns learned from real-world, large-scale deployments. false https://cfp.pydata.org/berlin2025/talk/X3UYBK/ https://cfp.pydata.org/berlin2025/talk/X3UYBK/feedback/ B05-B06 Navigating healthcare scientific knowledge:building AI agents for accurate biomedical data retrieval Talk 2025-09-02T15:00:00+02:00 15:00 00:30 With a focus on healthcare applications where accuracy is non negotiable, this talk highlights challenges and delivers practical insights on building AI agents which query complex biological and scientific data to answer sophisticated questions. Drawing from our experience developing Owkin-K Navigator, a free-to-use AI co-pilot for biological research, I'll share hard-won lessons about combining natural language processing with SQL querying and vector database retrieval to navigate large biomedical knowledge sources, addressing challenges of preventing hallucinations and ensuring proper source attribution. This session is ideal for data scientists, ML engineers, and anyone interested in applying python and LLM ecosystem to the healthcare domain. berlin2025-77081-navigating-healthcare-scientific-knowledge-building-ai-agents-for-accurate-biomedical-data-retrieval Generative AI Laura en The growth of scientific healthcare literature and publicly available biomedical databases has created many opportunities but also great challenges for researchers. While large amounts of biological data are now freely available, finding and connecting relevant information across disparate sources remains time-consuming and complex. LLM-powered tools offer promising solutions to this challenge, but implementing them in healthcare, where accuracy can impact patient outcomes, requires specialised approaches and careful design considerations. This talk will share practical lessons and technical strategies to address hallucinations, complex domain-specific terminology, source citations. The presentation will be structured into three main sections: 1. The challenge of scientific data retrieval (5 mins) 1. Overview of the current landscape of biological databases and scientific literature 2. Common challenges researchers face when searching for information across multiple sources 3. Specificities of healthcare domain where accuracy is critical 2. Technical architecture for LLM-powered scientific search (15 mins) 1. Reliable approaches to querying structured databases using natural language 2. Vector database implementation for semantic search across scientific literature 3. Strategies to ensure retrieved information is properly attributed to sources 4. Real-world performance considerations: balancing accuracy, latency, and cost 3. Lessons learned and future directions (5 mins) 1. Performance metrics and user feedback from academic researchers 2. Challenges and limitations of current approaches 3. Future directions for AI-assisted scientific discovery Throughout the talk, I'll provide concrete examples of how these technologies can be applied to real research questions, in a production environment, demonstrating the practical value of AI agents in accelerating scientific discovery. Intended audience: This talk is designed for data scientists, ML / Software engineers, bioinformaticians, and researchers interested in leveraging AI for scientific data retrieval and analysis. While examples will focus on biological data, the principles and techniques discussed are applicable across scientific domains. Basic familiarity with Python and AI concepts will be helpful but is not required. false https://cfp.pydata.org/berlin2025/talk/JEKYLT/ https://cfp.pydata.org/berlin2025/talk/JEKYLT/feedback/ B05-B06 From Manual to LLMs: Scaling Product Categorization Talk (long) 2025-09-02T16:00:00+02:00 16:00 00:45 How to use LLMs to categorize hundreds of thousands of products into 1,000 categories at scale? Learn about our journey from manual/rule-based methods, via fine-tuned semantic models, to a robust multi-step process which uses embeddings and LLMs via the OpenAI APIs. This talk offers data scientists and AI practitioners learnings and best practices for putting such a complex LLM-based system into production. This includes prompt development, balancing cost vs. accuracy via model selection, testing mult-case vs. single-case prompts, and saving costs by using the OpenAI Batch API and a smart early-stopping approach. We also describe our automation and monitoring in a PySpark environment. berlin2025-77627-from-manual-to-llms-scaling-product-categorization Generative AI Giampaolo CasollaAnsgar Grüne en **Target Audience:** Data scientists, AI/ML engineers, and practitioners interested in applying large language models (LLMs) / generative AI to solve real-world classification problems at scale. Attendees should have a foundational understanding of machine learning concepts and familiarity with the Python data science stack. Exposure to vector embeddings or LLM APIs is helpful but not mandatory. **Takeaway:** Attendees will gain practical insights and learn best practices for building, debugging, scaling, and productionizing a complex, multi-step generative AI system for large-scale product categorization. They will understand the evolution from traditional methods to LLMs, learn specific techniques for prompt engineering, batch processing, cost optimization with models like OpenAI's, and see the tangible business impact of such a system. **Detailed Outline:** This talk chronicles our journey tackling a yet challenging problem: accurately categorizing hundreds of thousands of diverse products into a fine-grained taxonomy of over 1,000 categories. We'll share our evolution from initial manual and rule-based systems to a sophisticated, production-ready Generative AI pipeline. * **Part 1: The Challenge & Initial Approaches (10 minutes)** * Introduction to the business need for accurate product categorization at scale. * Overview of the limitations encountered with traditional methods: * Manual Curation: Slow, expensive, inconsistent, and impossible to scale. * Rule-Based Systems: Brittle, hard to maintain, and unable to handle nuances or new product types. * Fine-tuned Semantic Models: An improvement, but struggled with zero-shot generalization to new categories and required significant labeled data and retraining. * **Part 2: Entering the GenAI Era - Iterations & Lessons Learned (10 minutes)** * Our initial exploration using LLMs for categorization, what worked, and what failed. * **Developing the Prompt:** We'll dive deep into the iterative process of prompt engineering for this complex multi-label, hierarchical classification task. We'll show examples of early prompts, their failure modes (e.g., inconsistent output format, hallucinated categories, difficulty handling multiple classification signals), and the refinements that led to more reliable results. We will discuss techniques for achieving structured output (e.g., JSON) from the LLM. * **Early Scaling Issues:** Discussing the pitfalls of naive API usage, latency problems, and prohibitive costs when dealing with large volumes. * **Part 3: Building a Robust, Scalable GenAI Pipeline (10 minutes)** * **The Hybrid Approach:** Detailing our successful multi-step architecture that combines the strengths of semantic embeddings for efficient candidate retrieval/filtering and LLMs (specifically leveraging OpenAI models) for nuanced final categorization. * **Productionization Strategies:** * *Batching:* Implementing efficient batch processing using asynchronous requests and the OpenAI Batch API to drastically reduce latency and cost. * *Cost vs. Accuracy:* Strategies for selecting the right model based on complexity and cost constraints. * *Semantic Similarity & Early Stopping:* Using vector similarity to intelligently prune the search space, avoiding the need to evaluate every product against all 1,000+ categories with the LLM, thus significantly optimizing cost and throughput. * *Automation & Monitoring*: How we process updates of categories and products automatically in PySpark and monitor that the live system works as expected. * **Part 4: Measuring Impact & Looking Ahead (10 minutes)** * Presenting the results: Showcasing the significant improvements in categorization accuracy and coverage compared to previous methodsIllustrative examples of challenging products correctly categorized by the GenAI system. * Discussing the tangible business value derived as measured in A-B tests * Briefly touching upon ongoing work and future directions. This presentation will focus on the practical application and engineering challenges, sharing reusable techniques and hard-won lessons applicable to anyone looking to leverage the power of generative AI for large-scale, real-world problems. We aim to provide a transparent account of not just the successes, but also the crucial learnings from failures encountered along the way. false https://cfp.pydata.org/berlin2025/talk/3LDDAB/ https://cfp.pydata.org/berlin2025/talk/3LDDAB/feedback/ Kuppelsaal Maintainers of the Future: Code, Culture, and Everything After Talk (long) 2025-09-03T09:10:00+02:00 09:10 00:50 How we sustain what we build — and why the future of tech depends on care, not only code. The last five years have reshaped tech — through a pandemic, economic uncertainty, shifting politics, and the rapid rise of AI. While these changes have opened new opportunities, they’ve also exposed the limits — and harms — of a “move fast and break things” mindset. berlin2025-77260-maintainers-of-the-future-code-culture-and-everything-after Education, Career & Life Jessica Greene en This talk invites the audience into a collective reflection on the state of tech today — and a reimagining of the futures we want to build. We’ll explore how small, mission-driven teams can use AI and automation to scale impact while centering their values, and why the work of maintenance — often invisible and undervalued — is foundational to responsible innovation. Drawing from my experience as a software engineer at a mission-driven company, and as an open-source community leader, I’ll unpack the challenges of long-term technical work: invisible labor, ethical drift, burnout and the quiet leadership of those who stay. In a world obsessed with velocity and dominance, this is a talk about resilience — and why the future belongs to those willing to maintain it as a radical act of shaping what comes next. false https://cfp.pydata.org/berlin2025/talk/C3MGDN/ https://cfp.pydata.org/berlin2025/talk/C3MGDN/feedback/ B09 Building an AI Agent for Natural Language to SQL Query Execution on Live Databases Tutorial 2025-09-03T10:40:00+02:00 10:40 01:30 This hands-on tutorial will guide participants through building an end-to-end AI agent that translates natural language questions into SQL queries, validates and executes them on live databases, and returns accurate responses. Participants will build a system that intelligently routes between a specialized SQL agent and a ReAct chat agent, implementing RAG for query similarity matching, comprehensive safety validation, and human-in-the-loop confirmation. By the end of this session, attendees will have created a powerful and extensible system they can adapt to their own data sources. berlin2025-77951-building-an-ai-agent-for-natural-language-to-sql-query-execution-on-live-databases Generative AI Cainã Max Couto da Silva en ### Overview Natural‑language interfaces unlock database insights for non‑technical users. This tutorial provides a practical implementation for building these systems reliably and effectively. Participants will build an AI agent system that can: 1. Route intelligently between SQL generation and ReAct chat agent workflows 2. Ingest and understand database schemas with domain knowledge 3. Retrieve relevant context and similar query examples using RAG with vector similarity 4. Generate accurate SQL with validation and safety guardrails 5. Execute queries safely with human-in-the-loop approval 6. Present results in an understandable format 7. Track costs and monitor performance using LangSmith 8. Manage session-based memory and conversation context We'll use the Kaggle dataset "[Brazilian E-Commerce dataset by Olist](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)" as our working example, demonstrating how to handle multiple tables across two schemas with complex relationships. This dataset will be hosted on an EC2 AWS instance for live interaction during the tutorial. This tutorial addresses real-world database complexity with production-grade considerations. Participants will start from a repository with backbone code and implement the key components during the session. By the end, attendees will have a working system they can adapt to their own datasets. ### Tools and Frameworks This tutorial will leverage modern tools and frameworks for efficient development: **AI and Agent Frameworks:** - LangChain for agent components and LLM interactions - LangGraph for agent orchestration and workflow management - LangSmith for comprehensive cost tracking and monitoring - OpenAI models with examples of alternatives **Database and Vector Store:** - SQLAlchemy for database interactions and schema retrieval - PostgreSQL as the database engine for the live dataset - PGVector for similarity-based query retrieval **Development:** - YAML for configuration management - `pyproject.toml` for standardized project configuration - UV reliable package management and Ruff for code formatting/linting false https://cfp.pydata.org/berlin2025/talk/GZUXGZ/ https://cfp.pydata.org/berlin2025/talk/GZUXGZ/feedback/ B09 How We Automate Chaos: Agentic AI and Community Ops at PyCon DE & PyData Talk 2025-09-03T13:40:00+02:00 13:40 00:30 Using AI agents and automation, PyCon DE & PyData volunteers have transformed chaos into streamlined conference ops. From YAML files to LLM-powered assistants, they automate speaker logistics, FAQs, video processing, and more while keeping humans focused on creativity. This case study reveals practical lessons on making AI work in real-world scenarios: structured workflows, validation, and clear context beat hype. Live demos and open-source tools included. berlin2025-77835-how-we-automate-chaos-agentic-ai-and-community-ops-at-pycon-de-pydata Data Handling & Engineering Alexander CS Hendorf en Every year, PyCon DE & PyData is run by a rotating crew of volunteers who build a full conference from scratch — in their spare time, with limited tools, shifting knowledge, and lots of coffee. It’s like launching a startup, dismantling it, and repeating from memory. To survive (and stay sane), we’ve turned conference ops into a sandbox for automation — leaning on scripts, structured documentation, and increasingly, agentic AI systems. Think YAML files, GitHub Actions, custom bots, and LLM-powered assistants doing the boring stuff, so humans can focus on creativity and connection. This talk is a no-fluff case study in what it actually takes to make automation — and especially AI agents — work in the wild: * How we went from chaotic Notion boards to reproducible workflows * How we use LLMs + APIs (LLMs, GitHub, Google, Drives, Pretalx, Pretix,…) to support speaker logistics, FAQs, video app, video cuts, certificates of participation and schedule drafts * Why Pydantic, Structure and even simple scripts matter more than hype * And most importantly: why agents are useless without clear structure, validation, and context We’ll show live examples, share the open tools we’ve built (and broken), and make the case that good community infrastructure is open-source-worthy. If you’re building tools for humans, this talk is for you. Want to help? We’re actively looking for contributors, testers, and curious minds to build better community tech together — come chat after the talk or find us online. false https://cfp.pydata.org/berlin2025/talk/WGJJQN/ https://cfp.pydata.org/berlin2025/talk/WGJJQN/feedback/ B07-B08 Bye-Bye Query Spaghetti: Write Queries You'll Actually Understand Using Pipelined SQL Syntax Talk 2025-09-03T10:40:00+02:00 10:40 00:30 Are your SQL queries becoming tangled webs that are difficult to decipher, debug, and maintain? This talk explores how to write shorter, more debuggable, and extensible SQL code using **Pipelined SQL**, an alternative syntax where queries are written as **a series of orthogonal, understandable steps**. We'll survey which databases and query engines currently support pipelined SQL natively or through extensions, and how it can be used on any platform by compiling pipelined SQL to any SQL dialect using open-source tools. A series of real-world examples, comparing traditional and pipelined SQL syntax side by side for a variety of use cases, will show you how to simplify existing code and make complex data transformations intuitive and manageable. berlin2025-77778-bye-bye-query-spaghetti-write-queries-you-ll-actually-understand-using-pipelined-sql-syntax Data Handling & Engineering Tobias Lampert en This session introduces Pipelined SQL, an alternative syntax for writing complex data queries as a clear, sequential flow of manageable transformations within a single query. Traditional SQL combines filtering (WHERE), aggregation (GROUP BY), and projection (SELECT expressions) within a single, monolithic block. This can make it challenging to discern individual data transformations or modify one aspect without impacting others. Pipelined SQL, in contrast, encourages building queries like an assembly line. You'll learn to structure your query logic so that each step performs a specific transformation and cleanly passes its result to the next. This pipelined approach, moving away from deeply nested subqueries or sprawling Common Table Expressions (CTEs), leads to queries that are more readable by easily following the logic from start to finish. As an added benefit, the resulting code is simpler to debug and also easily extendable by additional transformation steps. The talk will explain the core concepts of Pipelined SQL, how it differs from traditional SQL, what its main advantages over traditional SQL are. Native support for pipelined syntax is steadily growing across many modern databases, query engines and cloud data warehouses. We will explore the landscape of emerging dialects and identify which platforms currently offer native support or extensions for this powerful syntax. The session also covers a range of open-source tools that can compile such pipelined query code into any traditional SQL dialect, making this approach suitable for almost any platform. Through practical, real-world examples using BigQuery's pipe syntax, you'll see side-by-side comparisons demonstrating how Pipelined SQL can drastically reduce complexity and improve clarity for common data manipulation tasks. Prepare for genuine 'a-ha!' moments as you discover how Pipelined SQL offers refreshingly simple approaches to tasks that usually require convoluted traditional SQL. This session is ideal for data analysts, scientists, engineers, and anyone with basic SQL knowledge who wants to write cleaner, more robust, and more maintainable queries. You'll leave with a solid understanding of Pipelined SQL's benefits and practical knowledge to start simplifying your own SQL workflows. false https://cfp.pydata.org/berlin2025/talk/HKMYHY/ https://cfp.pydata.org/berlin2025/talk/HKMYHY/feedback/ B07-B08 Spot the difference: 🕵️ using foundation models to monitor for change with satellite imagery 🛰️ Talk 2025-09-03T13:40:00+02:00 13:40 00:30 Energy infrastructure is vulnerable to damage by erosion or third party interference, which often takes the form of unsanctioned construction. In this talk we discuss our experiences using deep learning algorithms powered by large foundation models to monitor for changes in bi-temporal very-high resolution satellite imagery. berlin2025-77950-spot-the-difference-using-foundation-models-to-monitor-for-change-with-satellite-imagery Computer Vision (incl. Generative AI CV) Ferdinand Schenck en Oil and gas pipelines are usually buried around 1.5 meters underground, making them vulnerable to human activity or natural processes like erosion. Pipeline operators need to perform regular checks to ensure the integrity of their infrastructure. Very High Resolution (VHR) satellite images, with ground sampling distances of less than 1 meter, provide an interesting solution to this problem allowing for large scale monitoring and regular revisit rates. Spotting changes is far from simple, as one needs to distinguish between relevant changes (such as construction activity), and irrelevant changes, such as shadows, seasonal changes or changes due to viewing angles. Geospatial foundation models, trained on vast collections of satellite imagery from across the globe, offer enhanced generalisation capabilities while requiring relatively few labels to achieve powerful performance. This global-scale pretraining enables these models to develop robust feature representations that transfer effectively to new geographic regions and tasks. false https://cfp.pydata.org/berlin2025/talk/SCQE8H/ https://cfp.pydata.org/berlin2025/talk/SCQE8H/feedback/ B07-B08 Lane detection in self-driving using only NumPy Talk 2025-09-03T14:20:00+02:00 14:20 00:30 Are you a scientist or a developer looking to understand how to use NumPy to solve computer vision problems? NumPy is a Python package that provides the multidimensional array object which you can use to solve the lane detection problem in computer vision for self-driving cars or autonomous driving. You can apply non-machine learning techniques using NumPy to find the straight lines on street images. No other external libraries, just python with NumPy. berlin2025-77733-lane-detection-in-self-driving-using-only-numpy Computer Vision (incl. Generative AI CV) Emma Saroyan en Car accidents happen every day. Lane detection is a common use case in computer vision and self-driving which can help prevent accidents by identifying the boundaries of driving lanes on an image of a road scene. It is a fundamental requirement for autonomous vehicles to drive safely on the roads. In my talk, “Lane detection in self-driving using only numpy”, I will share my learning experience on how to solve a computer vision problem with a small dataset. Outline of my talk: 25-27 minutes 2 minutes - the problem 5 minutes - What is NumPy and the steps of the algorithm, & the methods (Create Arrays using NumPy functions, Array slicing, indexing, math using NumPy Broadcasting) 10 minutes - edge detection algorithms 10 minutes - ROI extraction, hough transform and detecting the lines false https://cfp.pydata.org/berlin2025/talk/XNMYDK/ https://cfp.pydata.org/berlin2025/talk/XNMYDK/feedback/ B05-B06 Edge of Intelligence: The State of AI in Browsers Talk 2025-09-03T10:40:00+02:00 10:40 00:30 API calls suck! Okay, not all of them. But building your AI features reliant on third party APIs can bring a lot of trouble. In this talk you'll learn how to use web technologies to become more independent. berlin2025-77891-edge-of-intelligence-the-state-of-ai-in-browsers Infrastructure - Hardware & Cloud Johannes Kolbe en The current AI hype is being run on API calls, GPU clusters, and costly infrastructure. But what if we could break free from these constraints and run our models directly in the consumer's browser? Imagine a world where AI development is more reliable, cheaper, and more secure. In this talk, we'll explore the current state of WebAI, including the latest developments, challenges, and opportunities. We'll dive into the libraries, tools, and technologies that make it possible to run AI models in the browser, such as WebAssembly, WebGPU, and ONNX. We'll discuss how these technologies enable fast and efficient execution of AI models, and how they relate to Python. After the talk, you'll have a clear understanding of how to bring AI to the browser and unlock new possibilities for your applications. Join us to learn how to harness the power of AI and make it more accessible for everyone. false https://cfp.pydata.org/berlin2025/talk/KKWBKK/ https://cfp.pydata.org/berlin2025/talk/KKWBKK/feedback/ B05-B06 Building a Thriving Tech Ecosystem: The Role of PyLadies in Fostering Growth and Inclusion Talk 2025-09-03T11:20:00+02:00 11:20 00:30 The global tech ecosystem continues to grow, yet challenges like limited mentorship, a lack of role models, and fragmented community support hinder progress, especially for underrepresented groups. PyLadies offers a powerful model for bridging these gaps. This talk explores how PyLadies chapters worldwide foster technical growth, increase mentorship opportunities, and drive collaboration to create a more inclusive and sustainable global tech community. berlin2025-77957-building-a-thriving-tech-ecosystem-the-role-of-pyladies-in-fostering-growth-and-inclusion Community & Diversity Gertrude Abagale en The strength of any tech ecosystem lies in its ability to grow, support, and sustain its members. However, many aspiring technologists from underrepresented backgrounds struggle to find mentorship, role models, and opportunities to thrive. PyLadies, a global network dedicated to increasing women's participation in the Python community, presents a proven model for addressing these challenges. This talk will explore the impact of community-driven initiatives on the tech ecosystem, using PyLadies as a case study. We'll break down the discussion into three key sections: Understanding the Challenges: The barriers to entry and growth in tech for underrepresented groups The role of mentorship and community in shaping successful careers The PyLadies Model: How PyLadies chapters empower individuals through mentorship, technical training, and networking Success stories from PyLadies communities around the world Scaling the Impact: How local communities and organizations can adopt similar models to foster technical growth Actionable steps for individuals and companies to contribute to a more inclusive and sustainable tech ecosystem Through real-world examples, practical strategies, and interactive discussions, this talk aims to inspire attendees to take action, whether by starting or supporting PyLadies chapters, mentoring newcomers, or advocating for stronger community engagement. By the end of the session, participants will leave with a clear understanding of how they can contribute to strengthening the global tech ecosystem and why investing in community-driven initiatives like PyLadies is key to sustainable growth. false https://cfp.pydata.org/berlin2025/talk/RT8VVU/ https://cfp.pydata.org/berlin2025/talk/RT8VVU/feedback/ B05-B06 Flying Beyond Keywords: Our Aviation Semantic Search Journey Talk 2025-09-03T12:00:00+02:00 12:00 00:30 In aviation, search isn’t simple—people use abbreviations, slang, and technical terms that make exact matching tricky. We started with just Postgres, aiming for something that worked. Over time, we upgraded: semantic embeddings, reranking. We tackled filter complexity, slow index builds, and embedding updates and much more. Along the way, we learned a lot about making AI search fast, accurate, and actually usable for our users. It’s been a journey—full of turbulence, but worth the landing. berlin2025-77153-flying-beyond-keywords-our-aviation-semantic-search-journey Infrastructure - Hardware & Cloud Dat TranDennis Schmidt en In aviation, search is anything but straightforward. Reports are written by humans—pilots, cabin crew, engineers—each using their own mix of abbreviations, technical jargon, and everyday language. Standard keyword search often falls short. You might miss critical safety signals because a pilot wrote “navigation didn’t work” instead of “gps jamming,” or used a shorthand unknown to engineers on the ground. What we needed was semantic search—something that understands meaning, not just matches strings. But we started simple with a plain Postgres setup. Our goal: build something that works. We began with pgvector and basic sentence embeddings to enable semantic search inside Postgres. It was scrappy, but it gave us just enough traction to prove the value of semantic search in this domain. Then things took off. As complexity grew, so did the need for better retrieval and smarter ranking. We restructured the system: upgraded to better sentence embeddings, and most importantly, added reranking using cross-encoders. This turned our search results from “kinda relevant” to “spot on.” We moved to OpenVINO to make reranking faster on the CPU, especially important since we deploy on AWS Lambda. But the technical challenges didn’t stop there. We experimented with different pgvector index types—IVFFlat vs HNSW—and discovered surprising trade-offs in index build times and performance, especially under constrained RDS instances. Embedding updates became their own problem, so we built a parallel processing system using SQS and a tool we call “Cockpit” to manage recomputation. On top of that, search in our world isn't a single step. We layer semantic retrieval with full-text filtering, structured filters (e.g., airport, aircraft type), and real-time inputs. This creates a multi-layered AI search pipeline that needs to feel snappy and reliable to end-users. In this talk, we’ll walk through how we made this work with minimal ML infrastructure, how we evolved from an MVP to a robust system, and what tools made the biggest difference—from tokenization strategies and inference optimizations to batching tricks and search composition patterns. You’ll also hear the gritty details: bottlenecks between tokenization and inference, indexing challenges, and lessons from building this in production for a safety-critical industry. This talk is for folks who want to leverage Postgres for hybrid search as well. It’s for anyone who has ever duct-taped search with SQL and wondered how to take the next step. We’ll keep it real, share what we did, and reflect on what we’d do differently next time. false https://cfp.pydata.org/berlin2025/talk/XE9F7X/ https://cfp.pydata.org/berlin2025/talk/XE9F7X/feedback/ B05-B06 Not Just Code: Building Communities That Don’t Burn People Out Talk 2025-09-03T13:40:00+02:00 13:40 00:30 Open source runs on passion, but passion is not a renewable resource. This talk will explore the hidden emotional and social costs of contributing to open-source projects. From burnout to invisibility, we’ll reflect on how the very communities we depend on can unintentionally exhaust the people keeping them alive. Attendees will walk away with practical ways to support contributors, make communities more welcoming, and build open source that truly includes everyone. berlin2025-77859-not-just-code-building-communities-that-don-t-burn-people-out Community & Diversity Aishat Muibudeen en Open source has revolutionized the way we build software; however, it hasn’t solved the problem of how we support the people who make it possible. Behind every feature request, GitHub issue, or Slack ping is a human being trying to balance their desire to help with their mental, emotional, and sometimes financial limits. And while we celebrate open source wins, we rarely discuss the cost of participation: burnout, loneliness, exclusion, and the pressure to be “always on.” This talk takes a critical yet hopeful look at the human realities of contributing to and maintaining open-source projects. We’ll explore: The hidden labor behind community work and code maintenance How issues like burnout, gatekeeping, and lack of diversity still persist Why community care should be treated as infrastructure What inclusion actually looks like (beyond just opening a repo) Practices that support contributor well-being without sacrificing project momentum Whether you're a project lead, data scientist, contributor, or community manager, this session will help you see the people behind the pull requests—and understand how to build spaces where they can thrive, not just survive. false https://cfp.pydata.org/berlin2025/talk/WHPKVG/ https://cfp.pydata.org/berlin2025/talk/WHPKVG/feedback/ B05-B06 Template-based web app and deployment pipeline at an enterprise-ready level on Azure Talk (long) 2025-09-03T14:20:00+02:00 14:20 00:45 A practical deep-dive into Azure DevOps pipelines, the Azure CLI, and how to combine pipeline, bicep, and python templates to build a fully automated web app deployment system. Deploying a new proof of concept app within an actual enterprise environment never was faster. berlin2025-77665-template-based-web-app-and-deployment-pipeline-at-an-enterprise-ready-level-on-azure Infrastructure - Hardware & Cloud Johannes en In many enterprise environments, deploying a proof-of-concept data app to the cloud remains frustratingly slow and manual. Early user feedback often depends on clunky screen shares or static screenshots. This talk shows how we transformed that process - automating everything from infrastructure provisioning to web app deployment - using a system of pipeline, bicep, and python templates. The result? Stakeholders can interact with a working Streamlit app within minutes of a commit, with no further manual setup required. We take you with us on our journey from awkward beginnings to an elegant template-based setup, where all steps of the configuration and deployment process are automated. All Azure resources are created without manual steps. And it takes only one bio-break from submitting your work to the repository to the business user being able to test it live. Along the way we share best practices and pitfalls we discovered, as well as how we structure our templates and repositories, both for the web app, as well as the deployment pipeline. At the end, we will deploy a new web app together and explore the workings of the system live. While the concept will need adoption to other providers, you don't need to use Azure to profit from this talk - all cloud platforms share similar tools and challenges. Detailed Outline: 1\. Motivation (5 min) - Why it's hard to get user feedback early and why that is problematic - Why it's hard to get a real application running early - What if we could automate app deployment and configuration, or how the NKD data science teams went from awkward to awesome 2\. The app creation, deployment, and configuration process (12 min) - Struggles and best practices - Tools that help with consistency and automation - Handling virtual environments across dev systems and the cloud - Web app and pipeline repositories and templates 3\. The pipeline (18 min) - Structure of the stages - Minimizing manual configuration with file parsing and bicep - Matching branch and target server - Automated Azure resource creation using Azure CLI - App authorization and authentication configuration with more Azure CLI - Finally, the deployment 4\. Show case (5 min) - What the setup looks like when it's fully set up - Deploying an app live Key Takeaways: - How to reduce app deployment time from days to minutes using automated templates - Collaboration setup for small and medium-sized data teams - Best practices for structuring pipelines and web apps for consistency, security, and scalability - What not to do: key pitfalls we encountered and how we fixed them Target audience: Data or machine learning scientists or engineers in small or medium-sized teams, who want to deploy web apps faster and in a more consistent way. Attendees should be comfortable with python and have basic familiarity with web apps or DevOps principles. While Azure users benefit most, no in-depth knowledge is required - concepts will be explained as we go. false https://cfp.pydata.org/berlin2025/talk/KEJJSP/ https://cfp.pydata.org/berlin2025/talk/KEJJSP/feedback/