{"$schema": "https://c3voc.de/schedule/schema.json", "generator": {"name": "pretalx", "version": "2026.1.0.dev0"}, "schedule": {"url": "https://cfp.pydata.org/berlin2025/schedule/", "version": "0.20", "base_url": "https://cfp.pydata.org", "conference": {"acronym": "berlin2025", "title": "PyData Berlin 2025", "start": "2025-09-01", "end": "2025-09-03", "daysCount": 3, "timeslot_duration": "00:05", "time_zone_name": "Europe/Berlin", "colors": {"primary": "#4c9cb4"}, "rooms": [{"name": "Kuppelsaal", "slug": "4560-kuppelsaal", "guid": "a413bdae-4730-5a9d-8aa1-045579ce1087", "description": "upper floor", "capacity": 800}, {"name": "B09", "slug": "4561-b09", "guid": "844f8596-e84f-5029-b709-8892c0fca5c3", "description": "room for tutorials", "capacity": null}, {"name": "B07-B08", "slug": "4562-b07-b08", "guid": "e7ecef66-8ce7-51e4-9629-123a47fb4391", "description": null, "capacity": null}, {"name": "B05-B06", "slug": "4563-b05-b06", "guid": "aae589d8-5d0f-5d2f-8c55-720e32dc637e", "description": null, "capacity": null}], "tracks": [{"name": "Data Handling & Engineering", "slug": "6077-data-handling-engineering", "color": "#000000"}, {"name": "Natural Language Processing & Audio (incl. Generative AI NLP)", "slug": "6078-natural-language-processing-audio-incl-generative-ai-nlp", "color": "#000000"}, {"name": "Computer Vision (incl. Generative AI CV)", "slug": "6079-computer-vision-incl-generative-ai-cv", "color": "#000000"}, {"name": "Generative AI", "slug": "6080-generative-ai", "color": "#000000"}, {"name": "Embedded Systems & Robotics", "slug": "6081-embedded-systems-robotics", "color": "#000000"}, {"name": "PyData & Scientific Libraries Stack", "slug": "6082-pydata-scientific-libraries-stack", "color": "#000000"}, {"name": "Visualisation & Jupyter", "slug": "6083-visualisation-jupyter", "color": "#000000"}, {"name": "Community & Diversity", "slug": "6084-community-diversity", "color": "#000000"}, {"name": "Education, Career & Life", "slug": "6085-education-career-life", "color": "#000000"}, {"name": "Infrastructure - Hardware & Cloud", "slug": "6086-infrastructure-hardware-cloud", "color": "#000000"}, {"name": "Ethics & Privacy", "slug": "6087-ethics-privacy", "color": "#000000"}, {"name": "Lightning Talks", "slug": "6296-lightning-talks", "color": "#000000"}], "days": [{"index": 1, "date": "2025-09-01", "day_start": "2025-09-01T04:00:00+02:00", "day_end": "2025-09-02T03:59:00+02:00", "rooms": {"Kuppelsaal": [{"guid": "c46c1a84-276f-5036-baf5-8d7e69a9ed42", "code": "YF3MVA", "id": 80988, "logo": null, "date": "2025-09-01T09:00:00+02:00", "start": "09:00", "duration": "00:20", "room": "Kuppelsaal", "slug": "berlin2025-80988-opening-session", "url": "https://cfp.pydata.org/berlin2025/talk/YF3MVA/", "title": "Opening Session", "subtitle": "", "track": null, "type": "Plenary Session [Organizers]", "language": "en", "abstract": "Opening Session for PyData Berlin 2025", "description": "Opening Session for PyData Berlin 2025", "recording_license": "", "do_not_record": false, "persons": [], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/YF3MVA/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/YF3MVA/", "attachments": []}, {"guid": "1f3fff55-d06f-553e-a60e-551b68821ef5", "code": "HYGHBG", "id": 77339, "logo": null, "date": "2025-09-01T09:20:00+02:00", "start": "09:20", "duration": "00:50", "room": "Kuppelsaal", "slug": "berlin2025-77339-pydata-2077-a-data-science-future-retrospective", "url": "https://cfp.pydata.org/berlin2025/talk/HYGHBG/", "title": "PyData 2077: a data science future retrospective", "subtitle": "", "track": "Education, Career & Life", "type": "Keynote", "language": "en", "abstract": "From: Chrono-Regulatory Commission, Temporal Enforcement Division\r\nTo: PyData Berlin Organising Committee\r\nSubject: Citation #TMP-2077-091 - Unauthorised Spacetime Disturbance\r\n\r\nDear Committee,\r\nOur temporal monitoring systems have detected an unauthorised chronological anomaly emanating from your facility (Berliner Congress Center, coordinates 52.52068\u00b0N, 13.416451\u00b0E) scheduled to manifest on September 1st at 9:20 a.m.", "description": "VIOLATION DETAILS:\r\n- Unauthorized temporal incursion detected\r\n- Speakers identified as: Kitchen, A. & Summers, L. (baseline timeline)\r\n- Anomalous data signatures suggest retrospective analysis from non-contemporaneous source\r\n- Evidence of information leakage: late 21st-century technological practices and standards\r\n- Risk assessment: Moderate timeline contamination potential\r\n\r\nREGULATORY COMPLIANCE REQUIRED:\r\nPer Temporal Code Section 2077.3, you are hereby notified that failure to contain this spacetime disturbance will result in fines of up to 50,000 temporal credits. You must ensure adequate attendance at the specified coordinates to properly observe and contain the anomaly as it unfolds.\r\nWARNING: Preliminary scans indicate the transmission contains advanced analytical frameworks and critical commentary on primitive early-21st-century data science practices. Attendees may experience paradigm shifts, changes to mental models, or sudden clarity regarding field trajectories.\r\n\r\nSincerely,\r\nCompliance Officer Z-7749\r\nChrono-Regulatory Commission\r\n\"Keeping Yesterday Safe for Tomorrow\"", "recording_license": "", "do_not_record": false, "persons": [{"code": "3QJLK8", "name": "Laura Summers", "avatar": "https://cfp.pydata.org/media/avatars/3QJLK8_8WJj3Ud.webp", "biography": "Laura is a very technical designer\u2122\ufe0f, working at  Pydantic as Lead Design Engineer. Her side projects include Sweet Summer Child Score (summerchild.dev) and Ethics Litmus Tests (ethical-litmus.site). Laura is passionate about feminism, digital rights and designing for privacy. She speaks, writes and runs workshops at the intersection of design and technology.", "public_name": "Laura Summers", "guid": "0d79c925-8220-556f-b80f-b758c187a5d1", "url": "https://cfp.pydata.org/berlin2025/speaker/3QJLK8/"}, {"code": "Y3GHEB", "name": "Andy Kitchen", "avatar": "https://cfp.pydata.org/media/avatars/Y3GHEB_oIDflAH.webp", "biography": "Andy Kitchen is a hacker, startup founder and AI/Neuroscience researcher. Let's grab a beer and talk about philosophy, computer science and society (and science fiction while we're at it!)", "public_name": "Andy Kitchen", "guid": "3d850e4c-f6e5-588b-8700-0a4a6f2e6242", "url": "https://cfp.pydata.org/berlin2025/speaker/Y3GHEB/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/HYGHBG/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/HYGHBG/", "attachments": []}], "B09": [{"guid": "8b6e927f-6ee7-5dd5-a3fa-42a917d27515", "code": "GRZ3RG", "id": 77507, "logo": null, "date": "2025-09-01T10:40:00+02:00", "start": "10:40", "duration": "01:30", "room": "B09", "slug": "berlin2025-77507-a-beginner-s-guide-to-state-space-modeling", "url": "https://cfp.pydata.org/berlin2025/talk/GRZ3RG/", "title": "A Beginner's Guide to State Space Modeling", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Tutorial", "language": "en", "abstract": "**State Space Models** (SSMs) are powerful tools for time series analysis, widely used in finance, economics, ecology, and engineering. They allow researchers to encode structural behavior into time series models, including *trends*, *seasonality*, *autoregression*, and *irregular fluctuations*, to name just a few. Many workhorse time series models, including ARIMA, VAR, and ETS, are special cases of the general statespace framework.  \r\n\r\nIn this practical, hands-on tutorial, attendees will **learn how to leverage PyMC's new state-space modeling** capabilities (`pymc_extras.statespace`) to build, fit, and interpret Bayesian state space models.\r\n\r\nStarting from fundamental concepts, we'll **explore several real-world use cases**, demonstrating how SSMs help tackle common time series challenges, such as handling missing observations, integrating external regressors, and generating forecasts.", "description": "State Space Models offer **a structured yet flexible framework for time series analysis**. They elegantly handle latent processes like trends, seasonality, and noisy observations, making them particularly valuable in real-world applications.\r\n\r\nWe'll start with a brief overview of the theory behind SSMs, followed by practical examples where participants will:\r\n\r\n- **Understand the components of SSMs**, including observation and state equations.\r\n- **Learn how to specify and fit SSMs** using PyMC's state space module.\r\n- Implement a **modeling workflow using a survey data example**, showing how to use SSMs to model the data and generate predictions.\r\n- **Explore advanced topics** such as incorporating external regressors, generating forecasts or building custom models.\r\n\r\n### Target Audience\r\nThis tutorial is aimed at data scientists, statisticians, and data analysts with a basic understanding of statistics and Python, who are interested in expanding their toolkit with Bayesian time series methods. Prior experience with PyMC is not required but will be beneficial.\r\n\r\n### Takeaways\r\n\r\nBy the end of this tutorial, attendees will:\r\n\r\n- Understand the **theoretical foundations** of State Space Models.\r\n- Be able to **implement common SSMs** (local level, trend, and seasonal models) in PyMC.\r\n- **Evaluate and interpret** Bayesian state space models using PyMC.\r\n- **Appreciate practical scenarios** where SSMs outperform traditional time series approaches. \r\n\r\n### Background Knowledge Required\r\nBasic understanding of probability and statistics, and familiarity with Python. Prior experience with PyMC is not required but will be beneficial.\r\n\r\n### Materials Distribution\r\nAll tutorial materials, including notebooks and datasets, will be made available via a GitHub repository.\r\n\r\n## Outline\r\n\r\n**0 - 10 min: Introduction to State Space Models**\r\n\r\n- What are SSMs, and why use them?\r\n\r\n**10 - 25 min: State Space Model Fundamentals**\r\n\r\n- Observation and state equations.\r\n- Latent states, Kalman filters, and smoothing in Bayesian frameworks.\r\n\r\n**25 - 55 min: Implementing SSMs with PyMC (Hands-On)**\r\n\r\n- Setting up a local-level model in PyMC.\r\n- Extending models to incorporate trends and seasonality.\r\n- Posterior inference: interpreting results and uncertainty.\r\n\r\n**55 - 75 min: Advanced State Space Modeling (Hands-On)**\r\n\r\n- Dealing with missing data and irregular intervals.\r\n- Adding external covariates (regression components).\r\n- Model diagnostics and posterior predictive checks.\r\n\r\n**75 - 85 min: Real-world Application Case Study**\r\n\r\n- Demonstrating an end-to-end modeling example with real data.\r\n- Discussing best practices for practical time series modeling.\r\n\r\n**85 - 90 min: Wrap-up and Interactive Q&A**\r\n\r\n- Open floor for questions and further resources.\r\n\r\n---\r\n\r\n## Additional Resources\r\n\r\n- [Introduction to PyMC state space module](https://www.youtube.com/watch?v=G9VWXZdbtKQ)\r\n- [Podcast episode on PyMC's state space module](https://learnbayesstats.com/episode/124-state-space-models-structural-time-series-jesse-grabowski)\r\n- [PyMC State Space Module GitHub Repository](https://github.com/pymc-devs/pymc-extras/tree/main/pymc_extras/statespace)\r\n\r\nWe believe this tutorial will empower participants with practical knowledge of state space modeling in PyMC, enabling them to effectively analyze complex time series data using Bayesian approaches.", "recording_license": "", "do_not_record": false, "persons": [{"code": "8ZGVGR", "name": "Jesse Grabowski", "avatar": "https://cfp.pydata.org/media/avatars/8ZGVGR_foAcL4V.webp", "biography": "Jesse Grabowski is a PhD candidate at Paris 1 Pantheon-Sorbonne. He is also a principal data scientist at PyMC labs, and a core developer of PyMC, Pytensor, and related packages. His area of research includes time series modeling, macroeconomics, and finance.", "public_name": "Jesse Grabowski", "guid": "b17eb38d-f6ff-57ce-996a-802be0ab0384", "url": "https://cfp.pydata.org/berlin2025/speaker/8ZGVGR/"}, {"code": "7HJPXF", "name": "Alexandre Andorra", "avatar": "https://cfp.pydata.org/media/avatars/7HJPXF_4N7UxI9.webp", "biography": "\u26be Senior Applied Scientist\r\n\ud83c\udf99\ufe0f Creator @ LearnBayesStats Podcast\r\n\ud83d\udcca Cofounder @ PyMC Labs\r\n\ud83d\udc68\u200d\ud83c\udfeb Teacher @ Intuitive Bayes", "public_name": "Alexandre Andorra", "guid": "e0a2c782-4f00-5eee-a7c3-60e78b2b9e46", "url": "https://cfp.pydata.org/berlin2025/speaker/7HJPXF/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/GRZ3RG/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/GRZ3RG/", "attachments": []}, {"guid": "d2d2dc16-9e10-5c3d-8a16-76a30afd6b0e", "code": "MQS99P", "id": 80782, "logo": null, "date": "2025-09-01T12:30:00+02:00", "start": "12:30", "duration": "01:00", "room": "B09", "slug": "berlin2025-80782-pyladies-empowered-in-tech-lunch", "url": "https://cfp.pydata.org/berlin2025/talk/MQS99P/", "title": "PyLadies & Empowered in Tech Lunch", "subtitle": "", "track": "Community & Diversity", "type": "Social Event", "language": "en", "abstract": "Join PyLadies & Empowered in Tech for a special lunch event aimed at fostering community. Enjoy meaningful conversations and networking opportunities.", "description": "**PyLadies** is an international mentorship group with a focus on helping more women and gender non-conforming people become active participants and leaders in the Python open-source community. Its mission is to promote, educate and advance a diverse Python community through outreach, education, conferences, events and social gatherings.\r\n\r\n---\r\n\r\n**Empowered in Tech** is a community in Berlin dedicated to empowering FLINTA (women, lesbians, intersex, non-binary, trans and agender) people to excel in their tech journey. We welcome engineers, software developers, data scientists, designers, product managers, career changers and other professionals in the tech industry. We are open to all tech stacks, programming languages and experience levels. Our goal is to support our members to grow your careers, connect with like-minded people and feel welcome in tech.", "recording_license": "", "do_not_record": false, "persons": [], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/MQS99P/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/MQS99P/", "attachments": []}, {"guid": "50936293-ab4b-5562-a131-3ef9be680c2b", "code": "WXPVCS", "id": 77715, "logo": null, "date": "2025-09-01T13:40:00+02:00", "start": "13:40", "duration": "01:30", "room": "B09", "slug": "berlin2025-77715-more-than-dataframes-data-pipelines-with-the-swiss-army-knife-duckdb", "url": "https://cfp.pydata.org/berlin2025/talk/WXPVCS/", "title": "More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB", "subtitle": "", "track": "Data Handling & Engineering", "type": "Tutorial", "language": "en", "abstract": "Most Python developers reach for Pandas or Polars when working with tabular data\u2014but DuckDB offers a powerful alternative that\u2019s more than just another DataFrame library. In this tutorial, you\u2019ll learn how to use DuckDB as an in-process analytical database: building data pipelines, caching datasets, and running complex queries with SQL\u2014all without leaving Python. We\u2019ll cover common use cases like ETL, lightweight data orchestration, and interactive analytics workflows. You\u2019ll leave with a solid mental model for using DuckDB effectively as the \u201cSQLite for analytics.\u201d", "description": "The goal of this tutorial is to help Python users understand and use DuckDB not just as a DataFrame interface, but as a fully featured analytics database embedded in their Python workflows. We'll highlight real-world patterns where DuckDB shines compared to traditional libraries, especially for medium-scale datasets that don\u2019t justify a full data warehouse.\r\nYou\u2019ll learn:\r\n- When and why to reach for DuckDB instead of Pandas/Polars\r\n- How DuckDB handles local files (CSV, Parquet, JSON, Postgres database, and more)\r\n- Using DuckDB to build lightweight, SQL-based data pipelines\r\n- Techniques for caching intermediate data in-process\r\n- How to analyze data from remote sources via HTTP or S3\r\n- Tips for using DuckDB with Jupyter, dbt, or your favorite Python tools", "recording_license": "", "do_not_record": false, "persons": [{"code": "MPYCX8", "name": "Mehdi Ouazza", "avatar": "https://cfp.pydata.org/media/avatars/MPYCX8_yYr02Wg.webp", "biography": "I'm Mehdi, also known as mehdio, a data enthusiast with nearly a decade of experience in data engineering for companies of all sizes. I'm not your average data guy\u2014I inject humor and fun into my work to make complex topics easier to digest. When I'm not actively contributing to the data community through my blog, YouTube, and social media, you can find me off-beat, marching to the beat of my own data drum.\r\n\r\nRecently, I joined Motherduck as a developer advocate, where I bring my data engineering expertise to supercharge DuckDB.", "public_name": "Mehdi Ouazza", "guid": "82c1138f-768a-5981-8a46-b4c95f5f852a", "url": "https://cfp.pydata.org/berlin2025/speaker/MPYCX8/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/WXPVCS/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/WXPVCS/", "attachments": []}, {"guid": "8d2c5b36-f666-52f6-a032-c7696f7c21fa", "code": "XFPTWN", "id": 79732, "logo": null, "date": "2025-09-01T15:40:00+02:00", "start": "15:40", "duration": "01:30", "room": "B09", "slug": "berlin2025-79732-ai-ready-data-in-action-powering-smarter-agents", "url": "https://cfp.pydata.org/berlin2025/talk/XFPTWN/", "title": "AI-Ready Data in Action: Powering Smarter Agents", "subtitle": "", "track": "Data Handling & Engineering", "type": "Tutorial", "language": "en", "abstract": "This hands-on workshop focuses on what AI engineers do most often: making data AI-ready and turning it into production-useful applications. Together with dltHub and LanceDB, you\u2019ll walk through an end-to-end workflow: collecting and preparing real-world data with best practices, managing it in LanceDB, and powering AI applications with search, filters, hybrid retrieval, and lightweight agents. By the end, you\u2019ll know how to move from raw data to functional, production-ready AI setups without the usual friction. We will touch upon multi-modal data and going to production with this end-to-end use case.", "description": "Modern AI applications are only as powerful as the data that fuels them. Yet, much of the real-world data AI engineers encounter is messy, incomplete, or unoptimized data. In this hands-on tutorial, AI-Ready Data in Action: Powering Smarter Agents, participants will walk through the full lifecycle of preparing unstructured data, embedding it into LanceDB, and leveraging it for search and agentic applications. Using a real-world dataset, attendees will incrementally ingest, clean, and vectorize text data, tune hybrid search strategies, and build a lightweight chat agent to surface relevant results. The tutorial concludes by showing how to take a working demo into production. By the end, participants will gain practical experience in bridging the gap between messy raw data and production-ready pipelines for AI applications.\r\n\r\n**Prior knowledge**\r\n\r\n- Basic Python programming.\r\n- Awareness of embeddings, vectors, and AI search concepts (we\u2019ll explain where needed).\r\n\r\nThe tutorial is designed to be accessible: engineers familiar with Python should be able to follow along step by step.\r\n\r\n**Key Takeaways**\r\n\r\nBy the end of the tutorial, participants will:\r\n\r\n1. Understand the end-to-end workflow of taking raw, real-world data and preparing it for AI applications.\r\n2. Build and run an incremental dlt pipeline to ingest real data into LanceDB.\r\n3. Apply text preprocessing and generate embeddings for semantic search.\r\n4. Optimize retrieval with vector and hybrid search strategies.\r\n5. Implement a lightweight AI agent capable of surfacing relevant issues from a natural language description.\r\n6. Learn how to transition from a demo project to a production setup using LanceDB Cloud.\r\n\r\n**Outline**\r\n\r\n- Introduce dlt (data load tool) and how it enables schema evolution, incremental loading, and normalization in pipelines.\r\n- Introduce LanceDB and explain embeddings, vector search, hybrid retrieval and multi-modal data for AI applications.\r\n- Ingest and preprocess a real dataset with dlt, generate embeddings, and load it into LanceDB following best data engineering practices.\r\n- Optimize search in LanceDB by tuning parameters, selecting distance metrics, and adding hybrid retrieval.\r\n- Build a lightweight AI agent that queries LanceDB and returns the most relevant issues from natural-language prompts.\r\n- Demonstrate the path to production using automation, monitoring, and LanceDB Cloud for scaling and reliability.\r\n- Conclude with key takeaways and an open Q&A.", "recording_license": "", "do_not_record": false, "persons": [{"code": "3RZLNH", "name": "Violetta Mishechkina", "avatar": "https://cfp.pydata.org/media/avatars/3RZLNH_1ep4qi2.webp", "biography": "Violetta Mishechkina leads Solutions Engineering at dltHub, helping teams build AI-ready data pipelines using the open-source library dlt. With a background in ML and MLOps, she focuses on turning messy, real-world data into reliable inputs for production systems. Over the past few year, Violetta has led several workshops on AI and data engineering, sharing practical insights with data teams across industries.", "public_name": "Violetta Mishechkina", "guid": "55628046-39dd-5259-89a2-0b69ae00a714", "url": "https://cfp.pydata.org/berlin2025/speaker/3RZLNH/"}, {"code": "UHZJTM", "name": "Chang She", "avatar": "https://cfp.pydata.org/media/avatars/UHZJTM_AGqVsWb.webp", "biography": "Chang is the CEO/Co-founder of LanceDB and has been making data tooling for ML/AI for almost two decades.\r\nOne of the original co-authors of the pandas project, Chang started LanceDB to make it easy for AI teams to work with all of the data that doesn't fit neatly into all of those dataframes - from embeddings to images, from audio to video, at petabyte scale.", "public_name": "Chang She", "guid": "83f9431b-58da-5112-9313-682fef19e066", "url": "https://cfp.pydata.org/berlin2025/speaker/UHZJTM/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/XFPTWN/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/XFPTWN/", "attachments": []}], "B07-B08": [{"guid": "feef6de7-3152-54fb-b808-efe6ba927a27", "code": "VBCU9H", "id": 77772, "logo": null, "date": "2025-09-01T10:40:00+02:00", "start": "10:40", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77772-beyond-linear-funnels-visualizing-conditional-user-journeys-with-python", "url": "https://cfp.pydata.org/berlin2025/talk/VBCU9H/", "title": "Beyond Linear Funnels: Visualizing Conditional User Journeys with Python", "subtitle": "", "track": "Visualisation & Jupyter", "type": "Talk", "language": "en", "abstract": "Optimizing user funnels is a common task for data analysts and data scientists. Funnels are not always linear in the real world. often, the next step depends on earlier responses or actions. This results in complex funnels that can be tricky to analyze. I\u2019ll introduce an open-source Python library I developed that analyzes and visualizes non-linear, conditional funnels by utilizing Graphviz and Streamlit. It calculates conversion rates, drop-offs, time spent on each step, and highlights bottlenecks by color. Attendees will learn about how to quickly explore complex user journeys and generate insightful funnel data.", "description": "When we talk about funnels in analytics, most people think of linear funnels, where users move step-by-step through a fixed sequence of actions. But in real-world applications like dynamic forms, on-boarding flows, or diagnostic tools, funnels are often conditional and non-linear. The next step in the journey depends on user input at earlier stages, leading to different paths and variable funnel lengths for every user.\r\n\r\nAn example is a vehicle pricing tool: while all users answer general questions (e.g., type, mileage), follow-up questions may differ based on previous answers. For instance, only users with electric cars are asked about battery capacity. This branching logic creates challenges for traditional funnel visualization techniques which mostly consider funnels as linear.\r\n\r\nAlternative immediate solutions are not perfect:\r\nVisuals like Sankey diagrams are too limited/general and often visually collapse under real-world data messiness (users going back and forth, drop-offs, missing events).\r\nMilestone-based funnels (where you set a few milestones during the funnel to mimic linear funnels) simplify things too much,  hiding key details and masking where things actually break down.\r\n\r\nAs a data analyst, I needed a way to understand and visualize such nonlinear flows in a more straightforward and consumable way. Finding no library that met this need out of the box, I created funnelius, a Python library that processes raw event logs into ready to consume funnel graphs.\r\n\r\nThe library accepts a pandas DataFrame with user_id, action and action_timestamp columns. Then it will use pandas to transform DataFrame to a suitable format to feed into graphviz. It also adds necessary columns needed to filter and declutter the graph. Then it will visualize the funnel using dot rendering engine which includes:\r\n- Calculating key metrics for every step: number of users per step, conversion rates, time spent, percentage of total users and drop-offs. \r\n- conditional formatting based on different metrics to highlight bottlenecks.\r\n- Comparison with another dataframe and showing changes. \r\n- Showing the answers that users gave in each step and calculate the percentage of answers on every step.l.\r\n\r\nThe graph can be fine tuned with some options like:\r\n- Only show top-N routes to declutter graph\r\n- Show/hide Dropped users data\r\n- Only include users who started from specific steps. If we know that users must have specific steps as a starting point, this helps remove possible data issues if any.\r\n- Define metrics that should be calculated \r\n\r\nThere is also a Streamlit-based UI to interactively adjust parameters and export funnel analysis as PDF instead of doing it programmatically.\r\n\r\n\r\n\r\nThis tool can be helpful for data analysts and data scientists with Python knowledge who need to analyse conditional funnels.\r\n\r\nGithub Repository:\r\nhttps://github.com/yaseenesmaeelpour/funnelius", "recording_license": "", "do_not_record": false, "persons": [{"code": "ZJDVCQ", "name": "Yaseen Esmaeelpour", "avatar": "https://cfp.pydata.org/media/avatars/ZJDVCQ_UYEqf2m.webp", "biography": "I am a data analyst with experience in various sectors including tech and supply chain. I am also a hobby programmer and like to spend my spare time working on cool personal projects.", "public_name": "Yaseen Esmaeelpour", "guid": "0d02f12a-913b-5680-a9c9-6bf82dabf6b7", "url": "https://cfp.pydata.org/berlin2025/speaker/ZJDVCQ/"}], "links": [{"title": "Github repo", "url": "https://github.com/yaseenesmaeelpour/funnelius", "type": "related"}, {"title": "PYPI page", "url": "https://pypi.org/project/funnelius/0.0.1/", "type": "related"}], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/VBCU9H/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/VBCU9H/", "attachments": [{"title": "Screenshot 1", "url": "/media/berlin2025/submissions/VBCU9H/resources/Screenshot_FWR3_uIzcdG8.png", "type": "related"}, {"title": "Screenshot 2", "url": "/media/berlin2025/submissions/VBCU9H/resources/Screenshot_From_18wYCC4.png", "type": "related"}, {"title": "Slides", "url": "/media/berlin2025/submissions/VBCU9H/resources/funnelius_slide_19iyllf.pdf", "type": "related"}]}, {"guid": "d79f139d-d068-58f2-9d78-3c96f1e8a89d", "code": "QMPX9V", "id": 77698, "logo": null, "date": "2025-09-01T11:20:00+02:00", "start": "11:20", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77698-democratizing-digital-maps-how-protomaps-changes-the-game", "url": "https://cfp.pydata.org/berlin2025/talk/QMPX9V/", "title": "Democratizing Digital Maps: How Protomaps Changes the Game", "subtitle": "", "track": "Visualisation & Jupyter", "type": "Talk", "language": "en", "abstract": "Digital mapping has long been dominated by commercial providers, creating barriers of cost, complexity, and privacy concerns. This talk introduces Protomaps, an open-source project that reimagines how web maps are delivered and consumed. Using the innovative PMTiles format \u2013 a single-file approach to vector tiles \u2013 Protomaps eliminates complex server infrastructure while reducing bandwidth usage and improving performance. We'll explore how this technology democratizes cartography by making self-hosted maps accessible without API keys, usage quotas, or recurring costs. The presentation will demonstrate implementations with Leaflet and MapLibre, showcase customization options, and highlight cases where Protomaps enables privacy-conscious, offline-capable mapping solutions. Discover how this technology puts mapping control back in the hands of developers while maintaining the rich experiences modern applications demand.", "description": "In today\u2019s digital landscape, maps have become essential components of countless applications and services, from navigation and logistics to social platforms and data visualization. But for too long, the field has been dominated by a few companies whose services, while powerful, come with significant drawbacks: Usage quotas, tracking requirements, styling limitations, and recurring costs that can quickly skyrocket as applications grow.\r\n\r\nThis talk will introduce Protomaps, an innovative open source mapping technology that is fundamentally reshaping the way digital maps are created, distributed and used. At its core, Protomaps utilizes the groundbreaking PMTiles format \u2013 a single-file approach to vector tiles that eliminates the need for complex tile server infrastructure while increasing performance and reducing bandwidth consumption.\r\n\r\n#### Technical innovation\r\n\r\nWe start with the technical basics of protomaps and explain how the PMTiles format works and why it represents such a significant advance over conventional tile map approaches. Unlike conventional solutions that rely on thousands of individual tile files provided by a complex infrastructure, PMTiles bundles vector map data into a single, efficiently indexed file that can be hosted anywhere.\r\nThe presentation will demonstrate how this approach enables progressive loading, allowing maps to render quickly at variable zoom levels while preserving the rich detail and interactive capabilities users expect from modern mapping solutions. We\u2019ll examine the efficiency gains in terms of bandwidth usage, server requirements, and client-side rendering performance.\r\n\r\n#### Democratization in Practice\r\n\r\nThis talk will focus on how Protomaps democratizes digital mapping in a tangible way:\r\n\r\n##### Economic Accessibility\r\n\r\nBy eliminating recurring API costs and usage-based pricing models, Protomaps opens up mapping opportunities for projects of all sizes, from hobby developers to non-profit organizations and educational institutions with limited budgets.\r\n\r\n##### Technical Accessibility\r\n\r\nWe demonstrate practical implementations with Leaflet and MapLibre GL and show how developers can integrate Protomaps with just a few lines of code and minimal configuration.\r\n\r\n##### Customization Freedom\r\n\r\nWithout the styling restrictions imposed by commercial vendors, Protomaps allows complete creative control over the appearance of the map. We show examples of customized maps that would be difficult or impossible to achieve with traditional services.\r\n\r\n##### Privacy by Design\r\n\r\nAs Protomaps enables fully self-hosted mapping solutions, there is no need to share user location data or mapping activity with third parties \u2013 a crucial aspect for privacy-conscious applications and those operating under strict regulatory frameworks.\r\n\r\n#### Takeaways for Attendees\r\n\r\nParticipants will leave this session with the following knowledge:\r\n\r\n* Understand how PMTiles and Protomaps work\r\n* Know how to use Protomaps in their own projects\r\n* Customize maps to meet specific design and data needs\r\n* A new perspective on the possibilities of democratized digital mapping\r\n\r\nWhether you are a developer seeking cost-effective mapping solutions, an organization concerned about data privacy, or simply interested in the evolution of open source geospatial technology, this talk will give you valuable insight into how Protomaps is reshaping the landscape of digital cartography by putting powerful mapping capabilities back into the hands of developers and communities.", "recording_license": "", "do_not_record": false, "persons": [{"code": "EZ3Z3G", "name": "Veit Schiele", "avatar": "https://cfp.pydata.org/media/avatars/EZ3Z3G_Lv6UGty.webp", "biography": "Veit Schiele is a German IT expert and entrepreneur best known as the founder and CEO of cusy GmbH, a company focused on bridging the gap between software engineering and data science, developing robust, reproducible and scalable solutions for data analysis and visualization. He is also an experienced trainer who has authored tutorials on Python for data scientists and is known for his work in scientific programming, agile methodologies and IT compliance.\r\n\r\nVeit is also active in the Python community, particularly in the area of scientific computing. He organizes training courses and conferences on Python and data visualization, with the aim of promoting best practices in research software development.", "public_name": "Veit Schiele", "guid": "4e1e3239-cb9b-5926-9fe8-0a272fde2e39", "url": "https://cfp.pydata.org/berlin2025/speaker/EZ3Z3G/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/QMPX9V/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/QMPX9V/", "attachments": []}, {"guid": "530ae3c3-6252-5888-98c4-139767b29c78", "code": "KBEEHS", "id": 77728, "logo": null, "date": "2025-09-01T12:00:00+02:00", "start": "12:00", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77728-accessible-data-visualizations", "url": "https://cfp.pydata.org/berlin2025/talk/KBEEHS/", "title": "Accessible Data Visualizations", "subtitle": "", "track": "Visualisation & Jupyter", "type": "Talk", "language": "en", "abstract": "Data visualizations often exclude users with visual impairments and temporary or situational constraints. Many regulations (European Accessibility Act, American Disabilities Act) now mandate inclusive digital content. Our research provides practical solutions \u2014 optimized color palettes, supplementary patterns, and alternative formats \u2014 implemented in popular libraries like Bokeh and Vega-Altair. These techniques, available through our open-source cusy Design System, create visualizations that reach broader audiences while meeting compliance requirements and improving comprehension for all users.", "description": "## Introduction\r\n\r\nAccessible data visualizations extend beyond aesthetics to meet established standards and accommodate diverse visual abilities. This presentation demonstrates how to create visualizations that comply with Web Content Accessibility Guidelines (WCAG) contrast requirements, support users with color vision deficiencies, and convey information through multiple encoding channels. The topics in the presentation explore practical techniques using colors, patterns, SVG accessibility features, and alternative data formats. \r\n\r\nThis presentation is designed for data scientists, visualization specialists, dashboard designers, and accessibility auditors who need to communicate findings effectively to diverse audiences. Attendees will benefit by:\r\n\r\n- Learning practical techniques to make visualizations accessible without sacrificing analytical depth\r\n- Gaining implementation strategies for common data visualization libraries\r\n- Acquiring skills to expand their reach to users with visual impairments\r\n- Taking away ready-to-use color palettes and pattern sets for immediate implementation\r\n\r\n# Topics\r\n\r\n## Color Accessibility\r\n\r\nData visualizations must meet WCAG contrast ratios (\u22653:1) for distinguishable elements. Our optimized palette features:\r\n\r\n- Eight distinct colors plus neutral gray for invalid data\r\n- CIEDE2000 perceptual differences >20 between colors\r\n- Verified compatibility with various color vision deficiencies\r\n- Print-friendly CMYK values (ISO Coated V2 300% or Pantone C)\r\n- Contrast ratios >3.0 (WCAG AA-level) against white and black backgrounds\r\n\r\n## Pattern Implementation\r\n\r\nPatterns provide critical secondary encoding when color alone is insufficient, we'll present:\r\n\r\n- Unique pattern paired with each color\r\n- Area fills that maintain distinction at various scales\r\n- Sequential pattern densities for quantitative data\r\n- Pattern elements adaptable as point markers\r\n- Implementation via SVG `<pattern>` tags\r\n\r\n## Technical Implementation\r\n\r\nPractical examples will demonstrate:\r\n\r\n- Using color contrast checkers for validation\r\n- Implementing SVG `<pattern>` elements\r\n- Creating accessible SVG with proper ARIA attributes\r\n- Providing alternative data formats (e.g. HTML tables with semantic descriptions)\r\n- Testing with screen readers and accessibility tools\r\n\r\n## Conclusion\r\n\r\nImplementing these practices creates data visualizations that are not only compliant with accessibility regulations but also more effective for all users. The cusy Design System offers open-source resources to implement these techniques across various visualization libraries.", "recording_license": "", "do_not_record": false, "persons": [{"code": "W3RRQT", "name": "Maris Nieuwenhuis", "avatar": "https://cfp.pydata.org/media/avatars/W3RRQT_vHAuCRy.webp", "biography": "## Junior Dev\r\n- TS/JS, Python, Java, and a teeny bit o' C++\r\n- WebDev, DataViz, Backend-Buzz  \r\n\r\n#a11y", "public_name": "Maris Nieuwenhuis", "guid": "d67b117c-3fd6-5000-8344-a12742b915f5", "url": "https://cfp.pydata.org/berlin2025/speaker/W3RRQT/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/KBEEHS/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/KBEEHS/", "attachments": []}, {"guid": "28878430-3bc5-5414-8648-07329de41424", "code": "AU8F9U", "id": 77485, "logo": null, "date": "2025-09-01T13:40:00+02:00", "start": "13:40", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77485-automating-content-creation-with-llms-a-journey-from-manual-to-ai-driven-excellence", "url": "https://cfp.pydata.org/berlin2025/talk/AU8F9U/", "title": "Automating Content Creation with LLMs: A Journey from Manual to AI-Driven Excellence", "subtitle": "", "track": "Generative AI", "type": "Talk [Sponsored]", "language": "en", "abstract": "In the fast-paced realm of travel experiences, GetYourGuide encountered the challenge of maintaining consistent, high-quality content across its global marketplace. Manual content creation by suppliers often resulted in inconsistencies and errors, negatively impacting conversion rates. To address this, we leveraged large language models (LLMs) to automate content generation, ensuring uniformity and accuracy. This talk will explore our innovative approach, including the development of fine-tuned models for generating key text sections and the use of Function Calling GPT API for structured data. A pivotal aspect of our solution was the creation of an LLM evaluator to detect and correct hallucinations, thereby improving factual accuracy. Through A/B testing, we demonstrated that AI-driven content led to fewer defects and increased bookings. Attendees will gain insights into training data refinement, prompt engineering, and deploying AI at scale, offering valuable lessons for automating content creation across industries.", "description": "GetYourGuide, a global marketplace for travel experiences, needs to provide structured and inspiring content for every activity in its marketplace. \r\nBefore the release of our AI models, suppliers would create their content fully manually. The manual approach led to several issues in production, such as content inconsistencies, incorrect grammar, non-English language, and poor adherence to our content guidelines.\r\nThese content defects negatively impact the conversion rate of activities.\r\nAt the same time, with the large scale of new activity generation, our internal teams could only review a very small fraction of the submitted content.  \r\n\r\nWith our LLM solution, suppliers can now automatically generate optimal content for their activities. Our feature allows users to simply copy-paste any existing raw text of their activity, and our models would then prefill most of the content sections. Suppliers then have the opportunity to review and edit the content.\r\nWe chose two different methods to generate free text content and structured information.\r\n\r\nFor free text, we used the OpenAI fine-tune API to create two different models generating the relevant sections of our travel activities, i.e. the title, the highlights, the short and full descriptions.\r\nFor structured information, we used the Function Calling gpt API to prefill the different activities tags and categories that have fixed values constraints in our database, such as the transport used or the type of the guide. \r\n\r\nIn order to validate our models, as well as for production monitoring, we developed a dedicated LLM evaluator that identifies hallucinations for our specific case, that is our models generating information that is not factually correct as compared to the input supplier text. With this hallucination evaluator, we were able to score the performance of different models and unlock key learnings and iterations. The evaluator also enables our internal team to detect and correct the hallucinations in production.\r\n\r\nAfter several AB experiments, the new automated content creation feature is fully released to all our suppliers. The activities with content generated via AI showed significantly fewer content defects and a significant increase in bookings, with only a small fraction of hallucinations that can be reviewed and corrected manually.\r\n\r\nIn this talk, we will share our long journey consisting of several training data iterations to build our fine-tuned models, the prompt engineering challenges in building our evaluator and our function call model. We will also cover the different experiments and the operational challenges in training the models and deploying the service in production.\r\nThe talk will provide some concrete ideas and tools to automate the generation of optimal content with LLMs, which is a common use case in many industries.", "recording_license": "", "do_not_record": false, "persons": [{"code": "AVMY7M", "name": "Marco Vene", "avatar": "https://cfp.pydata.org/media/avatars/AVMY7M_GbQChi4.webp", "biography": "With over a decade of experience in data science and analytics, I am a Senior Data Scientist at GetYourGuide, where I lead initiatives in leveraging large language models (LLMs) to enhance content quality and conversion rates. My expertise includes fine-tuning LLMs for custom text generation and classification, developing NLP models for discovering new travel interests, and automating predictive models for global travel demand. I have a robust background in machine learning, natural language processing, and AI-driven content automation, which has significantly improved operational efficiencies and business outcomes.\r\nPrior to moving to Data Science, I was a Senior Data Analyst at GetYourGuide, where I developed key metrics for availability and loyalty, built automated forecasting for our travel activities, performed impact analyses for sales and marketing, and automated data analyses with custom libraries. \r\nBefore joining GetYourGuide, I worked as Data Analyst in Foodpanda, an online food delivery platform, where I optimized restaurant ranking algorithms and developed recommendation systems. \r\nMy analytical journey began at Wealth-X in Budapest, where I worked as a Business Analyst, and later as Research Consultant in Millward Brown Vermeer, where I applied statistical techniques to report insights to external customers.\r\nI hold a Master's degree in Marketing from Rotterdam School of Management, Erasmus University, graduated cum laude, and a Bachelor's degree in Business/Managerial Economics from Universit\u00e0 di Pisa.\r\nDriven by a passion for data-driven decision-making, I am committed to advancing AI technologies to solve complex business challenges. At PyData 2025 Berlin, I aim to share insights into deploying AI at scale, refining training data, and mastering prompt engineering to automate content creation across industries.", "public_name": "Marco Vene", "guid": "f5e5736b-23d1-5e60-9949-22dceb95e06b", "url": "https://cfp.pydata.org/berlin2025/speaker/AVMY7M/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/AU8F9U/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/AU8F9U/", "attachments": []}, {"guid": "2e0be8e6-d8a6-5914-800c-e92fd1d986e3", "code": "ZLJRNN", "id": 77605, "logo": null, "date": "2025-09-01T14:20:00+02:00", "start": "14:20", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77605-benchmarking-2000-cloud-servers-for-gbm-model-training-and-llm-inference-speed", "url": "https://cfp.pydata.org/berlin2025/talk/ZLJRNN/", "title": "Benchmarking 2000+ Cloud Servers for GBM Model Training and LLM Inference Speed", "subtitle": "", "track": "Infrastructure - Hardware & Cloud", "type": "Talk", "language": "en", "abstract": "Spare Cores is a Python-based, open-source, and vendor-independent ecosystem collecting, generating, and standardizing comprehensive data on cloud server pricing and performance. In our latest project, we started 2000+ server types across five cloud vendors to evaluate their suitability for serving Large Language Models from 135M to 70B parameters. We tested how efficiently models can be loaded into memory of VRAM, and measured inference speed across varying token lengths for prompt processing and text generation. The published data can help you find the optimal instance type for your LLM serving needs, and we will also share our experiences and challenges with the data collection and insights into general patterns.", "description": "Spare Cores is a vendor-independent, open-source, Python-based ecosystem offering a comprehensive inventory and performance evaluation of servers across cloud providers. We automate the discovery and provisioning of thousands of server types in public using GitHub Actions to run hardware inspection tools and benchmarks for different workloads, including:\r\n- General performance (GeekBench, PassMark)\r\n- Memory bandwidth and compressions algorightms\r\n- OpenSSL, Redis, and web serving speed\r\n- DS/ML-specific benchmarks like GBM training and LLM inference on CPUs and GPUs\r\n\r\nAll results and open-source tools (such as database dumps, APIs, and SDKs) are openly published to help users identify and launch the most cost-efficient instance type for their specific use case in their own cloud environment.\r\n\r\nThis talk introduces the open-source ecosystem, then highlights our latest benchmarking efforts, including the performance evaluation of ~2,000 server types to determine the largest LLM model (from 135M to 70B parameters) that can be loaded on the machines and the inference speeds achievable with various token length for prompt processing and text generation.\r\n\r\nSlides: https://sparecores.com/assets/slides/pydata-berlin-2025.html#/cover-slide", "recording_license": "", "do_not_record": false, "persons": [{"code": "H9DKCZ", "name": "Gergely Daroczi", "avatar": "https://cfp.pydata.org/media/avatars/H9DKCZ_sny7LWK.webp", "biography": "Gergely Daroczi, PhD, is a passionate R/Python user and package developer for two decades. With over 15 years in the industry, he has expertise in data science, engineering, cloud infrastructure, and data operations across SaaS, fintech, adtech, and healthtech startups in California and Hungary, focusing on building scalable data platforms. Gergely maintains a dozen open-source R and Python projects and organizes a tech meetup with 1,800 members in Hungary \u2013 along with other open-source and data conferences.", "public_name": "Gergely Daroczi", "guid": "84ed02d5-da16-5d9d-89f0-8579b6266fa0", "url": "https://cfp.pydata.org/berlin2025/speaker/H9DKCZ/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/ZLJRNN/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/ZLJRNN/", "attachments": []}, {"guid": "86e16a31-6177-5d69-b1ac-9b69efdf9476", "code": "FPDP3E", "id": 77874, "logo": null, "date": "2025-09-01T15:40:00+02:00", "start": "15:40", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77874-scaling-python-an-end-to-end-ml-pipeline-for-iss-anomaly-detection-with-kubeflow-and-mlflow", "url": "https://cfp.pydata.org/berlin2025/talk/FPDP3E/", "title": "Scaling Python: An End-to-End ML Pipeline for ISS Anomaly Detection with Kubeflow and MLFlow", "subtitle": "", "track": "Infrastructure - Hardware & Cloud", "type": "Talk", "language": "en", "abstract": "Building and deploying scalable, reproducible machine learning pipelines can be challenging, especially when working with orchestration tools like Slurm or Kubernetes. In this talk, we demonstrate how to create an end-to-end ML pipeline for anomaly detection in International Space Station (ISS) telemetry data using only Python code.\r\n\r\nWe show how Kubeflow Pipelines, MLFlow, and other open-source tools enable the seamless orchestration of critical steps: distributed preprocessing with Dask, hyperparameter optimization with Katib, distributed training with PyTorch Operator, experiment tracking and monitoring with MLFlow, and scalable model serving with KServe. All these steps are integrated into a holistic Kubeflow pipeline.\r\n\r\nBy leveraging Kubeflow's Python SDK, we simplify the complexities of Kubernetes configurations while achieving scalable, maintainable, and reproducible pipelines. This session provides practical insights, real-world challenges, and best practices, demonstrating how Python-first workflows empower data scientists to focus on machine learning development rather than infrastructure.", "description": "Among popular open-source MLOps tools, **Kubeflow** stands out as a Kubernetes-native platform designed to support the entire ML lifecycle, from data preprocessing to model training, deployment, and retraining. Its modular structure enables the integration of a wide range of tools, making it a highly versatile framework for building scalable and reproducible ML workflows. Despite this, most existing resources focus on individual components rather than demonstrating how these can be orchestrated into a seamless, end-to-end pipeline.\r\n\r\nIn this talk, we present a practical case study that highlights the potential of Kubeflow in a real-world application. Specifically, we showcase how an automated ML pipeline for anomaly detection in International Space Station (ISS) telemetry data can be built and deployed using Kubeflow and other open-source MLOps tools. The dataset, originating from the Columbus module of the ISS, introduces unique challenges due to its complexity and high-dimensional nature, providing an excellent testbed for MLOps workflows.\r\n\r\n### **What makes this approach unique?**\r\n\r\nOur workflow is built entirely in Python, leveraging Kubeflow\u2019s Python SDK to orchestrate every stage of the pipeline. This eliminates the need for manual interaction with Kubernetes or container configurations, making the process accessible to ML engineers and data scientists without extensive DevOps expertise.\r\n\r\n### **Key takeaways for attendees:**\r\n\r\n*   **Tool integration:** Learn how to combine Dask for distributed preprocessing, Katib for hyperparameter optimization, PyTorch Operator for distributed training, MLFlow for experiment tracking and monitoring, and KServe for scalable model serving. These tools are orchestrated into a unified pipeline using Kubeflow Pipelines.\r\n*   **Overcoming challenges:** Gain insights into the technical hurdles faced during the implementation of this pipeline and discover the strategies and best practices that made it possible.\r\n*   **Real-world impact:** Understand how to apply MLOps principles to complex, real-world datasets and how these principles translate into scalable, maintainable, and reproducible workflows.\r\n\r\nTo ensure reproducibility and accessibility, the entire pipeline, including configurations and code, is publicly available in our GitHub repository [here](https://github.com/hsteude/code-ml4cps-paper). Attendees will be able to replicate the workflow, adapt it to their own use cases, or extend it with additional features.\r\n\r\n### **Who should attend?**\r\n\r\nThis session is designed for data scientists, ML engineers, and Python enthusiasts who want to simplify the development of scalable ML pipelines. Whether you're new to Kubernetes or looking to streamline your MLOps workflows, this talk will provide actionable insights and tools to help you succeed.", "recording_license": "", "do_not_record": false, "persons": [{"code": "RL9F37", "name": "Christian Geier", "avatar": "https://cfp.pydata.org/media/avatars/RL9F37_XXV8RHO.webp", "biography": "Christian has 12+ years of experience in the scientific application of python in academic and industry settings. He is one of the founders of prokube.ai where he builds an MLOps platform build around Kubeflow, MLFlow, Kubernetes, and a host of other open source tools. He also holds a PhD in physics, where he gained experiences in maintaining a distributed compute clusters. Christian is a maintainer of several OSS projects.", "public_name": "Christian Geier", "guid": "406c71d6-c246-5e68-893b-df4bb32e608e", "url": "https://cfp.pydata.org/berlin2025/speaker/RL9F37/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/FPDP3E/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/FPDP3E/", "attachments": []}, {"guid": "3886859b-9079-56cd-a3d5-d1ce975e6873", "code": "SB88M7", "id": 77394, "logo": null, "date": "2025-09-01T16:20:00+02:00", "start": "16:20", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77394-beyond-the-black-box-interpreting-ml-models-with-shap", "url": "https://cfp.pydata.org/berlin2025/talk/SB88M7/", "title": "Beyond the Black Box: Interpreting ML models with SHAP", "subtitle": "", "track": "Visualisation & Jupyter", "type": "Talk", "language": "en", "abstract": "As machine learning models become more accurate and complex, explainability remains essential. Explainability helps not just with trust and transparency but also with generating actionable insights and guiding decision-making. One way of interpreting the model outputs is using SHapley Additive exPlanations (SHAP). In this talk, I will go through the concept of Shapley values and its mathematical intuition and then walk through a few real-world examples for different ML models. Attendees will gain a practical understanding of SHAP's strengths and limitations and how to use it to explain model predictions in their projects effectively.", "description": "## Audience\r\nThis talk is for Data Scientists and Machine Learning Engineers at any level. Basic knowledge of machine learning is useful but not necessary.\r\n\r\n## Objective\r\nAttendees will learn why explainable machine learning is important and how to use and interpret SHAP values for their model.\r\n\r\n## Details\r\n\r\nML models behave as black boxes in most scenarios. The model predicts or provides a certain output, but it is very difficult to generate any actionable insights directly. This is mostly because we generally have no idea which features are contributing the most to the model's behavior internally. SHAP provides a way to explain model predictions and can be an important tool in a data scientist's toolbox.\r\n\r\nIn this talk, we will begin by explaining to the audience the need for explainability and why it is essential to understand beyond what the model outputs. We will then briefly review the mathematical intuition behind Shapley values and their origins in game theory. After that, we will walk through a couple of case studies of tree-based and neural network-based models. We will be focusing on the interpretation of SHAP through various plots. Finally, we will discuss the best practices for interpreting SHAP visualizations, handling large datasets, and common pitfalls to avoid.\r\n\r\n## Outline\r\n\r\n- Introduction and motivation [1 min]\r\n- Why explainability matters? [5 min]\r\n   - Problem with black box models\r\n   - Actionable insights\r\n- SHAP theory and intuition [5 min]\r\n    - Shapley values\r\n    - Game theory origins\r\n    - SHAP\r\n- Case study 1: Tree-based model [4 min]\r\n    - Problem definition\r\n    - model output\r\n    - SHAP visualization\r\n      - Global plots\r\n      - Local plots\r\n    - Interpretation\r\n- Case study 2: Neural Network model [8 min]\r\n    - Problem definition\r\n    - Model output\r\n    - SHAP visualization\r\n       - Global plots\r\n       - Local plots\r\n    - Interpretation\r\n- Best practices and common pitfalls [4 min]\r\n    - Interpret SHAP correctly\r\n    - Avoid misleading explanations\r\n    - Performance challenges for large datasets\r\n    - Other techniques for explainability\r\n- Q/A [3 min]", "recording_license": "", "do_not_record": false, "persons": [{"code": "Z7XLKH", "name": "Avik Basu", "avatar": "https://cfp.pydata.org/media/avatars/Z7XLKH_Rey05Sy.webp", "biography": "Avik Basu is a Staff Data Scientist passionate about building intelligent, scalable systems that blend research with practical impact. With extensive experience in time series modeling, anomaly detection, and explainable AI, he focuses on making machine learning robust, interpretable, and production-ready.\r\n\r\nAvik is a frequent speaker at conferences like PyCascades, PyData and KubeCon, where he shares insights on topics such as reproducible ML workflows, ML-driven observability, etc. He is also an active contributor to the open-source ecosystem, serving as a maintainer of the real-time data processing framework Numaflow and a reviewer for scientific Python projects.\r\n\r\nOutside of work, he explores the intersection of machine learning, personal finance, and open-source tools, aiming to build software that is accessible, self-hostable, and privacy-focused. He is driven by a strong belief in community, transparency, and empowering others through education and mentorship.", "public_name": "Avik Basu", "guid": "ca1347a5-7adf-5604-b533-d0a11b6764be", "url": "https://cfp.pydata.org/berlin2025/speaker/Z7XLKH/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/SB88M7/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/SB88M7/", "attachments": []}, {"guid": "5e014a5e-35d0-54fe-aab9-cbff0d7f186e", "code": "VURY38", "id": 77039, "logo": null, "date": "2025-09-01T17:00:00+02:00", "start": "17:00", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77039-building-an-a-b-testing-framework-with-nicegui", "url": "https://cfp.pydata.org/berlin2025/talk/VURY38/", "title": "Building an A/B Testing Framework with NiceGUI", "subtitle": "", "track": "Visualisation & Jupyter", "type": "Talk", "language": "en", "abstract": "NiceGUI is a Python-based web UI framework that enables developers to build interactive web applications without using JavaScript. In this talk, I\u2019ll share how my team used NiceGUI to create an internal A/B testing platform entirely in Python. I\u2019ll discuss the key requirements for the platform, why we chose NiceGUI, and how it helped us design the UI, display results, and integrate with the backend. This session will demonstrate how NiceGUI simplifies development, reduces frontend complexity, and speeds up internal tool creation for Python developers.", "description": "NiceGUI is a Python-based web UI framework that enables developers to create full-featured, interactive web applications without needing to write JavaScript. \r\n\r\n In this talk, I\u2019ll share how my team and I used NiceGUI to build an internal A/B testing platform entirely in Python. A/B testing is essential for validating new features and improving user experience, and by creating a custom platform, we were able to streamline experiment management and simplify data visualization.\r\n\r\nThis talk is ideal for Python developers, data scientists, or anyone interested in creating web-based internal tools quickly. If you're looking for a solution that minimizes frontend complexity while providing a powerful framework for building interactive applications, this talk will provide valuable insights. No prior knowledge of JavaScript or frontend frameworks is necessary; familiarity with Python and basic web concepts will suffice.\r\n\r\nAfter a brief introduction, I\u2019ll first explain what A/B testing is and why it\u2019s so crucial for making data-driven decisions. I\u2019ll also discuss why having a custom-built platform can help improve experiment efficiency and results interpretation.\r\n\r\nNext, I\u2019ll dive into the key requirements we had for the platform, such as flexibility, ease of use, and seamless integration with our existing backend systems. I\u2019ll also explain why we chose NiceGUI over other Python-based frameworks, emphasizing its ability to help us build a robust web application without the complexities of traditional frontend development.\r\n\r\nThroughout the talk, I\u2019ll walk through how we used NiceGUI to design the user interface, display results, and integrate with the backend. I\u2019ll focus on the development experience, highlighting the challenges we faced and how NiceGUI\u2019s features allowed us to make rapid progress while keeping things simple and Pythonic.\r\n\r\nThe takeaway for the audience will be understanding how NiceGUI simplifies the development of interactive web applications, focusing on internal tools like dashboards or experiment management platforms. I\u2019ll also share the benefits we\u2019ve experienced with the platform so far and discuss the lessons we\u2019ve learned. Finally, I\u2019ll explain how NiceGUI helped us create an interactive, production-ready tool with minimal frontend complexity.\r\n\r\nThis session will demonstrate, through a specific use case, how NiceGUI can be an ideal solution for Python developers looking to quickly build internal tools, reduce frontend complexity, and speed up development cycles.\r\n\r\nAgenda:\r\n1. Introduction & Background (5 minutes)\r\n2. Requirements for an A/B Testing Platform (2 minutes)\r\n3. Why We Chose NiceGUI (2 minutes)\r\n4. How We Built It \u2013 Patterns & Architecture (10 minutes)\r\n5. Benefits and Outcomes (3 minutes)\r\n6. Challenges and Lessons Learned (3 minutes)", "recording_license": "", "do_not_record": false, "persons": [{"code": "QD8QT7", "name": "Wessel van de Goor", "avatar": "https://cfp.pydata.org/media/avatars/QD8QT7_S6UB8As.webp", "biography": "Wessel's greatest passion is working with data. He loves collecting, storing, transforming, analyzing, and presenting data. Wessel is an Analytics Engineer at Lotum, where he creates data models and develops ETL pipelines and dashboards to assist his colleagues in their daily work.\r\n\r\nEven in his free time, Wessel's love for data continues, as many of his hobbies can be explored in a data-driven way. He particularly enjoys diving into video and board games, analyzing everything that can be quantified.", "public_name": "Wessel van de Goor", "guid": "20fbfafe-62dc-569f-8d42-a20521f990a1", "url": "https://cfp.pydata.org/berlin2025/speaker/QD8QT7/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/VURY38/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/VURY38/", "attachments": []}], "B05-B06": [{"guid": "8c402e0f-8d4c-5a53-a553-2401d5fe39cc", "code": "KCPVYN", "id": 77590, "logo": null, "date": "2025-09-01T10:40:00+02:00", "start": "10:40", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77590-streamlining-satellite-data-for-analysis-ready-outputs", "url": "https://cfp.pydata.org/berlin2025/talk/KCPVYN/", "title": "\ud83d\udef0\ufe0f\u27a1\ufe0f\ud83e\uddd1\u200d\ud83d\udcbb: Streamlining Satellite Data for Analysis-Ready Outputs", "subtitle": "", "track": "Data Handling & Engineering", "type": "Talk", "language": "en", "abstract": "I will share how our team built an end-to-end system to transform raw satellite imagery into analysis-ready datasets for use cases like vegetation monitoring, deforestation detection, and identifying third-party activity. We streamlined the entire pipeline from automated acquisition and cloud storage to preprocessing that ensures spatial, spectral, and temporal consistency. By leveraging Prefect for orchestration, Anyscale Ray for scalable processing, and the open source STAC standard for metadata indexing, we reduced processing times from days to near real-time. We addressed challenges like inconsistent metadata and diverse sensor types, building a flexible system capable of supporting large-scale geospatial analytics and AI workloads.", "description": "Satellite imagery offers powerful insights for vegetation monitoring, deforestation detection, and identifying unauthorized activity but raw data isn\u2019t analysis-ready. In this talk, I will share how our team built a scalable, cloud-native pipeline that automates satellite data acquisition, storage, and preprocessing into consistent, analysis-ready datasets (ARDs). Designed for flexibility and growth, the system handles various sensors and formats while ensuring high data quality.\r\n\r\nWe use Prefect for workflow orchestration and Anyscale Ray for distributed processing, cutting processing times from days to near real-time. Open source SpatioTemporal Asset Catalog  (STAC) standards enable robust metadata indexing, supporting fast querying and long-term interoperability. This adaptable architecture empowers fast, reliable geospatial analytics across domains.", "recording_license": "", "do_not_record": false, "persons": [{"code": "LYCURQ", "name": "Vinayak Nair", "avatar": "https://cfp.pydata.org/media/avatars/LYCURQ_QiAl2i7.webp", "biography": "Remote Sensing & Space System Engineer | Innovating AI-Powered Geospatial Solutions | Expert in Satellite Data and Infrastructure Monitoring", "public_name": "Vinayak Nair", "guid": "0a8f2469-5d09-501e-ac92-f83ecdefa0de", "url": "https://cfp.pydata.org/berlin2025/speaker/LYCURQ/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/KCPVYN/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/KCPVYN/", "attachments": []}, {"guid": "0a691385-2c47-5a6a-bacb-3c1b6d385099", "code": "8UJA37", "id": 77898, "logo": null, "date": "2025-09-01T11:20:00+02:00", "start": "11:20", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77898-exploring-millions-of-high-dimensional-datapoints-in-the-browser-for-early-drug-discovery", "url": "https://cfp.pydata.org/berlin2025/talk/8UJA37/", "title": "Exploring Millions of High-dimensional Datapoints in the Browser for Early Drug Discovery", "subtitle": "", "track": "Data Handling & Engineering", "type": "Talk", "language": "en", "abstract": "The visual exploration of large, high-dimensional datasets presents significant challenges in data processing, transfer, and rendering for engineering in various industries. This talk will explore innovative approaches to harnessing massive datasets for early drug discovery, with a focus on interactive visualizations. We will demonstrate how our team at Bayer utilizes a modern tech stack to efficiently navigate and analyze millions of data points in a high-dimensional embedding space. Attendees will gain insights into overcoming performance challenges, optimizing data rendering, and developing user-friendly tools for effective data exploration. We aim to demonstrate how these technologies can transform the way we interact with complex datasets in engineering applications and eventually allow us to find the needle in a multidimensional haystack.", "description": "From initial screening to regulatory approval, developing new drugs can take over a decade. A major bottleneck is the early-stage identification of promising compounds, a process that increasingly relies on high-throughput image-based profiling and requires researchers to sift through vast oceans of potential molecular candidates. Analyzing these large-scale, high-dimensional datasets introduces challenges in data ingestion, transformation, and visualization. Overcoming those challenges has the potential to significantly accelerate the journey from discovery to delivery, thus providing life-saving treatments to patients faster.\r\n\r\nIn this talk, we share how our team at Bayer engineered a system to navigate millions of cell-level data points in the browser. Starting with raw microscopy images, we use computer vision and deep learning models to extract morphological features. These features are aggregated into \u201cconsensus profiles\u201d that enable robust comparisons across treatment conditions and experimental batches.\r\nWe\u2019ll present how we automated and optimized what was previously a four-week manual workflow using a tech stack including:\r\n\r\n\u2022\t\u2060Apache Airflow for orchestrating parallel processing and ensuring reproducibility  \r\n\u2022\t\u2060GraphQL combined with REST for a balance of flexibility and speed in serving data\r\n\u2022\tReact and Next.js for building user interfaces that support real-time interaction with millions of records\r\n\r\nWe\u2019ll also showcase techniques for creating accessible and performant visualizations: scatter plots, dose-response curves, dendrograms, and similarity heatmaps. These visualizations were designed for scientists who are no software developers, so particular attention was paid to usability, accessibility, and performance.\r\n\r\nBy presenting practical challenges and solutions, we will enable attendees to improve their approaches to data visualization and interaction in their own domains. We aim to convey how these technologies can transform the way we interact with complex datasets in engineering applications on a broad spectrum, empowering us with more efficient methodologies to locate the needle in a multidimensional haystack.", "recording_license": "", "do_not_record": false, "persons": [{"code": "SHDJXQ", "name": "Tim Tenckhoff", "avatar": "https://cfp.pydata.org/media/avatars/SHDJXQ_bbsSXEU.webp", "biography": "Tim is a Software Development Consultant at Netlight with a track record of experience in diverse industries, including MedTech, E-Mobility, FinTech, E-Commerce, EdTech and IoT. With a passion for technology and a relentless pursuit of excellence, he is dedicated to continuously push the boundaries of innovation while crafting clean, well-architected solutions and streamlining processes for efficiency. Currently, Tim is supporting Bayer in the Research and Development domain by visualising extensive cell painting image data in early drug discovery.", "public_name": "Tim Tenckhoff", "guid": "96f68912-d668-526f-a5f9-dd3da449aeca", "url": "https://cfp.pydata.org/berlin2025/speaker/SHDJXQ/"}, {"code": "KKDF38", "name": "Matthias Orlowski", "avatar": "https://cfp.pydata.org/media/avatars/KKDF38_IzM0Vwm.webp", "biography": "As a Machine Learning Engineer at Bayer, Matthias Orlowski has contributed to various projects, focusing on natural language processing in pharmacovigilance and medical image processing in radiology and early drug discovery. Matthias studied in Konstanz, Nottingham (UK), Durham (North Carolina, USA), and Berlin, where he earned a PhD from Humboldt University in 2015. Prior to joining Bayer, Matthias gained diverse experience in multiple roles and organizations, tackling projects in consumer targeting, campaigning, and recommender systems.", "public_name": "Matthias Orlowski", "guid": "da15a5ce-8324-514a-8ce7-e4b0526b4734", "url": "https://cfp.pydata.org/berlin2025/speaker/KKDF38/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/8UJA37/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/8UJA37/", "attachments": []}, {"guid": "da1fc13d-f428-576f-a4a7-d3276e066ba4", "code": "RQCNQV", "id": 77352, "logo": null, "date": "2025-09-01T12:00:00+02:00", "start": "12:00", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77352-democratizing-experimentation-how-getyourguide-built-a-flexible-and-scalable-a-b-testing-platform", "url": "https://cfp.pydata.org/berlin2025/talk/RQCNQV/", "title": "Democratizing Experimentation: How GetYourGuide Built a Flexible and Scalable A/B Testing Platform", "subtitle": "", "track": "Data Handling & Engineering", "type": "Talk", "language": "en", "abstract": "At GetYourGuide, we transformed experimentation from a centralized, closed system into a democratized, self-service platform accessible to all analysts, engineers, and product teams. In this talk, we'll share our journey to empower individuals across the company to define metrics, create dimensions, and easily extend statistical methods. We'll discuss how we built a Python-based Analyzer toolkit enabling standardized, reusable calculations, and how our experimentation platform provides ad-hoc analytical capabilities through a flexible API. Attendees will gain practical insights into creating scalable, maintainable, and user-friendly experimentation infrastructure, along with access to our open-source sequential testing implementation.", "description": "Experimentation is essential for data-driven product development, but centralized experimentation systems often become bottlenecks, limiting innovation and velocity. At GetYourGuide, we faced this challenge and decided to democratize experimentation, enabling analysts and product teams across the company to define, run, and analyze experiments independently. In this session, we'll share practical insights from our journey toward democratization, focusing on technical implementation details and lessons learned.  \r\n\r\n**From Centralized to Decentralized Experimentation**  \r\nInitially, experimentation at GetYourGuide was centralized, limiting flexibility and slowing down decision-making. We recognized the need to empower individual contributors (ICs) by creating a self-service experimentation platform. We'll discuss the practical challenges we encountered, including managing complexity, maintaining consistency, and ensuring data quality across decentralized teams.  \r\n\r\n**Enabling Flexible Metric and Dimension Definitions**  \r\nTo democratize experimentation effectively, we needed to empower analysts to define their own metrics and dimensions without heavy engineering involvement. We'll share how we designed a modular SQL-template approach, allowing analysts to quickly create, test, and deploy new definitions. We'll illustrate this approach with real-world examples, such as conversion rate, revenue per visitor, channel splits, and platform segmentation, demonstrating how this flexibility significantly accelerated experimentation velocity.  \r\n\r\n**Standardizing Statistical Calculations with the Analyzer Toolkit**\r\nOur initial experimentation infrastructure relied heavily on Looker data models, which proved insufficient for complex statistical methods like sequential testing. To address this, we built a Python-based analysis package, the Analyzer, that standardized statistical calculations and provided reusable components. We'll explain how analysts leverage this toolkit to ensure consistency, accuracy, and extensibility of statistical methods. We'll also share how the Analyzer became a valuable resource beyond experimentation, supporting broader analytical use-cases across the organization.  \r\n\r\n**Batch Processing and API-Driven Experiment Results**  \r\nTo ensure timely access to experiment results, we implemented a robust batch processing pipeline that pre-calculates daily experiment impressions, metrics, and dimensions. Additionally, we developed a flexible API layer to enable analysts to retrieve specific experiment results dynamically, without waiting for scheduled batch jobs. We'll discuss the technical architecture behind this dual approach, highlighting how it balances efficiency, reliability, and flexibility.  \r\n\r\n**Key Lessons and Takeaways**  \r\nAttendees will leave this session with practical insights into:\r\n* Democratizing experimentation to accelerate innovation and velocity.\r\n* Best practices for designing flexible, scalable, and maintainable experimentation infrastructure.\r\n* Technical strategies for enabling self-service metric/dimension definitions, standardized statistical calculations, and extensible analytical capabilities.  \r\n  \r\nWe'll conclude by briefly outlining our future plans, including additional discriminators, advanced statistical methods, and further UI enhancements aimed at continuous democratization.", "recording_license": "", "do_not_record": false, "persons": [{"code": "PP8Y9C", "name": "Konrad Richter", "avatar": "https://cfp.pydata.org/media/avatars/PP8Y9C_xppi7qa.webp", "biography": null, "public_name": "Konrad Richter", "guid": "27c5abeb-2464-5e1e-b8d1-4abb92d6fee1", "url": "https://cfp.pydata.org/berlin2025/speaker/PP8Y9C/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/RQCNQV/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/RQCNQV/", "attachments": [{"title": "Slides", "url": "/media/berlin2025/submissions/RQCNQV/resources/Democratizing_E_smpmEED.pdf", "type": "related"}]}, {"guid": "e0b37a7a-8e14-5e4f-b020-e782afe4048d", "code": "LZYBVH", "id": 77020, "logo": null, "date": "2025-09-01T13:40:00+02:00", "start": "13:40", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77020-the-eu-ai-act-unveiling-lesser-known-aspects-implementation-entities-and-exemptions", "url": "https://cfp.pydata.org/berlin2025/talk/LZYBVH/", "title": "The EU AI Act: Unveiling Lesser-Known Aspects, Implementation Entities, and Exemptions", "subtitle": "", "track": "Ethics & Privacy", "type": "Talk", "language": "en", "abstract": "The EU AI Act is already partly in effect which prohibits certain AI systems. After going through the basics, we cover some of the less talked about aspects of the Act, introducing entities involved in its implementation and how many high risk government and law enforcement use cases are excluded!", "description": "The EU AI Act is a groundbreaking regulatory framework, partly in effect, designed to govern AI systems based on their perceived risk. This talk provides an overview of the basics and explores lesser-discussed aspects of the Act, such as the entities involved in its implementation, the role of the private sector, and notable exemptions for high-risk government and law enforcement use cases.\r\n\r\nThe AI Act categorizes AI systems into different groups based on their potential harm. The two most notable groups are unacceptable and high risk groups. Unacceptable risk systems, social scoring systems, unconsciously behavior manipulative systems, and mass CCTV facial recognition systems are among the prohibited group.\r\n\r\nOn the other hand, high-risk systems including biometric identification systems, AI systems used in education and vocational training, and employment and worker management systems, must meet stringent obligations before entering the market.\r\n\r\nSurprisingly, the AI Act excludes many high-risk government and law enforcement use cases. AI systems used for national security, defense, and law enforcement tasks like border control, crime prevention, and criminal investigations are largely exempt. These exemptions aim to preserve public security and Member States' sovereignty but raise concerns about potential AI misuse in these sensitive areas. For instance, predictive policing tools, though controversial, fall outside the AI Act's scope.\r\n\r\nAdditionally, the AI Act will not apply to AI systems used as research or development tools or to systems developed or used exclusively for military purposes. This leaves a substantial gap in the regulation of high-risk AI systems, emphasizing the need for complementary safeguards.\r\n\r\nOne of the less talked about aspects is the complex ecosystem of entities involved in the AI Act's implementation. The European Artificial Intelligence Board is the Act's central hub, comprising representatives from each national supervisory authority, the European Data Protection Supervisor, and the Commission. The board will issue opinions and recommendations to ensure the AI Act's consistent application. National supervisory authorities, such as data protection agencies, will oversee the Act's enforcement, exchanging information through the board. The European Commission will facilitate cooperation among national authorities and with international organizations.\r\n\r\nWhen it comes to verifying submitted documents and claimed [lack] of high risk systems, there will be entities called notifying bodies which will be established by each Member State to assess and certify notified bodies. Notified bodies are conformity assessment bodies accredited to evaluate high-risk AI systems. These notified bodies, is a space where the private sector and startups can grow and engage with the regulatory bodies. They will play a crucial role in ensuring high-risk AI systems conform to the AI Act's requirements.\r\n\r\nMoreover, the AI Act introduces AI regulatory sandboxes, temporary experimental spaces allowing developers to test innovative AI systems under regulatory supervision. National competent authorities will establish and monitor these sandboxes, fostering innovation while minimizing risks. The private sector can engage with these sandboxes, creating opportunities for startups and established companies to develop and test their new systems.\r\n\r\nIn conclusion, the EU AI Act is a comprehensive regulatory framework that establishes a complex ecosystem of implementation entities and offers opportunities for private sector engagement. However, it also presents notable exemptions for high-risk government and law enforcement use cases, sparking debates about its scope and effectiveness. Understanding these lesser-known aspects is crucial for navigating the AI Act's regulatory landscape and fostering responsible AI innovation.", "recording_license": "", "do_not_record": false, "persons": [{"code": "HGSWKF", "name": "Adrin Jalali", "avatar": "https://cfp.pydata.org/media/avatars/HGSWKF_ZnotwvU.webp", "biography": "Adrin is VP Labs at probabl.ai and has a PhD in computational biology. He is also a maintainer of open source projects such as scikit-learn and fairlearn. He focuses on developer tools in the statistical machine learning and responsible ML space.", "public_name": "Adrin Jalali", "guid": "a4868d84-4229-51c6-9af9-2ed9d356b361", "url": "https://cfp.pydata.org/berlin2025/speaker/HGSWKF/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/LZYBVH/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/LZYBVH/", "attachments": []}, {"guid": "93a2cebb-0afb-53d9-961e-405a5168f30a", "code": "JE8YJT", "id": 77727, "logo": null, "date": "2025-09-01T14:20:00+02:00", "start": "14:20", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77727-what-s-really-going-on-in-your-model-a-python-guide-to-explainable-ai", "url": "https://cfp.pydata.org/berlin2025/talk/JE8YJT/", "title": "What\u2019s Really Going On in Your Model? A Python Guide to Explainable AI", "subtitle": "", "track": "Ethics & Privacy", "type": "Talk", "language": "en", "abstract": "As machine learning models become more complex, understanding why they make certain predictions is becoming just as important as the predictions themselves. Whether you're dealing with business stakeholders, regulators, or just debugging unexpected results, the ability to explain your model is no longer optional , it's essential.\r\n\r\nIn this talk, we'll walk through practical tools in the Python ecosystem that help bring transparency to your models, including SHAP, LIME, and Captum. Through hands-on examples, you'll learn how to apply these libraries to real-world models from decision trees to deep neural networks and make sense of what's happening under the hood.\r\n\r\nIf you've ever struggled to explain your model\u2019s output or justify its decisions, this session will give you a toolkit to build more trustworthy, interpretable systems  without sacrificing performance.", "description": "We\u2019ve all been there, your machine learning model performs well in testing, but when it comes time to explain why it made a specific prediction, things get murky. In many real-world applications, especially in domains like healthcare, finance, or operations, being able to explain your model isn\u2019t just helpful it\u2019s critical.This talk is a practical walkthrough of explainable AI (XAI) tools in Python, aimed at data scientists and engineers who want to make their models more transparent and trustworthy. We\u2019ll cover libraries like SHAP, LIME, and Captum, and show how to use them to generate both local and global explanations for models ranging from random forests to deep neural nets.You\u2019ll see hands-on examples, common pitfalls to avoid, and ideas for integrating interpretability into your workflow whether you\u2019re trying to debug your model or justify its predictions to a non-technical stakeholder.If you\u2019ve ever wanted to better understand your own models or help others trust them this session is for you.", "recording_license": "", "do_not_record": false, "persons": [{"code": "LXQGX3", "name": "Yashasvi Misra", "avatar": "https://cfp.pydata.org/media/avatars/LXQGX3_VVXhnDD.webp", "biography": "Yashasvi Misra is a Data Engineer at Pure Storage and Chair of the NumFOCUS Code of Conduct Working Group, where she helps foster inclusive practices across the open-source ecosystem. She has contributed to foundational projects like NumPy and has been an active part of the Python community since her college days. Yashasvi is also a passionate advocate for diversity and inclusion in tech. She introduced a period leave policy at a previous organisation and continues to work toward building more equitable workplaces. She has shared her work and insights at conferences around the world, including PyCon India, PyCon Europe, PyLadiesCon, and PyData Global.", "public_name": "Yashasvi Misra", "guid": "fdf2509a-de08-5695-b450-7533fe8fcbf1", "url": "https://cfp.pydata.org/berlin2025/speaker/LXQGX3/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/JE8YJT/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/JE8YJT/", "attachments": []}, {"guid": "5996ca5a-eeef-5d1b-a5f0-6e19e5b1d5f6", "code": "GW9EXL", "id": 77512, "logo": null, "date": "2025-09-01T16:20:00+02:00", "start": "16:20", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77512-consumer-choice-models-with-pymc-marketing", "url": "https://cfp.pydata.org/berlin2025/talk/GW9EXL/", "title": "Consumer Choice Models with PyMC Marketing", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Talk", "language": "en", "abstract": "Consumer choice models are an important part of product innovation and market strategy. In this talk we'll see how they can be used to learn about substitution goods and market shares in competitive markets using PyMC marketing's new consumer choice module.", "description": "The market sets the price, but what drives market demand? Classical implementations of discrete choice models discovered that market structure needed to be explicitly encoded in the model to avoid the problem of implausible predictions about the substitution value of distinct products. We demonstrate this issue and how to resolve it by adding more explicit structure to the models of market demand while giving insight into what drives the utility of products for consumers. These consumer choice models find a natural expression in the Bayesian paradigm and we show how to fit them to real data with PyMC Marketing's Consumer Choice module.", "recording_license": "", "do_not_record": false, "persons": [{"code": "GB9KHE", "name": "Nathaniel Forde", "avatar": "https://cfp.pydata.org/media/avatars/GB9KHE_9dI5Rai.webp", "biography": "I\u2019m a data scientist specialising in probabilistic modelling for the study of risk and causal inference. I have experience in model development, deployment, multivariate testing and monitoring.\r\n\r\nI\u2019m interested in questions of inference and measurement in the face of natural variation and confounding.\r\n\r\nMy academic background is in mathematical logic and philosophy where I mostly imagined possible worlds and modal logics.", "public_name": "Nathaniel Forde", "guid": "a1e45954-a39c-523b-93be-ac53df0ad728", "url": "https://cfp.pydata.org/berlin2025/speaker/GB9KHE/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/GW9EXL/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/GW9EXL/", "attachments": []}, {"guid": "5b0fc728-d7ca-5418-b5fa-7a9203a200b5", "code": "3XMJM3", "id": 77527, "logo": null, "date": "2025-09-01T17:00:00+02:00", "start": "17:00", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77527-risk-budget-optimization-for-causal-mix-models", "url": "https://cfp.pydata.org/berlin2025/talk/3XMJM3/", "title": "Risk Budget Optimization for Causal\u202fMix\u202fModels", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Talk", "language": "en", "abstract": "Traditional budget planners chase the highest predicted return and hope for the best.\u202fBayesian models take the opposite route: they quantify uncertainty first, then let us optimize budgets with that uncertainty fully on display.\u202fIn this talk we\u2019ll show how posterior distributions become a set of possible futures, and how risk\u2011aware loss functions convert those probabilities into spend decisions that balance upside with resilience.\u202fWhether you lead marketing, finance, or product, you\u2019ll learn a principled workflow for turning probabilistic insight into capital allocation that\u2019s both aggressive and defensible\u2014no black\u2011box magic, just transparent Bayesian reasoning and disciplined risk management.", "description": "Budget planning often treats forecasts as fixed targets, leaving decision\u2011makers blind to the volatility hiding beneath the averages.\u202fThis talk shows how Bayesian modelling turns every unknown\u2014channel response, cost elasticity, future demand\u2014into an explicit probability distribution.\u202fBy simulating thousands of plausible futures, we can measure upside and downside simultaneously and translate a company\u2019s risk appetite into clear optimisation objectives such as Value\u2011at\u2011Risk, Conditional VaR, entropic risk, or custom utility functions that respect budget caps and pacing rules.\r\n\r\nUsing reproducible PyMC Code, we will walk through converting posterior samples into risk\u2011aware spend recommendations, and visualising trade\u2011offs so non\u2011technical stakeholders grasp both opportunity and exposure.\u202f\r\n\r\nAttendees will leave with a notebook and code to adapt pymc bayesian models with Pymc-Marketing to perform marketing budgets, capital allocation, or any scenario where uncertainty and risk tolerance must shape financial decisions.", "recording_license": "", "do_not_record": false, "persons": [{"code": "WDM7WM", "name": "Carlos Trujillo", "avatar": "https://cfp.pydata.org/media/avatars/WDM7WM_KlOvVw3.webp", "biography": "Eight years ago, I discovered a lasting passion for data and AI\u2014the kind that keeps you experimenting long after your calendar says \u201cdone.\u201d That curiosity took me from Venezuela to Chile and, most recently, to Estonia, where I collaborate with teams across Latin\u202fAmerica, Europe, and Africa.\r\n\r\nAfter years in Chile doing Marketing consultancy, and working with companies like Omnicom Media Group at the Regional level, I move to help Bolt accelerate its marketing-data\u2011driven transformation, recently, shifted just a few tram stops north to Wise\u2014Estonia\u2019s largest tech unicorn\u2014bringing everything I learned from one high\u2011velocity scale\u2011up to another. My focus remains on turning marketing ambitions into measurable, model\u2011powered outcomes, even when the roadmap seems to sprint faster than the release notes.\r\n\r\nBeyond the day job, I\u2019m a core member of PyMC\u202fLabs, the research group behind open\u2011source projects such as PyMC, PyMC\u2011Marketing, CausalPy, and PyTensor. If you run PyMC\u2011Marketing and something unexpectedly works a little better, there\u2019s a non\u2011zero chance it came from one of my late\u2011night pull requests.\r\n\r\nMy long\u2011term goal is to master the hybrid role of \u201cMarketing Scientist\u201d blending statistical rigor with business storytelling. If you like statistics, bayesian models, data\u2011driven decisions, as well open\u2011source cameo, then let\u2019s connect.", "public_name": "Carlos Trujillo", "guid": "f822b3ac-f8d6-5dfb-bde4-62ed71500e83", "url": "https://cfp.pydata.org/berlin2025/speaker/WDM7WM/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/3XMJM3/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/3XMJM3/", "attachments": []}]}}, {"index": 2, "date": "2025-09-02", "day_start": "2025-09-02T04:00:00+02:00", "day_end": "2025-09-03T03:59:00+02:00", "rooms": {"Kuppelsaal": [{"guid": "1c481ea6-949e-57b7-bb57-3509d8aecfcd", "code": "JKEHMH", "id": 77228, "logo": null, "date": "2025-09-02T09:10:00+02:00", "start": "09:10", "duration": "01:00", "room": "Kuppelsaal", "slug": "berlin2025-77228-narwhals-enabling-universal-dataframe-support", "url": "https://cfp.pydata.org/berlin2025/talk/JKEHMH/", "title": "Narwhals: enabling universal dataframe support", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Keynote", "language": "en", "abstract": "Ever tried passing a Polars Dataframe to a data science library and found that it...just works? No errors, no panics, no noticeable overhead, just...results? This is becoming increasingly common in 2025, yet only 2 years ago, it was mostly unheard of. So, what changed? A large part of the answer is: Narwhals.\r\n\r\nNarwhals is a lightweight compatibility layer between dataframe libraries which lets your code work seamlessly across Polars, pandas, PySpark, DuckDB, and more! And it's not just a theoretical possibility: with ~30 million monthly downloads and set as a required dependency of Altair, Bokeh, Marimo, Plotly, Shiny, and more, it's clear that it's reshaping the data science landscape. By the end of the talk, you'll understand why writing generic dataframe code was such a headache (and why it isn't anymore), how Narwhals works and how its community operates, and how you can use it in your projects today. The talk will be technical yet accessible and light-hearted.", "description": "Narwhals is a lightweight and extensible compatibility layer between dataframe libraries. It is already used by several major open source libraries including Altair, Bokeh, Marimo, Plotly, and more. You will learn how to use Narwhals to build dataframe-agnostic tools, how Narwhals gained traction in a short amount of time, and what the future of dataframes looks like.\r\n\r\nThis is a technical talk, and basic familiarity with Python and dataframes will be assumed. We will cover:\r\n\r\n* What the data science landscape looked like in 2024 before Narwhals came onto the scene.\r\n* What problems Narwhals solves, why you can't \"just convert to pandas\" or \"just use PyArrow\".\r\n* How to use Narwhals, with an emphasis on lazy-only computation.\r\n* Static typing.\r\n* Narwhals and SQL.\r\n* Extending Narwhals with your own backend.\r\n* The Narwhals community, and how you can get involved.\r\n* What we think the future of dataframes looks like, and how you can help make it happen.\r\n\r\nTool builders will learn how to build tools for modern dataframe libraries without sacrificing support for foundational classic libraries such as pandas. Data scientists will learn about what goes on under the hood when their favourite tools support their favourite dataframe libraries. Finally, everyone will learn from insights on community building and management.", "recording_license": "", "do_not_record": false, "persons": [{"code": "9DVFDX", "name": "Marco Gorelli", "avatar": "https://cfp.pydata.org/media/avatars/9DVFDX_clPiL1G.webp", "biography": "Marco is the author of Narwhals, core contributor to pandas and Polars, and works at Quansight Labs as Senior Software Engineer. He also consults and trains clients professionally on Polars. He has also written the first Polars Plugins Tutorial and has taught Polars Plugins to clients.\r\n\r\nHe has a background in Mathematics and holds an MSc from the University of Oxford, and was one of the prize winners in the M6 Forecasting Competition (2nd place overall Q1).", "public_name": "Marco Gorelli", "guid": "f8effc8d-f4bf-5fcc-9ce4-0df7fe68b727", "url": "https://cfp.pydata.org/berlin2025/speaker/9DVFDX/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/JKEHMH/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/JKEHMH/", "attachments": []}, {"guid": "fe6a24f3-ff8a-54ed-986d-c08d003acb91", "code": "3BVEKT", "id": 78371, "logo": null, "date": "2025-09-02T16:45:00+02:00", "start": "16:45", "duration": "00:45", "room": "Kuppelsaal", "slug": "berlin2025-78371-lightning-talks", "url": "https://cfp.pydata.org/berlin2025/talk/3BVEKT/", "title": "Lightning Talks", "subtitle": "", "track": "Lightning Talks", "type": "Plenary Session [Organizers]", "language": "en", "abstract": "Lightning Talks are short, 5-minute presentations open to all attendees. They\u2019re a fun and fast-paced way to share ideas, showcase projects, spark discussions, or raise awareness about topics you care about \u2014 whether technical, community-related, or just inspiring.\r\n\r\nNo slides are required, and talks can be spontaneous or prepared. It\u2019s a great chance to speak up and connect with the community!", "description": "\u26a1 Lightning Talk Rules\r\n\r\n- No promotion for products or companies.\r\n- No call for 'we are hiring' (but you may name your employer).\r\n- One LT per person per conference policy.\r\n\r\nCommunity Event Announcements\r\n\r\n- \u23f1 You want to announce a community event? You have ONE minute.\r\n- All event announcements will be collected in a single slide slide deck, see instructions at the Lightning Talk desk in the Community Space in the Lounge on Level 1.\r\n\r\nAll other LTs:\r\n\r\n- \u23f1 You have exactly 5 minutes. The clock starts when you start \u2014 and ends when time\u2019s up. That\u2019s the thrill of Lightning Talks \u26a1\r\n- \ud83c\udfaf Be sharp, clear, and fun. Introduce your idea, make your point, give the audience something to remember. No pressure. (Okay, maybe a little.)\r\n- \ud83d\udc0d Keep it relevant to Python, PyData and the community. You can go broad \u2014 tools, workflows, stories, experiments \u2014 as long as there\u2019s some connection to Python, PyData or the community.\r\n- \ud83d\udc4f Keep it respectful. Keep it awesome. Humor is welcome, but please be kind, inclusive, and professional.\r\n- \ud83c\udfa4 Be ready when your name is called. We\u2019re running a tight session \u2014 speakers go on stage rapid-fire. Stay close and stay hyped.", "recording_license": "", "do_not_record": false, "persons": [], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/3BVEKT/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/3BVEKT/", "attachments": []}, {"guid": "11244ab6-85c7-58d1-9b52-1aae46f4742b", "code": "URGKYN", "id": 80783, "logo": null, "date": "2025-09-02T18:00:00+02:00", "start": "18:00", "duration": "01:00", "room": "Kuppelsaal", "slug": "berlin2025-80783-pyladies-empowered-in-tech-social-event-hofbrau-wirtshaus", "url": "https://cfp.pydata.org/berlin2025/talk/URGKYN/", "title": "PyLadies & Empowered in Tech Social Event @Hofbr\u00e4u Wirtshaus", "subtitle": "", "track": "Community & Diversity", "type": "Social Event", "language": "en", "abstract": "Social event organized by PyLadies & Empowered in Tech\r\n\r\nLocation: Hofbr\u00e4u Wirtshaus, Karl-Liebknecht-Str. 30, 10178 Berlin\r\n\r\n We\u2019ll meet outside the BCC at 18", "description": "**PyLadies** is an international mentorship group with a focus on helping more women and gender non-conforming people become active participants and leaders in the Python open-source community. Its mission is to promote, educate and advance a diverse Python community through outreach, education, conferences, events and social gatherings.\r\n\r\n---\r\n\r\n**Empowered in Tech** is a community in Berlin dedicated to empowering FLINTA (women, lesbians, intersex, non-binary, trans and agender) people to excel in their tech journey. We welcome engineers, software developers, data scientists, designers, product managers, career changers and other professionals in the tech industry. We are open to all tech stacks, programming languages and experience levels. Our goal is to support our members to grow your careers, connect with like-minded people and feel welcome in tech.", "recording_license": "", "do_not_record": false, "persons": [], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/URGKYN/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/URGKYN/", "attachments": []}], "B09": [{"guid": "6a38ff69-24e1-5e06-adce-b64e59d4c5b4", "code": "GBVFJ8", "id": 77921, "logo": null, "date": "2025-09-02T10:40:00+02:00", "start": "10:40", "duration": "01:30", "room": "B09", "slug": "berlin2025-77921-probably-fun-games-to-teach-machine-learning", "url": "https://cfp.pydata.org/berlin2025/talk/GBVFJ8/", "title": "Probably Fun: Games to teach Machine Learning", "subtitle": "", "track": "Education, Career & Life", "type": "Tutorial", "language": "en", "abstract": "In this tutorial, you will play several games that can be used to teach machine learning concepts. Each game can be played in big and small groups. Some involve hands- on material such as cards, some others involve electronic app. All games contain one or more concepts from Machine Learning.\r\n\r\nAs an outcome, you will take away multiple ideas that make complex topics more understandable \u2013 and enjoyable. By doing so, we would like to demonstrate that Machine Learning does not require computers, but the core ideas can be exemplified in a clear and memorable way without. We also would like to demonstrate that gamification is not limited to online quiz questions, but offers ways for learners to bond.\r\n\r\nWe will bring a set of carefully selected games that have been proven in a big classroom setting and contain useful abstractions of linear models, decision trees, LLMs and several other Machine Learning concepts. We also believe that it is probably fun to participate in this tutorial.", "description": "Board gaming has recently been declared part of the immaterial cultural heritage in Germany by UNESCO. Games encourage people to use their brains in a focused, constructive and peaceful way. This makes games a fantastic tool in the classroom. While many games contain algorithms and statistical models right under the surface, finding an actual model of Machine Learning is a bit harder. We have put some thought into creating or finding games that have a clear connection to Machine Learning.\r\n\r\nWe have conducted a tutorial featuring board games at PyConDE 2025. This time, we have increased the ante and moved the focus from statistics to Machine Learning. Also, at PyData Berlin we expect a particular challenge: we do not expect a room with tables for 80+ people. Therefore, we chose game mechanics that work with minimal material and scale up to big groups. As a consequence, the games would be easier to adapt to a larger class, such as university courses and seminars. Also, we take care to limit the time a game requires. In a classroom situation this allows to use the game as a priming exercise that can be followed up with theory and/or practical exercises using computers and programming.\r\n\r\nThe tutorial will be executed according to the following pseudocode (or lesson plan):\r\n\r\n1. Game #1 is played in a plenary (5 min)\r\n2. The presenters give a short introduction on why games matter (5 min)\r\n3. The audience is randomly sampled into teams of 6 (2 min)\r\n4. Game #2 is played in the teams in a cooperative manner (15 min)\r\n5. Game #3 is played in the teams in a cooperative manner (15 min)\r\n6. Game #4 is played with the teams competing against each other (20 min)\r\n7. Winners are determined and applauded (5 min)\r\n8. Game #5 is played in the plenary again (10 min)\r\n9. Q & A and wrap-up (10 min)\r\n\r\nOne of the presenters is certified as a board game educator \"Fachkraft f\u00fcr Gesellschaftsspiele\" by the Brettspielakademie (https://brettspielakademie.de/).\r\n\r\nThe games and lessons have been field-tested with university courses and are made available under a Creative Commons license. You are free to reuse or modify them for your own teaching. Several games (mostly on statistics) and sample lesson plans are available on https://www.academis.eu/probably_fun/ .", "recording_license": "", "do_not_record": false, "persons": [{"code": "9EPNQG", "name": "Dr. Kristian Rother", "avatar": "https://cfp.pydata.org/media/avatars/9EPNQG_QOc7ckn.webp", "biography": "Kristian is a freelance Python trainer who wrote his first lines of Python in the year 11111001111. After a career writing software for life science research, he has been teaching Python, Data Analysis and Machine Learning throughout Europe since 2011. More recently, he has built data pipelines for the real estate and medical sector.\r\n\r\nKristian has translated 5 Python books and written 2 more himself, in addition to numerous teaching guides. Kristian has collected 364 stars on Advent of Code. His knowledge about async is, unfortunately, miserable. His favorite Python module is 're'. Kristian believes everybody can learn programming.", "public_name": "Dr. Kristian Rother", "guid": "1096b371-55c7-509f-a52d-73e66c5db09b", "url": "https://cfp.pydata.org/berlin2025/speaker/9EPNQG/"}, {"code": "CSKERW", "name": "Shreyaasri Prakash", "avatar": null, "biography": null, "public_name": "Shreyaasri Prakash", "guid": "87eb6c67-c8d1-5970-aa9c-0749ab630d4d", "url": "https://cfp.pydata.org/berlin2025/speaker/CSKERW/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/GBVFJ8/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/GBVFJ8/", "attachments": []}, {"guid": "57f34947-5053-51e5-a4c7-8d6c21ffae07", "code": "W9Q7JY", "id": 77707, "logo": null, "date": "2025-09-02T13:40:00+02:00", "start": "13:40", "duration": "01:30", "room": "B09", "slug": "berlin2025-77707-deep-dive-into-the-synthetic-data-sdk", "url": "https://cfp.pydata.org/berlin2025/talk/W9Q7JY/", "title": "Deep Dive into the Synthetic Data SDK", "subtitle": "", "track": "Data Handling & Engineering", "type": "Tutorial", "language": "en", "abstract": "In January the Synthetic Data SDK was introduced and it quickly is gaining traction as becoming the standard Open Source library for creating privacy-preserving synthetic data. In this hands-on tutorial we're going beyond the basics and we'll look at many of the advanced features of the SDK including differential privacy, conditional generation, multi-tables, and fair synthetic data.", "description": "This hands-on tutorial will take participants beyond the basics of the Synthetic Data SDK, the emerging open-source standard for creating privacy-preserving synthetic data.\r\n\r\nAfter a brief recap of the SDK\u2019s core capabilities, the session will dive into advanced functionality, beginning with an in-depth exploration of differential privacy. Attendees will learn how the SDK integrates formal privacy guarantees, configure key parameters (i.e., epsilon and delta), and observe the trade-offs between privacy and utility through live examples.\r\n\r\nThe session will then focus on conditional generation, demonstrating how users can guide synthetic data output based on specific constraints or target values - an essential feature for scenario testing and AI model validation.\r\n\r\nA dedicated section will cover multi-table synthesis, where participants will learn how to model and generate relational datasets with primary-foreign key dependencies, preserving structural and statistical integrity across multiple linked tables.\r\n\r\nFinally, the tutorial will introduce the concept of fair synthetic data, showing how the SDK supports data generation aligned with the principle of statistical parity to help reduce representational bias in downstream use cases.\r\n\r\nEach segment includes interactive coding exercises and real-world datasets to ensure practical understanding. Participants should have a working knowledge of Python and prior experience with the SDK or similar tools.", "recording_license": "", "do_not_record": false, "persons": [{"code": "BYHCTT", "name": "Tobias Hann", "avatar": "https://cfp.pydata.org/media/avatars/BYHCTT_OCR43Qp.webp", "biography": "Tobias is the CEO of MOSTLY AI, the leader in privacy-preserving synthetic data. Originally from Vienna, Austria, he is currently based in Munich, Germany. Before joining MOSTLY AI, Tobias worked as a management consultant with the Boston Consulting Group and in tech start-ups in different leadership roles. He earned a PhD from the Vienna University of Business and Economics and an MBA from the Haas School of Business at UC Berkeley. With his extensive background in strategy and technology, Tobias drives MOSTLY AI\u2019s mission to revolutionize data access and data insights across industries.", "public_name": "Tobias Hann", "guid": "228ec27e-9b7b-5b90-9632-981408b8995e", "url": "https://cfp.pydata.org/berlin2025/speaker/BYHCTT/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/W9Q7JY/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/W9Q7JY/", "attachments": []}, {"guid": "fd7d6eca-7e17-5b4a-a0e1-fe9d2e459421", "code": "ZXTLEW", "id": 77956, "logo": null, "date": "2025-09-02T15:50:00+02:00", "start": "15:50", "duration": "00:45", "room": "B09", "slug": "berlin2025-77956-forget-the-cloud-building-lean-batch-pipelines-from-tcp-streams-with-python-and-duckdb", "url": "https://cfp.pydata.org/berlin2025/talk/ZXTLEW/", "title": "Forget the Cloud: Building Lean Batch Pipelines from TCP Streams with Python and DuckDB", "subtitle": "", "track": "Data Handling & Engineering", "type": "Talk (long)", "language": "en", "abstract": "Many industrial and legacy systems still push critical data over TCP streams. Instead of reaching for heavyweight cloud platforms, you can build fast, lean batch pipelines on-prem using Python and DuckDB.\r\n\r\nIn this talk, you'll learn how to turn raw TCP streams into structured data sets, ready for analysis, all running on-premise. We'll cover key patterns for batch processing, practical architecture examples, and real-world lessons from industrial projects.\r\n\r\nIf you work with sensor data, logs, or telemetry, and you value simplicity, speed, and control this talk is for you.", "description": "Cloud-native tools are everywhere. But not every system can or should move to the cloud.\r\n\r\nIn many industries like manufacturing, logistics, or energy, TCP streams remain the backbone of real-time data exchange. These systems are often on-premise, resource-constrained, and mission-critical.\r\n\r\nThis talk shows how you can build lean, powerful batch pipelines with source data coming from TCP streams using Python and DuckDB. All without the complexity of cloud services.\r\n\r\nWe'll cover:\r\n\r\n- Why TCP streams still matter\r\n- Stream vs. Batch: Choosing the right model for industrial data\r\n- Pipeline architecture: From streams to batch\r\n- DuckDB + Python: The perfect combo for lightweight analytics\r\n- Key pitfalls along the way\r\n- Limitations of this approach\r\n\r\n\r\nYou'll walk away with:\r\n\r\n- Ready-to-use patterns for TCP-based data pipelines\r\n- Insights on when to avoid unnecessary cloud complexity\r\n- Tips for building fast, reliable batch jobs on local infrastructure\r\n\r\nWhether you process factory sensor data, machine logs, or legacy telemetry, this talk will give you modern tools to make your data streams actionable and efficient.", "recording_license": "", "do_not_record": false, "persons": [{"code": "7H3JY8", "name": "Orell Garten", "avatar": "https://cfp.pydata.org/media/avatars/7H3JY8_69A3T08.webp", "biography": "Software and data engineering consultant. I build data systems that help businesses to answer questions about their business. I like solving problems in a pragmatic way.", "public_name": "Orell Garten", "guid": "c8748514-701f-5f1b-ab33-15b37bed1df8", "url": "https://cfp.pydata.org/berlin2025/speaker/7H3JY8/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/ZXTLEW/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/ZXTLEW/", "attachments": []}], "B07-B08": [{"guid": "c9b651a9-f1ee-55d1-af74-3b020e433b06", "code": "KPHH7H", "id": 78332, "logo": null, "date": "2025-09-02T10:40:00+02:00", "start": "10:40", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-78332-training-specialized-language-models-with-less-data-an-end-to-end-practical-guide", "url": "https://cfp.pydata.org/berlin2025/talk/KPHH7H/", "title": "Training Specialized Language Models with Less Data: An End-to-End Practical Guide", "subtitle": "", "track": "Natural Language Processing & Audio (incl. Generative AI NLP)", "type": "Talk [Sponsored]", "language": "en", "abstract": "Small Language Models (SLMs) offer an efficient and cost-effective alternative to LLMs\u2014especially when latency, privacy, inference costs or deployment constraints matter. However, training them typically requires large labeled datasets and is time-consuming, even if it isn't your first rodeo.\r\n\r\nThis talk presents an end-to-end approach for curating high-quality synthetic data using LLMs to train domain-specific SLMs. Using a real-world use case, we\u2019ll demonstrate how to reduce manual labeling time, cut costs, and maintain performance\u2014making SLMs viable for production applications.\r\n\r\nWhether you are a seasoned Machine Learning Engineer or a person just getting starting with building AI features, you will come away with the inspiration to build more performant, secure and environmentally-friendly AI systems.", "description": "Training effective language models typically involves two major bottlenecks: the need for vast amounts of labeled data and the engineering complexity of fine-tuning. This talk introduces a practical framework for addressing both, enabling teams to build small, domain-specialized language models (SLMs) that are deployable, secure, and cost-efficient\u2014without needing massive labeled datasets.\r\n\r\nSLMs are especially well-suited for focused tasks such as classification, function calling, or question answering, where full-scale LLMs are overkill. They are smaller, faster, and easier to deploy on local or mobile infrastructure\u2014making them ideal for latency-sensitive, privacy-conscious, or resource-limited applications. However, fine-tuning them still traditionally requires manually labeled data in the tens of thousands.\r\n\r\nOur approach uses synthetic data generation and validation techniques to drastically reduce the labeling burden. Leveraging large language models (LLMs) as \u201cteacher models,\u201d we generate and curate synthetic training data tailored to specific tasks. This data, combined with a handful of manually labeled examples and a clear task description, is then used to fine-tune SLMs (\u201cstudent models\u201d) that match or exceed the performance of larger models on the same narrow tasks.\r\n\r\nWe'll walk through a detailed example focused on a real-life use case covering:\r\n- Task scoping: How to define your model\u2019s purpose and output space clearly.\r\n- Synthetic data generation: Prompting LLMs to generate meaningful and diverse examples.\r\n- Data validation: Techniques for filtering out poor-quality, duplicate, or malformed synthetic data.\r\n- Model fine-tuning: How the student model is trained to emulate the teacher\u2019s domain knowledge.\r\n- Deployment: Delivering the model as binaries for use on internal infrastructure or edge devices.\r\n\r\nWe\u2019ll also discuss key challenges teams face in adopting this approach\u2014such as validation bottlenecks, overfitting on synthetic data, and the need for interpretable task definitions\u2014and how we\u2019ve addressed them in production environments.\r\n\r\nThis talk is targeted at data scientists, ML engineers, and tech leads who are looking for pragmatic strategies to bring specialized AI features into production without relying on API-based LLMs or manual annotation at scale. No prior knowledge of model distillation is required, though basic familiarity with supervised learning and model training will be helpful.\r\n\r\nAttendees will leave with:\r\n- A concrete workflow for training SLMs using synthetic data\r\n- Insights into trade-offs between SLMs and LLMs\r\n- Techniques for validating and curating LLM-generated data\r\n- A better understanding of when and how to deploy small models effectively in production\r\n\r\nThis is not a theoretical talk. It is a field-tested approach grounded in real use cases, designed to empower small teams to build efficient, private, and reliable NLP systems.", "recording_license": "", "do_not_record": false, "persons": [{"code": "7B9VSH", "name": "Jacek Golebiowski", "avatar": "https://cfp.pydata.org/media/avatars/7B9VSH_UmLXReJ.webp", "biography": "Jacek is the CTO of distil labs, making it easy to build specialized AI agents that can be deployed on-device/on-prem. Before that, he was a machine learning team lead at AWS, working on the core components of AWS Q, Automated ML, and natural language processing. He holds a PhD in Machine Learning for Quantum Mechanics from Imperial College London.", "public_name": "Jacek Golebiowski", "guid": "c42f929a-3400-5acc-be30-98e09d21c2f5", "url": "https://cfp.pydata.org/berlin2025/speaker/7B9VSH/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/KPHH7H/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/KPHH7H/", "attachments": []}, {"guid": "a1132f13-93ad-5bbe-ac39-ac56fc293db5", "code": "CAUAZY", "id": 77882, "logo": null, "date": "2025-09-02T11:20:00+02:00", "start": "11:20", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77882-most-ai-agents-are-useless-let-s-fix-that", "url": "https://cfp.pydata.org/berlin2025/talk/CAUAZY/", "title": "Most AI Agents Are Useless. Let\u2019s Fix That", "subtitle": "", "track": "Natural Language Processing & Audio (incl. Generative AI NLP)", "type": "Talk", "language": "en", "abstract": "AI agents are having a moment, but most of them are little more than fragile prototypes that break under pressure. Together, we\u2019ll explore why so many agentic systems fail in practice, and how to fix that with real engineering principles. In this talk, you\u2019ll learn how to build agents that are modular, observable, and ready for production. If you\u2019re tired of LLM demos that don\u2019t deliver, this talk is your blueprint for building agents that actually work.", "description": "Let\u2019s face it: most AI agents are glorified demos. They look flashy, but they\u2019re brittle, hard to debug, and rarely make it into real products. Why? Because wiring an LLM to a few tools is easy. Engineering a robust, testable, and scalable system is hard.\r\n\r\nThis talk is for practitioners, data scientists, AI engineers, and developers who want to stop tinkering and start shipping. We\u2019ll take a candid look at the common reasons agent systems fail and introduce practical patterns to fix them using Haystack, an open-source Python framework to build custom AI applications.\r\n\r\nYou\u2019ll learn how to design agents that are:\r\n\r\n- **Modular**, so they\u2019re easy to extend and evolve\r\n- **Observable**, so you can trace failures and understand the behavior\r\n- **Maintainable**, so they don\u2019t become one-off science projects\r\n\r\nWe\u2019ll also cover advanced topics like multimodal inputs and Model Context Protocol (MCP) to push your agents into more capable territory.\r\n\r\nWhether you\u2019re just starting to explore agents or trying to tame an unruly prototype, you\u2019ll leave with a clear, actionable blueprint to build something that\u2019s not just smart, but also reliable.", "recording_license": "", "do_not_record": false, "persons": [{"code": "PVSNUG", "name": "Bilge Y\u00fccel", "avatar": "https://cfp.pydata.org/media/avatars/PVSNUG_6VcFG3b.webp", "biography": "Bilge is a developer relations engineer at deepset, where she helps developers build powerful AI applications and teaches the world how to use Haystack. Passionate about RAG, LLMs, and all things Gen AI, she enjoys making complex AI concepts accessible both online and at real-life events", "public_name": "Bilge Y\u00fccel", "guid": "f8c5b23a-9de7-56c2-8983-a22c3cce5eb0", "url": "https://cfp.pydata.org/berlin2025/speaker/PVSNUG/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/CAUAZY/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/CAUAZY/", "attachments": []}, {"guid": "8ffd3416-6807-5609-bcbd-f3e7c4d84bdf", "code": "NUNXEV", "id": 77791, "logo": null, "date": "2025-09-02T12:00:00+02:00", "start": "12:00", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77791-one-api-to-rule-them-all-litellm-in-production", "url": "https://cfp.pydata.org/berlin2025/talk/NUNXEV/", "title": "One API to Rule Them All? LiteLLM in Production", "subtitle": "", "track": "Generative AI", "type": "Talk", "language": "en", "abstract": "Using LiteLLM in a Real-World RAG System: What Worked and What Didn\u2019t\r\n\r\nLiteLLM provides a unified interface to work with multiple LLM providers\u2014but how well does it hold up in practice? In this talk, I\u2019ll share how we used LiteLLM in a production system to simplify model access and handle token budgets. I\u2019ll outline the benefits, the hidden trade-offs, and the situations where the abstraction helped\u2014or got in the way. This is a practical, developer-focused session on integrating LiteLLM into real workflows, including lessons learned and limitations. If you\u2019re considering LiteLLM, this talk offers a grounded look at using it beyond simple prototypes.", "description": "Building a real-world LLM system often means juggling different providers, endpoints, and API quirks. LiteLLM promises a unified interface across model backends\u2014but how well does it hold up in production?\r\n\r\nIn this talk, I\u2019ll share how we integrated LiteLLM into a real-world system that includes budget usage tracking and other production concerns. From provider switching to budget handling, I\u2019ll walk through the benefits we saw\u2014and the challenges we hit. I\u2019ll also touch on the limits of abstraction. You\u2019ll get a practical look at where LiteLLM helped us and where not.\r\n\r\n**Key Takeaways**\r\n- Understand how LiteLLM can be used to unify access to multiple LLM providers\r\n- Learn how it fits into a real production pipeline (especially budget management and model load balancing)\r\n\r\n\r\n**Target Audience**\r\n- Developers and engineers working with LLMs in production\r\n- Anyone curious about LiteLLM\u2019s strengths and limitations in a real system", "recording_license": "", "do_not_record": false, "persons": [{"code": "JNDJND", "name": "Alina Dallmann", "avatar": "https://cfp.pydata.org/media/avatars/JNDJND_W9pG6Lm.webp", "biography": "Alina Dallmann is a computer scientist currently working as a Data Scientist at scieneers GmbH. Her enthusiasm for classical software development and data-driven projects has recently come together in various projects focused on building retrieval-augmented generation (RAG) systems.", "public_name": "Alina Dallmann", "guid": "8d6a67e5-489d-57f4-a151-ad9d2e2c6b18", "url": "https://cfp.pydata.org/berlin2025/speaker/JNDJND/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/NUNXEV/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/NUNXEV/", "attachments": []}, {"guid": "5970bd5f-7b65-5969-959d-f15c0dba3b00", "code": "BCGJQB", "id": 77541, "logo": "https://cfp.pydata.org/media/berlin2025/submissions/BCGJQB/numpyro_hierarchical_fore_LHObegf.png", "date": "2025-09-02T13:40:00+02:00", "start": "13:40", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77541-scaling-probabilistic-models-with-variational-inference", "url": "https://cfp.pydata.org/berlin2025/talk/BCGJQB/", "title": "Scaling Probabilistic Models with Variational Inference", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Talk", "language": "en", "abstract": "This talk presents variational inference as a tool to scale probabilistic models. We describe practical examples with NumPyro and PyMC to demonstrate this method, going through the main concepts and diagnostics. Instead of going heavy into the math, we focus on the code and practical tips to make this work in real industry applications.", "description": "Probabilistic models have proven to be a great tool for solving business-critical problems in fields such as marketing, demand forecasting, and risk-based optimization. One of the biggest challenges is scaling these models to large data sets and efficiently utilizing modern computing power. \r\n\r\nThis talk addresses the challenges of scaling probabilistic models using variational inference and other similar methods. We will explain the core concepts of variational inference in an accessible way, avoiding heavy mathematics. We will use practical examples with NumPyro and PyMC to demonstrate how to apply variational inference effectively. Starting with simple models and then showing applications with custom forecasting models and neural network components. Additionally, we will cover diagnostics such as simulation-based calibration and coverage to ensure model reliability. Our discussion will also include strategies for scaling, including mini-batch optimization and distributed computing.", "recording_license": "", "do_not_record": false, "persons": [{"code": "ADJDMC", "name": "Dr. Juan Orduz", "avatar": "https://cfp.pydata.org/media/avatars/ADJDMC_OcS7UZk.webp", "biography": "Juan is a Mathematician (Ph.D., Humboldt Universit\u00e4t zu Berlin) and data scientist. He is interested in interdisciplinary applications of mathematical methods, particularly time series analysis, Bayesian methods, and causal inference.", "public_name": "Dr. Juan Orduz", "guid": "bce64d59-705c-50b3-a6de-415e9b8a5ad1", "url": "https://cfp.pydata.org/berlin2025/speaker/ADJDMC/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/BCGJQB/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/BCGJQB/", "attachments": []}, {"guid": "ddd55345-9dd6-5ed1-a8bb-77f752cdf455", "code": "WGJJQN", "id": 77835, "logo": null, "date": "2025-09-02T14:20:00+02:00", "start": "14:20", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77835-how-we-automate-chaos-agentic-ai-and-community-ops-at-pycon-de-pydata", "url": "https://cfp.pydata.org/berlin2025/talk/WGJJQN/", "title": "How We Automate Chaos: Agentic AI and Community Ops at PyCon DE & PyData", "subtitle": "", "track": "Data Handling & Engineering", "type": "Talk", "language": "en", "abstract": "Using AI agents and automation, PyCon DE & PyData volunteers have transformed chaos into streamlined conference ops. From YAML files to LLM-powered assistants, they automate speaker logistics, FAQs, video processing, and more while keeping humans focused on creativity. This case study reveals practical lessons on making AI work in real-world scenarios: structured workflows, validation, and clear context beat hype. Live demos and open-source tools included.", "description": "Every year, PyCon DE & PyData is run by a rotating crew of volunteers who build a full conference from scratch \u2014 in their spare time, with limited tools, shifting knowledge, and lots of coffee. It\u2019s like launching a startup, dismantling it, and repeating from memory.\r\n\r\nTo survive (and stay sane), we\u2019ve turned conference ops into a sandbox for automation \u2014 leaning on scripts, structured documentation, and increasingly, agentic AI systems. Think YAML files, GitHub Actions, custom bots, and LLM-powered assistants doing the boring stuff, so humans can focus on creativity and connection.\r\n\r\nThis talk is a no-fluff case study in what it actually takes to make automation \u2014 and especially AI agents \u2014 work in the wild:\r\n * How we went from chaotic Notion boards to reproducible workflows\r\n * How we use LLMs + APIs (LLMs, GitHub, Google, Drives, Pretalx, Pretix,\u2026) to support speaker logistics, FAQs, video app, video cuts, certificates of participation and schedule drafts\r\n * Why Pydantic, Structure and even simple scripts matter more than hype\r\n * And most importantly: why agents are useless without clear structure, validation, and context\r\n\r\nWe\u2019ll show live examples, share the open tools we\u2019ve built (and broken), and make the case that good community infrastructure is open-source-worthy. If you\u2019re building tools for humans, this talk is for you.\r\n\r\nWant to help? We\u2019re actively looking for contributors, testers, and curious minds to build better community tech together \u2014 come chat after the talk or find us online.", "recording_license": "", "do_not_record": false, "persons": [{"code": "78ZETH", "name": "Alexander CS Hendorf", "avatar": "https://cfp.pydata.org/media/avatars/78ZETH_hPaAQIk.webp", "biography": "Alexander C. S. Hendorf has over 20 years of experience in digitalization, data, and artificial intelligence. As an independent consultant, he focuses on the practical implementation, adoption, and communication of data- and AI-driven strategies and decision-making processes.\r\n\r\nWhile still in law school, he worked as a DJ\u2014before dropping out to join a transatlantic music start-up. The venture evolved into a decent independent label group and, eventually, a small stock corporation, where Alexander became a partner and, at 28, took over as COO. He led the company\u2019s digital transformation and designed systems that could scale with growth. This entrepreneurial journey laid the foundation for his deep understanding of business strategy, technology, and innovation.\r\n\r\nAfter closing the chapter on digital music, Alexander turned his focus to data science and AI\u2014initially driven by curiosity, with weekends on Coursera and evenings on GPUs. That passion evolved into a career advising organizations on AI integration, data strategy, and building impact-driven teams.\r\n\r\nSome say he just picks the flashiest jobs\u2014record label owner, data scientist\u2014but really, he follows his passion: for what\u2019s new, what matters, and what connects people and technology.\r\n\r\nToday, he supports clients\u2014especially in regulated or legacy-heavy industries\u2014in aligning emerging technologies with real-world business goals. His work emphasizes cultural impact, sustainable change, and interdisciplinary thinking.\r\n\r\nAlexander is a recognized expert in data intelligence and a frequent speaker and chair at international conferences, including PyCon DE & PyData, Data2Day, and EuroPython. He\u2019s a Python Software Foundation Fellow, EuroPython Fellow, and board member of the Python Software Verband (Germany).\r\n\r\nSince 2024, he has been driving [Pioneers Hub](https://pioneershub.org), a non-profit supporting vibrant, inclusive tech communities\u2014and helping innovators keep pace in a rapidly changing world.", "public_name": "Alexander CS Hendorf", "guid": "18677b6a-d09e-55d8-8138-b81b4e33abe8", "url": "https://cfp.pydata.org/berlin2025/speaker/78ZETH/"}], "links": [{"title": "Talk Slides: Agentic AI and Community Ops at PyCon DE & PyData", "url": "https://bit.ly/AI-AGENTS-BER", "type": "related"}], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/WGJJQN/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/WGJJQN/", "attachments": []}, {"guid": "02d602c6-1023-5b4f-8399-64bcad27ecd8", "code": "KEJJSP", "id": 77665, "logo": null, "date": "2025-09-02T15:50:00+02:00", "start": "15:50", "duration": "00:45", "room": "B07-B08", "slug": "berlin2025-77665-template-based-web-app-and-deployment-pipeline-at-an-enterprise-ready-level-on-azure", "url": "https://cfp.pydata.org/berlin2025/talk/KEJJSP/", "title": "Template-based web app and deployment pipeline at an enterprise-ready level on Azure", "subtitle": "", "track": "Infrastructure - Hardware & Cloud", "type": "Talk (long)", "language": "en", "abstract": "A practical deep-dive into Azure DevOps pipelines, the Azure CLI, and how to combine pipeline, bicep, and python templates to build a fully automated web app deployment system. Deploying a new proof of concept app within an actual enterprise environment never was faster.", "description": "In many enterprise environments, deploying a proof-of-concept data app to the cloud remains frustratingly slow and manual. Early user feedback often depends on clunky screen shares or static screenshots. This talk shows how we transformed that process - automating everything from infrastructure provisioning to web app deployment - using a system of pipeline, bicep, and python templates. The result? Stakeholders can interact with a working Streamlit app within minutes of a commit, with no further manual setup required.\r\n\r\nWe take you with us on our journey from awkward beginnings to an elegant template-based setup, where all steps of the configuration and deployment process are automated. All Azure resources are created without manual steps. And it takes only one bio-break from submitting your work to the repository to the business user being able to test it live. Along the way we share best practices and pitfalls we discovered, as well as how we structure our templates and repositories, both for the web app, as well as the deployment pipeline. At the end, we will deploy a new web app together and explore the workings of the system live.\r\n\r\nWhile the concept will need adoption to other providers, you don't need to use Azure to profit from this talk - all cloud platforms share similar tools and challenges.\r\n\r\nDetailed Outline:\r\n\r\n1\\. Motivation (5 min)\r\n\r\n- Why it's hard to get user feedback early and why that is problematic\r\n- Why it's hard to get a real application running early\r\n- What if we could automate app deployment and configuration, or how the NKD data science teams went from awkward to awesome\r\n\r\n2\\. The app creation, deployment, and configuration process (12 min)\r\n\r\n- Struggles and best practices\r\n- Tools that help with consistency and automation\r\n- Handling virtual environments across dev systems and the cloud\r\n- Web app and pipeline repositories and templates\r\n\r\n3\\. The pipeline (18 min)\r\n\r\n- Structure of the stages\r\n- Minimizing manual configuration with file parsing and bicep\r\n- Matching branch and target server\r\n- Automated Azure resource creation using Azure CLI\r\n- App authorization and authentication configuration with more Azure CLI\r\n- Finally, the deployment\r\n\r\n4\\. Show case (5 min)\r\n\r\n- What the setup looks like when it's fully set up\r\n- Deploying an app live\r\n\r\nKey Takeaways:\r\n\r\n- How to reduce app deployment time from days to minutes using automated templates\r\n- Collaboration setup for small and medium-sized data teams\r\n- Best practices for structuring pipelines and web apps for consistency, security, and scalability\r\n- What not to do: key pitfalls we encountered and how we fixed them\r\n\r\nTarget audience:\r\n\r\nData or machine learning scientists or engineers in small or medium-sized teams, who want to deploy web apps faster and in a more consistent way. Attendees should be comfortable with python and have basic familiarity with web apps or DevOps principles. While Azure users benefit most, no in-depth knowledge is required - concepts will be explained as we go.", "recording_license": "", "do_not_record": false, "persons": [{"code": "BBLTES", "name": "Johannes Sch\u00f6ck", "avatar": "https://cfp.pydata.org/media/avatars/BBLTES_xXEb7F6.webp", "biography": "A studied natural scientist and expert in power semiconductors, Johannes tought himself data science skills and now does what he loves: to solve data and tech challenges that generate value.", "public_name": "Johannes Sch\u00f6ck", "guid": "b516ac82-d297-5f36-845b-9c7a98723e94", "url": "https://cfp.pydata.org/berlin2025/speaker/BBLTES/"}], "links": [{"title": "Repository", "url": "https://github.com/JSchoeck/talk_webapp_template_and_pipeline_on_azure", "type": "related"}, {"title": "Slides", "url": "https://github.com/JSchoeck/talk_webapp_template_and_pipeline_on_azure/blob/main/Template-based%20Web%20App%20and%20Deployment%20Pipeline%20on%20Azure%20in%20an%20enterprise%20environment.pdf", "type": "related"}], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/KEJJSP/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/KEJJSP/", "attachments": []}], "B05-B06": [{"guid": "017ce203-4d75-50ae-bf17-be85a167dd55", "code": "DBL9PQ", "id": 77894, "logo": null, "date": "2025-09-02T10:40:00+02:00", "start": "10:40", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77894-the-importance-and-elegance-of-polars-expressions", "url": "https://cfp.pydata.org/berlin2025/talk/DBL9PQ/", "title": "The Importance and Elegance of Polars Expressions", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Talk", "language": "en", "abstract": "Polars is known for its speed, but its elegance comes from its use of expressions. In this talk, we\u2019ll explore how Polars expressions work and why they are key to efficient and elegant data manipulation. Through real-world examples, you\u2019ll learn how to create, expand, and combine expressions in Polars to wrangle data more effectively.", "description": "Polars has gained popularity for its speed, but what truly makes it stand out is its syntax, especially the use of expressions. The book Python Polars: The Definitive Guide defines an expression as \"a tree of operations that describe how to construct one or more Series.\" In this talk, we\u2019ll demystify this concept, explaining how expressions make Polars an elegant tool for data manipulation.\r\n\r\nWe will cover:\r\n\r\n- Why expressions are crucial in Polars\r\n- A formal definition of an expression and what it means in practice\r\n- Creating expressions from existing columns or other values\r\n- Using expressions to select, filter, sort, and aggregate data\r\n- Applying expressions for aggregate statistics, mathematical transformations, and handling missing values\r\n- Combining expressions with operators, comparisons, and Boolean logic\r\n- A comparison of idiomatic vs. non-idiomatic Polars code\r\n\r\nBy the end of this talk, you\u2019ll understand how to leverage Polars expressions to write cleaner and more efficient data manipulation code.", "recording_license": "", "do_not_record": false, "persons": [{"code": "C7AFFQ", "name": "Jeroen Janssens", "avatar": "https://cfp.pydata.org/media/avatars/C7AFFQ_jZMuQv9.webp", "biography": "Jeroen Janssens, PhD, is a Senior Developer Relations Engineer at Posit, PBC. His expertise lies in visualizing data, implementing machine learning models, and building solutions using Python, R, JavaScript, and Bash. He\u2019s passionate about open source and sharing knowledge. He\u2019s the author of Python Polars: The Definitive Guide (O\u2019Reilly, 2025) and Data Science at the Command Line (O\u2019Reilly, 2021). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He lives with his wife and two kids in Rotterdam, the Netherlands.", "public_name": "Jeroen Janssens", "guid": "eba629fb-1976-5196-b046-7dc5a7495533", "url": "https://cfp.pydata.org/berlin2025/speaker/C7AFFQ/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/DBL9PQ/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/DBL9PQ/", "attachments": []}, {"guid": "ae3a9006-4088-56d5-86bd-f69c20a7faa0", "code": "HUNUEB", "id": 77643, "logo": null, "date": "2025-09-02T11:20:00+02:00", "start": "11:20", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77643-causal-inference-in-network-structures-lessons-learned-from-financial-services", "url": "https://cfp.pydata.org/berlin2025/talk/HUNUEB/", "title": "Causal Inference in Network Structures: Lessons learned From Financial Services", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Talk", "language": "en", "abstract": "*Causal inference techniques are crucial to understanding the impact of actions on outcomes.* *This talk shares lessons learned from applying these techniques in real-world scenarios where standard methods do not immediately apply. Our key question is: What is the causal impact of wealth planning services on a network of individual\u2019s investments and securities? We'll examine the challenges posed by practical constraints and show how to deal with them before applying standard approaches like staggered difference-in-difference.* \r\n\r\n*This self-contained talk is prepared for general data scientists who want to add causal inference techniques to their toolbox and learn from real-world data challenges.*", "description": "Wealth planning is a service offered by financial institutions. The advice helps clients grow their wealth through investing. This talk focuses on measuring the true impact of these services on a network of individual\u2019s investments and securities. However, measuring this impact presents several practical challenges, which will be tackled in this talk:\r\n\r\n   1) Controlled experiments are often impossible in practice, leaving only observational data available.\r\n   \r\n   2) Defining robust control groups is challenging when treatments are administered to individuals in relationship graphs at different times.\r\n\r\n   3) Analysis must account for multiple outcomes with different modalities\u2014securities (time-series) and investing (binary).\r\n\r\n   4) The parallel-trend assumption doesn't immediately hold.\r\n\r\n   5) Market trends confounding effect on outcome needs to be corrected.", "recording_license": "", "do_not_record": false, "persons": [{"code": "SGPWBV", "name": "Danial Senejohnny", "avatar": "https://cfp.pydata.org/media/avatars/SGPWBV_mAJd0XR.webp", "biography": "Danial is a data scientist & analytics translator with a PhD in applied mathematics (systems & control). In his career, he has experienced different sectors, i.e. manufacturing, cybersecurity, healthcare, and finance. In his current adventure at ABN AMRO, he contributes to personalized solutions that improves clients experience and satisfaction.", "public_name": "Danial Senejohnny", "guid": "b6f3f551-88f3-5650-8cf8-010a0753c272", "url": "https://cfp.pydata.org/berlin2025/speaker/SGPWBV/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/HUNUEB/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/HUNUEB/", "attachments": []}, {"guid": "f4d8a5f6-52d5-50b0-be43-fe8f744d218d", "code": "GPZPFP", "id": 77770, "logo": null, "date": "2025-09-02T12:00:00+02:00", "start": "12:00", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77770-building-reactive-data-apps-with-shinylive-and-webassembly", "url": "https://cfp.pydata.org/berlin2025/talk/GPZPFP/", "title": "Building Reactive Data Apps with Shinylive and WebAssembly", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Talk", "language": "en", "abstract": "WebAssembly is reshaping how Python applications can be delivered - allowing fully interactive apps that run directly in the browser, without a traditional backend server. In this talk, I\u2019ll demonstrate how to build reactive, data-driven web apps using Shinylive for Python, combining efficient local storage with Parquet and extending functionality with optional FastAPI cloud services. We\u2019ll explore the benefits and limitations of this architecture, share practical design patterns, and discuss when browser-based Python is the right choice. Attendees will leave with hands-on techniques for creating modern, lightweight, and highly responsive Python data applications.", "description": "In recent years, WebAssembly (Wasm) has opened new frontiers for delivering Python applications - enabling fully interactive, browser-native experiences without requiring a traditional server backend. This paradigm shift is particularly exciting for data scientists and developers looking to build lightweight, highly responsive data apps that can be deployed as static websites, reducing infrastructure complexity while improving user experience.\r\n\r\nIn this talk, I will walk through how to use Shinylive for Python, an emerging framework that combines reactive programming principles with the power of WebAssembly, to create rich data applications that run entirely in the browser. We\u2019ll cover how Shinylive translates reactive code into client-side interactions, eliminating the need for round-trips to a Python server. I\u2019ll also introduce techniques for integrating efficient local storage (via Apache Parquet) and show how optional FastAPI services can be layered on for hybrid architectures when needed.\r\n\r\nThis talk is intended for data scientists, machine learning engineers, and Python developers who are interested in building modern web applications without becoming full-time JavaScript engineers. Attendees will leave with practical techniques for building and deploying reactive data apps that run entirely in the browser.", "recording_license": "", "do_not_record": false, "persons": [{"code": "KETSDD", "name": "Christoph Scheuch", "avatar": "https://cfp.pydata.org/media/avatars/KETSDD_vOM4XY6.webp", "biography": "Christoph Scheuch is an independent data science and business intelligence expert, currently serving as an external lecturer at Humboldt University of Berlin and as a summer school instructor at the Barcelona School of Economics. He is the co-creator and maintainer of the [Tidy Finance](https://www.tidy-finance.org/) project, an open-source initiative promoting transparent and reproducible research in financial economics, and the [EconDataverse](https://www.econdataverse.org/), a universe of open-source packages to work seamlessly with economic data in R and Python.\r\n\r\nPreviously, Christoph held leadership roles at the social trading platform wikifolio.com, including Head of Artificial Intelligence, Director of Product, and Head of BI & Data Science. He has also lectured at the Vienna University of Economics and Business, where he earned his PhD in Finance through the Vienna Graduate School of Finance.", "public_name": "Christoph Scheuch", "guid": "43fc6027-1a52-52db-b10a-3078eda45852", "url": "https://cfp.pydata.org/berlin2025/speaker/KETSDD/"}], "links": [{"title": "Link to repo with slides & example app", "url": "https://github.com/tidy-intelligence/pydata-berlin-2025", "type": "related"}], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/GPZPFP/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/GPZPFP/", "attachments": []}, {"guid": "92ff4ed7-9e7a-533e-ada0-850edfa0a7d3", "code": "YZ9BY7", "id": 80816, "logo": null, "date": "2025-09-02T13:40:00+02:00", "start": "13:40", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-80816-data-science-in-containers-the-good-the-bad-and-the-ugly", "url": "https://cfp.pydata.org/berlin2025/talk/YZ9BY7/", "title": "Data science in containers: the good, the bad, and the ugly", "subtitle": "", "track": "Infrastructure - Hardware & Cloud", "type": "Talk", "language": "en", "abstract": "If we want to run data science workloads (e.g. using Tensorflow, PyTorch, and others) in containers (for local development or production on Kubernetes), we need to build container images. Doing that with a Dockerfile is fairly straightforward, but is it the best method?\r\nIn this talk, we'll take a well-known speech-to-text model (Whisper) and show various ways to run it in containers, comparing the outcomes in terms of image size and build time.", "description": "We'll demonstrate how to switch versions DRY-style (without maintaining multiple Dockerfiles!), how to leverage newer techniques like BuildKit cache mounts, and discuss other important considerations like the use of Alpine with Python, progressive image loading, and model loading strategies.\r\n\r\nAttendees will learn practical containerization techniques specifically tailored for data science workflows, with concrete examples using the Whisper model as our case study.", "recording_license": "", "do_not_record": false, "persons": [{"code": "MDDQY3", "name": "J\u00e9r\u00f4me Petazzoni", "avatar": "https://cfp.pydata.org/media/avatars/MDDQY3_BGSAy3p.webp", "biography": "J\u00e9r\u00f4me was part of the team that built, scaled, and operated the dotCloud PAAS, before that company became Docker. He's now an independent consultant, and since he loves to share what he learned, he continues to give many talks and demos on containers, Docker, and Kubernetes. He values diversity, and strives to be a good ally, or at least a decent social justice sidekick. He also collects musical instruments and can arguably play the theme of Zelda on a dozen of them.", "public_name": "J\u00e9r\u00f4me Petazzoni", "guid": "fee28126-e7f3-5fac-943e-9bd80c070b05", "url": "https://cfp.pydata.org/berlin2025/speaker/MDDQY3/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/YZ9BY7/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/YZ9BY7/", "attachments": []}, {"guid": "a7fe4b67-2534-55d7-90f5-8c707350e578", "code": "YKFWKQ", "id": 77660, "logo": null, "date": "2025-09-02T14:20:00+02:00", "start": "14:20", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77660-beyond-benchmarks-practical-evaluation-strategies-for-compound-ai-systems", "url": "https://cfp.pydata.org/berlin2025/talk/YKFWKQ/", "title": "Beyond Benchmarks: Practical Evaluation Strategies for Compound AI Systems", "subtitle": "", "track": "Natural Language Processing & Audio (incl. Generative AI NLP)", "type": "Talk", "language": "en", "abstract": "Evaluating large language models (LLMs) in real-world applications goes far beyond standard benchmarks. When LLMs are embedded in complex pipelines, choosing the right models, prompts, and parameters becomes an ongoing challenge.\r\n\r\nIn this talk, we will present a practical, human-in-the-loop evaluation framework that enables systematic improvement of LLM-powered systems based on expert feedback. By combining domain expert insights and automated evaluation methods, it is possible to iteratively refine these systems while building transparency and trust.\r\n\r\nThis talk will be valuable for anyone who wants to ensure their LLM applications can handle real-world complexity - not just perform well on generic benchmarks.", "description": "As large language models become integral to real-world applications, evaluating and improving their performance is a growing challenge. Generic benchmarks and simple metrics fail to adequately assess domain-specific, multi-step reasoning required by compound AI pipelines like retrieval-augmented generation (RAG), multi-tool agents, or knowledge assistants. Moreover, manual evaluation of every step is infeasible at scale, while fully automated LLM-as-a-judge approaches lack critical domain insights.\r\n\r\nIn this talk, we will present a practical evaluation approach to enable continuous improvement of LLM-powered systems. It incorporates the following stages: \r\n- Automatic tracing: capturing input/output pairs across the pipeline to build an evaluation dataset.\r\n- Expert feedback collection: working with subject matter experts and user interactions to assess correctness, and identify failure points.\r\n- Iterative improvement cycle: tuning the components and/or optimizing prompts.\r\n- Degradation tests: turning feedback into automated evaluation tests - ranging from exact match checks to LLM-as-a-judge assertions - to guard against regressions.\r\n- Continuous monitoring: using the growing evaluation dataset to validate the system as models, tools, or data sources evolve.\r\n\r\nThis framework ensures that LLM applications remain reliable, and aligned with specific business needs over time. \r\n\r\nTarget audience: AI practitioners developing and maintaining LLM-based applications.\r\n\r\nAttendees will learn strategies to:\r\n- Build a human-in-the-loop evaluation process combining expert feedback and automated methods.\r\n- Turn expert knowledge into automatic tests to guard against regressions.\r\n- Use iterative improvement cycles to refine LLM pipelines over time.\r\n\r\nThe attendees are assumed to have familiarity with LLMs and machine learning workflows but do not require deep NLP expertise.", "recording_license": "", "do_not_record": false, "persons": [{"code": "EACXYX", "name": "Iryna Kondrashchenko", "avatar": "https://cfp.pydata.org/media/avatars/EACXYX_merNsK0.webp", "biography": "Iryna is a data scientist and co-founder of DataForce Solutions GmbH, a company specialized in delivering end-to-end data science and AI services. She contributes to several open-source libraries, and strongly believes that open-source products foster a more inclusive tech industry, equipping individuals and organizations with the necessary tools to innovate and compete.", "public_name": "Iryna Kondrashchenko", "guid": "70018355-f170-5858-9983-12cb340d312d", "url": "https://cfp.pydata.org/berlin2025/speaker/EACXYX/"}, {"code": "NWAQCX", "name": "Oleh Kostromin", "avatar": "https://cfp.pydata.org/media/avatars/NWAQCX_J60YTrz.webp", "biography": "I am a Data Scientist primarily focused on Deep Learning and MLOps. In my spare time I contribute to several open-source python libraries.", "public_name": "Oleh Kostromin", "guid": "68c7801e-b9b7-5c17-bc44-d5e705e5c269", "url": "https://cfp.pydata.org/berlin2025/speaker/NWAQCX/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/YKFWKQ/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/YKFWKQ/", "attachments": []}, {"guid": "136b901a-53e4-5027-8da1-51120b11a88c", "code": "JEKYLT", "id": 77081, "logo": null, "date": "2025-09-02T15:00:00+02:00", "start": "15:00", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77081-navigating-healthcare-scientific-knowledge-building-ai-agents-for-accurate-biomedical-data-retrieval", "url": "https://cfp.pydata.org/berlin2025/talk/JEKYLT/", "title": "Navigating healthcare scientific knowledge:building AI agents for accurate biomedical data retrieval", "subtitle": "", "track": "Generative AI", "type": "Talk", "language": "en", "abstract": "With a focus on healthcare applications where accuracy is non negotiable, this talk highlights challenges and delivers practical insights on building AI agents which query complex biological and scientific data to answer sophisticated questions. Drawing from our experience developing Owkin-K Navigator, a free-to-use AI co-pilot for biological research, I'll share hard-won lessons about combining natural language processing with SQL querying and vector database retrieval to navigate large biomedical knowledge sources, addressing challenges of preventing hallucinations and ensuring proper source attribution.\r\nThis session is ideal for data scientists, ML engineers, and anyone interested in applying python and LLM ecosystem to the healthcare domain.", "description": "The growth of scientific healthcare literature and publicly available biomedical databases has created many opportunities but also great challenges for researchers. While large amounts of biological data are now freely available, finding and connecting relevant information across disparate sources remains time-consuming and complex. LLM-powered tools offer promising solutions to this challenge, but implementing them in healthcare, where accuracy can impact patient outcomes, requires specialised approaches and careful design considerations.\r\nThis talk will share practical lessons and technical strategies to address hallucinations, complex domain-specific terminology, source citations. \r\n\r\nThe presentation will be structured into three main sections:\r\n\r\n1. The challenge of scientific data retrieval (5 mins)\r\n    1. Overview of the current landscape of biological databases and scientific literature\r\n    2. Common challenges researchers face when searching for information across multiple sources\r\n    3. Specificities of healthcare domain where accuracy is critical\r\n\r\n2. Technical architecture for LLM-powered scientific search (15 mins)\r\n    1. Reliable approaches to querying structured databases using natural language\r\n    2. Vector database implementation for semantic search across scientific literature\r\n    3. Strategies to ensure retrieved information is properly attributed to sources\r\n    4. Real-world performance considerations: balancing accuracy, latency, and cost\r\n\r\n3. Lessons learned and future directions (5 mins)\r\n    1. Performance metrics and user feedback from academic researchers\r\n    2. Challenges and limitations of current approaches\r\n    3. Future directions for AI-assisted scientific discovery\r\n\r\nThroughout the talk, I'll provide concrete examples of how these technologies can be applied to real research questions, in a production environment, demonstrating the practical value of AI agents in accelerating scientific discovery.\r\n\r\nIntended audience: This talk is designed for data scientists, ML / Software engineers, bioinformaticians, and researchers interested in leveraging AI for scientific data retrieval and analysis. \r\nWhile examples will focus on biological data, the principles and techniques discussed are applicable across scientific domains. Basic familiarity with Python and AI concepts will be helpful but is not required.", "recording_license": "", "do_not_record": false, "persons": [{"code": "E79SEQ", "name": "Laura Dumont", "avatar": "https://cfp.pydata.org/media/avatars/E79SEQ_zaUlxxH.webp", "biography": "I have worked in the healthcare industry for more than 10 years, currently a senior machine learning at Owkin. Committed to open source and open science principles, I aspire to leverage Python and data science for social good, focusing on health, inclusion, and projects that make a meaningful difference in people's lives.", "public_name": "Laura Dumont", "guid": "e7ef5e01-20c2-576a-91d3-fe433ea84196", "url": "https://cfp.pydata.org/berlin2025/speaker/E79SEQ/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/JEKYLT/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/JEKYLT/", "attachments": []}, {"guid": "6ec0344e-c2fb-5672-8f89-f803dabbf89e", "code": "3LDDAB", "id": 77627, "logo": null, "date": "2025-09-02T15:50:00+02:00", "start": "15:50", "duration": "00:45", "room": "B05-B06", "slug": "berlin2025-77627-from-manual-to-llms-scaling-product-categorization", "url": "https://cfp.pydata.org/berlin2025/talk/3LDDAB/", "title": "From Manual to LLMs: Scaling Product Categorization", "subtitle": "", "track": "Generative AI", "type": "Talk (long)", "language": "en", "abstract": "How to use LLMs to categorize hundreds of thousands of products into 1,000 categories at scale? Learn about our journey from manual/rule-based methods, via fine-tuned semantic models, to a robust multi-step process which uses embeddings and LLMs via the OpenAI APIs. This talk offers data scientists and AI practitioners learnings and best practices for putting such a complex LLM-based system into production. This includes prompt development, balancing cost vs. accuracy via model selection, testing mult-case vs. single-case prompts, and saving costs by using the OpenAI Batch API and a smart early-stopping approach. We also describe our automation and monitoring in a PySpark environment.", "description": "**Target Audience:** Data scientists, AI/ML engineers, and practitioners interested in applying large language models (LLMs) / generative AI to solve real-world classification problems at scale. Attendees should have a foundational understanding of machine learning concepts and familiarity with the Python data science stack. Exposure to vector embeddings or LLM APIs is helpful but not mandatory.\r\n\r\n**Takeaway:** Attendees will gain practical insights and learn best practices for building, debugging, scaling, and productionizing a complex, multi-step generative AI system for large-scale product categorization. They will understand the evolution from traditional methods to LLMs, learn specific techniques for prompt engineering, batch processing, cost optimization with models like OpenAI's, and see the tangible business impact of such a system.\r\n\r\n**Detailed Outline:**\r\n\r\nThis talk chronicles our journey tackling a yet challenging problem: accurately categorizing hundreds of thousands of diverse products into a fine-grained taxonomy of over 1,000 categories. We'll share our evolution from initial manual and rule-based systems to a sophisticated, production-ready Generative AI pipeline.\r\n\r\n\r\n\r\n* **Part 1: The Challenge & Initial Approaches (10 minutes)**\r\n    * Introduction to the business need for accurate product categorization at scale.\r\n    * Overview of the limitations encountered with traditional methods:\r\n        * Manual Curation: Slow, expensive, inconsistent, and impossible to scale.\r\n        * Rule-Based Systems: Brittle, hard to maintain, and unable to handle nuances or new product types.\r\n        * Fine-tuned Semantic Models: An improvement, but struggled with zero-shot generalization to new categories and required significant labeled data and retraining.\r\n* **Part 2: Entering the GenAI Era - Iterations & Lessons Learned (10 minutes)**\r\n    * Our initial exploration using LLMs for categorization, what worked, and what failed.\r\n    * **Developing the Prompt:** We'll dive deep into the iterative process of prompt engineering for this complex multi-label, hierarchical classification task. We'll show examples of early prompts, their failure modes (e.g., inconsistent output format, hallucinated categories, difficulty handling multiple classification signals), and the refinements that led to more reliable results. We will discuss techniques for achieving structured output (e.g., JSON) from the LLM.\r\n    * **Early Scaling Issues:** Discussing the pitfalls of naive API usage, latency problems, and prohibitive costs when dealing with large volumes.\r\n* **Part 3: Building a Robust, Scalable GenAI Pipeline (10 minutes)**\r\n    * **The Hybrid Approach:** Detailing our successful multi-step architecture that combines the strengths of semantic embeddings for efficient candidate retrieval/filtering and LLMs (specifically leveraging OpenAI models) for nuanced final categorization.\r\n    * **Productionization Strategies:**\r\n        * *Batching:* Implementing efficient batch processing using asynchronous requests and the OpenAI Batch API to drastically reduce latency and cost.\r\n        * *Cost vs. Accuracy:* Strategies for selecting the right model based on complexity and cost constraints.\r\n        * *Semantic Similarity & Early Stopping:* Using vector similarity to intelligently prune the search space, avoiding the need to evaluate every product against all 1,000+ categories with the LLM, thus significantly optimizing cost and throughput.\r\n        * *Automation & Monitoring*: How we process updates of categories and products automatically in PySpark and monitor that the live system works as expected.\r\n* **Part 4: Measuring Impact & Looking Ahead (10 minutes)**\r\n    * Presenting the results: Showcasing the significant improvements in categorization accuracy and coverage compared to previous methodsIllustrative examples of challenging products correctly categorized by the GenAI system.\r\n    * Discussing the tangible business value derived as measured in A-B tests \r\n    * Briefly touching upon ongoing work and future directions.\r\n\r\nThis presentation will focus on the practical application and engineering challenges, sharing reusable techniques and hard-won lessons applicable to anyone looking to leverage the power of generative AI for large-scale, real-world problems. We aim to provide a transparent account of not just the successes, but also the crucial learnings from failures encountered along the way.", "recording_license": "", "do_not_record": false, "persons": [{"code": "SXXSJ7", "name": "Giampaolo Casolla", "avatar": "https://cfp.pydata.org/media/avatars/SXXSJ7_xrisFlC.webp", "biography": "Giampaolo Casolla is a Senior Data Scientist at GetYourGuide, leveraging advanced machine learning and Generative AI to solve complex travel industry challenges. With expertise spanning areas like Safety, Risk, and Security, and strong skills in stats, Python, R, and cloud tech, he brings a diverse background to the role. Prior to GetYourGuide, Giampaolo developed award-winning ML solutions at Amazon and has a background in research with publications and conference presentations. At GetYourGuide, he's focused on integrating LLMs and GenAI into data products to drive innovation in travel technology.", "public_name": "Giampaolo Casolla", "guid": "508d0482-7983-54bb-b13f-dcb3196ef399", "url": "https://cfp.pydata.org/berlin2025/speaker/SXXSJ7/"}, {"code": "YFJSC9", "name": "Ansgar Gr\u00fcne", "avatar": "https://cfp.pydata.org/media/avatars/YFJSC9_28XOgfM.webp", "biography": "Ansgar Gr\u00fcne is a Senior Data Scientist at GetYourGuide in Berlin. His work focuses on ML/AI approaches to improve the users search and discovery experience on the platform. He holds a Ph.D. in Theoretical Computer Science and has 10 years of experience as a Data Scientist in the travel industry following several years as software engineer.", "public_name": "Ansgar Gr\u00fcne", "guid": "9dd0fe48-b3ec-527b-8e05-3a74c21da85f", "url": "https://cfp.pydata.org/berlin2025/speaker/YFJSC9/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/3LDDAB/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/3LDDAB/", "attachments": []}]}}, {"index": 3, "date": "2025-09-03", "day_start": "2025-09-03T04:00:00+02:00", "day_end": "2025-09-04T03:59:00+02:00", "rooms": {"Kuppelsaal": [{"guid": "d6b8b89e-a27d-50cf-ba76-821549fbe7e8", "code": "C3MGDN", "id": 77260, "logo": null, "date": "2025-09-03T09:10:00+02:00", "start": "09:10", "duration": "01:00", "room": "Kuppelsaal", "slug": "berlin2025-77260-maintainers-of-the-future-code-culture-and-everything-after", "url": "https://cfp.pydata.org/berlin2025/talk/C3MGDN/", "title": "Maintainers of the Future: Code, Culture, and Everything After", "subtitle": "", "track": "Education, Career & Life", "type": "Keynote", "language": "en", "abstract": "How we sustain what we build \u2014 and why the future of tech depends on care, not only code.\r\n\r\nThe last five years have reshaped tech \u2014 through a pandemic, economic uncertainty, shifting politics, and the rapid rise of AI. While these changes have opened new opportunities, they\u2019ve also exposed the limits \u2014 and harms \u2014 of a \u201cmove fast and break things\u201d mindset.", "description": "This talk invites the audience into a collective reflection on the state of tech today \u2014 and a reimagining of the futures we want to build. We\u2019ll explore how small, mission-driven teams can use AI and automation to scale impact while centering their values, and why the work of maintenance \u2014 often invisible and undervalued \u2014 is foundational to responsible innovation.\r\n\r\nDrawing from my experience as a software engineer at a mission-driven company, and as an open-source community leader, I\u2019ll unpack the challenges of long-term technical work: invisible labor, ethical drift, burnout and the quiet leadership of those who stay. In a world obsessed with velocity and dominance, this is a talk about resilience \u2014 and why the future belongs to those willing to maintain it as a radical act of shaping what comes next.", "recording_license": "", "do_not_record": false, "persons": [{"code": "ZHB8A3", "name": "Jessica Greene", "avatar": "https://cfp.pydata.org/media/avatars/ZHB8A3_ZPg6olD.webp", "biography": "Jessica Greene is a self/community-taught developer who came to tech by way of the film industry and specialty coffee roasting. She is now a Senior Machine Learning Engineer at Ecosia.org, where she explores how ML and generative AI can support climate action. Passionate about ethical, sustainable, and inclusive technology, Jessica co-leads PyLadies Berlin, serves on the board of the Python Software Verband (PySV), and is part of the Python Software Foundation\u2019s Conduct Working Group. In 2024, she was honored with the inaugural Outstanding PyLadies Award and the PSF Community Service Award for her contributions to the Python ecosystem.", "public_name": "Jessica Greene", "guid": "36ff6ca0-14ed-5606-b242-b0027f09a498", "url": "https://cfp.pydata.org/berlin2025/speaker/ZHB8A3/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/C3MGDN/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/C3MGDN/", "attachments": []}, {"guid": "9eab448a-afae-539d-9a74-69e54cf24d19", "code": "M3RVNA", "id": 80989, "logo": null, "date": "2025-09-03T15:10:00+02:00", "start": "15:10", "duration": "00:15", "room": "Kuppelsaal", "slug": "berlin2025-80989-closing-session", "url": "https://cfp.pydata.org/berlin2025/talk/M3RVNA/", "title": "Closing Session", "subtitle": "", "track": null, "type": "Plenary Session [Organizers]", "language": "en", "abstract": "Closing Session", "description": "Closing Session", "recording_license": "", "do_not_record": false, "persons": [], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/M3RVNA/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/M3RVNA/", "attachments": []}], "B09": [{"guid": "0aad57bb-beab-5b01-a837-c5b6b8170de7", "code": "GZUXGZ", "id": 77951, "logo": null, "date": "2025-09-03T10:40:00+02:00", "start": "10:40", "duration": "01:30", "room": "B09", "slug": "berlin2025-77951-building-an-ai-agent-for-natural-language-to-sql-query-execution-on-live-databases", "url": "https://cfp.pydata.org/berlin2025/talk/GZUXGZ/", "title": "Building an AI Agent for Natural Language to SQL Query Execution on Live Databases", "subtitle": "", "track": "Generative AI", "type": "Tutorial", "language": "en", "abstract": "This hands-on tutorial will guide participants through building an end-to-end AI agent that translates natural language questions into SQL queries, validates and executes them on live databases, and returns accurate responses. Participants will build a system that intelligently routes between a specialized SQL agent and a ReAct chat agent, implementing RAG for query similarity matching, comprehensive safety validation, and human-in-the-loop confirmation. By the end of this session, attendees will have created a powerful and extensible system they can adapt to their own data sources.", "description": "### Overview\r\n\r\nNatural\u2011language interfaces unlock database insights for non\u2011technical users. This tutorial provides a practical implementation for building these systems reliably and effectively. \r\n\r\nParticipants will build an AI agent system that can:\r\n\r\n1. Route intelligently between SQL generation and ReAct chat agent workflows\r\n2. Ingest and understand database schemas with domain knowledge\r\n3. Retrieve relevant context and similar query examples using RAG with vector similarity\r\n4. Generate accurate SQL with validation and safety guardrails\r\n5. Execute queries safely with human-in-the-loop approval\r\n6. Present results in an understandable format\r\n7. Track costs and monitor performance using LangSmith\r\n8. Manage session-based memory and conversation context\r\n\r\nWe'll use the Kaggle dataset \"[Brazilian E-Commerce dataset by Olist](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)\" as our working example, demonstrating how to handle multiple tables across two schemas with complex relationships. This dataset will be hosted on an EC2 AWS instance for live interaction during the tutorial.\r\n\r\nThis tutorial addresses real-world database complexity with production-grade considerations. Participants will start from a repository with backbone code and implement the key components during the session. By the end, attendees will have a working system they can adapt to their own datasets.\r\n\r\n### Tools and Frameworks\r\nThis tutorial will leverage modern tools and frameworks for efficient development:\r\n\r\n**AI and Agent Frameworks:**\r\n- LangChain for agent components and LLM interactions\r\n- LangGraph for agent orchestration and workflow management\r\n- LangSmith for comprehensive cost tracking and monitoring\r\n- OpenAI models with examples of alternatives\r\n\r\n**Database and Vector Store:**\r\n- SQLAlchemy for database interactions and schema retrieval\r\n- PostgreSQL as the database engine for the live dataset\r\n- PGVector for similarity-based query retrieval\r\n\r\n**Development:**\r\n- YAML for configuration management\r\n- `pyproject.toml` for standardized project configuration\r\n- UV reliable package management and Ruff for code formatting/linting", "recording_license": "", "do_not_record": false, "persons": [{"code": "UFPWF3", "name": "Cain\u00e3 Max Couto da Silva", "avatar": "https://cfp.pydata.org/media/avatars/UFPWF3_rZpSwsE.webp", "biography": "I\u2019m a data scientist and AI engineer with 10+ years of experience across academic research and industry, building GenAI and machine learning solutions for research labs, startups, and Fortune 500 companies. I\u2019m also a passionate educator, contributing to data training programs as a professor and consultant, and an active open-source contributor and speaker at conferences like SciPy and PyData.", "public_name": "Cain\u00e3 Max Couto da Silva", "guid": "bc2083a3-b173-5b81-963b-c9f43b337cb1", "url": "https://cfp.pydata.org/berlin2025/speaker/UFPWF3/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/GZUXGZ/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/GZUXGZ/", "attachments": []}, {"guid": "30899fbe-3680-5b55-9326-4455eb71b620", "code": "B3STGX", "id": 77222, "logo": null, "date": "2025-09-03T13:40:00+02:00", "start": "13:40", "duration": "01:30", "room": "B09", "slug": "berlin2025-77222-see-only-what-you-are-allowed-to-see-fine-grained-authorization", "url": "https://cfp.pydata.org/berlin2025/talk/B3STGX/", "title": "See only what you are allowed to see: Fine-Grained Authorization", "subtitle": "", "track": "Data Handling & Engineering", "type": "Tutorial", "language": "en", "abstract": "Managing who can see or do what with your data is a fundamental challenge, especially as applications and data grow in complexity. Traditional role-based systems often lack the granularity needed for modern data platforms. \r\nFine-Grained Authorization (FGA) addresses this by controlling access at the individual resource level. In this 90-minute hands-on tutorial, we will explore implementing FGA using OpenFGA, an open-source authorization engine inspired by Google's Zanzibar. Attendees will learn the core concepts of Relationship-Based Access Control (ReBAC) and get practical experience defining authorization models, writing relationship tuples, and performing authorization checks using the OpenFGA Python SDK. Bring your laptop ready to code to learn how to build secure and flexible permission systems for your data applications.", "description": "This tutorial provides a practical, hands-on introduction to implementing Fine-Grained Authorization (FGA) for data-intensive applications using the open-source tool OpenFGA. As data platforms evolve and regulatory requirements become stricter, controlling access at a granular level \u2013 perhaps even row-level in a database context \u2013 becomes essential. Role-Based Access Control (RBAC), while common, often struggles to meet these complex needs, leading to insufficient flexibility or administrative overhead.\r\nWe will introduce the concept of Relationship-Based Access Control (ReBAC), the authorization paradigm powering systems like Google's Zanzibar and OpenFGA. You'll learn how ReBAC defines permissions based on the relationships between users and objects (e.g., \"Alice is a viewer of Document 'report_Q3'\"), enabling highly flexible and scalable access control logic.\r\nThe core of the tutorial will be dedicated to practical implementation. We will guide attendees through:\r\n1.Setting up a local OpenFGA instance (e.g., using Docker).\r\n2. Defining an authorization model using OpenFGA's Domain Specific Language (DSL) to represent resources, users, and the relationships between them. We will use a simplified data access scenario as our example, potentially inspired by challenges faced in research or data collaboration platforms.\r\n3. Writing and managing relationship tuples in OpenFGA.\r\n4. Using the OpenFGA Python SDK to connect your application logic to the authorization engine, \r\n5. Exploring strategies for integrating this with application backend code and potentially addressing concepts like enforcing row-level permissions.\r\n\r\nAttendees will follow along with live coding examples and complete exercises designed to solidify their understanding and build confidence in applying FGA principles with OpenFGA. By the end of the 90 minutes, you will have a foundational understanding of FGA/ReBAC and the practical skills to start integrating OpenFGA into your own projects. The tutorial materials, including code examples and setup instructions, will be provided via a GitHub repository.", "recording_license": "", "do_not_record": false, "persons": [{"code": "QTCFLN", "name": "Maria Knorps", "avatar": "https://cfp.pydata.org/media/avatars/QTCFLN_AlMADsa.webp", "biography": "Maria is a Principal Consultant (Data Engineer) at Modus Create, specializing in data science, software development, and emerging GenAI initiatives. With a PhD in mathematical modeling, she applies a rigorous, methodical approach to developing and maintaining high-quality, data-driven solutions. \r\n \r\nOutside of her technical pursuits, Maria is passionate about promoting diversity in the IT industry and inspiring girls and women to engage in programming. Balancing her career with motherhood of three, she finds limited but cherished time for personal hobbies such as riding motorcycles and knitting.", "public_name": "Maria Knorps", "guid": "1b28146f-196d-5854-add3-3841c6873e41", "url": "https://cfp.pydata.org/berlin2025/speaker/QTCFLN/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/B3STGX/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/B3STGX/", "attachments": []}], "B07-B08": [{"guid": "0761e5f1-8282-596e-a772-146b96d85e9d", "code": "HKMYHY", "id": 77778, "logo": null, "date": "2025-09-03T10:40:00+02:00", "start": "10:40", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77778-bye-bye-query-spaghetti-write-queries-you-ll-actually-understand-using-pipelined-sql-syntax", "url": "https://cfp.pydata.org/berlin2025/talk/HKMYHY/", "title": "Bye-Bye Query Spaghetti: Write Queries You'll Actually Understand Using Pipelined SQL Syntax", "subtitle": "", "track": "Data Handling & Engineering", "type": "Talk", "language": "en", "abstract": "Are your SQL queries becoming tangled webs that are difficult to decipher, debug, and maintain? This talk explores how to write shorter, more debuggable, and extensible SQL code using **Pipelined SQL**, an alternative syntax where queries are written as **a series of orthogonal, understandable steps**. We'll survey which databases and query engines currently support pipelined SQL natively or through extensions, and how it can be used on any platform by compiling pipelined SQL to any SQL dialect using open-source tools. A series of real-world examples, comparing traditional and pipelined SQL syntax side by side for a variety of use cases, will show you how to simplify existing code and make complex data transformations intuitive and manageable.", "description": "This session introduces Pipelined SQL, an alternative syntax for writing complex data queries as a clear, sequential flow of manageable transformations within a single query.\r\n\r\nTraditional SQL combines filtering (WHERE), aggregation (GROUP BY), and projection (SELECT expressions) within a single, monolithic block. This can make it challenging to discern individual data transformations or modify one aspect without impacting others. Pipelined SQL, in contrast, encourages building queries like an assembly line. You'll learn to structure your query logic so that each step performs a specific transformation and cleanly passes its result to the next. This pipelined approach, moving away from deeply nested subqueries or sprawling Common Table Expressions (CTEs), leads to queries that are more readable by easily following the logic from start to finish. As an added benefit, the resulting code is simpler to debug and also easily extendable by additional transformation steps.\r\n\r\n\r\nThe talk will explain the core concepts of Pipelined SQL, how it differs from traditional SQL, what its main advantages over traditional SQL are. Native support for pipelined syntax is steadily growing across many modern databases, query engines and cloud data warehouses. We will explore the landscape of emerging dialects and identify which platforms currently offer native support or extensions for this powerful syntax. The session also covers a range of open-source tools that can compile such pipelined query code into any traditional SQL dialect, making this approach suitable for almost any platform.\r\n\r\nThrough practical, real-world examples using BigQuery's pipe syntax, you'll see side-by-side comparisons demonstrating how Pipelined SQL can drastically reduce complexity and improve clarity for common data manipulation tasks. Prepare for genuine 'a-ha!' moments as you discover how Pipelined SQL offers refreshingly simple approaches to tasks that usually require convoluted traditional SQL.\r\n\r\nThis session is ideal for data analysts, scientists, engineers, and anyone with basic SQL knowledge who wants to write cleaner, more robust, and more maintainable queries. You'll leave with a solid understanding of Pipelined SQL's benefits and practical knowledge to start simplifying your own SQL workflows.", "recording_license": "", "do_not_record": false, "persons": [{"code": "PMZLKB", "name": "Tobias Lampert", "avatar": "https://cfp.pydata.org/media/avatars/PMZLKB_3A8DVom.webp", "biography": "An accomplished technical leader, Tobias brings over two decades of experience in software development, complemented by profound expertise in Data Science and Data Engineering. His career has focused on the end-to-end design and implementation of complex data-intensive applications, spanning the full lifecycle from data ingestion to deployment. In his current role at Lotum he is tackling a data volume of several hundred million events from mobile games per day.", "public_name": "Tobias Lampert", "guid": "988edae9-b784-556d-828a-35063416916c", "url": "https://cfp.pydata.org/berlin2025/speaker/PMZLKB/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/HKMYHY/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/HKMYHY/", "attachments": []}, {"guid": "bada3545-34f0-506e-9caf-460a0d882e30", "code": "GQBX3J", "id": 77931, "logo": null, "date": "2025-09-03T11:20:00+02:00", "start": "11:20", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77931-docling-get-your-documents-ready-for-gen-ai", "url": "https://cfp.pydata.org/berlin2025/talk/GQBX3J/", "title": "Docling: Get your documents ready for gen AI", "subtitle": "", "track": "Data Handling & Engineering", "type": "Talk", "language": "en", "abstract": "Docling, an open source package, is rapidly becoming the de facto standard for document parsing and export in the Python community. Earning close to 30,000 GitHub in less than one year and now part of the Linux AI & Data Foundation. Docling is redefining document AI with its ease and speed of use. In this session, we\u2019ll introduce Docling and its features, including usages with various generative AI frameworks and protocols (e.g. MCP).", "description": "Docling, an open source package, is rapidly becoming the de facto standard for document parsing and export in the Python community. Earning close to 30,000 GitHub in less than one year and now part of the Linux AI & Data Foundation. Docling is redefining document AI with its ease and speed of use. In this session, we\u2019ll introduce Docling and its features, including how: \r\n\r\n- Support for a wide array of formats\u2014such as PDFs, DOCX, PPTX, HTML, images, and Markdown\u2014and easy conversion to structured Markdown or JSON. \r\n- Advanced document understanding through capture of intricate page layouts, reading order, and table structures\u2014ideal for complex analysis.\r\n- Integration of the DoclingDocument format with popular AI frameworks\u2014such as LlamaIndex. LangChain, LlamaStack for retrieval-augmented generation (RAG) and QA applications.\r\n- Optical character recognition (OCR) support for scanned documents.\r\n- Support of Visual Language Models like SmolDocling created in collaboration with Hugging Face.\r\n- A user-friendly command line interface (CLI) and MCP connectors for developers.\r\n- How to use it as-a-service and at scale by deploy your own docling-serve.", "recording_license": "", "do_not_record": false, "persons": [{"code": "YRFJ3P", "name": "Michele Dolfi", "avatar": null, "biography": "Dr. Michele Dolfi is a technical lead in the AI for Knowledge group at IBM Research, focusing on knowledge engineering and understanding. Michele is one of the researchers who created the Deep Search platform and the Docling open source project. His expertise spans from artificial intelligence to high performance computing and quantum systems.", "public_name": "Michele Dolfi", "guid": "3911a7d0-9ce5-59f4-a9d2-5eed2100eb06", "url": "https://cfp.pydata.org/berlin2025/speaker/YRFJ3P/"}, {"code": "QHWNDQ", "name": "Christoph Auer", "avatar": null, "biography": "Dr. Christoph Auer is a technical lead in the AI for Knowledge group at IBM Research, focusing on automated knowledge extraction and dataset modeling. His ongoing dedication to document understanding systems has been instrumental in the development of key innovations that power Docling today.", "public_name": "Christoph Auer", "guid": "f67035be-d833-59cc-8d12-0b7e28e57683", "url": "https://cfp.pydata.org/berlin2025/speaker/QHWNDQ/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/GQBX3J/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/GQBX3J/", "attachments": []}, {"guid": "66d0c2ed-b3f9-5e51-bef5-cb50af71e62f", "code": "PPAYDV", "id": 77779, "logo": null, "date": "2025-09-03T12:00:00+02:00", "start": "12:00", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77779-better-docs-happier-users-what-we-learned-applying-diataxis-to-holoviz-libraries", "url": "https://cfp.pydata.org/berlin2025/talk/PPAYDV/", "title": "Better docs, happier users: What we learned applying Diataxis to HoloViz libraries", "subtitle": "", "track": "Community & Diversity", "type": "Talk", "language": "en", "abstract": "Clear documentation is crucial for the success of open-source libraries, but it\u2019s often hard to get right. In this talk, I\u2019ll share our experience applying the Diataxis documentation framework to improve two HoloViz ecosystem libraries, hvPlot and Panel. Attendees will come away with practical insights on applying Diataxis and strengthening documentation for their own projects.", "description": "Good documentation turns users into contributors \u2014 but achieving it requires more than good intentions. This talk shares the journey of applying the Diataxis framework to improve two open-source Python libraries from the HoloViz ecosystem: Panel and hvPlot.. We\u2019ll start with a short introduction to Diataxis (its four documentation types: tutorials, how-to guides, explanations, and references), then briefly present the libraries we worked on and their documentation challenges.\r\n\r\nThe heart of the talk focuses on practical lessons learned: how we mapped existing content into the Diataxis structure, handled content gaps and redundancies, engaged with the user community, and evolved our approach over time. We\u2019ll also discuss what we would do differently if we started again.\r\n\r\nThe goal is to give attendees a realistic, hands-on perspective on adopting Diataxis \u2014 including both its benefits and its challenges.", "recording_license": "", "do_not_record": false, "persons": [{"code": "DQFXPP", "name": "Maxime Liquet", "avatar": "https://cfp.pydata.org/media/avatars/DQFXPP_35hUjUH.webp", "biography": "Software Engineer at Anaconda, maintaining and improving the open-source data viz libraries of the HoloViz ecosystem. Previously a civil engineer specialized in flood risk assessment.", "public_name": "Maxime Liquet", "guid": "adaf972a-a409-5695-8764-181a31f9f525", "url": "https://cfp.pydata.org/berlin2025/speaker/DQFXPP/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/PPAYDV/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/PPAYDV/", "attachments": [{"title": "Presentation", "url": "/media/berlin2025/submissions/PPAYDV/resources/Diataxis_PyData_ELh5Ud5.pdf", "type": "related"}]}, {"guid": "4d8330be-5bfb-5f33-9e0f-b07ebef991f6", "code": "SCQE8H", "id": 77950, "logo": null, "date": "2025-09-03T13:40:00+02:00", "start": "13:40", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-77950-spot-the-difference-using-foundation-models-to-monitor-for-change-with-satellite-imagery", "url": "https://cfp.pydata.org/berlin2025/talk/SCQE8H/", "title": "Spot the difference: \ud83d\udd75\ufe0f using foundation models to monitor for change with satellite imagery \ud83d\udef0\ufe0f", "subtitle": "", "track": "Computer Vision (incl. Generative AI CV)", "type": "Talk", "language": "en", "abstract": "Energy infrastructure is vulnerable to damage by erosion or third party interference, which often takes the form of unsanctioned construction. In this talk we discuss our experiences using deep learning algorithms powered by large foundation models to monitor for changes in bi-temporal very-high resolution satellite imagery.", "description": "Oil and gas pipelines are usually buried around 1.5 meters underground, making them vulnerable to human activity or natural processes like erosion. Pipeline operators need to perform regular checks to ensure the integrity of their infrastructure. Very High Resolution (VHR) satellite images, with ground sampling distances of less than 1 meter,  provide an interesting solution to this problem allowing for large scale monitoring and regular revisit rates. \r\n\r\nSpotting changes is far from simple, as one needs to distinguish between relevant changes (such as construction activity), and irrelevant changes, such as shadows, seasonal changes or changes due to viewing angles. \r\n\r\nGeospatial foundation models, trained on vast collections of satellite imagery from across the globe, offer enhanced generalisation capabilities while requiring relatively few labels to achieve powerful performance. This global-scale pretraining enables these models to develop robust feature representations that transfer effectively to new geographic regions and tasks.", "recording_license": "", "do_not_record": false, "persons": [{"code": "J9ZPD3", "name": "Ferdinand Schenck", "avatar": "https://cfp.pydata.org/media/avatars/J9ZPD3_jVGqeu9.webp", "biography": "Machine Learning Engineer at LiveEO | Recovering Physicist | Spends most of his time making machines understand the world from space.", "public_name": "Ferdinand Schenck", "guid": "98b03cf1-6f41-5387-964b-07a76d4bb9cc", "url": "https://cfp.pydata.org/berlin2025/speaker/J9ZPD3/"}], "links": [{"title": "Slides Online (google slides)", "url": "https://docs.google.com/presentation/d/e/2PACX-1vQcW-gZfbu4ORWSkvHRPrtEFgU-Cc2-7XWrkaP5Q3LKNlp3UXM4q0sSUp7gWy8mh7Ny6wjE0gPuFtLB/pub", "type": "related"}], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/SCQE8H/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/SCQE8H/", "attachments": [{"title": "Slides PDF", "url": "/media/berlin2025/submissions/SCQE8H/resources/PyData_2025_Usi_fo1HCYb.pdf", "type": "related"}]}, {"guid": "6b188db3-0981-5284-a49e-1dad57b93ebd", "code": "RM8CNV", "id": 81305, "logo": null, "date": "2025-09-03T14:20:00+02:00", "start": "14:20", "duration": "00:30", "room": "B07-B08", "slug": "berlin2025-81305-kubeflow-pipelines-meet-uv", "url": "https://cfp.pydata.org/berlin2025/talk/RM8CNV/", "title": "Kubeflow pipelines meet uv", "subtitle": "", "track": "PyData & Scientific Libraries Stack", "type": "Talk", "language": "en", "abstract": "Kubeflow is a platform for building and deploying portable and scalable machine learning (ML) workflows using containers on Kubernetes-based systems.\r\n\r\nWe will code together a simple Kubeflow pipeline, show how to test it locally. As a bonus, we will explore one solution to avoid **dependency hell** using the modern dependency management tool **uv**.", "description": "In this demo, you will learn how to set up and run locally a Kubeflow pipeline that:\r\n\r\n- adheres standard pyproject.toml format\r\n- keeps consistent python version and dependencies across components\r\n- manages dependencies of all components at once, including lockfile\r\n\r\nWe will discuss how and why this enhanced setup can improve pipeline and dependency maintainability for systems running in production, while still taking advantage of the Kubeflow API flexibility and features.", "recording_license": "", "do_not_record": false, "persons": [{"code": "HXRXM8", "name": "Fabrizio Damicelli", "avatar": "https://cfp.pydata.org/media/avatars/HXRXM8_pjxLIin.webp", "biography": "I\u2019m Fabrizio \ud83e\uddc9, PhD in Computational Neuroscience \ud83c\udf93, now a Data Scientist \ud83d\udcbb in Hamburg (Germany). I work on fraud detection \ud83d\udd75\ud83c\udffd using neural networks and large datasets at one of the top e-commerce platforms in Europe.\r\n\r\nAs an open source advocate, I have contributed to a few projects and created a couple of Python packages that I maintain \ud83e\udd13. Beyond that, I am generally interested in topics around \ud83d\udc0d Python, \ud83e\udd16 Machine Learning, \ud83d\udcca Data Science, Computational modeling, \ud83d\ude80 Scientific computing and the application of data-driven solutions to make people\u2019s life better.", "public_name": "Fabrizio Damicelli", "guid": "c09ec6f6-8118-5e89-8c16-ff54430abbc8", "url": "https://cfp.pydata.org/berlin2025/speaker/HXRXM8/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/RM8CNV/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/RM8CNV/", "attachments": []}], "B05-B06": [{"guid": "c28b2d44-4a29-5392-a292-227fbe73e0e2", "code": "KKWBKK", "id": 77891, "logo": null, "date": "2025-09-03T10:40:00+02:00", "start": "10:40", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77891-edge-of-intelligence-the-state-of-ai-in-browsers", "url": "https://cfp.pydata.org/berlin2025/talk/KKWBKK/", "title": "Edge of Intelligence: The State of AI in Browsers", "subtitle": "", "track": "Infrastructure - Hardware & Cloud", "type": "Talk", "language": "en", "abstract": "API calls suck! Okay, not all of them. But building your AI features reliant on third party APIs can bring a lot of trouble. In this talk you'll learn how to use web technologies to become more independent.", "description": "The current AI hype is being run on API calls, GPU clusters, and costly infrastructure. But what if we could break free from these constraints and run our models directly in the consumer's browser?\r\n\r\nImagine a world where AI development is more reliable, cheaper, and more secure. In this talk, we'll explore the current state of WebAI, including the latest developments, challenges, and opportunities. We'll dive into the libraries, tools, and technologies that make it possible to run AI models in the browser, such as WebAssembly, WebGPU, and ONNX. We'll discuss how these technologies enable fast and efficient execution of AI models, and how they relate to Python.\r\n\r\nAfter the talk, you'll have a clear understanding of how to bring AI to the browser and unlock new possibilities for your applications. Join us to learn how to harness the power of AI and make it more accessible for everyone.", "recording_license": "", "do_not_record": false, "persons": [{"code": "QFSMUG", "name": "Johannes Kolbe", "avatar": "https://cfp.pydata.org/media/avatars/QFSMUG_UekUIHx.webp", "biography": "Hey, \r\n\r\nI'm Johannes, a Data Scientist who loves to tell educative stories about Machine Learning methods and AI. Preferably I'm doing this in Open Source communities.\r\n\r\nI've been working with Computer Vision for more than 10 years, ranging from designing my own Haar-Cascade face detection, over research on autonomous cars all the way to helping people configure their photobooks in a smart and easy way.", "public_name": "Johannes Kolbe", "guid": "6bc9aab7-4c1e-57c5-ac96-e67af8999cd1", "url": "https://cfp.pydata.org/berlin2025/speaker/QFSMUG/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/KKWBKK/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/KKWBKK/", "attachments": []}, {"guid": "6ab2bd16-c161-5328-9968-24c2eeb21dcf", "code": "GKFB3J", "id": 80770, "logo": null, "date": "2025-09-03T11:20:00+02:00", "start": "11:20", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-80770-how-digital-david-wins-against-data-goliaths", "url": "https://cfp.pydata.org/berlin2025/talk/GKFB3J/", "title": "How Digital David Wins Against Data Goliaths", "subtitle": "", "track": "Education, Career & Life", "type": "Talk", "language": "en", "abstract": "This talk introduces a new and innovative business model supported by a network of digital activists that form a collective force for protecting humanity, enabling digitally aware users to reclaim control over their data.", "description": "After the era of Big Oil, we now live in the age of Big Data. Whoever controls the data, controls the world. Large tech companies lure users with free digital services - email, messaging, and social platforms - but the hidden cost is steep: loss of privacy, autonomy, and data sovereignty. While open-source solutions offer alternatives, their adoption often remains limited to IT specialists. In a world where convenience has become a subtle form of control, the pressing question emerges: Why hasn\u2019t a broader movement for digital freedom taken hold, and is there a viable path forward?\r\n\r\nThis talk introduces a new and innovative business model supported by a network of digital activists that form a collective force for protecting humanity, enabling digitally aware users to reclaim control over their data. By combining innovative tools, thoughtful practices, and forward-looking approaches, we\u2019ll show how digital gurus can become \u201cDigital Davids,\u201d standing up to the Data Goliaths and shaping a more sovereign digital future for their communities.", "recording_license": "", "do_not_record": true, "persons": [{"code": "MHD8AQ", "name": "Pawel Herman", "avatar": null, "biography": null, "public_name": "Pawel Herman", "guid": "c9c08368-b7d3-582f-a9a7-5f92aaa31d0a", "url": "https://cfp.pydata.org/berlin2025/speaker/MHD8AQ/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/GKFB3J/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/GKFB3J/", "attachments": []}, {"guid": "cd877ae3-725f-59c3-8f9e-b99b33383f94", "code": "XE9F7X", "id": 77153, "logo": null, "date": "2025-09-03T12:00:00+02:00", "start": "12:00", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77153-flying-beyond-keywords-our-aviation-semantic-search-journey", "url": "https://cfp.pydata.org/berlin2025/talk/XE9F7X/", "title": "Flying Beyond Keywords: Our Aviation Semantic Search Journey", "subtitle": "", "track": "Infrastructure - Hardware & Cloud", "type": "Talk", "language": "en", "abstract": "In aviation, search isn\u2019t simple\u2014people use abbreviations, slang, and technical terms that make exact matching tricky. We started with just Postgres, aiming for something that worked. Over time, we upgraded: semantic embeddings, reranking. We tackled filter complexity, slow index builds, and embedding updates and much more. Along the way, we learned a lot about making AI search fast, accurate, and actually usable for our users. It\u2019s been a journey\u2014full of turbulence, but worth the landing.", "description": "In aviation, search is anything but straightforward. Reports are written by humans\u2014pilots, cabin crew, engineers\u2014each using their own mix of abbreviations, technical jargon, and everyday language. Standard keyword search often falls short. You might miss critical safety signals because a pilot wrote \u201cnavigation didn\u2019t work\u201d instead of \u201cgps jamming,\u201d or used a shorthand unknown to engineers on the ground. What we needed was semantic search\u2014something that understands meaning, not just matches strings.\r\n\r\nBut we started simple with a plain Postgres setup. Our goal: build something that works. We began with pgvector and basic sentence embeddings to enable semantic search inside Postgres. It was scrappy, but it gave us just enough traction to prove the value of semantic search in this domain.\r\nThen things took off. As complexity grew, so did the need for better retrieval and smarter ranking. We restructured the system: upgraded to better sentence embeddings, and most importantly, added reranking using cross-encoders. This turned our search results from \u201ckinda relevant\u201d to \u201cspot on.\u201d We moved to OpenVINO to make reranking faster on the CPU, especially important since we deploy on AWS Lambda.\r\n\r\nBut the technical challenges didn\u2019t stop there. We experimented with different pgvector index types\u2014IVFFlat vs HNSW\u2014and discovered surprising trade-offs in index build times and performance, especially under constrained RDS instances. Embedding updates became their own problem, so we built a parallel processing system using SQS and a tool we call \u201cCockpit\u201d to manage recomputation.\r\n\r\nOn top of that, search in our world isn't a single step. We layer semantic retrieval with full-text filtering, structured filters (e.g., airport, aircraft type), and real-time inputs. This creates a multi-layered AI search pipeline that needs to feel snappy and reliable to end-users.\r\n\r\nIn this talk, we\u2019ll walk through how we made this work with minimal ML infrastructure, how we evolved from an MVP to a robust system, and what tools made the biggest difference\u2014from tokenization strategies and inference optimizations to batching tricks and search composition patterns. You\u2019ll also hear the gritty details: bottlenecks between tokenization and inference, indexing challenges, and lessons from building this in production for a safety-critical industry.\r\n\r\nThis talk is for folks who want to leverage Postgres for hybrid search as well. It\u2019s for anyone who has ever duct-taped search with SQL and wondered how to take the next step. We\u2019ll keep it real, share what we did, and reflect on what we\u2019d do differently next time.", "recording_license": "", "do_not_record": false, "persons": [{"code": "RK89YW", "name": "Dat Tran", "avatar": "https://cfp.pydata.org/media/avatars/RK89YW_5L6gL3O.webp", "biography": "Dat is a seasoned technology and business leader with deep expertise in AI, machine learning, and digital transformation. As Partner & CTO at DATANOMIQ, he advises companies on AI strategy and implementation. Before that he worked for Axel Springer SE, idealo.de, Pivotal Labs and Accenture. His interests are diverse from traditional machine learning, deep learning, computer vision, AI in general to large language models. He has a lot of experiences from devising realistic data-driven use cases to the actual implementation into a real product; more than capable of distinguishing hype, buzzwords and wannabes from substance. He\u2019s actively engaged with the global tech community, sharing insights on AI, tech leadership, and digital transformation with over 76k+ followers on LinkedIn. As a frequent keynote speaker, he has presented at conferences such as PyData, WeAreDevelopers, and many more, mentoring professionals in machine learning and leadership along the way.", "public_name": "Dat Tran", "guid": "403ee557-5eaa-5a22-9600-2cdf8d54f786", "url": "https://cfp.pydata.org/berlin2025/speaker/RK89YW/"}, {"code": "8DEJ3D", "name": "Dennis Schmidt", "avatar": "https://cfp.pydata.org/media/avatars/8DEJ3D_tr2xyyj.webp", "biography": "Dennis is an engineering leader and product-focused technologist with deep expertise in building modern software systems, mobile platforms, and intuitive, user-centered products. As Staff Engineer at Beams, he helps shape AI-driven solutions for safety and risk management in aviation and beyond. Previously, he held senior engineering and leadership roles at Pivotal Labs and SoundCloud\u2014helping scale teams, launch cross-functional product initiatives, and drive iterative development practices for large clients such as Volkswagen and across sectors like banking, insurance, and multimedia. A passionate builder, Dennis has co-founded multiple startups and continues to run several side projects, among which is an \u201cApp of the Day\u201d winner in the US. He works across languages and stacks, focusing on long-term maintainability, thoughtful architecture, and delivering real-world impact.", "public_name": "Dennis Schmidt", "guid": "333765e8-7669-55d3-9f78-4438b4933656", "url": "https://cfp.pydata.org/berlin2025/speaker/8DEJ3D/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/XE9F7X/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/XE9F7X/", "attachments": []}, {"guid": "55eb68f9-517d-5fbb-9579-38abf5af7f94", "code": "FDBZSR", "id": 77524, "logo": null, "date": "2025-09-03T13:40:00+02:00", "start": "13:40", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-77524-when-postgres-is-enough-solving-document-storage-pub-sub-and-distributed-queues-without-more-tools", "url": "https://cfp.pydata.org/berlin2025/talk/FDBZSR/", "title": "When Postgres is enough: solving document storage, pub/sub and distributed queues without more tools", "subtitle": "", "track": "Data Handling & Engineering", "type": "Talk", "language": "en", "abstract": "When a new requirement appears, whether it's document storage, pub/sub messaging, distributed queues, or even full-text search, Postgres can often handle it without introducing more infrastructure.\r\n\r\nThis talk explores how to leverage Postgres' native features like JSONB, LISTEN/NOTIFY, queueing patterns and vector extensions to build robust, scalable systems without increasing infrastructure complexity. \r\n\r\nYou'll learn practical patterns that extend Postgres just far enough, keeping systems simpler, more maintainable, and easier to operate, especially in small to medium projects or freelancing setups, where Postgres often already forms a critical part of the stack.\r\n\r\nPostgres might not replace everything forever - but it can often get you much further than you think.", "description": "When building modern systems, it's easy to reach for specialized tools as new requirements pop up: a document store like MongoDB for flexible schemas, Kafka for pub/sub, Redis for distributed queuing, or Weaviate for storing vectors.\r\n\r\nBut what if you could meet many of these needs by simply extending the Postgres database you likely already have?\r\n\r\nIn this talk, we\u2019ll explore how Postgres' powerful native features such as:\r\n- JSONB for document storage\r\n- LISTEN/NOTIFY for pub/sub messaging \r\n- SELECT FOR UPDATE SKIP LOCKED for queueing\r\n- an extension for vectors \r\n\r\ncan be used to solve real-world problems without introducing new infrastructure.\r\n\r\nThroughout the talk, we\u2019ll walk through practical code examples in Python and SQL to show exactly how these patterns can be implemented in real projects.\r\n\r\nThe goal isn\u2019t to suggest that Postgres replaces purpose-built tools like Kafka, Redis, or MongoDB forever. Specialized systems still have their place, especially at larger scales. However, by reusing Postgres intelligently, you can delay these decisions until they are truly necessary,keeping your system simpler, easier to operate, and more maintainable in the meantime.\r\n\r\nEspecially for freelancers, startups, and small teams, reducing system complexity early on means faster iteration, fewer operational headaches, and lower costs. And since Postgres is already present in most modern tech stacks, these capabilities are often just a few SQL queries away.\r\n\r\n## Outline\r\n\r\n1. **Introduction**: re-using existing infrastructure instead of introducing new systems to focus on solving problems\r\n1. **Pub/Sub with Postgres**: messaging between services using LISTEN/NOTIFY\r\n1. **Queuing with Postgres**: building distributed queues with SELECT FOR UPDATE SKIP LOCKED\r\n1. **Document Storage with Postgres**: handling flexible schemas and semi-structured data using JSONB\r\n1. **Conclusion**: when re-using Postgres makes sense - and when specialized systems are needed\r\n**Bonus: storing vectors with Postgres for your AI workloads**: adding efficient vector functionality by installing an extension", "recording_license": "", "do_not_record": false, "persons": [{"code": "8NMQMV", "name": "Eugen Geist", "avatar": "https://cfp.pydata.org/media/avatars/8NMQMV_tM1gIOO.webp", "biography": "Seasoned Software & Data Engineering Professional with extensive experience in high-frequency trading systems, data warehousing, and cloud solutions. Expert in optimizing mission-critical systems and implementing engineering best practices. Specialized in Python, SQL and cloud technologies.\r\n\r\nCurrently working as a Freelance Developer focusing on software and data engineering.\r\n\r\nSkilled in developing distributed systems, data pipelines, and performance optimization, consistently delivering solutions that maximize business value.", "public_name": "Eugen Geist", "guid": "25cff7d0-c990-5d16-8f9b-b1624332672c", "url": "https://cfp.pydata.org/berlin2025/speaker/8NMQMV/"}], "links": [{"title": "Slides + Examples", "url": "https://github.com/e-geist/when_postgres_is_enough", "type": "related"}], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/FDBZSR/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/FDBZSR/", "attachments": []}, {"guid": "fb048691-4903-52f7-b54e-31f5c3f009e8", "code": "WWSZKY", "id": 80840, "logo": null, "date": "2025-09-03T14:20:00+02:00", "start": "14:20", "duration": "00:30", "room": "B05-B06", "slug": "berlin2025-80840-scraping-urban-mobility-analysis-of-berlin-carsharing", "url": "https://cfp.pydata.org/berlin2025/talk/WWSZKY/", "title": "Scraping urban mobility: analysis of Berlin carsharing", "subtitle": "", "track": null, "type": "Talk", "language": "en", "abstract": "Free-floating carsharing systems struggle to balance vehicle supply and demand, which often results in inefficient fleet distribution and reduced vehicle utilization. This talk explores how data scraping can be used to model vehicle demand and user behavior, enabling targeted incentives to encourage self-balancing vehicle flows.\r\n\r\nUsing information scraped from a major mobility provider over multiple months, the presentation provides spatiotemporal analyses and machine learning results to determine whether it's practically possible to offer low-friction discounts that lead to improved fleet balance.", "description": "You'll see the hidden patterns that carsharing data reveals when contextualized with urban information. The outcomes visually demonstrate the impact of area use, traffic and weather conditions across the city through comprehensive data visualizations.\r\nBuilding on existing research, the talk will present opportunities that arise from including user data in the equation and offer starting points for additional predictors. \r\n\r\nThis session is ideal for data scientists interested in urban analytics, transportation modeling, or real-world applications of predictive modeling in mobility systems.", "recording_license": "", "do_not_record": false, "persons": [{"code": "3GBZAT", "name": "Florian K\u00f6nig", "avatar": "https://cfp.pydata.org/media/avatars/3GBZAT_c2leQ3p.webp", "biography": "Florian is a multidisciplinary software engineer with a deep interest in human mobility and infrastructure - both physical and digital. His background in software engineering at CODE Berlin and digital design at IADE Lisbon brings a creative approach to technology and data visualization. Currently, he works as a mobile and backend engineer at TBO Digital.", "public_name": "Florian K\u00f6nig", "guid": "abe546c5-4c18-5a5a-bab2-cda9ae03fc45", "url": "https://cfp.pydata.org/berlin2025/speaker/3GBZAT/"}], "links": [], "feedback_url": "https://cfp.pydata.org/berlin2025/talk/WWSZKY/feedback/", "origin_url": "https://cfp.pydata.org/berlin2025/talk/WWSZKY/", "attachments": []}]}}]}}}