<?xml version='1.0' encoding='utf-8' ?>
<iCalendar xmlns:pentabarf='http://pentabarf.org' xmlns:xCal='urn:ietf:params:xml:ns:xcal'>
    <vcalendar>
        <version>2.0</version>
        <prodid>-//Pentabarf//Schedule//EN</prodid>
        <x-wr-caldesc></x-wr-caldesc>
        <x-wr-calname></x-wr-calname>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>DZTLEW@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-DZTLEW</pentabarf:event-slug>
            <pentabarf:title>Keynote: Building AI-First Organizations</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T091500</dtstart>
            <dtend>20250418T100000</dtend>
            <duration>004500</duration>
            <summary>Keynote: Building AI-First Organizations</summary>
            <description>In the quest to become AI-first, organizations face the imperative of aligning technological innovation with strategic business objectives. This transformation requires AI practitioners to evolve into strategic stewards who not only possess technical expertise but also deeply understand organizational goals and the multifaceted challenges of AI implementation. Key considerations include:

- **Strategic Alignment:** AI initiatives must be closely integrated with the organization&#x27;s overarching goals. This entails identifying areas where AI can drive significant value, such as enhancing operational efficiency, improving customer experiences, or enabling data-driven decision-making. A clear strategic vision ensures that AI projects are purpose-driven and aligned with business priorities. 
- **Data Management:** Treating data as a strategic asset is fundamental. This involves going beyond establishing robust data governance frameworks that ensure data quality, privacy, and security. Strategic data management practices enable leaders realize the monetary value of the organization’s data, build reliable AI models and foster trust among stakeholders.
- **Targeted AI Investment:** Organizations should focus AI development in domains where human capabilities are limited, allowing AI to complement human strengths. Conversely, in areas where humans excel and AI falls short—such as tasks requiring deep creativity, empathy, or complex judgment—investment should prioritize human expertise. This strategic allocation ensures that AI serves as an effective tool without encroaching upon domains where human skills are paramount. 
- **Human-AI Interaction Design:** Insights from research on human-machine interaction are vital for designing AI systems that are intuitive and user-friendly. Emphasizing the human-in-the-loop approach ensures that AI tools augment human capabilities, leading to more effective and ethical AI implementations. 
- **Ethical Considerations:** Addressing ethical challenges such as data privacy, bias, and regulatory compliance is crucial. Implementing AI responsibly involves proactive measures to mitigate risks and uphold ethical standards, thereby maintaining public trust and safeguarding the organization&#x27;s reputation. 
- **Change Management:** Transitioning to an AI-first organization necessitates effective change management strategies. This includes reskilling and upskilling employees, managing cultural shifts, and addressing potential resistance to change. Empowering employees to work alongside AI technologies fosters a culture of innovation and continuous improvement.

This keynote delves into these critical aspects, offering insights into how AI practitioners can become effective stewards of AI strategy. By embracing a holistic approach that encompasses strategic alignment, robust data practices, ethical considerations, and proactive change management, organizations can successfully navigate the complexities of AI adoption and thrive in an AI-centric future.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/DZTLEW/</url>
            <location>Auditorium 5</location>
            
            <attendee>Rajkumar Venkatesan</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>3YQQ8N@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-3YQQ8N</pentabarf:event-slug>
            <pentabarf:title>Making the most of test-time compute in LLMs</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T102000</dtstart>
            <dtend>20250418T105500</dtend>
            <duration>003500</duration>
            <summary>Making the most of test-time compute in LLMs</summary>
            <description>The objectives of this session are to:
1. Highlight differences between mainstream LLMs and reasoning models
2. Understand test-time compute and the different dimensions along which they can be scaled.
2. Demonstrate experimental results with reasoning models from DeepSeek and OpenAI
3. Learn how to prompt reasoning models effectively.
4.  Showcase how to leverage test time compute at the application level to achieve good results.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/3YQQ8N/</url>
            <location>Auditorium 5</location>
            
            <attendee>Suhas Pai</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>JNHA9R@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-JNHA9R</pentabarf:event-slug>
            <pentabarf:title>Evaluating LLMs at S&amp;P Global: Building a Robust Evaluation Framework for GenAI Productivity Tools</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T105500</dtstart>
            <dtend>20250418T113000</dtend>
            <duration>003500</duration>
            <summary>Evaluating LLMs at S&amp;P Global: Building a Robust Evaluation Framework for GenAI Productivity Tools</summary>
            <description>In this talk, we will provide an in-depth look at how S&amp;P Global built a comprehensive and reliable evaluation framework for our Generative AI (GenAI)-powered internal productivity tools, with a focus on our Market Intelligence (MI) Sales Assistant application.

We will begin by discussing the unique challenges of evaluating large language models (LLMs) and the importance of a robust evaluation strategy, especially for Retrieval Augmented Generation (RAG)-based systems. We’ll then dive into the key components of our framework:

• Metrics: We thoughtfully combine traditional statistical metrics like accuracy, precision, and latency with LLM-specific metrics such as answer relevance, faithfulness to source, and hallucination detection. We’ll explain each metric and its role in assessing model performance and talk about how custom metrics are often necessary in LLM applications.

• Question-Answer Pair Generation: We’ll share our process for generating diverse and representative question-answer pairs, including the models used, quality control measures, and lessons learned around promoting diversity in evaluation data.

• Ground Truth Creation: Our framework heavily involves subject matter experts (SMEs) to create and validate ground truth data. We’ll detail our process for engaging SMEs , documenting and versioning ground truth, and maintaining high standards.

• Evaluation Implementation: We’ll provide a technical overview of our framework, built using the MLflow library. We’ll cover our daily sampling process for continuous monitoring, our comprehensive testing triggered by new releases and document updates, and cost considerations. We will also talk broadly on other tools available outside of MLFlow.

Throughout the talk, we’ll share real-world results and concrete lessons learned, such as effective strategies for question generation, SME engagement, and scaling evaluation processes. We’ll demonstrate our MI Sales Assistant and evaluation dashboard to illustrate the framework in action.

Attendees will come away with a clear understanding of what it takes to implement a robust evaluation framework for a real-world GenAI application. They’ll learn proven best practices and potential pitfalls, equipping them to ensure their own AI systems consistently deliver value.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/JNHA9R/</url>
            <location>Auditorium 5</location>
            
            <attendee>MacKenzye Leroy</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>FHY93D@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-FHY93D</pentabarf:event-slug>
            <pentabarf:title>Maximizing Multimodal: Exploring the search frontier of text-to-image models to improve visual find-ability for creatives</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T113000</dtstart>
            <dtend>20250418T120500</dtend>
            <duration>003500</duration>
            <summary>Maximizing Multimodal: Exploring the search frontier of text-to-image models to improve visual find-ability for creatives</summary>
            <description>Objective:
Describe where and how we have improved the search experience in our product with open source multi-modal models and libraries. Real world examples from the things we have shipped (and decided not to ship) to production.

Outline:
1. Cover the architecture of open source hybrid search stack at Eezy (Elasticsearch, FAISS, PyTorch)
2. Demo the capabilities and limitations of openCLIP for retrieval embeddings
3. Highlight meaningful stops on our product roadmap from the last 2 years of deploying features into production.
4. Describe notable missteps and surprises uncovered along the way, so people see it&#x27;s not all roses in the AI powered future.
5. Demo of BORGES, a novel search framework that allows users to search with multiple queries for a nuanced navigation of the catalog to find exactly what they need

Audience:
- Anyone curious about real-world results we have extracted from AI
- Search practitioners developing hybrid search applications
- PyTorch and transformers enthusiasts interested in applications in vector space
- This talk is not overtly technical and does not require a background in ML/search/AI. The most math required is some multiplication and division, you got it, jump in.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/FHY93D/</url>
            <location>Auditorium 5</location>
            
            <attendee>Nathan Day</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>XEBBH7@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-XEBBH7</pentabarf:event-slug>
            <pentabarf:title>Fine tuning embeddings for semantic caching</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T120500</dtstart>
            <dtend>20250418T123500</dtend>
            <duration>003000</duration>
            <summary>Fine tuning embeddings for semantic caching</summary>
            <description># Who Should Attend?
This talk is designed for AI engineers and researchers interested in building with LLMs in production. Attendees with a basic understanding of NLP and RAG systems will benefit most, but the concepts and demonstrations will be approachable for a general technical audience.

# Why It’s Interesting?
As organizations incorporate LLMs into real-world products, they grapple with inference compute demands and sluggish response times. Semantic caching offers a pragmatic solution: once you identify frequently asked questions (or reoccurring queries), you can serve results from a cache rather than running a fresh, computationally expensive inference every time. This lowers cost and latency. Moreover, using various fine-tuning methods on the retrieval models improves the accuracy of “question deduplication,” ensuring cache hits are matched reliably.

# Key Takeaways
- Semantic Caching Fundamentals: How to design and implement a caching layer tailored for question-answering or conversational systems (RAG).
- Embedding Fine-Tuning: An overview of contrastive methods to improve embedding models’ ability to detect near-duplicate or semantically similar queries.
- Practical Insights: Best practices for integrating semantic caching in production, along with tips for monitoring performance and keeping infrastructure costs down.
- Real world examples.

# Background Knowledge
- Minimal NLP/ML Knowledge: Familiarity with embeddings, vector similarity, and basic model inference is helpful.
- Basic Software Engineering: Familiarity with productionizing ML workflows will help contextualize the caching strategy.

# Talk Outline (30 minutes)
1. Introduction to LLM challenges in production (high inference cost, slow responses) with real world examples.
2. Overview of semantic caching: concepts, benefits, and common pitfalls.
3. Improving cache hit rates with contrastive fine-tuning: what it is and how it enhances embedding models.
4. Demo of improving duplicate question detection.
5. Recap and system architecture review.
6. Share resources for further learning (GitHub links, additional reading, etc.)

By the end of this session, attendees will have a clear roadmap for employing semantic caching and contrastive fine-tuning to reduce costs and improve performance in LLM-powered applications. We look forward to sharing our experiences and answering your questions!</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/XEBBH7/</url>
            <location>Auditorium 5</location>
            
            <attendee>Tyler Hutcherson</attendee>
            
            <attendee>Srijith Rajamohan</attendee>
            
            <attendee>Waris Gill</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>NEKHFV@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-NEKHFV</pentabarf:event-slug>
            <pentabarf:title>Panel: Principles for Effective and Successful Data Scientists</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T133500</dtstart>
            <dtend>20250418T143500</dtend>
            <duration>010000</duration>
            <summary>Panel: Principles for Effective and Successful Data Scientists</summary>
            <description>This conversational panel brings together experienced data science professionals to explore what truly matters for success in the field beyond what&#x27;s typically learned in educational settings.

Our panelists will share insights on:
* The &quot;real world&quot; skills critical to data science that aren&#x27;t typically taught in academic programs
* Foundations of data science: the core understanding data, the mechanics of models, and the importance of considering MLOps as a Data Scientist
* How to stand out in data science job opportunities and the pathways in and through data science.
* Practical advice for students, job seekers, and career changers looking to enter or advance in data science

This session will be valuable for students, early-career data scientists, those interviewing for data science roles, professionals seeking promotions, and individuals looking to transition from other fields into data science.

The panel will include time for audience Q&amp;A, allowing attendees to ask specific questions around each major discussion point during the process.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/NEKHFV/</url>
            <location>Auditorium 5</location>
            
            <attendee>Aaron Baker</attendee>
            
            <attendee>Renee Teate</attendee>
            
            <attendee>David Der</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>HKZH7C@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-HKZH7C</pentabarf:event-slug>
            <pentabarf:title>Addressing Climate Change with AI</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T145500</dtstart>
            <dtend>20250418T153000</dtend>
            <duration>003500</duration>
            <summary>Addressing Climate Change with AI</summary>
            <description>Overview: 
AI is profoundly shaping society.  An equally forceful phenomenon is climate change; humanity is already feeling the impacts, and temperatures and greenhouse gas emissions keep rising.  The goal of this talk is to briefly survey the many ways AI is and can be used to address climate change, and to provide pointers to anyone interested in contributing to the effort.  The intended audience is anyone with an interest in this intersection of AI and climate change.

Climate Change: 
We’ll briefly discuss aspects of climate change which AI is tackling, such as mitigating emissions from the five most carbon-intensive sectors: energy, manufacturing, land use, transportation, and buildings / infrastructure.  We’ll also look at AI’s application to other areas such as climate modeling, carbon capture, climate finance, and reducing the carbon footprint of AI itself.  

AI: 
We’ll see how a number of AI methods can be used to address climate change, including: various neural net architectures (e.g. convolutional, recurrent, graph), LLMs, reinforcement learning, generative AI, neural operators, causality, and natural language processing.

Their intersection: 
We’ll display a matrix of climate change domains and selected AI methods that can address them, as a guide to tractable areas to tackle.  We’ll look at unsolved climate-related areas where AI could potentially help.  We’ll conclude by providing resources for anyone wishing to learn more about this intersection, and for technologists wanting to plug into an existing community to contribute to this effort.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/HKZH7C/</url>
            <location>Auditorium 5</location>
            
            <attendee>Dan Loehr</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>SF7WAK@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-SF7WAK</pentabarf:event-slug>
            <pentabarf:title>Real-Time Fitness Leaderboards with Open-Source Moose</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T153000</dtstart>
            <dtend>20250418T160500</dtend>
            <duration>003500</duration>
            <summary>Real-Time Fitness Leaderboards with Open-Source Moose</summary>
            <description>What &amp; Why
Health and fitness applications produce constant streams of data, from workout logs and step counts to heart-rate measurements and sleep metrics. Crafting a dynamic, user-facing experience—like up-to-the-minute leaderboards or automated badge award systems—requires real-time data access and frequent aggregations. Traditional OLTP databases can stall under heavy reads and writes, making it tough to maintain a snappy user experience.

Enter Moose, an open-source analytics engine built around a columnar architecture. With Moose, developers and data teams can:

Ingest large volumes of real-time data from wearables, apps, and sensors.
Run near-instantaneous aggregations to power live dashboards or personal health insights.
Scale analytics cost-effectively thanks to Moose’s open-source foundation and Python-friendly ecosystem.
Practical Use Case: Real-Time Fitness Leaderboards
We’ll demonstrate how to build a workout leaderboard that updates in real time as users complete activities. We’ll also show how to apply custom rules for awarding achievement badges, ensuring that your application can both process and surface analytics-driven insights at scale.

Who Should Attend
Data &amp; Analytics Engineers: Seeking solutions to handle large volumes of health/wellness data with frequent aggregations.
Developers/Architects: Building real-time or near-real-time consumer apps that rely on fast analytics.
Product Managers &amp; Tech Leads: Interested in creating engaging features like live dashboards and automatic badge systems within their wellness offerings.
Health &amp; Fitness Enthusiasts: Looking to understand how data architecture can enhance user engagement and personalized metrics.
A basic understanding of databases, Python data tools, and event streams (e.g., from wearable devices) is helpful but not required.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/SF7WAK/</url>
            <location>Auditorium 5</location>
            
            <attendee>David Der</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>D3Z7XN@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-D3Z7XN</pentabarf:event-slug>
            <pentabarf:title>Panel: Bridging the Gap: Collaborative Approaches to Data Science</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T160500</dtstart>
            <dtend>20250418T170500</dtend>
            <duration>010000</duration>
            <summary>Panel: Bridging the Gap: Collaborative Approaches to Data Science</summary>
            <description>This panel brings together practitioners and leaders to discuss the evolving landscape of data science collaboration and implementation. As organizations face increasing pressure to derive value from AI/ML initiatives, the traditional boundaries between disciplines are being reexamined and redefined.

Our panelists will explore:

- Breaking down isolation between data scientists, MLOps engineers, developers, and other stakeholders
- Creating effective frameworks for rapid experimentation that balance innovation with enterprise standards
- Establishing robust handoff processes for transitioning models from exploration to production
- Bridging cultural divides between the explorative nature of data science and the engineering mindset of MLOps
- Practical strategies for cross-functional collaboration that leverages complementary skills
- Managing stakeholder expectations and improving communication with non-technical audiences

This discussion is designed for data professionals at all levels—from individual contributors to team leaders and executives—who are navigating the challenges of modern data science implementation. The panel will address both technical and organizational aspects of successful data science teams.

The session will include time for audience Q&amp;A, allowing attendees to engage directly with panelists about their specific challenges in building collaborative data science environments.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/D3Z7XN/</url>
            <location>Auditorium 5</location>
            
            <attendee>Thomas  Loeber</attendee>
            
            <attendee>Manikandarajan Shanmugavel</attendee>
            
            <attendee>Renee Teate</attendee>
            
            <attendee>Christopher N. Eichelberger</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>RBYY9R@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-RBYY9R</pentabarf:event-slug>
            <pentabarf:title>Practical Applications of Apache Arrow</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T102000</dtstart>
            <dtend>20250418T105500</dtend>
            <duration>003500</duration>
            <summary>Practical Applications of Apache Arrow</summary>
            <description>The Apache Arrow project has been drastically improving the way analytical tools perform, interoperate, and scale. However, as Arrow is primarily used by developers, much of those improvements are happening &quot;behind the scenes,&quot; leaving many uninformed as to what exactly Apache Arrow is.

In this talk, we will provide a more formal definition of Apache Arrow, and discuss its various components that collectively are helping to revolutionize the data landscape. We will also take some time to explore how popular Python packages like pandas, polars, and pantab have been leveraging Apache Arrow for interoperability between utilities, while also having an open discussion as to what can still be done.

By the end of this talk, users will have an appreciation of how Apache Arrow is powering their Python (and non-Python!) libraries today, and how it will shape the data landscape going forward. Topics like Arrow Flight, Arrow Flight SQL, Arrow ADBC, and nanoarrow will be discussed, and attendees will gain a deeper understanding of how these technologies are evolving the way data is used in embedded environments, relational databases, HTTP exchanges, AI applications, and more.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/RBYY9R/</url>
            <location>Auditorium 4</location>
            
            <attendee>William Ayd</attendee>
            
            <attendee>Matthew Topol</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>8GSQPK@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-8GSQPK</pentabarf:event-slug>
            <pentabarf:title>Data wrangling with DuckDB</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T105500</dtstart>
            <dtend>20250418T113000</dtend>
            <duration>003500</duration>
            <summary>Data wrangling with DuckDB</summary>
            <description>Learn how to use DuckDB to process data in python! In the era of &quot;big data,&quot; many data practitioners immediately reach for distributed computing solutions when facing large datasets. Modern hardware capabilities combined with efficient tools like DuckDB make this much less necessary than a few years ago. This talk will demonstrate how to effectively wrangle data using DuckDB in Python, offering a powerful alternative to Pandas and Spark for the majority of data science workflows.

This session will cover:

- Understanding DuckDB&#x27;s architecture and its integration with the Python ecosystem
- Practical examples of migrating from pandas to DuckDB.
- Performance benchmarks comparing DuckDB against pandas and other popular Python data processing methods
- Real-world scenarios where DuckDB shines, including handling larger-than-memory datasets
- Discussion of the &quot;shrinking size&quot; of big data and when to consider DuckDB versus distributed computing solutions

This talk is aimed at Python data practitioners who regularly work with medium to large datasets (100MB-100GB) and are looking to optimize their data processing workflows. The presentation will include both conceptual explanations and hands-on code examples.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/8GSQPK/</url>
            <location>Auditorium 4</location>
            
            <attendee>Will Angel</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>UDQZBM@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-UDQZBM</pentabarf:event-slug>
            <pentabarf:title>Zero Code Change GPU-Powered Graph Analytics with NetworkX and cuGraph</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T113000</dtstart>
            <dtend>20250418T120500</dtend>
            <duration>003500</duration>
            <summary>Zero Code Change GPU-Powered Graph Analytics with NetworkX and cuGraph</summary>
            <description>This talk will showcase a GPU accelerated graph backend presented by NVIDIA in partnership with the NetworkX Community. It aims to showcase how GPUs are well-suited to solving graph problems at large scales.

The talk is intended for Python developers who are interested in using GPUs in their workflows and data scientists interested in Graph analytics.

During the talk, we intend to go over the following.

1. Brief introduction to Graphs and why Graph Analytics is so powerful.

2. Introducing NetworkX – Why is it so popular? What are its limitations?

3. Example showcasing the magic of Dispatching. The design philosophy and how it benefits both users and OS developers.

4. Real-world example on the Pokec (Social Network) dataset. How to do Community Detection on a large Graph using Louvain (with Zero Code Change)!

5. Finally, how we aim to work with the community to add new algorithm implementations and contribute to upstream NetworkX.
 
6. Q&amp;A!

===

Learn more:

 - [Project page](https://rapids.ai/nx-cugraph/)!
 - [GitHub](https://github.com/rapidsai/nx-cugraph)!

I&#x27;d love to connect with you and discuss ideas of applying Graph analytics to *your* work.

Reach out via [LinkedIn](https://www.linkedin.com/in/ralph-liu/)</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/UDQZBM/</url>
            <location>Auditorium 4</location>
            
            <attendee>Ralph Liu</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>XRXKDK@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-XRXKDK</pentabarf:event-slug>
            <pentabarf:title>Practical Multi Armed Bandits</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T120500</dtstart>
            <dtend>20250418T123500</dtend>
            <duration>003000</duration>
            <summary>Practical Multi Armed Bandits</summary>
            <description>Imagine a row of slot machines (often called one-armed bandits because of the lever on the side and the fact that they take your money) -- you know that one of them will pay out more than the others over time, but how do you figure out which one? This is the premise of the multi-armed bandit (MAB) problem, which has become a vital reinforcement learning technique used to balance the exploration-exploitation dilemma (e.g., at what point do you start exploiting the best choice to maximize your rewards instead of exploring for better options).

Multi-armed bandits are straightforward to implement: define your choices and assign each of them a probability distribution for selection. Each time a choice is made, the probability distribution for that choice is updated based on the outcome of a reward function. Easy right? The trick is in designing both your choices and your reward function in such a way that you capture the dynamics of your experimental environment, often a live environment that involves user behavior and other irregularities! 

Things get more complicated when you have multiple agents - each of them with their own probability distributions. Here, you need to design the reward functions such that your desired behavior emerges from the collective interactions of each individual agent. The best type of complexity arises globally from many simple local interactions! 

In this talk, we will learn how to implement multi-armed bandits and reward functions for three use cases: ordering a news feed, prioritizing tasks for a team in a sprint, and minimizing cloud costs for a distributed system. We&#x27;ll focus on practical strategies for designing reward functions and dealing with change. At the end of this talk you should be ready and excited to implement bandit algorithms for your own data science problems!</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/XRXKDK/</url>
            <location>Auditorium 4</location>
            
            <attendee>Benjamin Bengfort</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>AFZSVT@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-AFZSVT</pentabarf:event-slug>
            <pentabarf:title>Using Python to Unlock Insights from OpenStreetMap Data at Scale</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T145500</dtstart>
            <dtend>20250418T153000</dtend>
            <duration>003500</duration>
            <summary>Using Python to Unlock Insights from OpenStreetMap Data at Scale</summary>
            <description>Commercial real estate organizations are avid consumers of geospatial data. These organizations have already identified the value in particular of power and telecommunications infrastructure spatial data to make business decisions. Examples of these data include: the locations of power plants, transmission lines, fiber backbone cables, and  submarine fiber cables.

One rich source for these datasets is OpenStreetMap (OSM), however natively OSM does not  streamline access to data, especially at scale. Because OSM data are open, we can use Python to query, download, and transform OSM power and telecommunications spatial data for use within Open Source and commercial Geographic Information Systems (GIS) software applications, models built in Python and other languages, and really any other tools and processes which can read GIS data. 

This presentation will present a high-level overview of the overall data flow, and then dive into individual steps and how each step was implemented in Python. Examples will be provided, and maps and analyses based on the resulting spatial data will be demonstrated. This presentation will also explain one approach to download very large OSM datasets, for example data spanning continents and including many different themes. Along the way this presentation will also touch on how to avoid “gotchas” and how this approach could be adopted to different types of OSM data supporting other use cases and business requirements.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/AFZSVT/</url>
            <location>Auditorium 4</location>
            
            <attendee>Cory Eicher</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ECJWAP@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ECJWAP</pentabarf:event-slug>
            <pentabarf:title>Versioning Multimodal Data: Metadata &amp; Beyond</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T153000</dtstart>
            <dtend>20250418T160500</dtend>
            <duration>003500</duration>
            <summary>Versioning Multimodal Data: Metadata &amp; Beyond</summary>
            <description>The team behind DVC has spent years tackling data versioning challenges. With the rise of AI, we’ve seen new complexities emerge - especially with multimodal datasets like images, video, audio, and text. Simply tracking files is no longer enough; metadata, including bounding boxes, poses, text annotations, and embeddings, is now central to dataset management, using LLM for auto-annotation is becoming a daily routine. This talk shows why multimodal data versioning is different, how Pydantic provides a powerful way to structure and integrate metadata and how this approach is implemented in open-source library DataChain.

We’ll also cover efficient dataset operations at scale: computing diffs across millions of files, managing expensive GPU-based metadata computations like embeddings and performing incremental dataset updates. The audience will learn practical tricks for building scalable, high-performance AI workflows with modern dataset management techniques.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/ECJWAP/</url>
            <location>Auditorium 4</location>
            
            <attendee>Dmitry Petrov</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>8M9ZJN@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-8M9ZJN</pentabarf:event-slug>
            <pentabarf:title>AI Ready Data</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T160500</dtstart>
            <dtend>20250418T164000</dtend>
            <duration>003500</duration>
            <summary>AI Ready Data</summary>
            <description>Customers have been clear that receiving ‘just’ data is no longer sufficient. They expect data to be immediately accessible, usable and understandable to both human and AI consumers with “zero ETL” (Extract Transform Load). We will discuss the direction that S&amp;P is taking explicitly aimed at serving this need, greatly increasing the insight available to customers. This includes the concept of providing machine readable metadata at the column level for datasets. This metadata permits AI and ETL tools to automatically ingest and connect delivered data to a customer’s own data as well as automatically import that data into a customer’s data catalog.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/8M9ZJN/</url>
            <location>Auditorium 4</location>
            
            <attendee>Alec Gosse</attendee>
            
            <attendee>Hamish Brookeman</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>FMQ8PA@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-FMQ8PA</pentabarf:event-slug>
            <pentabarf:title>Visualization of higher-dimensional feature spaces during model training</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T164000</dtstart>
            <dtend>20250418T171500</dtend>
            <duration>003500</duration>
            <summary>Visualization of higher-dimensional feature spaces during model training</summary>
            <description>The goal of this talk is to provide machine learning practitioners with a few simple visualizations for more effective model training. These techniques have been developed through several years of real-world experience with model training, validation, deployment, and maintenance. Since the internal workings of large models are usually somewhat opaque, model trainers often ask themselves a familiar set of questions:  

When should I stop training my model? 

Which one of my saved model checkpoints is the “best”? 

What training data should I add (or remove) to achieve a given outcome? 

How do I know if my model is giving the right answer for the wrong reasons, or vice versa? 

How robust is my model to out-of-distribution data? 

Why is there performance drift in my deployed model? 

We argue that much greater emphasis on model observability and explainability is needed, and that the right sorts of visualizations can generate valuable insights and point toward specific improvements.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/FMQ8PA/</url>
            <location>Auditorium 4</location>
            
            <attendee>Vivek Dhand</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>ZXYBV3@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-ZXYBV3</pentabarf:event-slug>
            <pentabarf:title>Bayesian Risk Analysis For Large Multi-Modal Data</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T102000</dtstart>
            <dtend>20250418T105500</dtend>
            <duration>003500</duration>
            <summary>Bayesian Risk Analysis For Large Multi-Modal Data</summary>
            <description>This talk is based on research projects by UVA iTHRIV on the N3C platform. Its target audience includes data scientists, undergraduate students, graduate students, researchers, and anyone interested in data science. The general structure of this talk will consist of a brief introduction to The National COVID Cohort Collaborative (N3C), a database with multi-modal data sets, quantitative methods and models in Bayesian risk analysis, and some real-world applications of these methods as well as some publications by our team. This talk will be informative and will include a balanced percentage of mathematical expressions and real-world applications, and the audience will learn more about quantitative methods to analyze multi-modal data in N3C.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/ZXYBV3/</url>
            <location>Auditorium 3</location>
            
            <attendee>Sihang Jiang</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>GLBTZD@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-GLBTZD</pentabarf:event-slug>
            <pentabarf:title>Saving Lives with Data Science: How data science shortened the COVID-19 pandemic by 2 months</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T105500</dtstart>
            <dtend>20250418T113000</dtend>
            <duration>003500</duration>
            <summary>Saving Lives with Data Science: How data science shortened the COVID-19 pandemic by 2 months</summary>
            <description>This talk explores how data science accelerated COVID-19 vaccine trials, saving 6-8 weeks in deployment. Through geospatial modeling, we targeted diverse recruitment in emerging hot zones, ensuring efficient and representative trials. Attendees will discover how advanced analytics and collaboration turned insights into life-saving action.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/GLBTZD/</url>
            <location>Auditorium 3</location>
            
            <attendee>Greg Michaelson</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>PG9CKX@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-PG9CKX</pentabarf:event-slug>
            <pentabarf:title>The Art of Brain Data in ASD Subjects: Celebrating Neurodiversity Through Aesthetic Data Visualization</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T113000</dtstart>
            <dtend>20250418T120500</dtend>
            <duration>003500</duration>
            <summary>The Art of Brain Data in ASD Subjects: Celebrating Neurodiversity Through Aesthetic Data Visualization</summary>
            <description>Historically, research has highlighted a notable disparity in ASD diagnoses—with males being diagnosed significantly more frequently than females. However, beneath these statistics lies a rich tapestry of neuroanatomical diversity that often goes unnoticed. Our work reimagines this disparity as a piece of art, where data becomes a sculptural medium inviting viewers to engage with and reflect on the intricacies of brain structure.

Drawing on over 300 3D brain surface models from the Autism Centers of Excellence (ACE) study, our approach blends advanced MRI neuroimaging, multivariate statistical analysis, and cutting-edge 3D printing technology. The result is an artful representation that not only quantifies but also visually and tangibly celebrates sex differences in brain morphology across both ASD and non-ASD populations.

This presentation will take you on a journey through our methodological and creative process—from the acquisition and analysis of complex neuroimaging data to the transformation of these insights into physical art. We will discuss the technical details of MRI scanning, the challenges and innovations in our multivariate analyses, and the craftsmanship behind the 3D printing process.

Designed for an audience spanning both scientific and artistic disciplines, this presentation aims to inspire new ways of thinking about data visualization. By embracing &quot;data as art,&quot; we encourage a more holistic understanding of neurodiversity—one that not only informs but also resonates on an emotional and aesthetic level. Join us for this presentation as we explore how the fusion of art and science can lead to innovative insights into the human brain, fostering a deeper appreciation for the nuanced interplay of sex differences in ASD and beyond.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/PG9CKX/</url>
            <location>Auditorium 3</location>
            
            <attendee>Siwen Liao</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>CF3VVT@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-CF3VVT</pentabarf:event-slug>
            <pentabarf:title>Exploring Eviction Trends in Virginia</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T120500</dtstart>
            <dtend>20250418T123500</dtend>
            <duration>003000</duration>
            <summary>Exploring Eviction Trends in Virginia</summary>
            <description>Virginia is home to 5 of the top 10 cities with the highest rates of eviction nationwide. Housing instability threatens the security of entire communities and burdens already limited social safety nets. Yet research shows that housing instability is rooted not in individual or community failures, but in policies of exclusion, displacement, disinvestment, and discrimination.

While collected to support programmatic goals, administrative data can also be used to shift the lens to those in power. In this work we first visualize eviction activity across the Commonwealth in an interactive Shiny app to address questions and needs of organizations providing legal, policy, and community advocacy. In addition we estimate landlord actions – eviction filings and serial filings – as a function of community and landlord characteristics. Using a series of mixed-effects models, with data aggregated to zipcodes nested in counties, we estimate the impact of community characteristics and landlord attributes on the likelihood of eviction filings and nuisance filings.  Both the app and analysis speak to the larger causes and consequences of housing instability.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/CF3VVT/</url>
            <location>Auditorium 3</location>
            
            <attendee>Samantha Toet</attendee>
            
            <attendee>Dr. Michele Claibourn</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>NNXPCL@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-NNXPCL</pentabarf:event-slug>
            <pentabarf:title>Author Chat &amp; Book Signing</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T123500</dtstart>
            <dtend>20250418T133500</dtend>
            <duration>010000</duration>
            <summary>Author Chat &amp; Book Signing</summary>
            <description>Come meet the authors of some of your favorite data science books, or learn more about a book you&#x27;re interested in but haven&#x27;t purchased, yet. 

The authors listed below will be available during lunch for informal discussions, so drop in any time during the lunch break for a meet &amp; greet. Some authors will be signing books, so bring your books written by these authors if you want your copy autographed! (And check this schedule again before Friday, as we may have authors joining this session up until the day before the event.) Some limited copies may be available as giveaways.

Will Ayd: Pandas Cookbook, Third Edition (Packt)

Suhas Pai: Designing Large Language Model Applications (O&#x27;Reilly)

Renee M. P. Teate: SQL for Data Scientists (Wiley)

Matt Topol: In-Memory Analytics with Apache Arrow (Packt)


Note that author John Berryman will be presenting a tutorial on Saturday, and will be available during lunchtime on Saturday to chat about his book &quot;Prompt Engineering for LLMs: The Art and Science of Building Large Language Model-Based Applications&quot; (O&#x27;Reilly).</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/NNXPCL/</url>
            <location>Auditorium 3</location>
            
            <attendee>William Ayd</attendee>
            
            <attendee>Matthew Topol</attendee>
            
            <attendee>Renee Teate</attendee>
            
            <attendee>Suhas Pai</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>L3GESN@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-L3GESN</pentabarf:event-slug>
            <pentabarf:title>Using Changepoint and Bayesian Analysis to Drive Safety Improvements in Mining</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T145500</dtstart>
            <dtend>20250418T153000</dtend>
            <duration>003500</duration>
            <summary>Using Changepoint and Bayesian Analysis to Drive Safety Improvements in Mining</summary>
            <description>The presentation will cover how changepoint analysis is implemented, how the insights generated are applied to improve the safety metrics, and the challenges we have faced in communicating the insights. It will be structured as follows:
•	Understanding variability in the process (5 min): How random variation impacts safety metrics and challenges in measuring zero-harm.
•	Changepoint analysis implementation (10 min): Introduction to changepoint analysis using the changepoint package from R and Bayesian changepoint using the RBeast package from Python.
•	Communicating the insights (10 min): Challenges in communicating the insights and presenting them in a way that is actionable for the safety team and executives.
•	Q&amp;A (5-10 min): Open discussion and audience questions.
Attendees will learn:
•	Why comparing absolute numbers might be misleading.
•	How to implement changepoint analysis to detect significant changes in safety metrics.
•	Strategies to communicate actionable findings in non-data science teams and executive level.
This session is ideal for data practitioners with a background in basic probability and statistics (e.g., understanding distributions and confidence intervals). No programming expertise is required, but references to Python libraries and code snippets will provide actionable insights for those looking to implement these techniques in their work.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/L3GESN/</url>
            <location>Auditorium 3</location>
            
            <attendee>Mauricio Mathey</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>WRJYDF@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-WRJYDF</pentabarf:event-slug>
            <pentabarf:title>The Secret Sauce of Customer Satisfaction: Turning Data Pipelines into Data Products</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T153000</dtstart>
            <dtend>20250418T160500</dtend>
            <duration>003500</duration>
            <summary>The Secret Sauce of Customer Satisfaction: Turning Data Pipelines into Data Products</summary>
            <description>Since 2023, Elder Research has partnered with a major U.S.-based Quick Service Restaurant corporation to enhance the effectiveness of their enterprise data &amp; analytics group. Our goal was to instill a &quot;Data as a Product&quot; mindset across six Data Portfolios, which support internal analytics teams by maintaining core data pipelines for critical business and customer-facing applications.

In this talk, we’ll share key insights from the work by our technical business analysts and data engineers on this project, highlight the business value delivered to our client, and explore how &quot;Data as a Product&quot; principles can strengthen client relationships for all of us.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/WRJYDF/</url>
            <location>Auditorium 3</location>
            
            <attendee>Josh Fairchild</attendee>
            
            <attendee>Liam Agnew</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>9BTPLD@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-9BTPLD</pentabarf:event-slug>
            <pentabarf:title>Machine Learning Pipelines in Higher Education: Lessons Learned Taking Models From Training to Production</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T160500</dtstart>
            <dtend>20250418T164000</dtend>
            <duration>003500</duration>
            <summary>Machine Learning Pipelines in Higher Education: Lessons Learned Taking Models From Training to Production</summary>
            <description>In this talk, we will discuss some lessons learned working on human-centric data in higher education and the pitfalls you may encounter. The higher education student cycle begins with admissions, follows the student throughout the terms they attend, and ideally ends with graduation. Using this student lifecycle as a guide, we will dive into how the data available at each point of the student lifecycle and machine learning pipeline needs to be accounted for during training to prevent failures in production. We will also discuss how working with operational datasets provides unique limits to our models and what to watch out for.

This talk is geared towards a general audience, though familiarity with machine learning will be helpful. 

Outline:

Introduction to the student lifecycle (5 min)

Introduction to machine learning pipelines (5 min)

Working with data from across the student lifecycle (10 min)

Working with operational datasets for a machine learning model (5 min)

Concluding thoughts and Q&amp;A (5 min)</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/9BTPLD/</url>
            <location>Auditorium 3</location>
            
            <attendee>Brian Richards</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>RHCHVC@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-RHCHVC</pentabarf:event-slug>
            <pentabarf:title>What is Geometric Algebra and can it help me?</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250418T164000</dtstart>
            <dtend>20250418T171500</dtend>
            <duration>003500</duration>
            <summary>What is Geometric Algebra and can it help me?</summary>
            <description>Geometric Algebra (GA) is a mathematical language that has recently received significant attention from the computer graphics and engineering communities. Proponents of GA claim that it provides a geometrically intuitive interface, concise syntax, and the ability to unify several of the most important algebras. This talk will discuss the pros and cons of GA as a practical computational tool in Python data science. The first half of the talk will introduce the concepts of GA, and the second half will provide concrete demonstrations with the Kingdon library. 
While geared toward data scientists, this talk can be enjoyed by anyone interested in applied mathematics. A basic background in linear algebra will be helpful. Additionally,  those using vector algebra, complex numbers, quaternions, rotation matrices and the like will be especially interested. The audience should leave with a grasp of what GA is and what it isn&#x27;t,  so that they can decide if it is a tool worthy of their cognitive investment.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Talk</category>
            <url>https://cfp.pydata.org/virginia2025/talk/RHCHVC/</url>
            <location>Auditorium 3</location>
            
            <attendee>Alex Arsenovic</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>7EUB8R@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-7EUB8R</pentabarf:event-slug>
            <pentabarf:title>Mastering LLMs: From Prompt Engineering to Agentic AI</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T090000</dtstart>
            <dtend>20250419T103000</dtend>
            <duration>013000</duration>
            <summary>Mastering LLMs: From Prompt Engineering to Agentic AI</summary>
            <description>The rapid evolution of AI and Large Language Models (LLMs) has opened new possibilities for automation, content generation, and interactive agents. This hands-on workshop is designed for developers, researchers, and AI enthusiasts who want to deepen their understanding of LLMs and learn how to harness their full potential. Topics covered include:
- How LLMs work and the role of reinforcement learning in training
- The art and science of prompt engineering, including zero-shot and few-shot techniques
- Retrieval-Augmented Generation (RAG) for integrating external knowledge
- Agentic AI: Designing chatbots and workflow agents
- Fine-tuning models using LoRA for custom behaviors
- Evaluation methods for improving AI performance
- Future trends, including multimodal models and new interaction paradigms
Attendees will leave with practical skills, implementation strategies, and insights into the future of AI-powered applications.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/7EUB8R/</url>
            <location>Room 120</location>
            
            <attendee>John Berryman</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>XPFPFE@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-XPFPFE</pentabarf:event-slug>
            <pentabarf:title>Building Rich RAG Systems with Docling: Unlock Information from Tables, Images, and Complex Documents</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T110000</dtstart>
            <dtend>20250419T123000</dtend>
            <duration>013000</duration>
            <summary>Building Rich RAG Systems with Docling: Unlock Information from Tables, Images, and Complex Documents</summary>
            <description>### Overview and Objectives
This tutorial leverages Docling (https://ds4sd.github.io/docling/), a powerful open-source library designed for advanced document processing and AI integration. The session aims to equip data scientists and ML engineers with practical skills for building robust RAG systems by utilizing Docling&#x27;s comprehensive feature set. We will work through scenarios with multi-page tables, research paper processing maintaining multi-column layouts and equations, or technical documentation management that understands code blocks and diagrams. Through these examples, you&#x27;ll gain practical experience in building robust document processing pipelines that outperform traditional extraction tools.

Participants will learn how to:
- Process and parse various document formats (PDF, DOCX, HTML) using Docling
- Extract structured information including tables, formulas, and images
- Implement effective text chunking strategies for optimal retrieval
- Create vector databases for semantic search
- Integrate the pipeline with LLM frameworks for end-to-end RAG solutions

### Target Audience
This tutorial is designed for:
- Data scientists and ML engineers working on document processing and LLM applications
- Software developers implementing RAG systems
- Anyone interested in building production-ready document processing pipelines

**Experience Level:** Intermediate

**Prerequisites:**
- Basic Python programming knowledge
- Familiarity with basic NLP concepts
- Understanding of LLMs and vector databases (basic level)

### Technical Requirements
Participants should have:
- Python 3.10 or 3.11 installed
- A code editor or IDE
- Ability to install Python packages via pip
- 4GB+ of free disk space for models and dependencies

### Detailed Outline (90 minutes)

1. Introduction and Setup (15 minutes)
   - RAG system architecture overview
   - Setting up the development environment
   - Installing Docling and dependencies


2. Document Processing with Docling (25 minutes)
   - Understanding Docling&#x27;s document processing capabilities
   - Comparing traditional PDF extraction vs. Docling&#x27;s advanced parsing
   - Advanced extraction of tables, images, and complex layouts
   - Hands-on exercise: Processing sample documents with rich content


3. Building the RAG Pipeline (25 minutes)
   - Creating rich vector embeddings that preserve document structure
   - Integration with LLM frameworks
   - Hands-on exercise: Building a complete RAG pipeline


4. Best Practices and Production Considerations (15 minutes)
   - Performance optimization techniques
   - Using accelerators 
   - Docling-serve https://github.com/docling-project/docling-serve to deploy Docling as API service
   - Creating effective evaluations



5. Q&amp;A and Interactive Problem Solving (10 minutes)
   - Addressing participant questions
   - Troubleshooting common issues
   - Discussion of real-world applications


### Materials
https://github.com/KrishnaRekapalli/docling-rag-tutorial-pydata-2025

### Pre-work
Make sure that you have a Hugging Face access token / Replicate API key for LLM inference. You can get some free inference credit on both platforms without credit card. Other option is local ollama. For more details check https://github.com/KrishnaRekapalli/docling-rag-tutorial-pydata-2025

### Key Takeaways
Participants will leave the tutorial with:
- Practical experience in building RAG systems
- Understanding of document processing best practices
- Ability to extract and utilize information from complex document elements
- Hands-on experience comparing traditional vs. advanced extraction methods
- Knowledge of common pitfalls and how to avoid them
- Strategies for handling tables and images in RAG systems</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/XPFPFE/</url>
            <location>Room 120</location>
            
            <attendee>Krishna Rekapalli</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>3JXT7N@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-3JXT7N</pentabarf:event-slug>
            <pentabarf:title>Build Your Own Data Science AI Agents</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T133000</dtstart>
            <dtend>20250419T150000</dtend>
            <duration>013000</duration>
            <summary>Build Your Own Data Science AI Agents</summary>
            <description>**Prerequisite**: 
1. OpenAI developer API Key. If you do not have one, here is a video to create an account and create the OpenAI API Key. https://www.youtube.com/watch?v=JuAOOO18ycg
2. LangSmith API: https://smith.langchain.com/


**Tutorial Materials**: find this Google Drive link: https://drive.google.com/drive/folders/1keoQYO6iEm_b9olxxcWgOfmpipProaPJ?usp=drive_link

This hands-on tutorial will guide participants through designing, building, and deploying AI agents to streamline data science tasks.

**What You’ll Learn**
This tutorial will provide a deep dive into AI agents and multi-agent systems, covering:
- The role of AI agents in automating data science tasks such as data preprocessing, feature engineering, model selection, and evaluation.
- How to design a multi-agent system that efficiently distributes tasks while ensuring reliability and accuracy.
- Strategies for incorporating AI agents into everyday workflows to save time and enhance productivity.
- Common challenges, trade-offs, and best practices when using AI agents in data science.

**Tutorial Structure**
1. Introduction to AI Agents in Data Science (15 minutes)
- What are AI agents, and how do they fit into data science workflows?
- Examples of AI-driven automation in data science.
- Overview of multi-agent collaboration for data-related tasks.
2. Setting Up the Development Environment (10 minutes)
- Tools and frameworks for building AI agents in data science.
- Accessing tutorial materials (Google Drive).
3. Building an AI-Driven Data Science Workflow (40 minutes)
- Hands-on implementation: Automating exploratory data analysis (EDA), data preprocessing, model training, and evaluation with AI agents.
- Orchestrating agent collaboration for complex workflows.
- Ensuring accuracy, reliability, and interpretability in AI-assisted data tasks.
4. Challenges, Trade-offs, and Best Practices (15 minutes)
5. Q&amp;A and Wrap-Up (10 minutes)
- Discussion on real-world applications and industry adoption.
- Key takeaways and next steps for implementing AI agents in data projects.

**Who Should Attend?**
This tutorial is designed for data analysts, data scientists, machine learning practitioners, and AI engineers looking to integrate AI agents into their workflows. Attendees should have a basic understanding of Python and machine learning concepts. 

**Prerequisites &amp; Materials**
- Skill Level: Intermediate (basic Python and ML knowledge recommended).
- Resources: A Google Colab environment for hands-on execution (no local installation required).

By the end of this tutorial, participants will have a practical framework for using AI agents to automate and optimize data science workflows, improving efficiency and scalability in their projects.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/3JXT7N/</url>
            <location>Room 120</location>
            
            <attendee>Niharika Krishnan</attendee>
            
            <attendee>Chuxin Liu</attendee>
            
            <attendee>Astha Puri</attendee>
            
            <attendee>Michelle Rojas</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>LMBBBF@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-LMBBBF</pentabarf:event-slug>
            <pentabarf:title>Blazing the AI Trail: Using LangGraph to Conquer the Oregon Trail</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T153000</dtstart>
            <dtend>20250419T170000</dtend>
            <duration>013000</duration>
            <summary>Blazing the AI Trail: Using LangGraph to Conquer the Oregon Trail</summary>
            <description>Despite the growing excitement around AI agents, many practitioners lack clear guidance on how to implement them effectively. This workshop aims to bridge that gap by providing a structured, hands-on approach to building AI agent workflows with LangGraph. Participants will create an agent capable of playing the Oregon Trail and making in-game decisions, illustrating in a fun way not only how to implement agents but also when, why, and for what sorts of problems. 

Session outline:
1. **Understanding Agent Workflows (10 min)**
    - Overview of agentic workflows and their importance
    - When and why to build agent workflows
2. **Building a Basic LangGraph Agent (20 min)**
    - Setting up the LangGraph framework
    - Defining discrete operations with custom tools
3. **Enhancing Agent Capabilities (20 min)**
    - Structuring output for API interactions
    - Implementing vector retrieval for RAG to improve contextual responses
4. **Optimizing for Performance and Control (25 min)**
    - Creating a semantic cache to reduce LLM latency and cost
    - Implementing allow/block list routing for controlled execution
5. **Review and Discuss (15 min)**
    - Review what was just accomplished and why
    - Discuss any design challenges or open debugging questions
    - Open Q&amp;A for questions related to best practice

This workshop has been tested with participants at a variety of levels and typically takes ~60 minutes to complete if environment setup has been confirmed as noted above.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/LMBBBF/</url>
            <location>Room 120</location>
            
            <attendee>Robert Shelton</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>HNWLPV@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-HNWLPV</pentabarf:event-slug>
            <pentabarf:title>Responsible AI with SciPy</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T090000</dtstart>
            <dtend>20250419T103000</dtend>
            <duration>013000</duration>
            <summary>Responsible AI with SciPy</summary>
            <description>The tutorial provides an introduction to Responsible AI using SciPy.

This presentation will begin with an overview of Responsible AI concepts and of SciPy&#x27;s core features. Following this, there will be a tutorial on how to implement Responsible AI concepts in SciPy. 

The following items will be covered during the tutorial. 

- Data Processing and Validation 
- Bias Detection and Mitigation 
- Sensitivity Analysis 
- Explainability and Transparency 

Each topic will be demonstrated with examples, including links to extended tutorials featuring real-world applications from the healthcare industry.

By the end of this session, attendees will have a solid understanding of how to use SciPy for Responsible AI Applications. Additionally, they will be able to apply these concepts to their own projects immediately.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/HNWLPV/</url>
            <location>Room 130</location>
            
            <attendee>Andrea Hobby</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>RQCCPA@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-RQCCPA</pentabarf:event-slug>
            <pentabarf:title>Data Viz in Python as a Tool to Study HIV Health Disparities</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T110000</dtstart>
            <dtend>20250419T123000</dtend>
            <duration>013000</duration>
            <summary>Data Viz in Python as a Tool to Study HIV Health Disparities</summary>
            <description>Description: Data Viz in Python as a Tool to Study Health Disparities

Targeted to the intermediate Python user, this session will begin with a brief overview of the tools and libraries that will be used, such as Pandas, Matplotlib, Seaborn, Plotly, and GeoPandas. Participants will do hands-on coding, exploring how to transform secondary data into practical, professional visuals. Key coding topics include:
1.	Data Preprocessing and Exploration:
o	Advanced techniques in Pandas for cleaning and reshaping datasets, including handling missing data and filtering key variables.
o	Conducting exploratory data analysis (EDA) to uncover trends and patterns related to HIV disparities.

2.	Building Complex Visualizations:
o	Heatmaps with Seaborn to visualize correlations between demographic factors and health outcomes.
o	Geospatial maps using GeoPandas and Plotly to pinpoint regions with high HIV prevalence and disparities in care access.
o	Bar plots, stacked charts, and histograms to analyze outcomes across intersectional demographics.
o	Time series plots using Matplotlib and Seaborn to explore temporal changes in HIV rates and interventions.

3.	Next Steps:
o	Share Findings with Stakeholders: Present the visualizations and key insights to relevant stakeholders, such as public health officials, policymakers, healthcare providers, and community organizations, using clear and actionable language.
o	Develop Targeted Interventions: Use the insights from the analysis to design and propose interventions aimed at addressing identified disparities, such as community outreach programs, resource allocation strategies, or policy changes.
o	Monitor and Evaluate Impact: Implement a plan to track the effectiveness of interventions using measurable outcomes, such as reductions in infection rates or improvements in access to care, and iterate on strategies based on the results.
o	Build Collaborative Partnerships: Partner with community organizations, research institutions, and funding agencies to amplify efforts, secure resources, and ensure sustained action to address health disparities over time.

This session will emphasize practical, hands-on coding, and participants are encouraged to follow along to develop scripts they can apply to their own datasets. By the end of the webinar, attendees will have a deeper understanding of how to use Python for data visualization and actionable insights in public health.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/RQCCPA/</url>
            <location>Room 130</location>
            
            <attendee>Dr. Kimberly Deas</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>WAWAHD@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-WAWAHD</pentabarf:event-slug>
            <pentabarf:title>Getting Started with RAPIDS: GPU-Accelerated Data Science for PyData Users</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T133000</dtstart>
            <dtend>20250419T150000</dtend>
            <duration>013000</duration>
            <summary>Getting Started with RAPIDS: GPU-Accelerated Data Science for PyData Users</summary>
            <description>[NVIDIA](https://www.nvidia.com/) GPUs offer unmatched speed and efficiency for data processing and model training, significantly reducing the time and cost associated with these tasks. The appeal of GPUs becomes even stronger with zero-code-change libraries and plugins, allowing you to take advantage of GPU acceleration without having to rewrite your existing code. With [RAPIDS](https://rapids.ai/), you can use popular PyData libraries like **pandas**, **polars**, and **networkx** while reaping the performance benefits of GPUs.

This tutorial provides an introduction to **RAPIDS**, an open-source suite of libraries that accelerates data science and machine learning workflows using GPU technology. Aimed at data scientists and machine learning practitioners of all experience levels, the session will focus on how RAPIDS can be seamlessly integrated into existing data pipelines to achieve substantial performance improvements with minimal code changes.

Through hands-on coding exercises, attendees will explore the RAPIDS ecosystem, including **cuDF** (GPU-accelerated pandas) and **cuML** (GPU-accelerated machine learning), and learn how to integrate these tools into their workflows to accelerate tasks like data processing and model training. By the end of this tutorial, they&#x27;ll understand how RAPIDS integrates with the PyData ecosystem and significantly speed up workflows, 

The target audience for this tutorial is data scientists and machine learning practitioners. No prior GPU knowledge is required, but participants should have some experience with Python, pandas, and scikit-learn.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/WAWAHD/</url>
            <location>Room 130</location>
            
            <attendee>Naty Clementi</attendee>
            
            <attendee>Mike McCarty</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>B9RT3L@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-B9RT3L</pentabarf:event-slug>
            <pentabarf:title>From Pandas to PySpark</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T153000</dtstart>
            <dtend>20250419T170000</dtend>
            <duration>013000</duration>
            <summary>From Pandas to PySpark</summary>
            <description>This tutorial aims to close the gap between small-scale data analysis and big data processing. If you’ve ever tried to load a multi-gigabyte CSV into pandas or Excel, you know the frustration of crashing programs and endless waits. This tutorial shows how to level up your data skills using PySpark’s distributed DataFrame API.

We’ll do more than just introduce Spark concepts—we’ll work through a lively anime dataset full of ratings, genres, and user insights, so you can see how PySpark handles real-world tasks (like filtering, grouping, and joining) at scale. You’ll get comfortable with Spark’s architecture and learn how it uses lazy evaluations, cluster computing, and in-memory operations to achieve speedups. One highlight of the workshop is its hands-on approach: all exercises will be run in Google Colab. That means zero friction in setup—no cluster installation or environment wrangling. We’ll walk through the entire pipeline: loading massive CSV files, performing transformations that mirror pandas operations, and drawing insights through SQL-like queries.

Expect a fast-paced but accessible look at Spark’s key features, practical code examples, and best practices to keep your big data workflows efficient and transparent.

Tutorial Outline
- Why Spark?: A short overview of Hadoop MapReduce and how Spark rose to address its shortcomings.
- Distributed Data 101: Breaking down Spark’s architecture, executors, and lazy evaluation.
- Hands-On Setup: Launching PySpark in Google Colab so everyone can follow along in real time.
- Exploring the Anime Dataset: Reading data from CSV, structuring DataFrames, and performing data cleaning.
- Common Operations at Scale: Filtering, grouping, and aggregating millions of rows with PySpark.
- Comparisons to Pandas: Mapping familiar DataFrame operations to their Spark counterparts.
- Final Thoughts: Discussion of where Spark fits into modern data stacks, plus pointers for advanced usage (MLlib, streaming, cluster optimization).</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/B9RT3L/</url>
            <location>Room 130</location>
            
            <attendee>Cynthia Ukawu</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>SHTFQY@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-SHTFQY</pentabarf:event-slug>
            <pentabarf:title>Tutorial on Image Classification using Scikit-Image, Scikit-learn, and PyTorch</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T090000</dtstart>
            <dtend>20250419T103000</dtend>
            <duration>013000</duration>
            <summary>Tutorial on Image Classification using Scikit-Image, Scikit-learn, and PyTorch</summary>
            <description>Welcome to the exciting world of computer vision and machine learning!  This tutorial presents foundational computer vision operations to prepare you to build your first successful classification pipeline.  My goal is to help guide you past potential pitfalls and present topics for consideration as you embark on your machine learning journey.

1. Computer Vision Basics
   * The Basics
   * Software and Packages
2. Image Segmentation
   * Preprocessing (histograms, filters)
   * Thresholding
   * Morphological Operators
   * Advanced Segmentation
3. Feature Extraction
   * Textures
      * GLCM
      * LBP
4. Model Development - scikit-learn
   * scikit-learn
       * Gaussian Process
5. Feature Importance
   * Shapley
6. Neural Networks - PyTorch
7.  Model Development
    * CNN
    * Transfer Learning
8. Model Performance
   * Tensorboard
   * Saliency map

Notebooks will be available prior to the start of the tutorial.  Please come prepared with the following python packages installed:
* numpy
* pandas
* scikit-learn
* scikit-image 
* torch
* torchvision
* pytensorboard</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/SHTFQY/</url>
            <location>Room 140</location>
            
            <attendee>Matt Litz</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>GYFR7G@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-GYFR7G</pentabarf:event-slug>
            <pentabarf:title>A Beginner&#x27;s Guide to Variational Inference</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T110000</dtstart>
            <dtend>20250419T123000</dtend>
            <duration>013000</duration>
            <summary>A Beginner&#x27;s Guide to Variational Inference</summary>
            <description>## Description

This tutorial is **for data scientists, statisticians, and machine learning practitioners who are comfortable with Python and basics of probability**.

We’ll break down the mechanics of VI and its application in PyMC in an approachable way, starting with intuitive explanations and building up to practical examples.

Participants will learn how to apply ADVI and Pathfinder in PyMC and evaluate their results against MCMC, gaining insights into when and why to choose VI.

### Takeaways

Participants will leave understanding:

- The fundamentals of VI and how it differs from MCMC.
- How to implement ADVI and Pathfinder in PyMC.
- Practical considerations when selecting and evaluating inference methods.

### Background Knowledge Required

- Basic understanding of probability and Bayesian inference.
- Familiarity with Python. Prior PyMC experience is helpful but not required.

### Materials Distribution

All materials, including notebooks and datasets, will be available on GitHub.

## Outline

1. **Introduction: Why Variational Inference?** (10 min)
- The limitations of MCMC for large datasets.
- Overview of VI: How it works and why it’s faster.

2. **Variational Inference Basics** (20 min)
- Key concepts: Evidence Lower Bound (ELBO), optimization, and approximation families.
- Intuitive explanation of ADVI and Pathfinder.

3. **Implementing VI with PyMC** (15 min)
- Step-by-step walkthrough of VI with a linear model.
- Comparing ADVI, Pathfinder, and MCMC.

4. **Evaluating VI Approximations** (10 min)
- How to measure the quality of VI approximations (ELBO, simulation-based calibration, etc.).
- Practical trade-offs between speed and accuracy.

5. **Scaling Up: Complex Models and Real-World Applications** (25 min)
- Applying VI to hierarchical and large-scale models.
- Tips for debugging and optimizing VI workflows.

6. **Open Discussion and Q&amp;A** (10 min)
- Address audience-specific use cases and questions.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/GYFR7G/</url>
            <location>Room 140</location>
            
            <attendee>Chris Fonnesbeck</attendee>
            
        </vevent>
        
        <vevent>
            <method>PUBLISH</method>
            <uid>WZKH8G@@cfp.pydata.org</uid>
            <pentabarf:event-id></pentabarf:event-id>
            <pentabarf:event-slug>-WZKH8G</pentabarf:event-slug>
            <pentabarf:title>Introduction to Wikidata</pentabarf:title>
            <pentabarf:subtitle></pentabarf:subtitle>
            <pentabarf:language>en</pentabarf:language>
            <pentabarf:language-code>en</pentabarf:language-code>
            <dtstart>20250419T133000</dtstart>
            <dtend>20250419T150000</dtend>
            <duration>013000</duration>
            <summary>Introduction to Wikidata</summary>
            <description>Wikipedia is the general reference source for humans to read. Wikidata is its interconnected, structured data complement, and accessible through queries. We will consider Wikidata&#x27;s purpose, scope, and editorial community, then query for interesting results in pop culture, science, civics, and more. Attendees will learn how to access sample queries including through Jupyter Notebooks.</description>
            <class>PUBLIC</class>
            <status>CONFIRMED</status>
            <category>Tutorial</category>
            <url>https://cfp.pydata.org/virginia2025/talk/WZKH8G/</url>
            <location>Room 140</location>
            
            <attendee>Lane Rasberry</attendee>
            
            <attendee>Robin Isadora Brown</attendee>
            
        </vevent>
        
    </vcalendar>
</iCalendar>
