PyData Vermont 2025
Data are only as powerful as the trust they carry and the communities they serve. In this talk, I’ll explore how building authentic partnerships transforms not just what data we collect, but how they are interpreted and acted upon. At a time when data shape decisions across every sector, it has never been more important for us to use data responsibly. When grounded in trust and collaboration, data have the power to illuminate our most complex challenges and break down silos to drive meaningful change.
This session explores early design and back-end development of the Vermont Data Collaborative, an open source data dashboard for the state. Participants will engage in a collaborative discussion on incorporating feedback from community partners, with an emphasis on design considerations, data access, prototyping with the final project in mind, potential pitfalls. The aim is to both provide a practical framework for your open source data project, and improve that framework itself.
Do you know where your data is? It's time to unlock the real power of location with Python. This talk is your practical guide to the open geospatial ecosystem, designed for data practitioners who are ready to turn location data into meaningful insights.
We'll cover the end-to-end workflow: from acquiring open data to performing powerful spatial joins and creating compelling maps. You will learn to use core libraries like GeoPandas and Shapely, and finally demystify one of the trickiest parts of geospatial work: coordinate reference systems.
This session is for data scientists, analysts, and engineers familiar with pandas who want to add a powerful new dimension to their work. You’ll leave with a clear, actionable roadmap for integrating geospatial analysis into your projects.
We will discuss fundamental linguistics and data science concepts that underpin the ability to extract signal from text. This talk brings theoretical context to general data science and NLP approaches. Topics will include the linguistic grounding of large language models (LLMs), basic NLP methods, and common pitfalls in textual analysis. We will also present some tools developed by our lab that can act as powerful lenses for textual data. Some examples we will use to approach these topics include: word frequency and distributions, Zipf’s law, the Distributional Hypothesis, allotaxonometry, sentiment, time series, and scale.
Takeaways from this talk will be theoretical background and tools that support a holistic approach to extracting signal from text, empowering attendees to engage critically with NLP applications in the wild and to deploy NLP approaches responsibly and creatively.
45 min talk going through the decision tree for picking Python environment tools / package managers geared toward Scientific computing and context on why Python environment management is so difficult -- spoiler: it's why Python is so popular -- for its flexibility and extensibility.
Apache Iceberg is an open table format that comes with a lot of benefits and interoperability. Learn about what it is, how it works, and how it is used at BETA Technologies to empower analytics at scale.
This hands-on workshop explores "small data" through the creation of physical and analog data visualizations. Inspired by projects like Giorgia Lupi's "Dear Data", participants will form small groups and collaborate to represent and humanize data from one of five pre-selected datasets. We will explore how these physical representations can foster community collaboration and new ways of seeing, hearing and sharing data.
Messy and inconsistent data is the curse of any analytic or modeling workflow. This talk uses the example of working with address data and demonstrates how natural-language-based approaches can be applied to clean and normalize addresses at scale. The presentation will showcase the results of several methods, ranging from naive regular expression rules to 3rd-party APIs, open-source address parsing, scalable LLM embeddings with vector search, and custom text embeddings.
Attendees will leave knowing when to choose each method and how to balance cost, speed, and precision.
In this introductory hands-on tutorial, participants will learn how to accelerate their data workflows with RAPIDS, an open-source suite of libraries designed to leverage the power of NVIDIA GPUs for end-to-end data pipelines. Using familiar PyData APIs like cuDF (GPU-accelerated pandas) and cuML (GPU-accelerated machine learning), attendees will explore how to seamlessly integrate these tools into their existing workflows with minimal code changes, achieving significant speedups in tasks such as data processing and model training.
If you have worked with AI in any capacity, you'll know that AI is only as valuable as the data in can leverage. Data is the cornerstone of AI, and developers need better ways to transform complex documents into structured data ready for model training and inference.
In this session we will learn how to turn common, real-world documents and scans into structured data for search and RAG. In this 90-minute, code-along workshop, you’ll learn all about Docling, an open-source toolkit for advanced document conversion, allowing you to leverage your data more effectively into AI workflows. We’ll complete three labs; Conversion, Chunking, and RAG, and you’ll leave with runnable notebooks from a public GitHub repo.
Audience: Python practitioners shipping document-centric apps.
Prereqs: basic Python/Jupyter.
Learn to extend Claude's capabilities by building a Model Context Protocol (MCP) server that connects Claude Desktop to external APIs. This hands-on tutorial guides participants through creating a simple MCP server using Python and conda, demonstrating how to enable Claude to access real-time data from the New York Times Books API.