Know Your Data(Frame) with Paguro: Declarative and Composable Validation and Metadata using Polars PyData Seattle 2025

Know Your Data(Frame) with Paguro: Declarative and Composable Validation and Metadata using Polars
.ical
2025-11-07 14:35–15:20, Room 313

Modern data pipelines are fast and expressive, but ensuring data quality is often not as straightforward. This talk introduces Paguro, an open-source, feature-rich validation and metadata library designed on top of the Polars DataFrame library. Paguro enables users to validate both single Data(Lazy)Frames and collections of Data(Lazy)Frames together, and provides beautifully formatted terminal diagnostics that explain why and where validation failed. Attendees will learn how to integrate the lightweight, fast, and composable validation toolkit into their workflows, from exploration to production, using a familiar Polars-native syntax.

Objective

To show how Paguro makes data validation simple, declarative, and non-intrusive in DataFrame-centered pipelines, while supporting advanced use cases like multi-DataFrame validation and cross-DataFrame relationship checks. The goal is to help attendees understand how to integrate validation naturally into their workflow and why expressive diagnostics improve debugging and trust.

Central Thesis

Validation should align with how users already write Data(Lazy)Frame code: concise, composable, and expressive. Paguro delivers this by allowing users to introduce validation without restructuring their logic, while producing clear, context-rich terminal reports that explain not only what failed, but why and where.

Outline

Motivation: Why Validation Matters

Validation in tabular workflows can be a natural extension of how transformations are already written. By grounding it in the expressive power of Polars expressions, we can create a framework that makes validations simpler to author, more composable, and capable of capturing a wider range of correctness checks.

Introducing Paguro

Design goals: non-intrusive, expressive, Polars-native, API
Two core functions that cover most validation needs, with a rich ecosystem around them for more complex needs

Validation in Practice

Single-Data(Lazy)Frame checks where Polars expressions are first class objects
Validating collections of Data(Lazy)Frames together
Checking cross-DataFrame relationships
Demonstration of validation in a pipeline

Diagnostics and User Experience

Beautifully formatted terminal output with detailed error reports
Metadata and descriptive rules for interpretability

Takeaways and Applications

Where Paguro fits: exploration, production, automated checks
Benefits: modularity, clarity, maintainability
How it complements the Polars ecosystem

Key Takeaways

Validation can be added to existing Polars pipelines without rewriting existing Polars code.
Paguro allows checks on both single Data(Lazy)Frames and collections of Data(Lazy)Frames with cross-relationships.
Users get beautiful, context-rich diagnostics that make debugging easier and data pipelines more trustworthy.

Audience

Data scientists, data engineers, and developers who work with tabular data.

Background Knowledge

Assumes basic familiarity with Python and DataFrame-based workflows.
Some experience with DataFrame libraries (Polars) is assumed.
Some knowledge of Polars expressions is helpful but not required.

Prior Knowledge Expected: Previous knowledge expected

Bernardo Dionisi

Hi, I’m Bernardo. I earned my PhD at Duke University, where I studied the economics of innovation. That work drew me into the practical challenges of data—how to make pipelines reliable, how to integrate validation naturally, and more recently, how these tools can be combined with AI.

Know Your Data(Frame) with Paguro: Declarative and Composable Validation and Metadata using Polars .ical 2025-11-07 14:35–15:20, Room 313