2025-11-07 –, Talk Track 2
Modern data pipelines are fast and expressive, but ensuring data quality is often not as straightforward. This talk introduces Paguro, an open-source, feature-rich validation and metadata library designed on top of the Polars DataFrame library. Paguro enables users to validate both single Data(Lazy)Frames and collections of Data(Lazy)Frames together, and provides beautifully formatted terminal diagnostics that explain why and where validation failed. Attendees will learn how to integrate the lightweight, fast, and composable validation toolkit into their workflows, from exploration to production, using a familiar Polars-native syntax.
Objective
To show how Paguro makes data validation simple, declarative, and non-intrusive in DataFrame-centered pipelines, while supporting advanced use cases like multi-DataFrame validation and cross-DataFrame relationship checks. The goal is to help attendees understand how to integrate validation naturally into their workflow and why expressive diagnostics improve debugging and trust.
Central Thesis
Validation should align with how users already write Data(Lazy)Frame code: concise, composable, and expressive. Paguro delivers this by allowing users to introduce validation without restructuring their logic, while producing clear, context-rich terminal reports that explain not only what failed, but why and where.
Outline
- Motivation: Why Validation Matters
- Validation in tabular workflows can be a natural extension of how transformations are already written. By grounding it in the expressive power of Polars expressions, we can create a framework that makes validations simpler to author, more composable, and capable of capturing a wider range of correctness checks.
- Introducing Paguro
- Design goals: non-intrusive, expressive, Polars-native, API
- Two core functions that cover most validation needs, with a rich ecosystem around them for more complex needs
- Validation in Practice
- Single-Data(Lazy)Frame checks where Polars expressions are first class objects
- Validating collections of Data(Lazy)Frames together
- Checking cross-DataFrame relationships
- Demonstration of validation in a pipeline
- Diagnostics and User Experience
- Beautifully formatted terminal output with detailed error reports
- Metadata and descriptive rules for interpretability
- Takeaways and Applications
- Where Paguro fits: exploration, production, automated checks
- Benefits: modularity, clarity, maintainability
- How it complements the Polars ecosystem
Key Takeaways
- Validation can be added to existing Polars pipelines without rewriting existing Polars code.
- Paguro allows checks on both single Data(Lazy)Frames and collections of Data(Lazy)Frames with cross-relationships.
- Users get beautiful, context-rich diagnostics that make debugging easier and data pipelines more trustworthy.
Audience
Data scientists, data engineers, and developers who work with tabular data.
Background Knowledge
- Assumes basic familiarity with Python and DataFrame-based workflows.
- Some experience with DataFrame libraries (Polars) is assumed.
- Some knowledge of Polars expressions is helpful but not required.
Previous knowledge expected
Hi, I’m Bernardo. I earned my PhD at Duke University, where I studied the economics of innovation. That work drew me into the practical challenges of data—how to make pipelines reliable, how to integrate validation naturally, and more recently, how these tools can be combined with AI.