PyData Tel Aviv 2025

Let Your Data Tell Its Story: Building a Lightweight In-House Data Lineage Solution
2025-11-05 , Eng

What if your data could tell you its own story—where it came from, how it moved, and how it was used? In this talk, we’ll show you how to build a lightweight, in-house data lineage tool that brings that story to life. By capturing how data flows through your pipelines and systems, you gain instant visibility into dependencies, usage, and downstream impact of data changes and failures. Whether you're tracking broken pipelines or auditing data usage, keeping track of your data’s lineage gives you the context needed to take quick, informed action.


Understanding how data moves through your systems is essential—not just for ensuring data quality, but for responding quickly when things break. In this session, we’ll share how we built a fast, flexible, and low-overhead data lineage tool tailored to our pipelines and team needs. It’s designed to integrate seamlessly into existing workflows and help answer high-stakes questions: What processes could be impacted if a data source is replaced? If my process fails, who else is affected? Is the data pipeline our team spent months building actually being used?
We’ll walk through how we designed this tool using Airflow, Python, and Redis, and how it helps us trace dependencies, surface critical usage patterns, and act quickly when things go sideways. Whether you're running a modern data stack or wrangling legacy pipelines, you’ll leave with a practical blueprint—and a few hard-earned lessons—for implementing your own in-house data lineage system, no vendor required.


Prior Knowledge Expected:

No previous knowledge expected

Ilana Makover enjoys finding bugs and solving them. She is a Machine Learning Engineer at Bluevine.