PyData Amsterdam 2025

Orchestrating success: How Vinted standardizes large-scale, decentralized data pipelines
09-26, 14:10–14:45 (Europe/Amsterdam), Nebula

At Vinted, Europe’s largest second-hand marketplace, over 20 decentralized data teams generate, transform, and build products on petabytes of data. Each team utilizes their own tools, workflows, and expertise. Coordinating data pipeline creation across such diverse teams presents significant challenges. These include complex inter-team dependencies, inconsistent scheduling solutions, and rapidly evolving requirements.

This talk is aimed at data engineers, platform engineers, and technical leads with experience in workflow orchestration and will demonstrate how we empower teams at Vinted to define data pipelines quickly and reliably. We will present our user-friendly abstraction layer built on top of Apache Airflow, enhanced by a Python code generator. This abstraction simplifies upgrades and migrations, removes scheduler complexity, and supports Vinted’s rapid growth. Attendees will learn how Python abstractions and code generation can standardize pipeline development across diverse teams, reduce operational complexity, and enable greater flexibility and control in large-scale data organizations. Through practical lessons and real-world examples of our abstraction interface, we will offer insights into designing scheduler-agnostic architectures for successful data pipeline orchestration.


This talk will present the architectural and practical decisions behind Vinted's approach to managing large-scale, decentralized data pipelines. Aimed at data engineers, platform builders, and technical leads familiar with orchestration tools (such as Apache Airflow, Prefect, or Dagster), the session will focus on the lessons learned from building and operating our "workflow abstraction layer," and the supporting code generation infrastructure that enables fast, reliable, and consistent pipeline delivery.

Key session topics:

  1. Background & Motivation
    - Describe Vinted's organizational landscape—multiple product verticals, autonomous data teams, and shared infrastructure.
    - Surface typical challenges: duplicated pipeline code, diverging scheduler solutions, migration issues, and silos across dbt, Docker-based, and ML workflows.
    - Explain the need for abstraction and unification to unlock velocity and safety at scale.

  2. Abstraction Layer Design
    - Present our user-facing abstraction API for describing pipelines, regardless of job type or runner.
    - Discuss how we enable standardization without constraining teams' unique needs or technical stacks.
    - Show how this API hides Airflow and scheduler-specific details, facilitating upgrades (e.g. Airflow 3.0, or a completely new scheduler tool).

  3. Python Code Generation & Deployment
    - Walk through our automated code generation tool.
    - Generating Airflow DAGs for dbt repositories, Dockerized data jobs, and Google Vertex-AI (ML) pipelines. How configuration and validation happen at code-generation time during CI, minimizing surprises at runtime.
    - How we deploy interconnected DAGs independently with declarative dependencies between data assets using a global data asset registry (metadata, SLOs, and owner information).

  4. Outcomes, Lessons Learned, & Future Directions
    - Quantifiable impact: reduced pipeline delivery times, improved reliability, more autonomous teams.
    - Operational benefits: easier scheduler upgrades, standardized monitoring, and simplified incident management.
    - Lessons on balancing central guardrails and team autonomy.
    - Ideas for "scheduler-agnostic" evolution: how our abstractions would allow us to seamlessly upgrade to Airflow 3.0 and support new runtimes (e.g. On-prem kubernetes tasks) in the future.