PyData Berlin 2025

Forget the Cloud: Building Lean Batch Pipelines from TCP Streams with Python and DuckDB
2025-09-02 , B09

Many industrial and legacy systems still push critical data over TCP streams. Instead of reaching for heavyweight cloud platforms, you can build fast, lean batch pipelines on-prem using Python and DuckDB.

In this talk, you'll learn how to turn raw TCP streams into structured data sets, ready for analysis, all running on-premise. We'll cover key patterns for batch processing, practical architecture examples, and real-world lessons from industrial projects.

If you work with sensor data, logs, or telemetry, and you value simplicity, speed, and control this talk is for you.


Cloud-native tools are everywhere. But not every system can or should move to the cloud.

In many industries like manufacturing, logistics, or energy, TCP streams remain the backbone of real-time data exchange. These systems are often on-premise, resource-constrained, and mission-critical.

This talk shows how you can build lean, powerful batch pipelines with source data coming from TCP streams using Python and DuckDB. All without the complexity of cloud services.

We'll cover:

  • Why TCP streams still matter
  • Stream vs. Batch: Choosing the right model for industrial data
  • Pipeline architecture: From streams to batch
  • DuckDB + Python: The perfect combo for lightweight analytics
  • Key pitfalls along the way
  • Limitations of this approach

You'll walk away with:

  • Ready-to-use patterns for TCP-based data pipelines
  • Insights on when to avoid unnecessary cloud complexity
  • Tips for building fast, reliable batch jobs on local infrastructure

Whether you process factory sensor data, machine logs, or legacy telemetry, this talk will give you modern tools to make your data streams actionable and efficient.


Prerequisites:

Attendees should have a basic understanding of data engineering principles, but no special knowledge is required.

Abstract as a tweet (X) or toot (Mastodon):

Learn how to build on-prem data pipelines with streaming inputs and batch processing.

Expected audience expertise: Domain:

Intermediate

Software and data engineering consultant. I build data systems that help businesses to answer questions about their business. I like solving problems in a pragmatic way.