PyData London 2025

Hands-on with Apache Iceberg
06-06, 09:00–10:30 (Europe/London), Hardwick Hub

You've probably heard the name Apache Iceberg by now. If it wasn't when Databricks reportedly spent 2 billion USD buying Tabular, it might have been when AWS announced S3 Tables built on Iceberg. But do you know what Apache Iceberg actually is? Or how you could start using it today?

In this tutorial, we will walk through an end-to-end example of writing and reading Iceberg data, while taking a few pitstops to demonstrate Iceberg's selling points.


**This tutorial is aimed at the data engineer who's somewhat familiar with cloud storage solutions such as S3, Azure Blob Storage or Google Cloud Storage. The tutorial will consist of fully-local components running in Docker and Jupyter notebooks. You will be able to replicate the environment locally and play around with it yourself. **

Please clone https://github.com/andersbogsnes/pydata-london-2025-hands-on-apache-iceberg and run the commands in the README.md before the workshop if possible!

The goal of this tutorial is to give you an understanding of what Apache Iceberg is and does.

We will write data in Iceberg format to an object store, taking the opportunity to demonstrate each of Iceberg's selling points. Finally, we will query the data using a variety of query engines to demonstrate the promises of Iceberg's interoperability.

Outline

  • Introduce some of the concepts needed to understand the why of Apache Iceberg
  • A brief history of table formats
  • A discussion of the importance of file formats
  • Introducing the dataset we will be working with
  • Writing data into Iceberg format - what is happening under the hood?
  • Demonstrating the main selling points of Iceberg and why you should care
  • Schema Evolution
  • Hidden Partitioning
  • Time Travel
  • Data Compaction
  • Querying the data
  • Duckdb
  • Polars
  • Other query engines

Prior Knowledge Expected

No previous knowledge expected

Anders is the Head of Investments Engineering at Nordea Asset Management and organizer of Pydata Copenhagen Meetup. He has a background as a ML Tech Lead and Python Enabler with an interest in data engineering, ML and ML Engineering. Hailing from Stavanger, Norway, he is currently located in Copenhagen, Denmark