PyData Global 2025

Hands-on with Blosc2: Accelerating Your Python Data Workflows
2025-12-10 , General Track

As datasets grow, I/O becomes a primary bottleneck, slowing down scientific computing and data analysis. This tutorial provides a hands-on introduction to Blosc2, a powerful meta-compressor designed to turn I/O-bound workflows into CPU-bound ones. We will move beyond basic compression and explore how to structure data for high-performance computation.

Participants will learn to use the python-blosc2 library to compress and decompress data with various codecs and filters, optimizing for speed and ratio. The core of the tutorial will focus on the Blosc2 NDArray object, a chunked, N-dimensional array that lives on disk or in memory. Through a series of interactive exercises, you will learn how to perform out-of-core mathematical operations and analytics directly on compressed arrays, effectively handling datasets larger than available RAM.

We will also cover practical topics like data storage backends, two-level partitioning for faster data slicing, and how to integrate Blosc2 into existing NumPy-based workflows. You will leave this session with the practical skills needed to significantly accelerate your data pipelines and manage massive datasets with ease.


Audience & Prerequisites

This tutorial is for data scientists, engineers, and researchers who work with large numerical datasets in Python.

Prerequisites: Attendees should have intermediate Python programming skills and be comfortable with the basics of NumPy arrays. No prior experience with Blosc2 is necessary.

Setup: Participants will need a laptop and can follow along using a provided cloud-based environment (e.g., Binder) or a local installation of Python, Jupyter, and the python-blosc2 library.

Learning Objectives

By the end of this tutorial, attendees will be able to:

  • Understand the core concepts behind the Blosc2 meta-compressor.
  • Compress and decompress NumPy arrays, tuning parameters for optimal performance.
  • Create, manipulate, and slice Blosc2 NDArray objects for out-of-core processing.
  • Perform efficient mathematical computations directly on compressed data.
  • Store and retrieve compressed datasets using different storage backends.
  • Integrate Blosc2 into their existing data analysis workflows to mitigate I/O bottlenecks.

Outline (90 minutes)

Introduction & Setup (10 mins)

  • The I/O Bottleneck Problem.
  • Core Concepts: What are meta-compressors, chunks, and blocks?
  • Tutorial environment setup (Jupyter notebooks).

Part 1: Compression Fundamentals (20 mins)

  • Hands-on: Using blosc2.compress() and blosc2.decompress().
  • Exploring codecs (lz4, zstd), compression levels, and filters (shuffle, bitshuffle).
  • Exercise: Compressing a sample dataset and analyzing the trade-offs between speed and ratio.

Part 2: The NDArray - Computing on Compressed Data (35 mins)

  • Hands-on: Creating NDArray objects from scratch and from NumPy arrays.
  • Storing arrays on-disk vs. in-memory.
  • Exercise: Slicing and accessing data from an on-disk NDArray.
  • Performing mathematical operations (arr * 2 + 1) and reductions (arr.sum()) on compressed data.
  • Exercise: Analyzing a dataset larger than RAM.

Part 3: Advanced Features & Integration (20 mins)

  • Hands-on: Using two-level partitioning (meta-chunks) for faster slicing.
  • Brief overview of Caterva2 for sharing compressed data via an API.
  • Recap and Q&A.

Repository: Tutorial materials including notebooks and datasets will be available at a public GitHub repository (link to be provided upon acceptance).


Prior Knowledge Expected:

Yes

I am a curious person who studied Physics (BSc, MSc) and Applied Maths (MSc). I spent over a year at CERN for my MSc in High Energy Physics. However, I found maths and computer sciences equally fascinating, so I left academia to pursue these fields. Over the years, I developed a passion for handling large datasets and using compression to enable their analysis on commodity hardware accessible to everyone.

I am the CEO of ironArray SLU and also leading the Blosc Development Team, and currently interested in determining, ahead of time, which combinations of codecs and filters can provide a personalized compression experience. I am also very excited in providing a way for sharing Blosc2 datasets in the network in an easy and effective way via Caterva2, and Cat2Cloud, a software as a service for handling and computing with datasets directly in the cloud.

As an Open Source believer, I started the PyTables project more than 20 years ago. After 25 years in this business, I started several other useful open source projects like Blosc2, Caterva2 and Btune; those efforts won me two prizes that mean a lot to me:

You can know more on what I am working on by reading my latest blogs.

Degree in Physics, Princeton University, 2019
Masters in Applied Mathematics, University of Edinburgh, 2020
PhD in Applied Mathematics, Universitat Jaume I 2024
Working at ironArray as engineer and product owner since 2025.