PyData Seattle 2025

Building Inference Workflows with Tile Languages
2025-11-08 , Talk Track 2

The world of generative AI is expanding. New models are hitting the market daily. The field has bifurcated between model training and model inference. The need for fast inference has led to numerous Tile languages to be developed. These languages use concepts from linear algebra and borrow common numpy apis. In this talk we will show how tiling works and how to build inference models from scratch in pure Python with embedded tile languages. The goal is to provide attendees with a good overview that can be integrated in common data pipelines.


Recently, there has been an explosion of interest in block-based Python programming models targeting GPUs, driven by the machine learning community. Many new Python frameworks have been developed, such as Triton, JAX/Pallas, and Warp. In March 2025, NVIDIA announced a new block-based dialect (cuTile) and compiler stack for CUDA (Tile IR).

In tile-based programming models, you write seemingly sequential functions that operate on small, local arrays that subdivide your inputs. These functions are then invoked concurrently on multiple instances. Each instance has a group of threads associated with it, and array operations are parallelized across those threads. Concurrency and data movement within groups of threads are implicit and abstracted away, in contrast to models like SIMT where users must explicitly synchronize and coordinate threads and tensor cores, pipeline loading of data, account for memory coalescing, etc.

Tile-based programming has been a staple in numerical and scientific computing for decades. Examples include NWChem's Tensor Contraction Engine, BLIS, and ATLAS. Tile-based programming is a form of array programming, and draws inspiration from languages and frameworks such as APL, MATLAB, and NumPy.

Motivation

This trend towards Pythonic tile-based models for GPU programming is due to a variety of factors:

  • More and more data scientists are programming GPUs, including those who are not experts in concurrency and hardware performance.
  • Tile-based code is simpler to design, write, and debug for data-parallel GPU applications.
  • Compilers can reason about tile-based programs without more complex and brittle analysis.
  • Array-centric paradigms are more intuitive for Python developers familiar with NumPy.
  • Tile-based GPU frameworks offer better portability even as GPU architectures change more and more between generations.
  • Tile-based models significantly simplifies programming machine learning acceleration technology like tensor cores.

Simply put, more data scientists have to use GPUs and GPU technology is evolving rapidly, creating a need for higher level and more portable paradigms.

Results

We'll present the recently announced cuTile and Tile IR. cuTile is a new tile-based programming model for NVIDIA's CUDA platform. It is a novel compiler stack and intermediate representation called Tile IR.

We'll show a new reference large-language-model GPU application based on LLAMA3 and DeepSeek implemented with a variety of different tile-based GPU programming frameworks, including cuTile, as well as in traditional SIMT.

By attending this talk, you will:

  • Learn the best practices for writing tile-based Python applications for GPUs.
  • Gain insight into the performance of tile-based Python GPU code and how it actually gets executed.
  • Discover how to reason about and debug tile-based Python GPU applications.
  • Understand the differences between tile-based and SIMT programming and when each paradigm should be used.
  • Dive into real examples of tile-based Python GPU code.
  • Explore NVIDIA's new cuTile and Tile IR projects

Prior Knowledge Expected:

No previous knowledge expected

I lead CUDA Python Product Management, working to make CUDA a Python native.

I received my Ph.D. from the University of Chicago in 2010, where Ibuilt domain-specific languages to generate high-performance code for physics simulations with the PETSc and FEniCS projects. After spending a brief time as a research professor at the University of Texas and Texas Advanced Computing Center, I have been a serial startup executive, including a founding team member of Anaconda.

I am a leader in the Python open data science community (PyData). A contributor to Python's scientific computing stack since 2006, I am most notably a co-creator of the popular Dask distributed computing framework, the Conda package manager, and the SymPy symbolic computing library. I was a founder of the NumFOCUS foundation. At NumFOCUS, I served as the president and director, leading the development of programs supporting open-source codes such as Pandas, NumPy, and Jupyter.