06-07, 10:20–11:05 (Europe/London), Hardwick Hub
This talk examines multi-threaded parallel inference on PyTorch models using the new No-GIL, free-threaded version of Python. Using a simple 124M parameter GPT2 model that we train from scratch, we explore the novel new territory unlocked by free-threaded Python: parallel PyTorch model inference, where multiple threads, unimpeded by the Python GIL, attempt to generate text from a transformer-based model in parallel.
Python 3.13, released in October 2024, is the first version of Python to introduce support for a “no-GIL” free-threaded mode, per PEP-703 Making the Global Interpreter Lock Optional in CPython, unlocking the ability for multiple Python threads to run simultaneously.
This allows, for the first time since the language’s inception in December 1989, a single Python process to saturate all CPU cores in parallel with pure Python code (i.e. not farming out to extension modules written in C, C++, or, more recently, Rust).
This talk post explores what can be done with PyTorch now with the new free-threaded version of Python, specifically focusing on run-time inference on transformer-based generative models.
We will introduce a free-threaded implementation of an asyncio-based HTTP server that allows for parallel model inference of a GPT2 PyTorch model, scaling up to multiple GPUs with ease, all within a single Python process---this is novel, uncharted territory that is now unlocked thanks to free-threaded Python.
No previous knowledge expected
During his work at NVIDIA, Michał gained vast experience in Deep Learning Software Development. He tackled challenges in training and inference, ranging from small-scale to large-scale applications, as well as user-facing tasks and highly-optimized benchmarks like MLPerf. Michał also possesses a deep understanding of data loading problems, having worked as a developer on NVIDIA DALI, the Data Loading Library.