PyData Seattle 2025

Data Loading for Data Engineers
2025-11-07 , Talk Track 2

Data scientists need data to train their models. The process of feeding the training algorithm with data is loosely described as "data loading." This talk looks at the data loading process from a data engineer's perspective. We will describe common techniques such as splits, shuffling, clumping, epochs, and distribution. We will show how the way data is loaded can have impacts on training speed and model quality. Finally, we examine what constraints these workloads put on data systems and discuss best practices for preparing a database to serve as a source for data loading.


In the first part of the talk we will describe how data loading is done. Scientists typically need to split their data into test/train splits (or K different splits for k-fold cross validation). They need the data to arrive in random order and they will usually feed the training algorithm through multiple epochs where the same data is provided with a different permutation. This is typically a distributed process and care must be taken so each worker has a different, but random, view of the data. We will briefly describe PyTorch and Ray which are two tools commonly used here. We expect this portion to take 15 minutes.

Next we will discuss the model training, evaluation, and the loss function. We will show how the loss function can be used to compare different training strategies. We will look at clumping and show how it reduces randomness but increases I/O throughput and show how that affects the loss function. This portion should take 15 minutes.

Finally we will talk about the impacts that random access has on different data storage solutions. In particular, it is very challenging for columnar formats on cloud storage. We will briefly show how LanceDB and an NVMe cache can speed up a data loading workload by caching the data on the first epoch for faster training in future epochs. This will occupy the final 10 minutes.


Prior Knowledge Expected:

No previous knowledge expected

Weston is an open source software engineer at LanceDB. He is on the PMC for Apache Arrow and Substrait and has spent an unhealthy amount of time thinking about how best to read data from cloud storage. Recently he has been helping develop the Lance file and table formats and studying how random access, multimodal data, and search can be integrated into the modern data lake.