Weston Pace
Weston is an open source software engineer at LanceDB. He is on the PMC for Apache Arrow and Substrait and has spent an unhealthy amount of time thinking about how best to read data from cloud storage. Recently he has been helping develop the Lance file and table formats and studying how random access, multimodal data, and search can be integrated into the modern data lake.
Session
Data scientists need data to train their models. The process of feeding the training algorithm with data is loosely described as "data loading." This talk looks at the data loading process from a data engineer's perspective. We will describe common techniques such as splits, shuffling, clumping, epochs, and distribution. We will show how the way data is loaded can have impacts on training speed and model quality. Finally, we examine what constraints these workloads put on data systems and discuss best practices for preparing a database to serve as a source for data loading.