Computer Vision Data Version Control and Reproducibility at Scale PyData Global 2025

Computer Vision Data Version Control and Reproducibility at Scale
.ical
2025-12-11 13:30–15:00, Analytics, Visualization & Decision Science

Computer vision, the field focused on enabling machines to interpret and understand visual data, tackles challenges like image recognition, object detection, and scene understanding. PyData tools play a critical role in solving these issues by offering robust libraries like TensorFlow, PyTorch, Keras, and Langchain for building and training machine learning models, performing image processing, and managing large datasets. This hands-on session will enable attendees to learn how to optimize computer vision projects with end-to-end version control baked in.

Petabytes of unstructured data stand as the cornerstone upon which triumphant Machine Learning (ML) models are built. One common method for researchers to extract subsets of data to their local environments is by simply using the age-old copy-paste, for model training. This method allows for iterative experimentation, but it also introduces challenges with the efficiency of data management when developing machine learning models, including reproducibility constraints, inefficient data transfer, alongside limited compute power.

This is where data version control technologies can help overcome these challenges for computer vision researchers. In this workshop we'll cover:

How to use open source tooling to version control your data when working with data locally.
Best practices for working with data, preventing the need to copy data locally, while enabling the training of models at scale directly on the cloud. This will be demoed with an OSS stack:
Langchain
Tensorflow
PyTorch
Keras

You will come away with practical methods to improve your data management when developing and iterating upon Machine Learning models, built for modern computer vision research.

Prior Knowledge Expected: Yes

Joe Pringle

Joe Pringle is VP of Customer Success at lakeFS supporting open source data version control and infrastructure, by providing expertise on data strategy, data science, AI and machine learning. He helps accelerate innovation, and plan and execute data science and machine learning initiatives. He has 20+ years experience helping large enterprises use data to increase impact on important public policy issues including education, health, the environment, and economic development. He also has a passion for focusing technology initiatives on people - and working backwards from understanding end users to identify opportunities to help busy people work faster, smarter, and better.

Computer Vision Data Version Control and Reproducibility at Scale .ical 2025-12-11 13:30–15:00, Analytics, Visualization & Decision Science

Computer Vision Data Version Control and Reproducibility at Scale
.ical
2025-12-11 13:30–15:00, Analytics, Visualization & Decision Science