2025-12-09 –, Live from PyData Boston
Most data science projects start with a simple notebook—a spark of curiosity, some exploration, and a handful of promising results. But what happens when that experiment needs to grow up and go into production?
This talk follows the story of a single machine learning exploration that matures into a full-fledged ETL pipeline. We’ll walk through the practical steps and real-world challenges that come up when moving from a Jupyter notebook to something robust enough for daily use.
We’ll cover how to:
- Set clear objectives and document the process from the beginning
- Break messy notebook logic into modular, reusable components
- Choose the right tools (Papermill, nbconvert, shell scripts) based on your workflow—not just the hype
- Track environments and dependencies to make sure your project runs tomorrow the way it did today
- Handle data integrity, schema changes, and even evolving labels as your datasets shift over time
And as a bonus: bring your results to life with interactive visualizations using tools like PyScript, Voila, and Panel + HoloViz
- (3 mins) Intro
- I've been supporting various groups in their developer experience since 2020 after being a freelance Python consultant. I've worked on many many dozens of projects, unblocking users picking the right tools for the task at hand.
- It works on my machine
- What we're building today: ML pipeline with RAPIDS -> Snowflake
- We're going to watch a real project grow up
- (3 mins) Exploration - starting as a single messy notebook, sample data set.
- Why RAPIDS? GPU
- Large data sets
- GPU availability - remote machine, local GPU
- workflows that work well with GPU
- Load Data cuDF / pandas
- Quick EDA and data visualization
- Train cuML / scikit-learn model
- no-code change philosophy
- Why RAPIDS? GPU
- (7 mins) Make it repeatable - Start with simple tried and true tools, explore where tools like Papermill help with flexibilty and reproducibility
- common painpoints: operating cadence, specialized scenarios, manual execution is error prone
- shell scripts versus papermill
- reproducible environments
- generate HTML reports
- pass through parameters in your notebook
- (8 mins) Make it reliable - Modular code & testing
- common painpoints: data schema changes, debugging issues, testing & modularity
- nbconvert + Python: turn your notebook into a script
- turn a function into a module
- dashboard with HoloViz / Panel, discuss choosing tools like Voila and PyScript
- (5 mins) Snowflake integration
- common painpoints: data volume, coordinate with other data systems, audits
- picking the right tools: cost complexity tradeoff
- RAPIDS preprocessing to Snowflake storage
- self-service access for stakeholders
- (3 mins) Conclusion
- Start simple
- Add complexity when you feel specific pain