PyData Vermont 2025

Complex Data Ingestion with Open Source AI
2025-10-22 , UVM Alumni House Silver Pavilon

If you have worked with AI in any capacity, you'll know that AI is only as valuable as the data in can leverage. Data is the cornerstone of AI, and developers need better ways to transform complex documents into structured data ready for model training and inference.

In this session we will learn how to turn common, real-world documents and scans into structured data for search and RAG. In this 90-minute, code-along workshop, you’ll learn all about Docling, an open-source toolkit for advanced document conversion, allowing you to leverage your data more effectively into AI workflows. We’ll complete three labs; Conversion, Chunking, and RAG, and you’ll leave with runnable notebooks from a public GitHub repo.

Audience: Python practitioners shipping document-centric apps.

Prereqs: basic Python/Jupyter.


Docling is an MIT-licensed toolkit that turns messy, real-world files such as DOCX, XLSX, HTML, images, and more, into a single, richly structured “DoclingDocument” you can export to Markdown/JSON and wire directly into AI workflows. In this 90-minute, code-along workshop, you’ll learn how to convert heterogeneous documents reliably, chunk and serialize them for retrieval, and stand up a working RAG flow, using only your laptop.

Who should attend
Data scientists, ML/LLM engineers, data engineers, and Python practitioners who ship document-heavy apps (RAG, search, extraction) and want a robust open-source foundation.

Takeaways
- Why advanced document ingestion is important and how it can improve your workflows
- Hands on experience using Docling
- A working, locally runnable RAG notebook you can adapt to your stack

Background expected
Comfort with Python and Jupyter; basic familiarity with vectors/RAG helpful but not required.

Time breakdown
0–20: Why Docling? Capabilities & workflows.
20–35: Lab 1 : Conversion: convert PDFs/DOCX/PPTX/XLSX; export to Markdown/JSON.
35–60: Lab 2 : Chunking: hybrid chunking & serialization strategies for retrieval.
60–80: Lab 3 : RAG: build a multimodal RAG with a Python framework.
80–90: Q/A

Participant requirements
Laptop with Python 3.10+ and Jupyter (or VS Code), or Google Colab account

Materials & distribution
All three labs plus sample documents will be in a public GitHub repository; the link will be included on the session page and slides.


Prior Knowledge Expected: No previous knowledge expected

Ming Zhao is an open-source developer and Developer Advocate at IBM Research, where he helps IBM leverage open technologies while building impactful tools and growing vibrant open-source communities. He’s passionate about making open tech accessible to all and ensuring developers have the tools they need to succeed in the rapidly developing AI space. Ming now leads community efforts around Docling, IBM’s fastest-growing open source project, recently welcomed into the LF AI & Data Foundation.