04-19, 11:00–12:30 (US/Eastern), Room 120
Traditional PDF extraction tools often struggle with complex layouts, tables, and images, Docling (an opensource Python library developed at IBM) excels at extracting structured information from these elements, enabling the creation of richer, more accurate vector databases. This hands-on tutorial will guide participants through building a Retrieval Augmented Generation (RAG) system using Docling, an open-source document processing library.
Participants will learn how to harness Docling's advanced capabilities to build superior RAG systems that can understand and retrieve information from complex document elements that traditional tools might miss. Participants will learn how to handle complex documents, extract structured information, and create an efficient vector database for semantic search. The session will cover best practices for document parsing, chunking strategies, and integration with popular LLM frameworks.
Overview and Objectives
This tutorial leverages Docling (https://ds4sd.github.io/docling/), a powerful open-source library designed for advanced document processing and AI integration. The session aims to equip data scientists and ML engineers with practical skills for building robust RAG systems by utilizing Docling's comprehensive feature set. We will work through scenarios with multi-page tables, research paper processing maintaining multi-column layouts and equations, or technical documentation management that understands code blocks and diagrams. Through these examples, you'll gain practical experience in building robust document processing pipelines that outperform traditional extraction tools.
Participants will learn how to:
- Process and parse various document formats (PDF, DOCX, HTML) using Docling
- Extract structured information including tables, formulas, and images
- Implement effective text chunking strategies for optimal retrieval
- Create vector databases for semantic search
- Integrate the pipeline with LLM frameworks for end-to-end RAG solutions
Target Audience
This tutorial is designed for:
- Data scientists and ML engineers working on document processing and LLM applications
- Software developers implementing RAG systems
- Anyone interested in building production-ready document processing pipelines
Experience Level: Intermediate
Prerequisites:
- Basic Python programming knowledge
- Familiarity with basic NLP concepts
- Understanding of LLMs and vector databases (basic level)
Technical Requirements
Participants should have:
- Python 3.10 or 3.11 installed
- A code editor or IDE
- Ability to install Python packages via pip
- 4GB+ of free disk space for models and dependencies
Detailed Outline (90 minutes)
-
Introduction and Setup (15 minutes)
- RAG system architecture overview
- Setting up the development environment
- Installing Docling and dependencies -
Document Processing with Docling (25 minutes)
- Understanding Docling's document processing capabilities
- Comparing traditional PDF extraction vs. Docling's advanced parsing
- Advanced extraction of tables, images, and complex layouts
- Hands-on exercise: Processing sample documents with rich content -
Building the RAG Pipeline (25 minutes)
- Creating rich vector embeddings that preserve document structure
- Integration with LLM frameworks
- Hands-on exercise: Building a complete RAG pipeline -
Best Practices and Production Considerations (15 minutes)
- Performance optimization techniques
- Using accelerators
- Docling-serve https://github.com/docling-project/docling-serve to deploy Docling as API service
- Creating effective evaluations -
Q&A and Interactive Problem Solving (10 minutes)
- Addressing participant questions
- Troubleshooting common issues
- Discussion of real-world applications
Materials
https://github.com/KrishnaRekapalli/docling-rag-tutorial-pydata-2025
Pre-work
Make sure that you have a Hugging Face access token / Replicate API key for LLM inference. You can get some free inference credit on both platforms without credit card. Other option is local ollama. For more details check https://github.com/KrishnaRekapalli/docling-rag-tutorial-pydata-2025
Key Takeaways
Participants will leave the tutorial with:
- Practical experience in building RAG systems
- Understanding of document processing best practices
- Ability to extract and utilize information from complex document elements
- Hands-on experience comparing traditional vs. advanced extraction methods
- Knowledge of common pitfalls and how to avoid them
- Strategies for handling tables and images in RAG systems
No previous knowledge expected
Krishna is a Senior Data Scientist at IBM's Watsonx.ai Solution Architecture Center of Excellence, specializing in designing and implementing enterprise-scale LLM-powered AI solutions and agentic workflows. With over 7 years of experience building machine learning applications, they bring extensive expertise in hybrid cloud architectures, geospatial data analysis, and artificial intelligence. At IBM, they work directly with clients to architect and deploy production-ready AI solutions, focusing on practical implementation challenges and scalable architectures.