Building Rich RAG Systems with Docling: Unlock Information from Tables, Images, and Complex Documents PyData Virginia 2025

Building Rich RAG Systems with Docling: Unlock Information from Tables, Images, and Complex Documents
.ical

04-19, 11:00–12:30 (US/Eastern), Room 120

Traditional PDF extraction tools often struggle with complex layouts, tables, and images, Docling (an opensource Python library developed at IBM) excels at extracting structured information from these elements, enabling the creation of richer, more accurate vector databases. This hands-on tutorial will guide participants through building a Retrieval Augmented Generation (RAG) system using Docling, an open-source document processing library.

Participants will learn how to harness Docling's advanced capabilities to build superior RAG systems that can understand and retrieve information from complex document elements that traditional tools might miss. Participants will learn how to handle complex documents, extract structured information, and create an efficient vector database for semantic search. The session will cover best practices for document parsing, chunking strategies, and integration with popular LLM frameworks.

Overview and Objectives

This tutorial leverages Docling (https://ds4sd.github.io/docling/), a powerful open-source library designed for advanced document processing and AI integration. The session aims to equip data scientists and ML engineers with practical skills for building robust RAG systems by utilizing Docling's comprehensive feature set. We will work through scenarios with multi-page tables, research paper processing maintaining multi-column layouts and equations, or technical documentation management that understands code blocks and diagrams. Through these examples, you'll gain practical experience in building robust document processing pipelines that outperform traditional extraction tools.

Participants will learn how to:
- Process and parse various document formats (PDF, DOCX, HTML) using Docling
- Extract structured information including tables, formulas, and images
- Implement effective text chunking strategies for optimal retrieval
- Create vector databases for semantic search
- Integrate the pipeline with LLM frameworks for end-to-end RAG solutions

Target Audience

This tutorial is designed for:
- Data scientists and ML engineers working on document processing and LLM applications
- Software developers implementing RAG systems
- Anyone interested in building production-ready document processing pipelines

Experience Level: Intermediate

Prerequisites:
- Basic Python programming knowledge
- Familiarity with basic NLP concepts
- Understanding of LLMs and vector databases (basic level)

Technical Requirements

Participants should have:
- Python 3.10 or 3.11 installed
- A code editor or IDE
- Ability to install Python packages via pip
- 4GB+ of free disk space for models and dependencies

Detailed Outline (90 minutes)

Introduction and Setup (15 minutes)
- RAG system architecture overview
- Setting up the development environment
- Installing Docling and dependencies
Document Processing with Docling (25 minutes)
- Understanding Docling's document processing capabilities
- Comparing traditional PDF extraction vs. Docling's advanced parsing
- Advanced extraction of tables, images, and complex layouts
- Hands-on exercise: Processing sample documents with rich content
Building the RAG Pipeline (25 minutes)
- Creating rich vector embeddings that preserve document structure
- Integration with LLM frameworks
- Hands-on exercise: Building a complete RAG pipeline
Best Practices and Production Considerations (15 minutes)
- Performance optimization techniques
- Using accelerators
- Docling-serve https://github.com/docling-project/docling-serve to deploy Docling as API service
- Creating effective evaluations
Q&A and Interactive Problem Solving (10 minutes)
- Addressing participant questions
- Troubleshooting common issues
- Discussion of real-world applications

Materials

https://github.com/KrishnaRekapalli/docling-rag-tutorial-pydata-2025

Pre-work

Make sure that you have a Hugging Face access token / Replicate API key for LLM inference. You can get some free inference credit on both platforms without credit card. Other option is local ollama. For more details check https://github.com/KrishnaRekapalli/docling-rag-tutorial-pydata-2025

Key Takeaways

Participants will leave the tutorial with:
- Practical experience in building RAG systems
- Understanding of document processing best practices
- Ability to extract and utilize information from complex document elements
- Hands-on experience comparing traditional vs. advanced extraction methods
- Knowledge of common pitfalls and how to avoid them
- Strategies for handling tables and images in RAG systems

Prior Knowledge Expected –

No previous knowledge expected

Krishna Rekapalli

Krishna is a Senior Data Scientist at IBM's Watsonx.ai Solution Architecture Center of Excellence, specializing in designing and implementing enterprise-scale LLM-powered AI solutions and agentic workflows. With over 7 years of experience building machine learning applications, they bring extensive expertise in hybrid cloud architectures, geospatial data analysis, and artificial intelligence. At IBM, they work directly with clients to architect and deploy production-ready AI solutions, focusing on practical implementation challenges and scalable architectures.

Building Rich RAG Systems with Docling: Unlock Information from Tables, Images, and Complex Documents .ical 04-19, 11:00–12:30 (US/Eastern), Room 120