PyData Seattle 2025

Synthetic Data for LLMs and Multi-Agent Document Workflows
2025-11-07 , Talk Track 2

Testing and building document-based AI agents tipically requires requires access to real documents, which can be sensitive or hard to access. Synthetic data can be a safe and flexible alternative to real data. This talk explores how synthetic data can be used to prototype and evaluate document-based AI agents without requiring access to sensitive or proprietary documents.
In this session, we will cover why synthetic data is increasingly relevant for LLM training as well as for the development of successful AI agent workflows. We will address the challenges and expectations for synthetic document generation, such as structure capture, semantics, and dependencies that mimic real-world existing documents.
Furthermore, in this session we will cover how to integrate synthetic data into the desing of multi-agent systems, while ensuring coordenation, reliability and reproducibility of such systems. Finally, we will showcase a practical workflow for document retrieval, showcasing how synthetic data can accelerate experimentation and improve trust for AI agent driven systems.
By focusing on both the role of synthetic data and the complexity of AI agents workflows design, this talk highlights hands-on strategies for safe experimentation while addressing challenges of reproducibility and scalability.


Developing AI agents for documents typically requires access to real-world data, which is often sensitive or difficult to obtain. Synthetic data can be a safe and flexible alternative as it enables access to data without exposing private and proprietary information.
In this session, we will explore why synthetic documents can improve LLM training and improve the outcome and reliability of multi-agent workflows. We’ll discuss the challenges of generating realistic corpora that capture structure, semantics, and dependencies, and how overcoming these challenges enables scalable and reproducible experimentation.
By the end of the talk, you will understand:

  • Why adopt synthetic data for LLM training and AI Agents development
  • What are the challenges to generating high-quality synthetic documents
  • How synthetic data strengthens the design of multi-agent workflows
  • A practical example of a retrieval workflow and synthetic data integration that demonstrates these concepts in action

This session is aimed at data scientists and engineers who want to explore how to develop accurate and reliable LLM and AI Agent systems while overcoming challenges such as data access and data availability.


Prior Knowledge Expected:

No previous knowledge expected

Fabiana Clemente is an entrepreneur and startup founder with a background in data science and AI. She has led the development of solutions that improve data quality and leverage synthetic data and generative AI to accelerate innovation. A current maintainer and active contributor to the open-source project ydata-profiling, one of the most widely adopted open-source EDA technologies, Fabiana is also an advocate for privacy-preserving approaches to data and AI.
She has authored research, received industry awards, and frequently speaks at international conferences. Her work bridges data engineering, machine learning, and responsible AI, with a focus on making advanced workflows, from document intelligence to multi-agent systems, more practical and reliable.