Scaling Trust: A practical guide on evaluating LLMs and Agents PyData Amsterdam 2025

Scaling Trust: A practical guide on evaluating LLMs and Agents
.ical

09-26, 11:05–11:40 (Europe/Amsterdam), Apollo

Recently, the integration of Generative AI (GenAI) technologies into both our personal and professional lives has surged. In most organizations, the deployment of GenAI applications is on the rise, and this trend is expected to continue in the foreseeable future. Evaluating GenAI systems presents unique challenges not present in traditional ML. The main peculiarity is the absence of ground truth for textual metrics such as: text clarity, location extraction accuracy, factual accuracy and so on. Nevertheless the non-negligible model serving cost demands an even more thorough evaluation of the system to be deployed in production.

Defining the metric ground truth is a costly and time consuming process requiring human annotation. To address this, we are going to present how to evaluate LLM-based applications by leveraging LLMs themselves as evaluators. Moreover we are going to outline the complexities and evaluation methods for LLM-based Agents which operate with autonomy and present further evaluation challenges. Lastly, we will explore the critical role of evaluation in the GenAI lifecycle and outline the steps taken to integrate these processes seamlessly.

Whether you are an AI practitioner, user or enthusiast, join us to gain insights into the future of GenAI evaluation and its impact on enhancing application performance.

The accelerating integration of LLMs and autonomous Agents into industry applications highlights an urgent need for robust evaluation methodologies, a step that is currently often bypassed or inadequately performed. This talk directly addresses this gap by providing a comprehensive guide to why and how to evaluate both single LLM outputs and LLM-driven Agent behaviors.

We will delve into the nuances that make GenAI evaluation distinct, present the "LLM-as-a-judge" methodology for single LLMs, and extend these principles to the multifaceted evaluation of LLM-based Agents, covering key metrics and operational complexities. Crucially, we will showcase how to implement these evaluations using accessible open-source frameworks.

This session is designed for a diverse audience including GenAI practitioners, product managers, and technical leaders shaping GenAI strategy, alongside enthusiasts keen to understand how these powerful models can be reliably assessed. A basic familiarity with GenAI technologies and a high-level understanding of how single LLMs and LLM-based agents function will be beneficial.

An outline of the talk is as follows:

The GenAI Evaluation Imperative: Why traditional metrics fall short and the risks of inadequate GenAI evaluation.
The Uniqueness of GenAI Models: What makes LLM and Agent evaluation fundamentally different?
LLM-as-a-Judge for Single LLMs:
- Theory: Principles of using an LLM to evaluate text quality
- Practice: Demonstrating evaluation of LLMs using open-source frameworks
Evaluating LLM-Based Autonomous Agents:
- Methodology: Defining success for Agents, assessing tool use, task completion, consistency
- Key Metrics & Challenges: Beyond single responses – evaluating multi-turn interactions and autonomy
- Open-source tools for Agent evaluation

Embedding Evaluation in the GenAI Lifecycle: From development and testing to continuous monitoring in production.

Conclusion & Key Takeaways

By the end of this talk, attendees will possess a solid understanding of why GenAI evaluation is critical, be equipped with established methods for evaluating both single LLMs and LLM-based agents using the LLM-as-a-judge paradigm, and know which open-source frameworks can facilitate these processes. This knowledge will empower them to build more reliable, trustworthy, and effective GenAI applications.

Scaling Trust: A practical guide on evaluating LLMs and Agents .ical 09-26, 11:05–11:40 (Europe/Amsterdam), Apollo

Scaling Trust: A practical guide on evaluating LLMs and Agents
.ical

09-26, 11:05–11:40 (Europe/Amsterdam), Apollo