George Chouliaras PyData Amsterdam 2025

George Chouliaras
.ical

Experienced AI practitioner with more than 7 years of experience in building and deploying AI systems. I have a particular interest in Software Quality for AI systems, development & evaluation of GenAI systems .

Session

09-26

11:05

35min

Scaling Trust: A practical guide on evaluating LLMs and Agents

George Chouliaras, Antonio Castelli

Recently, the integration of Generative AI (GenAI) technologies into both our personal and professional lives has surged. In most organizations, the deployment of GenAI applications is on the rise, and this trend is expected to continue in the foreseeable future. Evaluating GenAI systems presents unique challenges not present in traditional ML. The main peculiarity is the absence of ground truth for textual metrics such as: text clarity, location extraction accuracy, factual accuracy and so on. Nevertheless the non-negligible model serving cost demands an even more thorough evaluation of the system to be deployed in production.

Defining the metric ground truth is a costly and time consuming process requiring human annotation. To address this, we are going to present how to evaluate LLM-based applications by leveraging LLMs themselves as evaluators. Moreover we are going to outline the complexities and evaluation methods for LLM-based Agents which operate with autonomy and present further evaluation challenges. Lastly, we will explore the critical role of evaluation in the GenAI lifecycle and outline the steps taken to integrate these processes seamlessly.

Whether you are an AI practitioner, user or enthusiast, join us to gain insights into the future of GenAI evaluation and its impact on enhancing application performance.

Apollo

George Chouliaras .ical

Session

George Chouliaras
.ical