PyData Tel Aviv 2025

Evaluating Your AI Agent: How Do You Properly Measure Performance?
2025-11-05 , AI

AI agents are becoming the next big thing. But deploying an agent without truly understanding its performance, limits, and potential failure points is a high-stakes gamble. How do you ensure your agent is not just functional, but genuinely reliable, robust, and safe?
This talk explores the practical challenges of evaluating AI agents effectively. We'll discover how to define meaningful success metrics, implement comprehensive testing strategies that reflect real world complexity, and meaningfully incorporate human feedback. You'll leave with a practical framework to confidently assess your agent's capabilities and ensure reliable performance when stakes are high.


Throughout the talk we will give examples from both classic LLM evaluation and their counterparts in the realm of Agents, showcase what not to do, and common pitfalls. This talk provides a practical guide specifically for evaluating these advanced LLM agents, addressing the critical need for reliable, robust, and safe systems.
Using a running example agent throughout, we'll illustrate common evaluation pitfalls and demonstrate effective techniques. We'll begin by clearly differentiating LLM agent evaluation from base LLM evaluation, highlighting the added complexities of evaluating planning and tool interaction sequences. We will discuss the role and limitations of current benchmarks such as AgentBench and τ-bench.
The core of the talk focuses on three complementary evaluation perspectives crucial for a holistic view:
Final Response: Assessing the quality and correctness of the agent's ultimate output.
Trajectory: Analyzing the sequence of steps and tool calls taken to reach the conclusion.
Single Step: Examining the validity of individual decisions within the trajectory (e.g., tool selection).
Furthermore, we'll delve into the core of using LLMs as a judges, covering essential calibration techniques and strategies for establishing overall trust and confidence in your evaluation process. Attendees will leave with a clear, actionable framework for implementing these modern LLM agent evaluation methods.


Prior Knowledge Expected:

Previous knowledge expected

Linoy Cohen is a Senior Data Scientist at Intuit in the NLP team. As part of her role, she leads the evaluation track and is responsible for creating automatic evaluations for LLMs and Agents that provide an objective method to measure their capabilities based on specific custom criteria and needs.

Shirli is a senior AI scientist at Intuit, where she brings cutting-edge innovation to life through generative models and agentic AI. Her areas of expertise span reinforcement learning, LLM training and evaluation, NLP, classical machine learning, and the design of intelligent agents.

Shirli holds a Ph.D. and M.Sc. in Electrical and Computer Engineering from the Technion, specializing in Reinforcement Learning, and a B.Sc. in Biomedical Engineering from Ben Gurion University.