06-07, 11:05–11:50 (Europe/London), Grand Hall
AI agents and multi-step workflows are powerful, but testing them can be tricky. This talk explores practical ways to test these complex systems — like running multi-step simulations, checking tool calls, and using LLMs for evaluation. You'll also learn how to prioritize what to test and set up session-level evaluations with open-source tools.
AI agents and multi-step AI workflows are incredibly powerful — but they can also be risky to deploy and even scarier to change. You don’t want your users to be the ones finding the bugs, but it's often not clear how to test such complex systems in advance. Traditional unit tests and ML evaluation methods don’t really work when interactions unfold unpredictably across an entire session.
In this talk, we’ll break down practical ways to test compound AI systems, including chatbots and AI agents. We'll cover:
- Strategies for testing complex systems
- Specific approaches, from testing the correctness of tool calls to running multi-step simulations.
- How to automate evaluation using both LLM-as-a-judge and deterministic checks.
- How to prioritize testing, balancing edge cases, adversarial scenarios, and core user experiences.
We'll also share how you can configure and run session-level evaluation using open-source tools.
No previous knowledge expected
Emeli Dral is a Co-founder and CTO at Evidently AI, a startup developing open-source tools to evaluate, test, and monitor the performance of AI systems.
Earlier, she co-founded an industrial AI startup and served as the Chief Data Scientist at Yandex Data Factory. She led over 50 applied ML projects for various industries - from banking to manufacturing. Emeli is a data science lecturer at Harbour.Space University, and a co-author of the Machine Learning and Data Analysis curriculum at Coursera with over 100,000 students.