How to Keep Your LLM Chatbots Real: A Metrics Survival Guide PyData Amsterdam 2025

How to Keep Your LLM Chatbots Real: A Metrics Survival Guide
.ical

09-26, 11:50–12:25 (Europe/Amsterdam), Apollo

In this brave new world of vibe coding and YOLO-to-prod mentality, let’s take a step back and keep things grounded (pun intended). None of us would ever deploy a classical ML model to production without clearly defined metrics and proper evaluation, so let's talk about methodologies for measuring performance of LLM-powered chatbots. Think of retriever recall, answer relevancy, correctness, faithfulness and hallucination rates. With the wild west of metric standards still in full swing, I’ll guide you through the challenges of curating a synthetic test set, and selecting suitable metrics and open-source packages that help evaluating your use case. Everything is possible, from simple LLM-as-a-judge approaches like those inherent to many packages like MLFLow now up to complex multi-step quantification approaches with Ragas. If you work in the GenAI space or with LLM-powered chatbots, this session is for you! Prior or background knowledge is of advantage, but not required.

[min 0-4] Introduction: A brief overview of common LLM-powered chatbots (RAG, agentic). Focus will be on a high-level structure, highlighting the sections we’ll want to evaluate later (retriever, QA-chain)

[min 4-5] Motivation: To optimise performance we first need to identify the performance of the individual elements. We need a standardized measure to understand if optimising leads to measurable improvements.

[min 5-7] Step 1, the evaluation dataset: Curating a suitable test set is key for comprehensive metrics. Manually creating a test set is tedious, Ragas offers a set of tools that allow to generate a synthetic test set. I will give an overview over the tools for test set generation and show examples. The examples will be later used with the evaluation metrics.

[min 7-20] Step 2, metrics for RAG evaluation: I will showcase with examples a set of metrics that allows for a fast comparison and covers the complete RAG pipeline. The goal is not to be exhaustive, but rather give the attendees a set of metrics that have been shown to be useful in combination with each other. This covers:
- Metrics to evaluate the document retriever performance: Context Precision (proportion of retrieved documents that are relevant to the query) and Context Recall (number of relevant documents were successfully retrieved)
- QA-Metrics: Evaluate the chatbot response with respect to the user question in terms of Answer Relevancy and with respect to the ground truth in terms of Answer Correctness, as well as Faithfulness that evaluates answer with respect to context.
Open-source packages: For each of the metrics above, we’ll compare how various packages (MLFlow, DeepEval, Ragas) calculate these metrics.
- Custom metric Hallucination Rate that measures the likelihood of hallucination

[min 20-25] Agentic and other metrics: Ragas also offers possibilities to evaluate the tool workflow of agentic frameworks, e.g. through Tool Call Accuracy, and additional general purpose metrics like the Summarization Score. The goal here is to give a short impression of these additional metrics that the package offers

[min 25-30] Conclusion: As a conclusion I offer tips and tricks on how to select the best metric for your use case. The goal is that the attendees walk away with some concrete metrics they can use to evaluate their RAG/agentic frameworks and empowers them to dive deeper into the functionalities that the Ragas package offers.

How to Keep Your LLM Chatbots Real: A Metrics Survival Guide .ical 09-26, 11:50–12:25 (Europe/Amsterdam), Apollo

How to Keep Your LLM Chatbots Real: A Metrics Survival Guide
.ical

09-26, 11:50–12:25 (Europe/Amsterdam), Apollo