PyData Amsterdam 2025

Measure twice, deploy once: Evaluation of retrieval systems
2025-09-25 , Voyager

Improving retrieval systems—especially in RAG pipelines—requires a clear understanding of what’s working and what isn’t. The only scalable way to do that is through meaningful metrics. In this talk, we share insights from building a platform-agnostic search and retrieval product, and how we balance performance against cost. Bigger models often give better results… but at what price? We explain how to assess what’s “good enough” and why the choice of benchmark really matters.


We’ll dive into the metrics-based evaluation of retrieval systems for real-world RAG applications, covering both open- and closed-source models. Expect practical takeaways on managing trade-offs between model quality and cost, and how to build evaluation pipelines that reflect production needs.

With a background in computational linguistics and a strong link to academia, Paul leads our DS & AI team. He focuses on practical, agnostic LLM applications with a strong open-source flavour.

After a PhD in computational physics, Marten transitioned from modelling solar cells to evaluating ML systems. He now works on building and assessing open-source retrieval pipelines at Sopra Steria.