04-18, 10:55–11:30 (US/Eastern), Auditorium 5
Discover how S&P Global built an enterprise-grade evaluation framework that transformed our GenAI deployment process. Through automated monitoring, expert validation, & continuous testing, we’ve streamlined the document integration step of our RAG tools, while ensuring our AI tools maintain consistent quality and reliability.
In this talk, we will provide an in-depth look at how S&P Global built a comprehensive and reliable evaluation framework for our Generative AI (GenAI)-powered internal productivity tools, with a focus on our Market Intelligence (MI) Sales Assistant application.
We will begin by discussing the unique challenges of evaluating large language models (LLMs) and the importance of a robust evaluation strategy, especially for Retrieval Augmented Generation (RAG)-based systems. We’ll then dive into the key components of our framework:
• Metrics: We thoughtfully combine traditional statistical metrics like accuracy, precision, and latency with LLM-specific metrics such as answer relevance, faithfulness to source, and hallucination detection. We’ll explain each metric and its role in assessing model performance and talk about how custom metrics are often necessary in LLM applications.
• Question-Answer Pair Generation: We’ll share our process for generating diverse and representative question-answer pairs, including the models used, quality control measures, and lessons learned around promoting diversity in evaluation data.
• Ground Truth Creation: Our framework heavily involves subject matter experts (SMEs) to create and validate ground truth data. We’ll detail our process for engaging SMEs , documenting and versioning ground truth, and maintaining high standards.
• Evaluation Implementation: We’ll provide a technical overview of our framework, built using the MLflow library. We’ll cover our daily sampling process for continuous monitoring, our comprehensive testing triggered by new releases and document updates, and cost considerations. We will also talk broadly on other tools available outside of MLFlow.
Throughout the talk, we’ll share real-world results and concrete lessons learned, such as effective strategies for question generation, SME engagement, and scaling evaluation processes. We’ll demonstrate our MI Sales Assistant and evaluation dashboard to illustrate the framework in action.
Attendees will come away with a clear understanding of what it takes to implement a robust evaluation framework for a real-world GenAI application. They’ll learn proven best practices and potential pitfalls, equipping them to ensure their own AI systems consistently deliver value.
Previous knowledge expected
MacKenzye Leroy is a Lead Data Scientist within S&P Global's newly formed MI Enterprise Technology & Internal Productivity Team, where he focuses on developing enterprise AI solutions to transform business operations. Working closely with stakeholders across Sales, Commercial, Legal, and Marketing, he implements AI-powered productivity solutions.
MacKenzye combines his M.S. in Data Science from the University of Virginia with his physics background to solve complex business challenges. His expertise spans artificial intelligence, machine learning, data pipeline development, anomaly detection, statistical analysis, and full-stack data science implementation - from initial concept through production deployment.
When not working with data, MacKenzye can be found exploring mountain trails by foot, bike, or snowboard, reading, or cheering on his beloved New York Mets.