PyData Global 2025

UQLM: Detecting LLM Hallucinations with Uncertainty Quantification in Python
2025-12-10 , Machine Learning & AI

As LLMs become increasingly embedded in critical applications across healthcare, legal, and financial domains, their tendency to generate plausible-sounding but false information poses significant risks. This talk introduces UQLM, an open-source Python package for uncertainty-aware generation that flags likely hallucinations without requiring ground truth data. UQLM computes response-level confidence scores from token probabilities, consistency across sampled responses, LLM judges, and tunable ensembles. Attendees will learn practical strategies for implementing hallucination detection in production systems and leave with code examples they can immediately apply to improve the reliability of their LLM-powered applications. No prior uncertainty quantification background required.


Objective.

Show how to add uncertainty-aware controls to LLM apps using UQLM so practitioners can detect and handle hallucinations at generation time without ground truth data.

Context and Gap.

Many hallucination detection methods assume existence of ground truth data, which is rarely available in production. Research has proposed ground-truth-free uncertainty quantification (UQ) techniques, but adoption suffers from fragmented tooling. UQLM packages these methods behind a simple API and provides a versatile suite of UQ-based confidence scorers that work across tasks.

What you will see.

  • Black-box UQ via response consistency from multiple samples
  • White-box UQ from token log probabilities
  • LLM-as-a-judge scoring
  • Ensemble tuning and threshold selection for your use case
  • Patterns for routing: block, warn, or escalate to human review

Outline (30 minutes total).

  • 0–4: Why hallucinations matter in production
  • 4–8: Limits of traditional hallucination detection approaches and where UQ fits
  • 8–20: UQLM walkthrough and code examples
  • 20–24: Choosing thresholds and tuning ensembles
  • 24–27: Results on several use cases and interpreting confidence
  • 27–30: Q&A

Expected background.

Basic familiarity with LLMs and machine learning. No prior uncertainty quantification knowledge required.

Key takeaways.

  • When and why ground-truth-free hallucination detection is useful in production
  • How to add UQLM to a Python app in a few lines of code
  • Pros and cons of consistency-based, token-probability-based, and judge-based methods
  • Practical guidance on thresholds, ensemble tuning, and handling low-confidence outputs

Prior Knowledge Expected:

No

Dylan Bouchard is a Principal Applied Scientist focusing on AI Research & Open Source at CVS Health. He leads the company's Responsible AI Research program, where he developed two impactful open source libraries: UQLM, a toolkit for detecting hallucinations in large language models, and LangFair, a framework for evaluating bias and fairness in LLMs. His work bridges academic research with practical tools that help make AI systems more reliable and equitable.

I am Senior Data Scientist at CVS Health and works in the Responsible AI and LLM/Agentic systems. My expertise lies in the technical aspects of ethical AI, with a particular focus on bias and fairness testing. I am dedicated to identifying and mitigating biases in AI systems to ensure they are fair and equitable for all users. Additionally, I specialize in hallucination detection and mitigation for large language models (LLMs), multi-modal models, and AI agents, striving to enhance the reliability and trustworthiness of these advanced technologies. The recent cutting-edge tools includes open-source libraries like LangFair and UQLM.