2025-11-08 –, Room 301B
Prompt variation isn't just an engineering nuisance, it's a window into fundamental LLM limitations. When a model's accuracy drops from 95% to 75% due to minor rephrasing, we're not just seeing brittleness; we're potentially exposing data contamination, spurious correlations, and shallow pattern matching. This talk explores prompt variation as a powerful diagnostic tool for understanding LLM reliability. We discuss how small changes in format, phrasing, or ordering can cause accuracy to collapse revealing about models memorizing benchmark patterns or learning superficial correlations rather than robust task representations. Drawing from academic and industry research, you will learn to distinguish between LLM's true capability and memorization, identify when models are pattern-matching rather than reasoning, and build evaluation frameworks that expose these vulnerabilities before deployment.
Prompt variation serves as a diagnostic instrument for understanding what models actually know. By systematically changing how we ask questions while preserving their meaning, we can distinguish between three critical phenomena: genuine task understanding, dataset contamination that causes performance to collapse when phrasing changes, and shallow pattern matching that works for specific formats but fails when structure shifts. This distinction fundamentally changes how we interpret benchmark scores and deploy models in production.
This talk explores fundamental questions about LLM evaluation and behavior through the lens of prompt variation:
- What does prompt sensitivity reveal about model understanding?
- Can prompt variation reliably detect data contamination?
- How do different prompt perturbations expose model performance?
Evaluation Framework and Benchmarking. The talk presents practical approaches for implementing variation-based evaluation, sharing methods for generating semantic-preserving perturbations that maintain task intent while varying surface form. We show how to construct contamination-resistant benchmarks that cannot be gamed through memorization, using variation as a built-in defense against dataset leakage.
Prompt variation is not a bug to be fixed but a diagnostic tool that reveals the gap between what models appear to know and what they actually understand. By systematically varying how we interact with models, we can build more reliable AI systems and more honest evaluations of their capabilities. This talk provides a technical roadmap for building and evaluating LLMs that are not only accurate, but reliable, aligned, and production-ready.
Aziza is an Applied Scientist at Oracle (AI Science) in Generative AI Evaluations specifically working with multi-modal, text and code generation. Previously she worked in Content Moderation, AI safety at Microsoft’s Responsible & OpenAI research team. She is a graduate of a master of science in Artificial Intelligence from Northwestern University. Aziza is interested in developing tools and methods that embed human-like reasoning capabilities into AI systems and applying these technologies to socially-driven tasks. Aziza is based in Seattle and after work, she gets busy training for her next marathon or hiking somewhere around PNW.