Ever wonder why structured LLM output doesn’t feel as reliable as its natural language responses? At Khan Academy, we asked ourselves the same thing—especially as we leaned heavily on JSON-based structured outputs to power our AI tutor, Khanmigo.
Surprisingly, the root of the problem often lies in one of the most familiar tools in a Python developer’s toolbox: the humble dict
. In this talk, we follow the story of how dictionary ordering can shape (and sometimes distort) structured LLM output. We’ll walk through how different frameworks—OpenAI, Claude, LangChain, OpenRouter, vLLM—handle structured responses, and why those differences matter more than you’d expect.
Along the way, we’ll share practical best practices we’ve developed to improve structured output reliability, observe subtle failure cases, and debug weird edge behaviors. If you’re building LLM apps with structured output, you’ll leave with concrete tips—and a deeper appreciation for the details that make or break your system.