2025-11-07 –, Talk Track 1
AI initiatives don’t stall because of weak models or scarce GPUs—they stall because organizations (and their LLMs) can’t reliably find, connect, and trust their own tabular data. Traditional catalogs promised order but turned into graveyards of stale metadata: manually curated, impossible to maintain, and blind to the messy realities of enterprise-scale environments.
What’s needed is a semantic foundation that doesn’t just document data, but deterministically maps it—every valid join, entity, and lineage verifiable against the data itself.
This talk explores methods designed for that reality: statistical profiling to reveal true distributions, functional type detection to identify natural keys and relationships, deterministic join validation to separate signal from noise, and entity-centric mapping that organizes data around business concepts rather than table names. These approaches automate what was once brittle and manual, keeping catalogs alive, current, and grounded in evidence.
AI initiatives aren’t failing industry-wide (just) because of weak models or lack of GPUs. They fail because organizations (and LLMs) can’t reliably find, connect, and trust their own data, especially their tabular data.
Traditional catalogs promised order but became graveyards of stale metadata: manually curated, impossible to keep current, and blind to the messy realities of enterprise-scale data. The result? Data scientists burn weeks validating joins and features, analysts struggle to reconcile inconsistent views, auditors can’t trace metrics end-to-end, and AI projects stall before they begin.
What’s needed is a cataloging process that doesn’t just document data—it deterministically and precisely maps it, using methods that directly interrogate the data itself.
The Need – A reliable semantic foundation where every valid join, entity, and lineage is discoverable and verifiable, powering analytics and AI alike.
The Methods – Instead of depending on naming conventions or tribal knowledge, the new approach combines:
- Statistical profiling to expose true distributions and anomalies.
- Functional type detection to infer natural keys, dates, identifiers, and relationships.
- Deterministic join validation to separate real connections from coincidental overlaps.
- Entity-centric mapping that organizes data around business concepts, not table names.
These methods are designed to survive the realities of messy, inconsistent enterprise data.
The Tricks to Overcome – Automating these methods means the catalog self-corrects, stays current, and grounds every insight in evidence—not memory. This is in the same spirit that Satya Nadella highlighted as critical for agentic workflows: deterministic pre-processing that makes LLMs and AI agents reliable on enterprise data.
The Outcomes – With this foundation in place:
- Data scientists radically cut feature engineering time and deliver stronger models.
- Auditors trace every metric with deterministic lineage, eliminating gaps and guesswork.
- Analysts build ad-hoc, multi-source insights quickly and confidently.
- AI assistants generate precise, explainable SQL, finally grounded in truth instead of probability.
Speaker
Kirsten Lum — CEO of Schemantic.io and former Amazon Head of Data Science, Data Engineering & Economics; Lecturer at the University of Washington's Masters of Business Analytics program.
Previous knowledge expected
At Schemantic.io and Storytellers.ai, I oversee all aspects of data science, product, and engineering with more than 40 patent claims underyling our tech. An almost decade long analytics veteran of Amazon and Expedia, I have led dozens of leaders across applied science, economics, analytics, data architecture, instrumentation, customer segmentation, customer retention, marketing operations and impact measurement at global scale.