2025-09-25 –, Apollo
Standard benchmarks are kinda bullsh** and the internet knows it.
Little more than a marketing ploy, leaderboards have made us lose
trust in model release claims. They rarely reflect your unique,
real-world needs, leaving you without a reliable way to measure
success.This talk is about why building and continuously updating your
own evaluation systems is the key to creating a durable competitive
moat.
We’ll explore how to craft a robust “golden dataset,” and review the
tooling ecosystem.I learned a few tricks on how to make the most of
your evals from how to collect them to how to label them and I want to
share it and make sure you get the best golden data set possible.
.