2025-11-08 –, Room 301A
LLM apps fail without reliable, reproducible evaluation. This talk maps the open‑source evaluation landscape, compares leading techniques (RAGAS, Evaluation Driven Development) and frameworks (DeepEval, Phoenix, LangFuse, and braintrust), and shows how to combine tests, RAG‑specific evals, and observability to ship higher‑quality systems.
Attendees leave with a decision checklist, code patterns, and a production‑ready playbook.
Thesis: No single tool covers all evaluation needs. Effective teams pair: (1) eval driven development, (2) RAG‑specific quality metrics, and (3) production observability/tracing.
Outline and approach:
Brief frame of failure modes (data leakage, brittle evals, silent regressions).
Introduction to evaluation and leading open‑source frameworks.
Show comparison methodology (safety, tracing, multimodal, cost/latency considerations).
Code‑forward segments: evals example braintrust; RAG scoring with RAGAS; tracing + scoring with Phoenix/LangFuse; experiment tracking.
Live mini demo: evaluate the same output across tools; discuss discrepancies and how to interpret them.
Close with a decision checklist and integration patterns for local dev, CI, and production.
Target audience: Data scientists, ML/LLM engineers, MLOps/platform engineers (intermediate to advanced).
Talk type: hands‑on, code‑centric, data‑driven; light on theory, focused on reproducible workflows.
Key takeaways:
- A clear taxonomy of evaluation needs and where the framework fit
- A decision checklist to pick tools by use case (RAG, safety, tracing, multimodal, enterprise)
- Reference patterns for evals, CI/CD quality gates, and production monitoring
- Pitfalls to avoid (overfitting to evals, prompt leakage, metric misuse) and how to mitigate them
Seb is a Senior Member of Technical Staff at Cerebras Systems with a Master’s in Information Systems, originally from Germany and now US‑based. After beginning a PhD, he moved into consulting and served as Chief Product Officer at a major Austrian bank. He later pursued NLP research with MIT, co-founded and exited a startup, and built many AI/NLP systems in production. He has taught 20+ academic courses and published seven peer‑reviewed articles, known for translating complex concepts into practical solutions that bridge technical rigor with stakeholder needs.