PyData Seattle 2025

Evaluation is all you need
2025-11-08 , Talk Track 3

LLM apps fail without reliable, reproducible evaluation. This talk maps the open‑source evaluation landscape, compares leading techniques (RAGAS, G-Eval, graders) and frameworks (DeepEval, Phoenix, LangFuse, OpenAI Evals), and shows how to combine unit tests, RAG‑specific evals, and observability to ship higher‑quality systems.
Attendees leave with a decision checklist, code patterns, and a production‑ready playbook.


Objective: Provide a practical guide to selecting and combining open‑source evaluation tools for LLM and RAG systems.

Thesis: No single tool covers all evaluation needs. Effective teams pair: (1) unit‑test style checks during development, (2) RAG‑specific quality metrics, and (3) production observability/tracing—optionally anchored in MLOps platforms for governance.

Outline and approach:

  1. Brief frame of failure modes (data leakage, brittle evals, silent regressions).
  2. Introduction of open‑source frameworks: DeepEval, RAGAS, Phoenix (Arize), LangFuse, and OpenAI Evals.
  3. Show comparison methodology (capability matrix across metrics, safety, tracing, multimodal, cost/latency considerations).
  4. Code‑forward segments: pytest‑style evals (DeepEval); RAG scoring with RAGAS; tracing + scoring with Phoenix/LangFuse; CI/CD gates and experiment tracking.
  5. Live mini demo: evaluate the same output across tools; discuss discrepancies and how to interpret them.
  6. Close with a decision checklist and integration patterns for local dev, CI, and production.

Target audience: Data scientists, ML/LLM engineers, MLOps/platform engineers (intermediate to advanced).

Talk type: hands‑on, code‑centric, data‑driven; light on theory, focused on reproducible workflows.

Key takeaways:
- A clear taxonomy of evaluation needs and where the framework fit
- A decision checklist to pick tools by use case (RAG, safety, tracing, multimodal, enterprise)
- Reference patterns for evals, CI/CD quality gates, and production monitoring
- Pitfalls to avoid (overfitting to evals, prompt leakage, metric misuse) and how to mitigate them


Prior Knowledge Expected:

Previous knowledge expected

Seb is a Lead AI Engineer with a Master’s in Information Systems, originally from Germany and now US‑based. After beginning a PhD, he moved into consulting and served as Chief Product Officer at a major Austrian bank. He later pursued NLP research with MIT, co‑founded and exited a startup, and built AI/NLP systems in production. He has taught 20+ academic courses and published seven peer‑reviewed articles, known for translating complex concepts into practical solutions that bridge technical rigor with stakeholder needs.