AI Evals and Analytics Podcast
What is Evaluation-driven development? When should you start building evals for your product? How to build it from scrach? Using a real-world example of a customer chatbot for a medical insurance company, we walk through the process of setting up evals from scratch: translating product requirements into quantifiable metrics, curating quality test datasets (hint: you need fewer examples than you think), and making go/no-go decisions based on eval scores. You'll learn why accuracy and safety require different approaches, how to avoid the trap of AI-generated test data, and why 94% vs 95% accuracy matters less than you'd expect—but safety guardrails are non-negotiable. This is the practical blueprint for anyone building AI products who wants to catch problems before users do. 00:00 – Introduction: Why We Need to Talk About Evals Now 00:39 – When to Start AI Evals? 03:20 – Example Setup: Medical Insurance Customer Chatbot 04:30 – Defining Evals in Product Requirements 07:19 – What Is Evaluation-Driven Development? 08:27 – Breaking Down "Accuracy": What Does It Really Mean? 09:42 – Dataset Curation: Quality Over Quantity 11:24 – How Big Should Your Test Set Be? 12:25 – Safety Guardrails: Knowledge Boundary and PII Leakage 15:29 – Making Release Decisions with Eval Metrics 17:33 – Start with What's Critical to Your Use Case Stella Liu: https://www.linkedin.com/in/wenxingl/ [https://www.linkedin.com/in/wenxingl/] Amy Chen: https://www.linkedin.com/in/amy17519/ [https://www.linkedin.com/in/amy17519/] More about AI Evals and Analytics -- https://ai-evals.org/ [https://ai-evals.org/] We (Stella & Amy) created the AI Evaluation & Analytics Playbook, a practical framework that helps teams ship production-ready, trustworthy AI systems. Powered by Firstory Hosting [https://firstory.me/zh]
3 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de AI Evals and Analytics Podcast!