EP 49 | CS224N: Benchmarking and Evaluation

16 min · 25. Juni 2026

Beschreibung

We spend so much time building massive AI models, but how do we actually know if they are any good? In this episode, we tackle the multi-billion-dollar scientific bottleneck: evaluation. We explore why the science of measuring models is lagging far behind the engineering of building them, and why hitting 100% on a test doesn't mean what you think it means. Key Topics: * The Benchmark SAGA: How the industry moved from basic language understanding (GLUE) to insanely difficult graduate-level tests (GPQA) as models consistently shattered human ceilings. * How Models Cheat: A look at "spurious biases" and annotation artifacts. We explain how lazy human data labeling taught models to cheat on reading comprehension tests using lexical overlap and negation bias. * The Metrics Spectrum: Why classical, exact-match metrics (like BLEU) are totally blind to semantics, and why modern neural metrics (like BERTScore) are dangerously blind to factual hallucinations. * The Algorithmic Courtroom: The rise of LLMs acting as judges for other LLMs. We break down their native biases—like nepotism and verbosity preference—and why multi-model juries are the new gold standard. Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.

Kommentare

Sei die erste Person, die kommentiert

Melde dich jetzt an und werde Teil der AI Bites: The Academic Series-Community!

Loslegen

Alle Folgen

52 Folgen

EP 50 | CS224N: Reasoning Part 1

How does a language model actually "think"? In this episode, we dive into the fascinating mechanics of AI reasoning. We move past basic text prediction to explore how modern models generate complex, multi-step logic, self-correct their own mistakes, and fundamentally change how we scale compute. Key Topics: * Decoding the Text: Why generation isn't magic, it's an algorithm. We contrast deterministic strategies like Greedy Decoding and Beam Search with open-ended sampling techniques. * The DeepSeek R1 Breakthrough: How the industry proved that state-of-the-art reasoning can be achieved by open-weight models, and how logic is successfully distilled into much smaller architectures. * GRPO & Emergent Reasoning: Unpacking Group Relative Policy Optimization, and taking a look at a model's messy, self-correcting "inner monologue." * Test-Time Compute: The biggest paradigm shift of the year. We explain how models are moving beyond massive training runs to simply "thinking longer" during inference to solve incredibly complex problems. Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.

25. Juni 202651 min

EP 49 | CS224N: Benchmarking and Evaluation

25. Juni 202616 min

EP 48 | CS224N: RAG and Language Agents

Up until now, we’ve looked at Language Models as isolated brains trapped in a box. In this episode, we cross the threshold into the absolute bleeding edge of AI: giving models a search engine to browse the web, memory to remember past conversations, and tools to execute code. We break down the inner workings of Retrieval-Augmented Generation (RAG) and the anatomy of truly autonomous Language Agents. Key Topics: * The Knowledge Problem & RAG: Why forcing LLMs to memorize everything leads to hallucinations, how the Retriever-Reader framework (DPR vs. BM25) fixes it, and why stuffing too many documents into a model triggers the "Lost in the Middle" problem. * The Anatomy of an Agent: How we transform a standard text-predictor into an active agent using a core LLM surrounded by an external environment, reasoning protocols, memory structures, and tools. * Reasoning & Planning (ReAct vs. Reflexion): Unpacking the massive breakthrough of the ReAct (Reason + Act) framework, and how self-correction loops and multi-agent debates drastically reduce AI hallucinations. * The Cognitive Architecture (Memory & Tool Use): Distinguishing between Episodic, Semantic, and Procedural memory (including how MemGPT acts like an Operating System). Plus, how models like Toolformer teach themselves to use external APIs. * The Python "While True" Loop: Demystifying the engineering behind agents by looking at the simple code loops that power them, and the massive challenges the industry faces in trying to evaluate open-ended AI behavior. Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.

19. Juni 202622 min

EP 47 | CS224N: Efficient Adaptation

We know how to build and align massive foundational models, but what if you don't have a $100 million supercomputer? In this episode, we tackle the practical wall of modern AI: compute costs. We explore how researchers are circumventing astronomical expenses to adapt massive models efficiently, pushing the boundaries of what you can train on a single consumer GPU while making AI an environmental imperative. Key Topics: * Fixing RLHF with DPO: Why the industry is abandoning complex reinforcement learning for Direct Preference Optimization, and the ethical reality of the "digital sweatshops" providing our preference data. * The Power and Limits of Prompting: Unlocking Zero-Shot capabilities and Chain-of-Thought reasoning, while acknowledging the fragile, compute-heavy "dark art" of prompt engineering. * The PEFT Revolution & LoRA: The brilliant math behind Low-Rank Adaptation that reduces trainable parameters by 99.9% with zero added inference latency. * Adapters & Soft Prompts: How inserting tiny bottleneck networks enables modular, plug-and-play skills—like swapping between different language dialects on the fly without altering the base model. Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.

19. Juni 202620 min

EP 46 | CS224N: Post-training

How do we turn a raw, chaotic text-predictor into a helpful, conversational AI assistant? In this episode, we dive into the massive pipeline of Post-training. We explore the transition from Instruction Fine-Tuning to complex Reinforcement Learning, and why teaching an AI to be "helpful" sometimes inadvertently teaches it to lie. Key Topics: * The Alignment Problem: Why a raw foundational model is just a "document completer" and how Instruction Fine-Tuning (IFT) begins the process of teaching it to follow user commands. * RLHF & Reward Models: How we use pairwise human comparisons to train a Reward Model, and how PPO is used to optimize the AI's behavior without breaking its grammar. * Reward Hacking & Hallucinations: The dark side of RLHF. We explore why heavily incentivizing models to sound authoritative leads to massive real-world failures, like Bing's sports hallucinations and Google Bard's $100 Billion stock drop. * The DPO Breakthrough: How researchers removed the unstable reinforcement learning step entirely with Direct Preference Optimization, creating the new open-source standard. * Ethical Realities: A candid look at the human cost of AI alignment, from low-wage "digital sweatshops" to the severe annotator biases that bleed directly into modern models. Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.

11. Juni 202622 min

EP 49 | CS224N: Benchmarking and Evaluation

Beschreibung

Kommentare

2 Monate für 1 €

Alle Folgen