EP 37 | CME295: LLM Evaluations

21 min · 15 de abr de 2026

Descripción

If an AI can write a poem, code a website, and pass the bar exam, how do we actually measure its performance? This episode tackles the notoriously difficult science of LLM Evaluation. We look at why standard testing benchmarks are breaking down and how researchers are trying to keep up. Key Topics: * The Benchmark Problem: Why traditional multiple-choice tests are saturating and failing to capture true model intelligence. * LLM-as-a-Judge: The growing trend of using powerful models (like GPT-4) to grade and evaluate the outputs of other models. * Data Contamination: The massive challenge of testing a model when its training data essentially includes the entire internet—did it reason through the test, or just memorize the answer key? Note: This is an AI-generated study resource created via NotebookLM based on the Stanford CME295 curriculum and personal study notes.

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de AI Bites: The Academic Series!

Prueba gratis

Todos los episodios

45 episodios

EP 43 | CS224N: Language Models and RNNs

We are continuing our journey through Stanford's CS224N by exploring the absolute foundation of modern natural language processing. In this episode, we break down Language Models and Recurrent Neural Networks (RNNs), unpacking how the simple task of predicting the next word ultimately taught machines to learn facts, logic, and arithmetic. Key Topics: * Language Modeling & n-grams: The core concept of next-word prediction and why the pre-deep learning era of statistical n-gram models ultimately failed due to sparsity, storage bloat, and "goldfish memory." * The RNN Breakthrough: How the industry moved past fixed-window models to Recurrent Neural Networks, allowing machines to process sequences of any length by reusing the exact same weight matrix at every time step. * Exploding & Vanishing Gradients: The mathematical hurdles that broke early RNNs. We explore why taking massive SGD steps (exploding) or forgetting long-distance dependencies (vanishing) required fixes like gradient clipping and LSTMs. * Neural Machine Translation (NMT): A look at the Sequence-to-Sequence (Seq2Seq) Encoder-Decoder architecture that revolutionized machine translation between 2014 and 2016—and the massive "Bottleneck Problem" it created for future engineers to solve. Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.

29 de may de 20269 min

EP 42 | CS224N: Backpropagation and Neural Networks

We are looking under the hood of deep learning to understand the mathematical engine driving modern artificial intelligence: Backpropagation. In this episode, we break down how neural networks transition away from rigid linear boundaries to build complex, non-linear understandings of language. Key Topics: * Evaluating Word Vectors: The core trade-offs between Intrinsic subtask testing (like word analogies) and Extrinsic downstream evaluation in real-world applications. * Named Entity Recognition (NER): How window classification allows networks to train word vectors and model weights simultaneously to classify entities in context. * The Magic of Non-Linearities: Why activation functions (from classic ReLU to modern LLM standards like GELU and SwiGLU) are mathematically necessary to keep deep layers from collapsing into a single flat function. * Gradients, Jacobians, and Graphs: A walk through matrix calculus, the practical engineering reality of the "Shape Convention," and how computation graphs use simple rules (Addition distributes, Max routes, Multiplication switches) to pass error signals flawlessly. Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.

29 de may de 202623 min

EP 41 | CS224N: Word Vectors

How do you teach a computer the actual meaning of a word? In this episode, we dive into the fundamental building block of modern NLP: Word Vectors. We break down how algorithms map words into a dimensional space, allowing machines to mathematically understand context, similarity, and semantic relationships. Key Topics: * Moving Past One-Hot Encodings: Why simply assigning a random 1 or 0 to a word fails to capture its actual meaning. * Word2Vec (2013): The breakthrough framework that learns word representations by predicting surrounding context words (Skip-gram and CBOW). * Semantic Math: How vector geometry perfectly captures complex relationships (e.g., the famous "King - Man + Woman = Queen" example). Note: This is an AI-generated study resource created via NotebookLM based on the Stanford CS224N curriculum and personal study notes.

22 de may de 202620 min

EP 40 | CS224N: History of NLP

Welcome to a brand new series! We are diving into Stanford's CS224N. To understand where AI is today, we first need to understand how we got here. In this episode, we trace the evolution of Natural Language Processing from early rigid experiments to the deep learning revolution that powers modern language models. Key Topics: * The Early Days: The struggles of symbolic, rule-based systems and manual dictionaries like WordNet. * The Statistical Era: How probabilistic models and machine learning began to change the landscape in the 1990s. * The Deep Learning Shift: Why neural networks ultimately became the dominant, scalable force in language processing. Note: This is an AI-generated study resource created via NotebookLM based on the Stanford CS224N curriculum and personal study notes.

22 de may de 202622 min

EP 39 | CME295 in 15 Minutes (The Full Recap)

Short on time? We’ve distilled the entire Stanford CME295 course into a single, high-energy video recap. This "Cram Session" takes you on a complete journey from the absolute basics of natural language processing to the cutting edge of Large Language Models. Watch or listen for the "Best Of" our course deep dives: * The Foundation: Moving past RNNs into the Self-Attention revolution and the core Transformer architecture. * The Training Pipeline: The massive undertaking of Pre-training, Supervised Fine-Tuning (SFT), and Preference Tuning to build a safe assistant. * Reasoning & Agents: How models use Chain of Thought to solve multi-step problems , and how RAG and Tool Calling turn them into autonomous agents. * The Future: A look at what's next, including Vision Transformers (ViT), Diffusion LLMs, and highly capable Small Language Models (SLMs). Note: This is an AI-generated study resource created via NotebookLM based on the Stanford CME295 curriculum and personal study notes.

22 de abr de 20267 min

EP 37 | CME295: LLM Evaluations

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios