AI Bites: The Academic Series
Language models do not actually read text—they read tokens. In this episode, we explore the invisible preprocessing layer that Andrej Karpathy says is "at the heart of much weirdness of LLMs." We demystify the Tokenization problem, explain why your AI can't count letters, and discuss the massive socio-economic inequalities baked into modern AI pricing. Key Topics: * The BPE Algorithm: How Byte Pair Encoding finds the "Goldilocks" zone between infinite character sequences and rigid word vocabularies by merging frequent bytes. * Strawberries & Glitch Tokens: Why ChatGPT confidently fails to spell the word "strawberry," and what the "SolidGoldMagikarp" glitch token reveals about adversarial vulnerabilities. * Cross-Lingual Transfer & The Capacity Curse: How an AI trained on English sentiment can zero-shot evaluate French, but degrades in overall performance when forced to learn too many languages at once. * The Tokenization Tax: The stark reality of Subword Fertility. We explain how English-biased tokenizers unfairly overcharge non-English speakers, slowing down processing speeds and degrading output quality for the global majority. Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.
55 Folgen
Kommentare
0Sei die erste Person, die kommentiert
Melde dich jetzt an und werde Teil der AI Bites: The Academic Series-Community!