LessWrong posts by zvi
Only six weeks after Opus 4.7, we have Opus 4.8. For everyone, that means another incremental upgrade to Claude. It is once again smarter, and can do tasks for longer, and comes with a number of hot new features. For me, that also means reading another 244 page system card. It was only April 20 when I did a full review of the Opus 4.7 system card, plus an additional post focusing on related issues of model welfare. These updates are incremental and coming more rapidly, and this still is below the capability level of Claude Mythos, so the focus will be on the delta. What is different about Opus 4.8 versus what we already know about Opus 4.7 and Mythos? It turns out there's still a lot to talk about. Image created as self-portrait for this post by Claude Opus 4.8 Table of Contents 1. Here We Go Again: Executive Summary. 2. Introduction (1). 3. RSP Evaluations (2). 4. Move That Goalpost. 5. The Failures Are News. 6. Alignment Risk Slowly Rises. 7. New Risk Pathways Just Dropped. 8. Cyber (3). 9. Harmful Requests (4.1). 10. We Need To Talk (4.2 [...] --- Outline: (01:16) Here We Go Again: Executive Summary (02:33) Introduction (1) (02:42) RSP Evaluations (2) (03:47) Move That Goalpost (05:41) The Failures Are News (07:33) Alignment Risk Slowly Rises (09:00) New Risk Pathways Just Dropped (11:26) Cyber (3) (12:22) Harmful Requests (4.1) (14:23) We Need To Talk (4.2 and 4.3) (17:36) Overcoming Bias (4.4) (19:33) Agentic Safety (5) (21:40) Prompt Injection (5.2) (25:18) Alignment (6) (26:33) Looking For Problems (27:55) Who Watches The Training (6.2.2) (32:07) Automated Behavioral Audit (32:47) The Model Is Smarter Than The Eval (6.2.3.2) (34:39) You Should See The Other Guy (36:30) UK AISI Testing (6.2.4) (36:50) In Vendbench (6.2.5) (39:27) Honesty (6.3.3 to 6.3.6) (41:35) Chain of Thought (CoT) Monitorability (6.5) (44:09) What's In The Box? (6.6) (45:57) That's All For Now --- First published: May 29th, 2026 Source: https://www.lesswrong.com/posts/Gx6cJ6cG9JfeSNcLB/claude-opus-4-8-the-system-card [https://www.lesswrong.com/posts/Gx6cJ6cG9JfeSNcLB/claude-opus-4-8-the-system-card?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration]. --- Images from the article: Man holding gear before massive suspended clockwork mechanism with multiple rings and moon phase display. [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/zzlyhakfrla31qrmwg9c]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/zzlyhakfrla31qrmwg9c ---------------------------------------- A graph showing [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/wwvjt2vneq5pug1zejj4]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/wwvjt2vneq5pug1zejj4 ---------------------------------------- Bar graphs titled [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/ffherfvolvmryerd1xyd]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/ffherfvolvmryerd1xyd ---------------------------------------- Table showing refusal rates for four Claude models in malicious computer use evaluation. [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/n0cxrzq6lladrib5nomo]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/n0cxrzq6lladrib5nomo ---------------------------------------- Table showing AI model task completion rates for voter suppression and domestic polarization scenarios. [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/ky8xlgr5uakljpcbibzr]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/ky8xlgr5uakljpcbibzr ---------------------------------------- Bar graph titled [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/gd1acpbmaykjt3zupymg]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/gd1acpbmaykjt3zupymg ---------------------------------------- Bar graph titled [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/ugkzq014eskhdywflvsf]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/ugkzq014eskhdywflvsf ---------------------------------------- Table showing attack success rates for AI models with and without safeguards in coding environments. [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/guvc4uyapdu0avdlkic9]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/guvc4uyapdu0avdlkic9 ---------------------------------------- Table showing attack success rates of Shade indirect prompt injection attacks across Claude models with and without safeguards. [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/gnexgul1dhlhl4nurzsy]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/gnexgul1dhlhl4nurzsy ---------------------------------------- Bar graph titled [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/r7jrchid070xcfemwz3y]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/r7jrchid070xcfemwz3y ---------------------------------------- Table showing attack success rates of prompt injection attacks across Claude models with and without safeguards. [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/uge46jxxnzf5xkgvgbxd]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/uge46jxxnzf5xkgvgbxd ---------------------------------------- ROC curves titled [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/m922nihtrfmxmckru8vw]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/m922nihtrfmxmckru8vw ---------------------------------------- Bar graphs showing [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/oebsh5ikqsajmsk9in27]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/oebsh5ikqsajmsk9in27 ---------------------------------------- Bar graph titled [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/hh9ct7vinbb7yfctqgw2]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/hh9ct7vinbb7yfctqgw2 ---------------------------------------- Bar graph titled [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/sk0zy13j0waevqmizt6d]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/sk0zy13j0waevqmizt6d ---------------------------------------- Bar graphs comparing behavioral audit scores across different AI models and steering conditions for six misalignment categories. [https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/fhvuwtto0wpg5euknj77]https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/Gx6cJ6cG9JfeSNcLB/fhvuwtto0wpg5euknj77 Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.
250 afleveringen
Reacties
0Wees de eerste die een reactie plaatst
Meld je nu aan en word lid van de LessWrong posts by zvi community!