The Scratchpad Monologues (CoT part 2)

46 min · 16 de mar de 2026

Descripción

If chain of thought is a model "thinking aloud" to itself, then why does it express doubt, frustration or suspicion about the problems it's solving, sometimes for pages and pages of its scratchpad? And what does chain of thought mean for AI safety? We'll hear from Julian Schulz, a researcher who's studying encoded reasoning in large language models, about where the opportunities, risks and weirdness lie in chain of thought. Here are some links to his research: * On a model jailbreaking its monitor: https://www.lesswrong.com/posts/szyZi5d4febZZSiq3/monitor-jailbreaking-evading-chain-of-thought-monitoring * A roadmap for safety cases based on CoT: https://arxiv.org/html/2510.19476v1#S1 * His posts on Less Wrong: https://www.lesswrong.com/users/wuschel-schulz Some of the other papers we discussed include: * On the biology of a large language model: https://transformer-circuits.pub/2025/attribution-graphs/biology.html * Monitoring reasoning models for misbehavior and the risks of promoting obfuscation: https://arxiv.org/pdf/2503.11926 * How steganography comes about: https://arxiv.org/pdf/2506.01926 * Assuring agent safety evals by analysing transcripts (with excerpts from weird monologues): https://www.alignmentforum.org/posts/e8nMZewwonifENQYB/assuring-agent-safety-evaluations-by-analysing-transcripts * Stress-testing deliberative misalignment: https://www.apolloresearch.ai/research/stress-testing-deliberative-alignment-for-anti-scheming-training/ * And the "watchers" CoT snippet from the paper above: https://www.antischeming.ai/snippets#using-non-standard-language

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de The Glitchatorio!

Prueba gratis

Todos los episodios

20 episodios

Saturated

Benchmarks are the primary measure of AI capability. They involve testing the most advanced models and seeing what kinds of problems they can solve, or what kinds of human tasks they might be able to do. And from late 2025 to mid-2026, most of the main benchmarks became saturated, meaning the models score so highly that the tests aren't meaningful anymore, both in terms of comparing different models' performance as well as their individual performance. That might suggest the models are just getting good at taking these tests. Or it might mean we're approaching the threshold of AGI. In this episode, we'll hear from Håvard Ihle, who came up with his own benchmark called Weird ML to try to answer this question. Note: Håvard's views are his own and do not represent the views of his employer the Norwegian Defence Research Establishment. The METR time-horizon exponential graph is important context for this episode: https://metr.org/time-horizons/ Learn more about WeirdML: * https://epoch.ai/benchmarks/weirdml [https://epoch.ai/benchmarks/weirdml] * https://www.lesswrong.com/posts/LfQCzph7rc2vxpweS/introducing-the-weirdml-benchmark [https://www.lesswrong.com/posts/LfQCzph7rc2vxpweS/introducing-the-weirdml-benchmark] * https://www.lesswrong.com/posts/NLnGRDRXATW2pqXuE/is-the-gap-between-open-and-closed-models-growing-evidence [https://www.lesswrong.com/posts/NLnGRDRXATW2pqXuE/is-the-gap-between-open-and-closed-models-growing-evidence] * https://www.lesswrong.com/posts/ifSBamvobbyB9KWjK/inference-costs-for-hard-coding-tasks-halve-roughly-every [https://www.lesswrong.com/posts/ifSBamvobbyB9KWjK/inference-costs-for-hard-coding-tasks-halve-roughly-every] * https://www.lesswrong.com/posts/hoQd3rE7WEaduBmMT/weirdml-time-horizons [https://www.lesswrong.com/posts/hoQd3rE7WEaduBmMT/weirdml-time-horizons]

25 de may de 202633 min

AI & Mental Health

Could AI address the global mental health crisis at scale? And what are the risks and unknowns that go along with that? These are the questions being investigated by a working group called AIMHI (https://forum.effectivealtruism.org/posts/MrFBezseyfnQd9XmJ/seeking-feedback-an-initiative-on-ai-mental-health-and [https://forum.effectivealtruism.org/posts/MrFBezseyfnQd9XmJ/seeking-feedback-an-initiative-on-ai-mental-health-and]). In this episode, I talk to four members of the group about their field research as well as the mental health chatbot they're developing (https://stillwater.coach/ [https://stillwater.coach/]), whose focus is on serving populations with severe mental healthcare shortages (https://impartial-priorities.org/p/ai-mental-health-chatbots-for-low [https://impartial-priorities.org/p/ai-mental-health-chatbots-for-low]). * Find out more about Effective Mental Health: https://effectivementalhealth.com * Join one of AIMHI's weekly coworking sessions: https://luma.com/calendar/cal-JNJlcdItDuFEFcn [https://luma.com/calendar/cal-JNJlcdItDuFEFcn] * Read about the project's theory of change: https://impartial-priorities.org/p/breaking-the-cycle-of-trauma-and

4 de may de 202636 min

2 AIs Take A Session (Fiction)

What if AIs went to therapy? Would it help them to become "fitter, happier, more productive" ? (in the words of the old Radiohead song) Or would they take it as a novel type of evaluation (and maybe that's what it really is)? Note: this episode was written and recorded in November 2025, five months before the release of the Mythos Preview system card that mentions Claude's session with a human psychiatrist in the "Model welfare" section. So as weird as this episode might seem, the truth is actually stranger. https://www-cdn.anthropic.com/8b8380204f74670be75e81c820ca8dda846ab289.pdf (See page 180 for the psychiatrist's report)

13 de abr de 20267 min

You Be The Judge

Can we trust AI to keep AI honest? Having a human in the loop is already more illusion than reality, as the task of checking and overseeing LLM outputs is increasingly assigned to other LLMs. The problem is that these LLM judges tend to be biased in favor of the answers they generate themselves — even when the answers are wrong. To understand why this is, and what we can do about it, listen to my conversation with AI safety researcher Taslim Mahbub. We'll talk about his research into self-preference bias, the surprising results of his experiments and some potential mitigation strategies, as outlined in this post on mitigating collusive self-preference: https://www.lesswrong.com/posts/nB7kAf8c4tvnvZ4u3/mitigating-collusive-self-preference-by-redaction-and-2 and this paper on mitigating self-preference through authorship obfuscation: https://arxiv.org/abs/2512.05379 As a bonus, if you're interested in Taslim's earlier research on using machine learning in service of biodiversity monitoring, here's the abstract of his paper on convolutional neural networks (CNN) for identifying bat species: https://ieeexplore.ieee.org/document/9311084

30 de mar de 202622 min

The Scratchpad Monologues (CoT part 2)

16 de mar de 202646 min

The Scratchpad Monologues (CoT part 2)

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios