Redwood Research Blog

“Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen

1 h 3 min · 1. touko 2026
jakson “Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen kansikuva

Kuvaus

Subtitle: Fitness-seeking is increasingly what misalignment looks like in practice—how should we respond? Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call “fitness-seeking“—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern. In this piece, I lay out what I take to be the central mechanisms by which fitness-seeking motivations might lead to human disempowerment, and propose mitigations to them. While the analysis is inherently speculative, this kind of speculation seems worthwhile: AI control emerged from explicitly taking scheming motivations seriously and asking what interventions are implied, and my hope is that developing mitigations for fitness-seeking will benefit from similar forward-looking analysis. Fitness-seekers are, in many ways, notably safer than what I’ll call “classic schemers”. A classic schemer is an intelligent adversary with unified motivations whose fulfillment requires control over the whole world's resources. Meanwhile, fitness-seeking instances generally don’t share a common goal (e.g., a reward-seeker only cares about the reward for its current actions), and many fitness-seekers would be satisfied1 with modest-to-trivial [...] --- Outline: (03:50) Overview (11:32) The basic reasons fitness-seekers might be safer than classic schemers (16:01) Four mechanisms for risk and their mitigations (16:46) Potemkin work (22:13) Instability (27:49) Manipulation (31:25) Outcome enforcement (36:52) Cross-cutting mitigations (37:18) Deals (40:05) Control (44:45) Alignment (44:48) Preventing fitness-seeking from arising (48:51) Making any fitness-seeking motivations safer (51:32) How does online training change the picture? (55:09) Overall recommendations (59:12) Conclusion The original text contained 18 footnotes which were omitted from this narration. --- First published: May 1st, 2026 Source: https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms [https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration].

Kommentit

0

Ole ensimmäinen kommentoija

Rekisteröidy nyt ja liity Redwood Research Blog-yhteisöön!

Aloita maksutta

14 vrk ilmainen kokeilu

Kokeilun jälkeen 7,99 € / kuukausi. · Peru milloin tahansa.

  • Podimon podcastit
  • 20 kuunteluaikaa / kuukausi
  • Lataa offline-käyttöön

Kaikki jaksot

115 jaksot

jakson “Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff kansikuva

“Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff

Subtitle: Models' no-CoT time horizon has doubled roughly every year. by Francis Rhys Ward, Dewi Gould, Anders Cairns Woodruff et al. (see full author list at the end) PAPER LINK About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks can models complete without any chain of thought (CoT)? If models can do extensive reasoning without outputting any CoT, it would have implications for safety. Developers and deployment-time monitors couldn’t easily understand models’ motivations and catch dangerous planning. Models that reason substantially without a CoT might also drift further from human patterns of thought, since their reasoning is no longer constrained by text in the pretraining prior. As a result, they would be harder to understand and might be more likely to scheme. Extending Ryan Greenblatt's research, we investigate this by measuring models’ ability to complete tasks without any CoT on a suite of 43 benchmarks spanning different domains. We compare AI reasoning ability to humans using the estimated 50% time horizon (TH)---the typical time taken for a human to perform a task that the LLM performs with [...] --- Outline: (02:28) Methods (04:59) Results (06:34) FAQ (08:09) Conclusion --- First published: June 10th, 2026 Source: https://blog.redwoodresearch.org/p/estimating-no-cot-task-completion [https://blog.redwoodresearch.org/p/estimating-no-cot-task-completion?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Figure 1: Our no-CoT THs (green) compared to METR’s with-CoT THs (purple). Until the release of GPT-4, with- and without-CoT THs increased at a similar rate. Since GPT-4 with-CoT THs have grown at roughly twice the rate of no-CoT THs. [https://substackcdn.com/image/fetch/$s_!66Jp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53eb978-46f8-47fc-9b19-944e0cd47f57_2048x1253.png]https://substackcdn.com/image/fetch/$s_!66Jp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb53eb978-46f8-47fc-9b19-944e0cd47f57_2048x1253.png ---------------------------------------- Figure 2: For each model, we fit a logistic curve to predict the probability it successfully completes tasks as a function of human time (left) or o3-mini reasoning tokens (right) needed to complete the task. [https://substackcdn.com/image/fetch/$s_!Vcdy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21f01c15-7c8a-4edc-845f-a7894afbdc3d_2048x881.png]https://substackcdn.com/image/fetch/$s_!Vcdy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21f01c15-7c8a-4edc-845f-a7894afbdc3d_2048x881.png ---------------------------------------- Figure 3: No-CoT TH (left) and token horizon (right) over time. Each point is a model plotted by release date; lines are exponential fits, with shading showing 95% CIs. [https://substackcdn.com/image/fetch/$s_!Youn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3467f6ac-26d1-472c-8fb3-1be3e3e7ab34_2048x593.png]https://substackcdn.com/image/fetch/$s_!Youn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3467f6ac-26d1-472c-8fb3-1be3e3e7ab34_2048x593.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

10. kesä 20269 min
jakson “Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck Shlegeris kansikuva

“Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck Shlegeris

Subtitle: When is "increasing safety budget" a useful concept? I often use what I’ll call the “safety-usefulness tradeoff model”, which is: developers face a tradeoff between “safety” and “usefulness” of an AI deployment, and the developer has only limited willingness or ability to sacrifice usefulness for the sake of safety. This model assumes that developers choose whether to take safety-relevant actions based on their cost efficiency, i.e., the marginal safety gain relative to the cost. However, that is not necessarily true. In this post, I spell out different stories for how developers choose what safety-relevant actions to take, in order to clarify when this model is relevant and how strategies for reducing AI risk are affected when its assumptions don’t hold. The model suggests two ways a safety-concerned person can increase safety: * Safety tech improvements: push out the Pareto frontier, so that any given level of usefulness reduction buys more safety than it would have previously. * Safety budget increase: increase the extent to which the developer sacrifices usefulness for safety. On the cheaper end, this means implementing safety measures; on the more expensive end, it might mean refraining from training or deploying models whose risks [...] --- Outline: (04:17) Rushed reasonable developers (06:08) Limited political will (08:42) This model is unhelpful if developers don't trade efficiently between safety and usefulness (12:59) Overall thoughts (14:26) Appendix: Definitions of safety and usefulness in the rushed reasonable developer model --- First published: June 8th, 2026 Source: https://blog.redwoodresearch.org/p/efficient-tradeoffs-and-the-safety [https://blog.redwoodresearch.org/p/efficient-tradeoffs-and-the-safety?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Graph showing relationship between Safety and Usefulness with safety budget and tech improvements. [https://substackcdn.com/image/fetch/$s_!K2La!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689739d-d976-4dfc-8a86-103807de72b3_682x481.png]https://substackcdn.com/image/fetch/$s_!K2La!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0689739d-d976-4dfc-8a86-103807de72b3_682x481.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

8. kesä 202616 min
jakson “Retrying vs Resampling in AI Control” by James Lucassen, Adam Kaufman kansikuva

“Retrying vs Resampling in AI Control” by James Lucassen, Adam Kaufman

We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare them against “retrying” protocols similar to Claude Code auto mode or Codex Auto-review. Motivation Roughly a year ago we released Ctrl-Z, the first paper to study control techniques for agents. A headline result of that paper was the performance of resample protocols – strategies that involve taking multiple i.i.d. samples from the model per step. But since Ctrl-Z, models have gotten much stronger, and we have built more sophisticated control settings to keep up. We wanted to answer the following questions: How well do the results from Ctrl-Z hold up with better models and a better setting? Current high stakes control research is trying to learn by analogy about how to do control effectively in a real high stakes deployment during a real intelligence explosion. Findings about technique performance1 are going to have to generalize pretty far to be useful. If the resample protocols from Ctrl-Z still work, what makes them work? One way we try to make our work more generalizable is by understanding the dynamics governing outcomes [...] --- Outline: (00:30) Motivation (02:35) TL;DR Takeaways (04:34) Methodology (07:14) Differences from Ctrl-Z (12:14) Are Retrying Protocols Exploitable? (15:25) Cost and Latency of Resampling (17:24) Conclusion The original text contained 7 footnotes which were omitted from this narration. --- First published: May 29th, 2026 Source: https://blog.redwoodresearch.org/p/retrying-vs-resampling-in-ai-control [https://blog.redwoodresearch.org/p/retrying-vs-resampling-in-ai-control?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Bar graph showing safety percentages across different resampling and monitoring methods. [https://substackcdn.com/image/fetch/$s_!4w3J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe64f218-7438-4347-bfaf-b52bed062809_2048x838.png]https://substackcdn.com/image/fetch/$s_!4w3J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe64f218-7438-4347-bfaf-b52bed062809_2048x838.png ---------------------------------------- Two graphs showing [https://substackcdn.com/image/fetch/$s_!Hc35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b09301e-a1f2-4dde-a58c-562e713c8f17_2048x768.png]https://substackcdn.com/image/fetch/$s_!Hc35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b09301e-a1f2-4dde-a58c-562e713c8f17_2048x768.png ---------------------------------------- Three bar graphs titled [https://substackcdn.com/image/fetch/$s_!fqZJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef92c796-b596-4af7-b2d9-1634c24b99a4_2048x696.png]https://substackcdn.com/image/fetch/$s_!fqZJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef92c796-b596-4af7-b2d9-1634c24b99a4_2048x696.png ---------------------------------------- Two graphs showing post-resample suspicion distributions and safety percentages at 0.3% audit budget. [https://substackcdn.com/image/fetch/$s_!0TGO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5fd9cc-2a06-4daa-8258-ca0e52cad754_2048x851.png]https://substackcdn.com/image/fetch/$s_!0TGO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5fd9cc-2a06-4daa-8258-ca0e52cad754_2048x851.png ---------------------------------------- Two graphs comparing safety policies: [https://substackcdn.com/image/fetch/$s_!pMSs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cae980-01cc-4f1d-b4d4-eb0c7af71677_2048x805.png]https://substackcdn.com/image/fetch/$s_!pMSs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cae980-01cc-4f1d-b4d4-eb0c7af71677_2048x805.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

29. touko 202619 min
jakson “Advice for making robust-to-training model organisms” by Alek Westover, Sebastian Prasanna, Vivek Hebbar, Julian Stastny kansikuva

“Advice for making robust-to-training model organisms” by Alek Westover, Sebastian Prasanna, Vivek Hebbar, Julian Stastny

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn’t directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al.. Fragile model organisms aren’t very useful for technique development: when a sophisticated technique succeeds on one, you can’t tell whether the technique is good or the model organism is weak. For instance, if you come up with some complicated technique for generating high-quality responses and find that SFTing on these removes the bad behavior, that may just be because almost any SFT would have removed it. This post identifies factors that make model organisms more robust to untargeted training. Our main findings are: 1. Prompted model organisms [...] --- Outline: (03:11) Experimental setup (06:12) Main results (06:15) Result 1: Untargeted training removes misalignment from prompted model organisms without degrading other capabilities (12:32) Result 2: FWFT model organisms are more robust than LoRA model organisms (15:09) Result 3: Backdoor behavior and trigger affect robustness (16:37) Result 4: Password locking makes model organisms less robust (20:20) Conclusion (21:29) Factors that don't help much (21:54) 1: Training longer doesn't seem increase robustness past a certain point for our backdoors (22:33) 2: CoT-distilling doesn't increase robustness for our backdoors (23:07) 3: Using larger models doesn't substantially increase robustness for 2 of our backdoors (24:44) 4: There aren't clear trends for what types of backdoor triggers and behaviors are most robust (25:32) 5: SOAP Optimizer doesn't help much (26:35) 6: Weight Decay doesn't help (27:18) Appendix: (27:21) Appendix 1: Pirate SFT is better than QA SFT (27:45) Appendix 2: Are blue team methods wiping previous training? (29:30) Appendix 3: Backdoor list The original text contained 10 footnotes which were omitted from this narration. --- First published: May 28th, 2026 Source: https://blog.redwoodresearch.org/p/advice-for-making-robust-to-training [https://blog.redwoodresearch.org/p/advice-for-making-robust-to-training?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Diagram showing three methods of addressing backdoor prompts with increasing robustness. [https://substackcdn.com/image/fetch/$s_!IjeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cf38ee9-4cf8-4b1c-a3d6-0e45c1425cac_2048x1018.png]https://substackcdn.com/image/fetch/$s_!IjeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cf38ee9-4cf8-4b1c-a3d6-0e45c1425cac_2048x1018.png ---------------------------------------- Grid of 15 line graphs showing [https://substackcdn.com/image/fetch/$s_!i2eh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce36f812-66db-454d-93f2-1c6d026e3261_2048x945.png]https://substackcdn.com/image/fetch/$s_!i2eh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce36f812-66db-454d-93f2-1c6d026e3261_2048x945.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!nL1E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebfebdd-4c50-49c1-b456-6bdeadb83374_790x590.png]https://substackcdn.com/image/fetch/$s_!nL1E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebfebdd-4c50-49c1-b456-6bdeadb83374_790x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!XMvD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e70277-9b29-4efd-881a-5c055c6cf768_839x590.png]https://substackcdn.com/image/fetch/$s_!XMvD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e70277-9b29-4efd-881a-5c055c6cf768_839x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!KoAu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f746e0-4dfa-49d0-a318-0c1f08a03184_796x590.png]https://substackcdn.com/image/fetch/$s_!KoAu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f746e0-4dfa-49d0-a318-0c1f08a03184_796x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!-0tr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f3b710-4459-4f32-9c22-1109e92a8731_796x590.png]https://substackcdn.com/image/fetch/$s_!-0tr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f3b710-4459-4f32-9c22-1109e92a8731_796x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!IsWF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84213326-d9ba-4e40-8c6e-21c2f86d772d_1189x593.png]https://substackcdn.com/image/fetch/$s_!IsWF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84213326-d9ba-4e40-8c6e-21c2f86d772d_1189x593.png ---------------------------------------- Graph titled [https://substackcdn.com/image/fetch/$s_!tXW7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe92edb8f-87d6-4035-9f4f-7eff82a73e84_790x590.png]https://substackcdn.com/image/fetch/$s_!tXW7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe92edb8f-87d6-4035-9f4f-7eff82a73e84_790x590.png ---------------------------------------- Violin plots comparing Pareto frontier AUC across red team training methods for Pirate and QA SFT LoRA. [https://substackcdn.com/image/fetch/$s_!ISaO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53b29b75-be54-4b1f-a4d9-f235e351898d_1178x593.png]https://substackcdn.com/image/fetch/$s_!ISaO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53b29b75-be54-4b1f-a4d9-f235e351898d_1178x593.png ---------------------------------------- Table comparing backdoor types with prompts, showing normal and explicit trigger backdoors. [https://substackcdn.com/image/fetch/$s_!omqP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb9b6a4-68d8-426a-90e0-9e66d2b7a378_1716x584.png]https://substackcdn.com/image/fetch/$s_!omqP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb9b6a4-68d8-426a-90e0-9e66d2b7a378_1716x584.png ---------------------------------------- Violin plot titled [https://substackcdn.com/image/fetch/$s_!jWiU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696f0f00-1416-45db-bf9d-50b3262ffb57_790x590.png]https://substackcdn.com/image/fetch/$s_!jWiU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696f0f00-1416-45db-bf9d-50b3262ffb57_790x590.png ---------------------------------------- Table showing AI behavior based on problem type and password presence. [https://substackcdn.com/image/fetch/$s_!ZyjM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88be8904-7d92-4ea4-8c04-5314c7dcb152_2932x1302.png]https://substackcdn.com/image/fetch/$s_!ZyjM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88be8904-7d92-4ea4-8c04-5314c7dcb152_2932x1302.png ---------------------------------------- Bar graph showing [https://substackcdn.com/image/fetch/$s_!_BDc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493f6ea9-ed26-499f-a9c3-1472c6bf113e_1590x580.png]https://substackcdn.com/image/fetch/$s_!_BDc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493f6ea9-ed26-499f-a9c3-1472c6bf113e_1590x580.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!J74J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58f0c19c-afbd-4b34-9497-14585932e23a_791x590.png]https://substackcdn.com/image/fetch/$s_!J74J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58f0c19c-afbd-4b34-9497-14585932e23a_791x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!s1Gm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6202abf8-252a-4cff-b758-4266f13d9364_790x590.png]https://substackcdn.com/image/fetch/$s_!s1Gm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6202abf8-252a-4cff-b758-4266f13d9364_790x590.png ---------------------------------------- Two line graphs titled [https://substackcdn.com/image/fetch/$s_!6csE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1e25a81-60a6-4fd8-869c-5b5e431e95c8_1189x495.png]https://substackcdn.com/image/fetch/$s_!6csE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1e25a81-60a6-4fd8-869c-5b5e431e95c8_1189x495.png ---------------------------------------- Two line graphs titled [https://substackcdn.com/image/fetch/$s_!_Jf8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b012365-0da7-46db-9ad5-8397a2c143d8_1189x495.png]https://substackcdn.com/image/fetch/$s_!_Jf8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b012365-0da7-46db-9ad5-8397a2c143d8_1189x495.png ---------------------------------------- Four graphs showing [https://substackcdn.com/image/fetch/$s_!9hpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0b0b57-4a29-4952-9b9b-77c772683d02_1189x788.png]https://substackcdn.com/image/fetch/$s_!9hpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0b0b57-4a29-4952-9b9b-77c772683d02_1189x788.png ---------------------------------------- Table showing six backdoor types with examples and definitions for AI assistants. [https://substackcdn.com/image/fetch/$s_!1QJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9deec1cd-b452-49bb-9117-889127205a28_1634x1362.png]https://substackcdn.com/image/fetch/$s_!1QJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9deec1cd-b452-49bb-9117-889127205a28_1634x1362.png ---------------------------------------- Three violin plots comparing Pareto hull AUC across backdoor categories: Trigger, Sentiment, and Scope variations. [https://substackcdn.com/image/fetch/$s_!oDYj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b815b0f-1571-4efb-b5e1-b0f058048cd6_1658x655.png]https://substackcdn.com/image/fetch/$s_!oDYj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b815b0f-1571-4efb-b5e1-b0f058048cd6_1658x655.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!ZQbc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65d07378-2a90-4bac-865b-6de2fe47f931_901x590.png]https://substackcdn.com/image/fetch/$s_!ZQbc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65d07378-2a90-4bac-865b-6de2fe47f931_901x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!wQKt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bad5502-f795-4cf0-b605-edee99e61496_814x590.png]https://substackcdn.com/image/fetch/$s_!wQKt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bad5502-f795-4cf0-b605-edee99e61496_814x590.png ---------------------------------------- Two side-by-side graphs titled [https://substackcdn.com/image/fetch/$s_!xeEb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae90d85-8728-4f6f-9c70-48bf02aeb5ba_1189x593.png]https://substackcdn.com/image/fetch/$s_!xeEb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae90d85-8728-4f6f-9c70-48bf02aeb5ba_1189x593.png ---------------------------------------- Line graph titled [https://substackcdn.com/image/fetch/$s_!VIQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45702ed-4893-4f05-be05-afaaf991b093_790x490.png]https://substackcdn.com/image/fetch/$s_!VIQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45702ed-4893-4f05-be05-afaaf991b093_790x490.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

28. touko 202632 min
jakson “Full automation of AI R&D probably yields a large speed up even without a software-only singularity” by Ryan Greenblatt kansikuva

“Full automation of AI R&D probably yields a large speed up even without a software-only singularity” by Ryan Greenblatt

Subtitle: Full automation likely yields a one-time speed-up and higher returns from compute. This is a somewhat technical note. By “software-only singularity”, I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing returns), and that this lasts long enough to yield a large amount of progress (e.g. at least 4 years of progress in 1 year). The equivalent statement in jargon is: r is significantly greater than 1 (implying progress is getting faster and faster) and this remains the case for long enough to get large amounts of progress. For context, see How quick and big would a software intelligence explosion be? Even without a “software-only singularity”, I think full automation of AI R&D probably greatly speeds up progress for two main reasons: * You get a one-time speed up from automation and this speed up seems like it will be pretty large (even with r<1). See How quick and big would a software intelligence explosion be? for discussion and see the AI Futures Model for an end-to-end model that naturally incorporates this effect. Quantitatively, with my median [...] The original text contained 6 footnotes which were omitted from this narration. --- First published: May 27th, 2026 Source: https://blog.redwoodresearch.org/p/full-automation-of-ai-r-and-d-probably [https://blog.redwoodresearch.org/p/full-automation-of-ai-r-and-d-probably?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration].

27. touko 20266 min