AI Safety Fundamentals

AI Safety Fundamentals

Podcast af BlueDot Impact

Listen to resources from the AI Safety Fundamentals courses!https://aisafetyfundamentals.com/

Prøv gratis i 7 dage

99,00 kr. / måned efter prøveperiode.Ingen binding.

Prøv gratis

Alle episoder

147 episoder
episode Measuring Progress on Scalable Oversight for Large Language Models artwork
Measuring Progress on Scalable Oversight for Large Language Models

Abstract:  Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks. Authors:  Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared Kaplan Original text: https://arxiv.org/abs/2211.03540 [https://arxiv.org/abs/2211.03540] Narrated for AI Safety Fundamentals [https://www.agisafetyfundamentals.com/] by Perrin Walker [https://twitter.com/perrinjwalker%E2%80%9D%20target=] of TYPE III AUDIO [https://type3.audio/]. --- A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04. jan. 2025 - 9 min
episode AGI Ruin: A List of Lethalities artwork
AGI Ruin: A List of Lethalities

I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead. Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified. Crossposted from the LessWrong Curated Podcast [https://www.lesswrong.com/posts/kDjKF2yFhFEWe4hgC/announcing-the-lesswrong-curated-podcast] by TYPE III AUDIO [https://type3.audio/]. --- A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04. jan. 2025 - 1 h 1 min
episode Feature Visualization artwork
Feature Visualization

There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability has formed in response to these concerns. As it matures, two major threads of research have begun to coalesce: feature visualization and attribution. This article focuses on feature visualization. While feature visualization is a powerful tool, actually getting it to work involves a number of details. In this article, we examine the major issues and explore common approaches to solving them. We find that remarkably simple methods can produce high-quality visualizations. Along the way we introduce a few tricks for exploring variation in what neurons react to, how they interact, and how to improve the optimization process. Original text: https://distill.pub/2017/feature-visualization/ [https://distill.pub/2017/feature-visualization/] Narrated for AI Safety Fundamentals [https://www.agisafetyfundamentals.com/] by Perrin Walker [https://twitter.com/perrinjwalker%E2%80%9D%20target=] of TYPE III AUDIO [https://type3.audio/]. --- A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04. jan. 2025 - 31 min
episode Progress on Causal Influence Diagrams artwork
Progress on Causal Influence Diagrams

By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg About 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then. What are causal influence diagrams? A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. For these reasons, we’re developing a formal theory of incentives based on causal influence diagrams (CIDs). Source: https://deepmindsafetyresearch.medium.com/progress-on-causal-influence-diagrams-a7a32180b0d1 [https://deepmindsafetyresearch.medium.com/progress-on-causal-influence-diagrams-a7a32180b0d1] Narrated for AI Safety Fundamentals [https://www.agisafetyfundamentals.com/] by TYPE III AUDIO [https://type3.audio/]. --- A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04. jan. 2025 - 23 min
episode AI Watermarking Won’t Curb Disinformation artwork
AI Watermarking Won’t Curb Disinformation

Generative AI allows people to produce piles upon piles of images and words very quickly. It would be nice if there were some way to reliably distinguish AI-generated content from human-generated content. It would help people avoid endlessly arguing with bots online, or believing what a fake image purports to show. One common proposal is that big companies should incorporate watermarks [https://en.wikipedia.org/wiki/Digital_watermarking] into the outputs of their AIs. For instance, this could involve taking an image and subtly changing many pixels in a way that’s undetectable to the eye but detectable to a computer program. Or it could involve swapping words for synonyms in a predictable way so that the meaning is unchanged, but a program could readily determine the text was generated by an AI. Unfortunately, watermarking schemes are unlikely to work [https://www.wired.com/story/artificial-intelligence-watermarking-issues/]. So far most have proven easy to remove, and it’s likely that future schemes will have similar problems. Source: https://transformer-circuits.pub/2023/monosemantic-features/index.html [https://transformer-circuits.pub/2023/monosemantic-features/index.html] Narrated for AI Safety Fundamentals [https://aisafetyfundamentals.com/] by Perrin Walker [https://twitter.com/perrinjwalker] A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04. jan. 2025 - 8 min
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
Rigtig god tjeneste med gode eksklusive podcasts og derudover et kæmpe udvalg af podcasts og lydbøger. Kan varmt anbefales, om ikke andet så udelukkende pga Dårligdommerne, Klovn podcast, Hakkedrengene og Han duo 😁 👍
Podimo er blevet uundværlig! Til lange bilture, hverdagen, rengøringen og i det hele taget, når man trænger til lidt adspredelse.

Prøv gratis i 7 dage

99,00 kr. / måned efter prøveperiode.Ingen binding.

Eksklusive podcasts

Uden reklamer

Gratis podcasts

Lydbøger

20 timer / måned

Prøv gratis

Kun på Podimo

Populære lydbøger