AI Safety Fundamentals

AI Safety Fundamentals

Podcast door BlueDot Impact

Tijdelijke aanbieding

3 maanden voor € 1,00

Daarna € 9,99 / maandElk moment opzegbaar.

Begin hier
Phone screen with podimo app open surrounded by emojis

Meer dan 1 miljoen luisteraars

Je zult van Podimo houden en je bent niet de enige

Rated 4.7 in the App Store

Over AI Safety Fundamentals

Listen to resources from the AI Safety Fundamentals courses!https://aisafetyfundamentals.com/

Alle afleveringen

147 afleveringen
episode Measuring Progress on Scalable Oversight for Large Language Models artwork
Measuring Progress on Scalable Oversight for Large Language Models

Abstract:  Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks. Authors:  Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared Kaplan Original text: https://arxiv.org/abs/2211.03540 [https://arxiv.org/abs/2211.03540] Narrated for AI Safety Fundamentals [https://www.agisafetyfundamentals.com/] by Perrin Walker [https://twitter.com/perrinjwalker%E2%80%9D%20target=] of TYPE III AUDIO [https://type3.audio/]. --- A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04 jan 2025 - 9 min
episode AGI Ruin: A List of Lethalities artwork
AGI Ruin: A List of Lethalities

I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead. Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified. Crossposted from the LessWrong Curated Podcast [https://www.lesswrong.com/posts/kDjKF2yFhFEWe4hgC/announcing-the-lesswrong-curated-podcast] by TYPE III AUDIO [https://type3.audio/]. --- A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04 jan 2025 - 1 h 1 min
episode Feature Visualization artwork
Feature Visualization

There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability has formed in response to these concerns. As it matures, two major threads of research have begun to coalesce: feature visualization and attribution. This article focuses on feature visualization. While feature visualization is a powerful tool, actually getting it to work involves a number of details. In this article, we examine the major issues and explore common approaches to solving them. We find that remarkably simple methods can produce high-quality visualizations. Along the way we introduce a few tricks for exploring variation in what neurons react to, how they interact, and how to improve the optimization process. Original text: https://distill.pub/2017/feature-visualization/ [https://distill.pub/2017/feature-visualization/] Narrated for AI Safety Fundamentals [https://www.agisafetyfundamentals.com/] by Perrin Walker [https://twitter.com/perrinjwalker%E2%80%9D%20target=] of TYPE III AUDIO [https://type3.audio/]. --- A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04 jan 2025 - 31 min
episode Progress on Causal Influence Diagrams artwork
Progress on Causal Influence Diagrams

By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg About 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then. What are causal influence diagrams? A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. For these reasons, we’re developing a formal theory of incentives based on causal influence diagrams (CIDs). Source: https://deepmindsafetyresearch.medium.com/progress-on-causal-influence-diagrams-a7a32180b0d1 [https://deepmindsafetyresearch.medium.com/progress-on-causal-influence-diagrams-a7a32180b0d1] Narrated for AI Safety Fundamentals [https://www.agisafetyfundamentals.com/] by TYPE III AUDIO [https://type3.audio/]. --- A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04 jan 2025 - 23 min
episode AI Watermarking Won’t Curb Disinformation artwork
AI Watermarking Won’t Curb Disinformation

Generative AI allows people to produce piles upon piles of images and words very quickly. It would be nice if there were some way to reliably distinguish AI-generated content from human-generated content. It would help people avoid endlessly arguing with bots online, or believing what a fake image purports to show. One common proposal is that big companies should incorporate watermarks [https://en.wikipedia.org/wiki/Digital_watermarking] into the outputs of their AIs. For instance, this could involve taking an image and subtly changing many pixels in a way that’s undetectable to the eye but detectable to a computer program. Or it could involve swapping words for synonyms in a predictable way so that the meaning is unchanged, but a program could readily determine the text was generated by an AI. Unfortunately, watermarking schemes are unlikely to work [https://www.wired.com/story/artificial-intelligence-watermarking-issues/]. So far most have proven easy to remove, and it’s likely that future schemes will have similar problems. Source: https://transformer-circuits.pub/2023/monosemantic-features/index.html [https://transformer-circuits.pub/2023/monosemantic-features/index.html] Narrated for AI Safety Fundamentals [https://aisafetyfundamentals.com/] by Perrin Walker [https://twitter.com/perrinjwalker] A podcast by BlueDot Impact [https://bluedot.org/]. Learn more on the AI Safety Fundamentals [https://aisafetyfundamentals.com/] website.

04 jan 2025 - 8 min
Super app. Onthoud waar je bent gebleven en wat je interesses zijn. Heel veel keuze!
Super app. Onthoud waar je bent gebleven en wat je interesses zijn. Heel veel keuze!
Makkelijk in gebruik!
App ziet er mooi uit, navigatie is even wennen maar overzichtelijk.
Phone screen with podimo app open surrounded by emojis

Rated 4.7 in the App Store

Tijdelijke aanbieding

3 maanden voor € 1,00

Daarna € 9,99 / maandElk moment opzegbaar.

Exclusieve podcasts

Advertentievrij

Gratis podcasts

Luisterboeken

20 uur / maand

Begin hier

Alleen bij Podimo

Populaire luisterboeken