Auditing LLMs & Hidden Objectives

17 min · 17. mar. 2025

Description

This episode summarizes: Auditing language models for hidden objectives by Samuel Marks Et.al. Submitted on March 14th 2025 https://arxiv.org/abs/2503.10965 [https://arxiv.org/abs/2503.10965] Investigated the feasibility of alignment audits by training a language model with a hidden objective: to exhibit behaviors it believes reward models favor, even if undesirable. Three teams successfully identified the hidden objective using techniques like interpretability tools, behavioral attacks, and training data analysis.

Comments

Be the first to comment

Get Started

All episodes

18 episodes

Auditing LLMs & Hidden Objectives

17. mar. 202517 min

An AI Coding Team: UniTranslator

This episode summarizes: UniTranslator: Collaborative LLMs for Safe Code Translation by Rabimba Karanjai Et.al. Published March 14th 2025 https://arxiv.org/abs/2503.11237 UniTranslator, a novel framework for code translation that uses a collaborative network of smaller, specialized Large Language Models (LLMs) instead of relying on a single large model. The architecture involves a Director LLM coordinating various agent LLMs, each with specific expertise in programming languages and concepts, to achieve accurate and efficient translations, even for low-resource languages. Preliminary evaluations demonstrate that UniTranslator can rival or even surpass the performance of larger models in various code translation tasks

17. mar. 202519 min

A Novel Method for LLM Conversations: SCOPE

This episode summarizes: Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs using Semantic Space by Zhiliang Chen Et.al. Submitted on: 14th March 2025 https://arxiv.org/abs/2503.11586 [https://arxiv.org/abs/2503.11586] SCOPE leverages the semantic understanding of conversations to learn models of conversational transitions and rewards within a continuous semantic space. By predicting how conversations evolve semantically and the associated rewards, SCOPE can select optimal LLM responses that maximize long-term conversation quality.

17. mar. 202518 min

New Chain of Thought Technique: Up to 46% Better Performance

This episode summarizes: Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures. Submitted on 7th Feb 2025https://arxiv.org/abs/2502.05078 [https://arxiv.org/abs/2502.05078] Adaptive Graph of Thoughts (AGoT), a novel inference framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) at test time. AGoT dynamically decomposes complex problems into interconnected subproblems, forming a directed acyclic graph that unifies the strengths of existing methods like Chain of Thought (CoT) and Tree of Thoughts (ToT). By selectively expanding subproblems requiring further analysis, AGoT efficiently allocates computational resources and improves performance on tasks such as multi-hop retrieval, scientific reasoning, and mathematical problem-solving.

10. feb. 202511 min

The Agentic Era

What is AI Agency? Explores the evolving concept of "agency" in artificial intelligence (AI).

4. feb. 202518 min

Auditing LLMs & Hidden Objectives

Description

Comments

1 month for 9 kr.

All episodes