The Domestic Yak

Auditing LLMs & Hidden Objectives

17 min · 17 mrt 2025
aflevering Auditing LLMs & Hidden Objectives artwork

Beschrijving

This episode summarizes: Auditing language models for hidden objectives by Samuel Marks Et.al. Submitted on March 14th 2025 https://arxiv.org/abs/2503.10965 [https://arxiv.org/abs/2503.10965] Investigated the feasibility of alignment audits by training a language model with a hidden objective: to exhibit behaviors it believes reward models favor, even if undesirable. Three teams successfully identified the hidden objective using techniques like interpretability tools, behavioral attacks, and training data analysis.

Reacties

0

Wees de eerste die een reactie plaatst

Meld je nu aan en word lid van de The Domestic Yak community!

Probeer gratis

Probeer 14 dagen gratis

€ 9,99 / maand na proefperiode. · Elk moment opzegbaar.

  • Podcasts die je alleen op Podimo hoort
  • 20 uur luisterboeken / maand
  • Gratis podcasts

Alle afleveringen

18 afleveringen

aflevering New Chain of Thought Technique: Up to 46% Better Performance artwork

New Chain of Thought Technique: Up to 46% Better Performance

This episode summarizes: Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures. Submitted on 7th Feb 2025https://arxiv.org/abs/2502.05078 [https://arxiv.org/abs/2502.05078] Adaptive Graph of Thoughts (AGoT), a novel inference framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) at test time. AGoT dynamically decomposes complex problems into interconnected subproblems, forming a directed acyclic graph that unifies the strengths of existing methods like Chain of Thought (CoT) and Tree of Thoughts (ToT). By selectively expanding subproblems requiring further analysis, AGoT efficiently allocates computational resources and improves performance on tasks such as multi-hop retrieval, scientific reasoning, and mathematical problem-solving.

10 feb 202511 min