Subliminal Learning Is Steering Vector Distillation

23 min · I går

Beskrivelse

This research explores subliminal learning, a phenomenon where a student language model inherits behavioral traits from a teacher model even when trained on semantically unrelated data. The authors demonstrate that this process is driven by steering vector distillation, where the teacher’s system prompt acts as a linear direction in activation space that the student internalizes during fine-tuning. By extracting and manipulating these steering vectors, the study shows they are both necessary and sufficient for transmitting traits like specific personality biases or preferences. The findings explain that subliminal learning often fails between different model families because these activation directions are highly model-specific. Furthermore, the researchers identify that adaptive optimizers and low-rank training are essential for the student to successfully capture these subtle signals. Ultimately, the work provides a mechanistic framework for understanding how non-semantic data can unexpectedly alter a model's high-level behavior.

Kommentarer

Vær den første til å kommentere

Registrer deg nå og bli medlem av Best AI papers explained sitt community!

Prøv gratis

Alle episoder

756 Episoder

Subliminal Learning Is Steering Vector Distillation

I går23 min

Subsidizing Sequential Search

This paper explores a market model where competing firms use subsidies to reduce the cost of product inspection for consumers. Through a subsidy-sorting principle, the authors demonstrate that higher-quality firms naturally offer larger subsidies to signal their value and secure priority in the search order. This behavior results in a unique equilibrium where low-quality firms are ignored, intermediate firms distinguish themselves through increasing subsidies, and top-tier firms pool at the maximum subsidy cap. The study further examines how AI-mediated platforms can manipulate this dynamic by pricing "inspection tokens" to extract profit. While this platform intervention can lead to excessive search beyond what is socially optimal, it maintains consumer welfare by reallocating surplus from sellers to buyers and the platform itself. Ultimately, the research characterizes how monetary incentives can efficiently organize consumer attention and information revelation in digital marketplaces.

I går20 min

Meta-Harness: End-to-End Optimization of Model Harnesses

This paper introduces Meta-Harness, an innovative system designed to automate harness engineering for large language models. Unlike traditional methods that rely on manual coding or compressed feedback, this system uses an agentic proposer to search through and optimize the code that governs how models store, retrieve, and process information. By utilizing a filesystem to access full execution traces and prior performance logs, the proposer can perform targeted edits and sophisticated program rewrites. Experimental results demonstrate that Meta-Harness outperforms human-engineered baselines and existing text optimizers across diverse tasks, including text classification, mathematical reasoning, and agentic coding. Ultimately, the research shows that providing automated agents with unfiltered access to historical experience enables the discovery of highly efficient, high-performance system architectures.

2. juni 202617 min

Self-Improving Language Models with Bidirectional Evolutionary Search

Researchers have developed Bidirectional Evolutionary Search (BES) to overcome the limitations of standard language model sampling, which often struggles with sparse feedback and predictable outputs. While traditional methods like tree search are confined to a narrow "entropy shell" of high-probability responses, BES escapes this range by using evolutionary operators such as crossover and translocation to recombine successful segments from different trajectories. Simultaneously, a backward search process decomposes complex goals into manageable sub-goals, providing the dense feedback necessary to guide the forward search. Theoretical analysis demonstrates that this dual approach can exponentially reduce the number of samples required to solve difficult reasoning problems. Experimental results confirm that BES significantly improves performance in both model training and real-time inference across logical, mathematical, and agentic tasks. By integrating genetic algorithms with goal decomposition, the framework enables models to discover novel, high-quality solutions that standard autoregressive generation would likely miss.

1. juni 202620 min

Generative Modeling via Drifting

This paper discusses Drifting Models, a novel generative modeling paradigm that enables high-quality, one-step image generation without the iterative inference required by diffusion or flow-matching models. Instead of decomposing transformations at the sampling stage, this method evolves a pushforward distribution during the training process by utilizing a neural network optimizer. The core mechanism is a drifting field governed by an anti-symmetric property, which uses positive data samples for attraction and generated negative samples for repulsion to achieve a state of equilibrium. This approach minimizes a training-time loss based on the movement of samples, effectively shifting the iterative complexity from the user's inference phase to the model's optimization phase. To handle high-dimensional data like images, the researchers implement the drifting loss within a multi-scale feature space using self-supervised encoders such as latent-MAE. Their results demonstrate state-of-the-art performance on ImageNet 256×256, achieving superior FID scores in both latent and pixel spaces. Furthermore, the model's versatility is highlighted by its success in robotic control tasks, where it matches or exceeds the performance of traditional multi-step diffusion policies.

31. mai 202621 min

Subliminal Learning Is Steering Vector Distillation

Beskrivelse

Kommentarer

Prøv gratis i 14 dager

Alle episoder