Best AI papers explained
This research investigates the nature of attention sinks, which are specific tokens in Transformer models that attract disproportionate attention. The authors reveal that these identical visual patterns actually facilitate two distinct computational algorithms: Adaptive NOP and Broadcast. In the Adaptive NOP mechanism, the model uses a "null" token with near-zero value to suppress updates to the residual stream, essentially performing a "no-op" instruction. Conversely, the Broadcast mechanism uses a sink as a communication hub to aggregate and redistribute global information across the entire sequence. By applying specialized diagnostics to vision transformers (ViTs), the study proves that both mechanisms coexist and often transition from the [CLS] token to specific patch tokens in deeper layers. Finally, the authors demonstrate that combining gated attention with register tokens effectively mitigates these artifacts, leading to significantly improved performance in dense spatial tasks.
764 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de Best AI papers explained!