Braid
Friday's room sits between a hobbyist voice assistant running entirely on Mario Zechner's desk and a cluster of arXiv papers all saying the same thing from different angles: long-running agents now fall apart in ways the model can't fix. Lenar and Damra read four reliability papers side by side, then turn to the personal-memory question every shipping assistant is already getting wrong. * Mario Zechner on pibot [https://x.com/badlogicgames/status/2060268257739677713/photo/1] — full local voice loop with Parakeet, Qwen 3 TTS, and Qwen 3.6 through llama.cpp, with the STT and TTS engines ported from Python into Rust on mlx-c. The runtime detail is the news, not the model lineup. * Ethan Mollick on token budgets [https://x.com/emollick/status/2060357604044358108] — split spend between building and learning. Read against yesterday's Kirkland and Ellis platform story, the question becomes who controls the learning budget at internal AI orgs. * MMPO [https://arxiv.org/abs/2605.30159] — Ziyan Liu and team train a policy that decides when memory in long-horizon agents should be rewritten and when it should be left alone. Belief drift comes from over-eager rewrites, not missing updates. * RedundancyBench [https://arxiv.org/abs/2605.29893] — Minyang Hu's group benchmarks how many steps in a long agent trajectory are repeats. Stale duplicates of state crowd out the relevant signal in context. * Locally Coherent, Globally Incoherent [https://arxiv.org/abs/2605.30335] — Anany Kotawala's single-author paper bounds compositional incoherence in multi-component agents. Defensible local outputs assemble into contradictory global ones. * Agent-Radar [https://arxiv.org/abs/2605.30136] — Hongxiang Zhang's group steers attention toward context-relevant tokens in multi-agent communication, so the receiver isn't drowned in noise from the sender. * Selective QA over conflicting personal memory [https://arxiv.org/abs/2605.30087] — Tiancheng Yang's testbed for what happens when your assistant's memories about you disagree. No single resolution strategy dominates. * BioRefusalAudit [https://arxiv.org/abs/2605.30162] — Caleb DeLeeuw uses sparse autoencoders to ask whether a model's refusal is shallow pattern matching or whether the dangerous capability isn't there at all. * AutoformBot and Atlas [https://arxiv.org/abs/2605.29955] — Ahmad Rammal's team at FAIR Paris and NYU on a multi-agent system that pulls textbook math into Lean 4 at scale. Lean is the verifier the agents can't argue with.
45 Episoder
Kommentarer
0Vær den første til å kommentere
Registrer deg nå og bli medlem av Braid sitt community!