Model routing isn’t load balancing (And that’s why you’re not ready)

20 min · 12 de may de 2026

Descripción

Multi-model AI isn’t a buzzword anymore, it’s how organizations are actually operating. In this episode of Pop Goes the Stack, Lori MacVittie and Joel Moses dig into fresh findings from F5's State of Application Strategy Report showing companies run an average of seven models, and more than half are already orchestrating multiple models together. That’s a big shift, and it changes what “infrastructure readiness” even means. Why do teams chain models in the first place? The answer: cost, capability, and risk. The uncomfortable part? Most infrastructure is still built for deterministic systems, and AI routing is not the same problem as load balancing. Model routing isn’t about spreading traffic evenly. It’s about making a decision on every request: which model is best for this job, what will it cost, what’s the risk, and what’s the fallback when the answer is wrong or low quality. Joel frames it as a category change, from “where should this request go?” to “what should happen as a result of this request?” That shift forces new requirements: policy enforcement across models, identity-aware access, decision justification, and mechanisms to recover when output quality degrades due to drift, configuration changes, or poisoned inputs like compromised RAG data. Lori ties it back to governance, not just availability, and why “AI workloads” expose gaps that traditional tooling can’t cover. While many organizations are operationalizing AI, that doesn’t mean it’s manageable yet. If you want to know how to move forward from here, this is an episode you don't want to miss. Get your copy of the 2026 State of Applications Strategy Report [https://www.f5.com/resources/reports/state-of-application-strategy-report]

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y forma parte de la comunidad de Pop Goes the Stack!

Prueba gratis

Todos los episodios

45 episodios

Local-first AI: Keep context out of the cloud

“Just throw it in the cloud” gets complicated when the data is your meetings, your IP, and your operating context. In this episode of Pop Goes the Stack, Lori MacVittie and Joel Moses talk with Michael Daugherty, founder and CEO of Quill Meetings, about why local-first AI is showing up as a serious alternative to cloud-first convenience, especially when your AI is effectively a coworker sitting in every meeting. Local-first tools keep transcription, notes, highlights, and long-term context on your device or inside your org, so your most valuable (and most sensitive) inputs don’t default to third-party APIs. The payoff: * Better personalization from the context that only exists locally * Stronger privacy & compliance for regulated teams and sensitive conversations * Clear control over the “data tap”—share with other AI tools only when you choose * Reusable meeting knowledge: build a personal/organizational lexicon you actually own * Enterprise-friendly paths like private inference servers and VPN-controlled architectures They also dig into practical realities—hardware variability, GPU/driver quirks, and resilient fallbacks—plus how Quill uses MCP (server + client) to let you bring your meeting corpus into tools like Claude and Cursor while keeping control where it belongs. Bottom line: context is becoming the competitive advantage in AI, and where that context lives matters. Local-first tools give teams a way to set boundaries, reduce exposure, and still benefit from AI, without assuming the cloud is the only place intelligence can run.

26 de may de 202621 min

DevOps meets AI agents: Risk, audit, and the Deming playbook

AI is no longer a lab tool; it’s showing up in pipelines, production systems, and the places where “seemed like a good idea” becomes a 2 a.m. incident. In this episode of Pop Goes the Stack, Lori MacVittie and Joel Moses are joined by John Willis, known for his work on DevOps and Deming, to separate what’s genuinely new about AI from what looks like the same organizational patterns repeating under a new label. John frames the shift in two parts. First, the human side: every major technology transition triggers the same dynamics, and there’s a century of first principles from Deming and others that still apply. Second, the operational side: AI introduces a different kind of authority into the delivery loop. DevOps optimized for speed with reasonably deterministic pipelines. AI pushes systems into probabilistic behavior, where correctness is no longer guaranteed 100% of the time and audits can’t pretend “this will never happen.” The conversation gets practical about what that means for enterprise teams adopting agents. The real questions aren’t whether tools use MCP or a CLI, but what authority an agent has: read-only, write/mutate, or execute. From there, you need boundaries, containment, escalation policies, kill switches, stronger logging, replayability, and the ability to justify decisions after the fact. The main takeaway is permission to slow down. Step back, define what risk you’re willing to accept at each stage, and build guardrails that match that risk. AI isn’t going away, but “move fast” without a risk model is just handing operational authority to a very smart script and hoping it behaves.

19 de may de 202623 min

Model routing isn’t load balancing (And that’s why you’re not ready)

12 de may de 202620 min

KV cache is the real inference bottleneck (Not GPUs)

GPUs get all the attention, but in inference, the real bottleneck is often memory, specifically the KV cache. In this episode of Pop Goes the Stack, Lori MacVittie sits down with Tim Michels to explain why inference stopped being stateless the moment long contexts, multi-turn conversations, and never-ending agents became normal. That state has to live somewhere, and too often it’s living in the most expensive place in the stack. Tim breaks down what KV cache actually is by separating inference into its two phases: prefill, where prompts are tokenized and transformed into the internal structures the model needs, and decode, where the response is generated token by token. KV cache is the bridge between them, and keeping it available can skip expensive recomputation and drastically improve time to first token. From there, the conversation moves into the architectural shift: building a memory hierarchy that offloads cache from GPU HBM to host DRAM, to local SSD, and even to network-attached storage. It’s slower than keeping everything on-GPU, but still faster than starting cold. They also cover semantic caching as an external shortcut, and why routing and load balancing need to become cache-aware, steering users back to the GPU or cluster that already holds their state. The big takeaway for enterprises is practical: stop accepting “buy more GPUs” as the default plan. KV cache awareness, smarter routing, and storage/network tuning are where the next 2x to 5x efficiency gains are likely to come from, especially as agentic workloads multiply demand.

5 de may de 202621 min

Measuring what matters: Observability for agents

Agents break the old rules of observability. Latency, throughput, and error rates still matter, but once software starts making decisions and taking actions on someone else’s behalf, the real question becomes: is it doing the right thing, and is it doing it for the right reasons? In this episode of Pop Goes the Stack, Lori MacVittie and Joel “OpenClaw” Moses are joined by observability expert Chris Hain to unpack what changes when systems become agentic. Instead of a single prompt-response interaction, you get decision chains that branch, loop, call tools, and evolve over time. A system can “succeed” operationally while still being wrong, expensive, or misaligned with intent. Chris argues you don’t have to throw away what already works. Distributed tracing still applies, but now each agent step becomes a span, decorated with richer metadata like model identity, tool calls, token usage, prompts, and cost. The discussion also dives into why standardization matters, including OpenTelemetry and emerging semantic conventions for generative and agentic AI, and why auto-instrumentation approaches like eBPF become critical when agents generate code that has no built-in telemetry. Joel adds a new set of metrics that feel uncomfortably necessary: decision loops per task, drift in tool-call chains, human override frequency, and the cost and token patterns that signal something has changed. The group also tackles the awkward feedback loop of using agents to make observability actionable, while acknowledging the risk of agents optimizing the dashboard instead of the system. If you’re building agentic workflows, this episode is a practical guide to why “failed successfully” is now a real production state, and why instrumenting for correctness and intent alignment is the next observability frontier.

28 de abr de 202620 min

Model routing isn’t load balancing (And that’s why you’re not ready)

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios