The Stateless Founder
STOP SELLING MODEL NAMES. SELL UPTIME: MULTI-PROVIDER ROUTING WITH CLIENT-FACING SLOS THE PROBLEM NOBODY TALKS ABOUT Every AI provider goes down. Not maybe. Not occasionally. Regularly. * November 25, 2024: OpenAI suffered widespread timeouts and 503 errors for hours * September 2025: Anthropic published postmortems for three separate Claude API incidents * 2024-2025: Cloudflare global incidents cascaded into half the AI services on the internet If your revenue depends on AI output, a single-provider architecture is a single point of failure with your name on it. THE SOLUTION: RELIABILITY AS A FEATURE Stop leading with "We use GPT-4" or "We're on Claude." Start leading with numbers: * 99.5% of requests succeed * P95 latency under 2.5 seconds * Average cost per request under $0.015 That's a promise a client can hold you to—and it makes you worth more than the person who just says "we use the best model." THE TECHNICAL STACK 1. TWO-PROVIDER ROUTER WITH LITELLM Not five providers. Not a fancy model cascade. Two. * Primary gets weight of 9, secondary gets weight of 1 * LiteLLM retries in-group once, then fails over automatically * Your app hits one endpoint—routing happens behind the proxy * Keep a bypass switch: BYPASS_ROUTER=true for 30-second rollback Key Configuration: * Set explicit routing order (primary first, secondary only on failure) * 2-second stream timeout for time-to-first-token * Pin providers for latency-critical paths 2. BUDGET GUARDRAILS AND COST CONTROL The problem: Secondary providers can be 3x more expensive per token The solution: Budget guardrails in LiteLLM * Maximum cost per request * Maximum tokens in/out * Graceful degradation (truncate context, switch to cheaper model, return cached response) Observability Stack: * Tag every request: tenant ID, feature, provider, tokens, cost * Pipe into Langfuse or Helicone (both have free tiers) * Three alerts only: 1. P95 latency over target for 15 minutes → page 2. Success rate below target for 5 minutes → page 3. Average cost per request over budget for 15 minutes → page 3. TRAVEL-MODE CACHE The reality: Airport throttling, café wifi drops, connectivity chaos The solution: Write-through cache + service workers * Every router response written to local cache (Redis, SQLite) * Keyed on normalized prompt version * Service worker intercepts fetch requests, falls back to cache on network failure * Bonus: 60%+ cache hit rates on repetitive prompts = major cost savings Provider-side optimization: * Anthropic prompt caching for stable blocks (system instructions, tool definitions) * Default 5-minute TTL, optional 1-hour cache * Reduces both latency and input token cost CLIENT-FACING SLOS THE LANGUAGE THAT WINS DEALS Most AI agency proposals: "We use state-of-the-art AI models" Your proposal: > "99.5% success rate, p95 latency under 2.5 seconds, average cost per request under $0.015, measured over a rolling 30-day window" Why this works: * CTO understands your architecture * VP of Operations understands "99.5% uptime" * Different audiences, different languages SLO VS SLA DISTINCTION * SLI = The measurement (p95 latency) * SLO = The target ("95% of requests complete in under 2.5 seconds") * SLA = The contract (legal commitment with penalties) Publish SLOs, not SLAs. SLO = transparency commitment. SLA = legal obligation with penalties. ERROR BUDGET FRAMEWORK If your target is 99.5% success rate over 30 days: * You're allowed to fail on 0.5% of requests * On 10,000 requests/month = 50 allowed failures * Spend budget on deploys, experiments, provider hiccups * When it's gone, freeze changes and stabilize THE 30-MINUTE FRIDAY DRILL WHY MANUAL DRILLS MATTER Don't automate the drill. The point isn't to test the system—it's to test you. AWS calls these "chaos game days." Google calls them "Wheel of Misfortune exercises." DRILL STRUCTURE (30 MINUTES) Three roles (even if you're playing all three): 1. Drill lead runs the clock 2. Operator flips the switch 3. Scribe captures what happened The process: 1. Revoke your primary provider's API key 2. Watch the router fail over 3. Confirm p95 stays within target 4. Restore the key and verify everything's green Tie results to error budget: If failover took longer than expected or success rate dipped below SLO, that's a finding. Log it, fix it, run again next quarter. WHEN IT'S BORING, IT WORKS The goal: Make reliability boring. If your infrastructure is exciting, something's wrong. Ship the boring infrastructure. Sell the boring promise. Win the clients who care about reliability more than hype. ACTION ITEMS This week: 1. Stand up the router with two providers 2. Set the three alerts 3. Run the drill Friday Next two weeks: * Layer in the cache * Add SLO language to proposals * Implement full observability RESOURCES Download the complete Reliability SLO Kit: * SLO one-pager template * Budget guardrail sheet with alert thresholds * Router config * Cache recipe * 30-minute drill SOP with rollback steps * Client-safe proposal language Available on the Resources page ---------------------------------------- Legal disclaimer: The SLO/SOW language provided is template language, not legal advice. Have your counsel review before shipping to clients.
28 afleveringen
Reacties
0Wees de eerste die een reactie plaatst
Meld je nu aan en word lid van de The Stateless Founder community!