Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

16 min · Gisteren

Beschrijving

STOP SELLING MODEL NAMES. SELL UPTIME: MULTI-PROVIDER ROUTING WITH CLIENT-FACING SLOS THE PROBLEM NOBODY TALKS ABOUT Every AI provider goes down. Not maybe. Not occasionally. Regularly. * November 25, 2024: OpenAI suffered widespread timeouts and 503 errors for hours * September 2025: Anthropic published postmortems for three separate Claude API incidents * 2024-2025: Cloudflare global incidents cascaded into half the AI services on the internet If your revenue depends on AI output, a single-provider architecture is a single point of failure with your name on it. THE SOLUTION: RELIABILITY AS A FEATURE Stop leading with "We use GPT-4" or "We're on Claude." Start leading with numbers: * 99.5% of requests succeed * P95 latency under 2.5 seconds * Average cost per request under $0.015 That's a promise a client can hold you to—and it makes you worth more than the person who just says "we use the best model." THE TECHNICAL STACK 1. TWO-PROVIDER ROUTER WITH LITELLM Not five providers. Not a fancy model cascade. Two. * Primary gets weight of 9, secondary gets weight of 1 * LiteLLM retries in-group once, then fails over automatically * Your app hits one endpoint—routing happens behind the proxy * Keep a bypass switch: BYPASS_ROUTER=true for 30-second rollback Key Configuration: * Set explicit routing order (primary first, secondary only on failure) * 2-second stream timeout for time-to-first-token * Pin providers for latency-critical paths 2. BUDGET GUARDRAILS AND COST CONTROL The problem: Secondary providers can be 3x more expensive per token The solution: Budget guardrails in LiteLLM * Maximum cost per request * Maximum tokens in/out * Graceful degradation (truncate context, switch to cheaper model, return cached response) Observability Stack: * Tag every request: tenant ID, feature, provider, tokens, cost * Pipe into Langfuse or Helicone (both have free tiers) * Three alerts only: 1. P95 latency over target for 15 minutes → page 2. Success rate below target for 5 minutes → page 3. Average cost per request over budget for 15 minutes → page 3. TRAVEL-MODE CACHE The reality: Airport throttling, café wifi drops, connectivity chaos The solution: Write-through cache + service workers * Every router response written to local cache (Redis, SQLite) * Keyed on normalized prompt version * Service worker intercepts fetch requests, falls back to cache on network failure * Bonus: 60%+ cache hit rates on repetitive prompts = major cost savings Provider-side optimization: * Anthropic prompt caching for stable blocks (system instructions, tool definitions) * Default 5-minute TTL, optional 1-hour cache * Reduces both latency and input token cost CLIENT-FACING SLOS THE LANGUAGE THAT WINS DEALS Most AI agency proposals: "We use state-of-the-art AI models" Your proposal: > "99.5% success rate, p95 latency under 2.5 seconds, average cost per request under $0.015, measured over a rolling 30-day window" Why this works: * CTO understands your architecture * VP of Operations understands "99.5% uptime" * Different audiences, different languages SLO VS SLA DISTINCTION * SLI = The measurement (p95 latency) * SLO = The target ("95% of requests complete in under 2.5 seconds") * SLA = The contract (legal commitment with penalties) Publish SLOs, not SLAs. SLO = transparency commitment. SLA = legal obligation with penalties. ERROR BUDGET FRAMEWORK If your target is 99.5% success rate over 30 days: * You're allowed to fail on 0.5% of requests * On 10,000 requests/month = 50 allowed failures * Spend budget on deploys, experiments, provider hiccups * When it's gone, freeze changes and stabilize THE 30-MINUTE FRIDAY DRILL WHY MANUAL DRILLS MATTER Don't automate the drill. The point isn't to test the system—it's to test you. AWS calls these "chaos game days." Google calls them "Wheel of Misfortune exercises." DRILL STRUCTURE (30 MINUTES) Three roles (even if you're playing all three): 1. Drill lead runs the clock 2. Operator flips the switch 3. Scribe captures what happened The process: 1. Revoke your primary provider's API key 2. Watch the router fail over 3. Confirm p95 stays within target 4. Restore the key and verify everything's green Tie results to error budget: If failover took longer than expected or success rate dipped below SLO, that's a finding. Log it, fix it, run again next quarter. WHEN IT'S BORING, IT WORKS The goal: Make reliability boring. If your infrastructure is exciting, something's wrong. Ship the boring infrastructure. Sell the boring promise. Win the clients who care about reliability more than hype. ACTION ITEMS This week: 1. Stand up the router with two providers 2. Set the three alerts 3. Run the drill Friday Next two weeks: * Layer in the cache * Add SLO language to proposals * Implement full observability RESOURCES Download the complete Reliability SLO Kit: * SLO one-pager template * Budget guardrail sheet with alert thresholds * Router config * Cache recipe * 30-minute drill SOP with rollback steps * Client-safe proposal language Available on the Resources page ---------------------------------------- Legal disclaimer: The SLO/SOW language provided is template language, not legal advice. Have your counsel review before shipping to clients.

Reacties

Wees de eerste die een reactie plaatst

Meld je nu aan en word lid van de The Stateless Founder community!

Probeer gratis

Alle afleveringen

28 afleveringen

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

Gisteren16 min

The One-Page Operating Plan: 90 Days to a Nomad Business You Can Run From Anywhere

THE ONE-PAGE OPERATING PLAN: 90 DAYS TO A NOMAD BUSINESS YOU CAN RUN FROM ANYWHERE EPISODE OVERVIEW Stop drowning in scattered Notion databases and Trello boards. In this episode, Santi and Kira show you how to consolidate your entire AI nomad business onto a single page that you can review in 12 minutes every Monday from any café with sketchy wifi. WHAT YOU'LL LEARN THE ONE-PAGE FRAMEWORK * Offer & ICP: One line each—outcome you deliver, not tools you use * Pricing & Margins: Target 50-65% for AI agencies, 75-85% for pure SaaS * LTV:CAC Ratios: Aim for >3 with payback under 12 months * Acquisition Model: Primary/secondary channels with conversion rates * Self-Serve Flow: Time-to-value under 5 minutes * QA Wall: Three layers of protection * 90-Day Milestones: Specific, measurable outcomes THREE COMPLETE BUSINESS MODELS 1. CONTENT OPS AGENCY * Offer: 8 SEO articles + 12 social repurposes + 1 email brief/month * Pricing: $5,000/month retainer + $2,000 setup * Costs: $1,750 labor + $150 AI/tools + contingency = 55-60% margin * KPIs: 72-hour activation, 90% QA pass rate, 6+ month retention 2. FRACTIONAL REVOPS * Offer: HubSpot hygiene, lead routing, lifecycle stages, renewal risk flags * Pricing: $5,500/month + $3,500 onboarding audit * Costs: $2,000-3,000 operator + $100-300 tooling = 50-60% margin * KPIs: +3-5 pts MQL→SQL conversion, <5 min lead response 3. SUPPORT AUTOMATIONS * Offer: AI agent deployment with escalation workflows * Pricing: $3,000 setup + $3,000/month + $0.99/resolution pass-through * Target: 50-70% autonomous resolution on covered intents * QA: Per-ticket LLM judge + weekly golden-set replay THE NOMAD-SPECIFIC CHECKS THE LISBON TEST (5 QUESTIONS) 1. Can I review KPIs on my phone in under 3 minutes? 2. Do critical delivery steps have offline fallbacks? 3. Are handoffs async with owner, due time, and definition of done? 4. Is tool spend capped as percentage of MRR with alerts? 5. Does one Notion page link everything I need for the week? VISA RUN REVENUE * Base living costs + coworking + tools + visa run costs (amortized monthly) * 20% buffer for unexpected expenses * Schengen 90/180 rule: 90 days in any rolling 180-day period KEY RESOURCES MENTIONED * David Skok's SaaS Metrics 2.0: LTV:CAC and payback guidelines * PLG Handbook: Time-to-value under 5 minutes benchmark * Parakeeto/Agency Management Institute: 50-65% delivery margin targets * Intercom Fin: $0.99 per resolution pricing model * EdgeTier/MaxContact: QA sampling statistics (2-5% manual coverage) TOOLS & PRICING REFERENCES * OpenAI API: Track weekly spend to protect margins * Notion Business: Current workspace pricing for team costs * Make.com: Credit-based automation platform pricing * Google Workspace: $7-22/user/month for business tiers DOWNLOADS 🎯 The Nomad AI Operator One-Pager Kit * Notion one-page operating canvas (fill-in-the-blanks) * Google Sheets KPI/Runway tracker with defaults * Three worked examples (Content Ops, RevOps, Support) * Lisbon Test checklist * Visa Run Revenue calculator Available on the Resources page COMMUNITY CHALLENGE Fill out your one-page plan this week and post it in the community. We're doing live teardowns on Wednesday's office hours—we'll pressure-test your margins and run the Lisbon Test on your stack. NEXT STEPS 1. Tonight: Block 60 minutes, duplicate the Notion canvas, fill every field 2. This Week: Set up the Google Sheet with your actual costs and runway 3. Monday: Start the 12-minute weekly review cadence 4. 90 Days: Evaluate whether you have a business you can see clearly ---------------------------------------- Episode 17 • Season 1 • The Stateless Founder

1 jun 202615 min

Attach a Migration + License Addendum to Your Next SOW

ATTACH A MIGRATION + LICENSE ADDENDUM TO YOUR NEXT SOW THE PROBLEM: EVERYTHING BECOMES "DELIVERABLES" Most nomad builders treat everything they deliver as one blob: "deliverables." Client pays, client owns deliverables. Done. But that includes: * Your prompt library that took a year to build * Connector templates you use across every client * Scoring models trained on your own data * Monitoring scripts and error-handling patterns All lumped together with the custom dashboard built specifically for their use case. THE SOLUTION: BACKGROUND IP VS FOREGROUND IP Background IP: Everything you brought to the engagement (pre-existing tools, libraries, templates, models) Foreground IP: Stuff created specifically for this client under this SOW The addendum says: client owns the foreground, you keep the background. But you license the background to the client so they can actually use what you built them. THE THREE-PART SOW ADDENDUM 1. BACKGROUND IP SCHEDULE A literal table listing every reusable component: * Component name, type, version, owner * License scope: "internal use only," "seat-based," "usage-based" * Takes ~20 minutes if you know your stack 2. LICENSE GRANT WITH THREE PRICING PATHS Seat-Based: Simple predictability * 5 users × $10/seat/month = $50/month * Right fit when access tied to named humans * Agent-assist tools, back-office dashboards Usage-Based with Caps: Value alignment without bill shock * Base fee + per-unit rate above threshold + monthly ceiling * Real-time usage meters so clients see exactly where they stand * Hybrid model accelerating in AI-powered features Revenue-Share: For outcome-tied modules * Percentage of attributable revenue + monthly minimum * Requires attribution rules in contract (last-touch, split, uplift) * Upsell engines, lead-gen tools, pricing optimizers 3. GUARANTEED DATA HANDBACK CLAUSE * Export client data (not your tools) in machine-readable format * 30-60 day window, deletion certificate provided * GDPR Article 28 already requires this for personal data * Changes negotiation: "your data is yours, my tools are mine" MIGRATION CHECKLIST: MAKING LICENSES CREDIBLE PHASE 1: PREP * Name owners, shared channel setup * Mirror environment with masked data * Schema diff: Source vs target, field by field * Rate-limit planning: Bulk endpoints, client-side throttling, exponential backoff PHASE 2: TEST AND CUT * Dry run on 1-5% of data, reconcile counts * Freeze period, final sync, switch DNS/keys/webhooks * Rollback triggers: Record mismatch threshold, sustained 500s, critical test failures * No heroics from hammocks in Gili Air PHASE 3: POST-CUTOVER * Reconciliation report signed by both sides * Observability on, legacy credentials cleaned up SWITCHING COSTS: VALUE, NOT HOSTAGE DYNAMICS The Calculator Inputs (from academic research): * Rebuild hours × blended rate * Integration rework time * Training hours by role * PM overhead * Opportunity cost per day of freeze * Contractual fees Key Principle: Share the math transparently. Walk clients through inputs, let them adjust numbers. Transparency separates value-based switching costs from hostage situations. REGULATORY CONTEXT * EU Data Act (2024): Pushing seamless switching between providers * GDPR Article 28: Requires data return/deletion at service end * Market trends: Hybrid pricing models rising, seat-only declining * Gartner research: Value enhancement drives loyalty, not switching costs RESOURCES Migration + License Addendum Playbook includes: 1. SOW addendum with all three pricing options 2. Background IP schedule template 3. Migration runbook (schema diffs, rate limits, rollback) 4. Switch-cost calculator with formulas KEY SOURCES * Terms.Law IP + Work Product Addendum Generator * AWS Prescriptive Guidance on migration cutovers * Maxio 2025 SaaS Pricing Trends Report * SEG 2026 Annual SaaS Report * Burnham, Frels, Mahajan switching cost typology ---------------------------------------- Next episode: Wednesday

29 mei 202616 min

The 14-Day Partner Sprint: Feed-Drops, Mini-Templates, and the 15-Minute SLA

THE 14-DAY PARTNER SPRINT: FEED-DROPS, MINI-TEMPLATES, AND THE 15-MINUTE SLA THE QUESTION THAT STARTED IT ALL Someone in Kira's Slack community asked: "I've done three collabs this year. A podcast swap, a newsletter mention, a joint webinar. Each one spiked traffic for like two days and then nothing. How do I make partnerships actually compound instead of just being one-off favors?" The answer: Stop treating partnerships like networking events. Start treating them like a systematic distribution channel. THE THREE MISSING PIECES Most partnership marketing fails because it's missing: 1. A shared asset that lives beyond the collab - not a moment, but something that keeps working 2. Tracking that tells you which partner actually moved the needle - so you can prove ROI and repeat what works 3. A response system - when someone shows up from a partner's audience, you answer in 15 minutes, not 15 hours THE 14-DAY PARTNER SPRINT SYSTEM PARTNER SELECTION: THE ADJACENCY TEST Use these five criteria to filter potential partners: * Does their audience overlap with yours (same job title, same problem)? * Do they cover topics within your top three themes? * Can you ship the collab async? * Is their engagement real (actual clicks and listens, not vanity followers)? * Is there a clear contact you can reach? Pass rate needed: 4 out of 5. If they only pass 3, the fit is too loose. Partner types to target: * Podcasters * Community admins * Tool companies * Agencies * Educators (newsletter writers, course creators) Target: 4 prospects in each category = 20 total on your shortlist Expected yes rate: 20-30% (plan for 70% rejection) THE ASSETS THAT ACTUALLY COMPOUND Feed-drops: A full episode from your podcast publishes directly in another podcast's RSS feed. Key requirements: * Host-voiced intro (20-30 seconds) * Talent reads outperform generic announcer reads by 3 points on purchase intent * Realistic conversion: ~0.67% device conversion (Chartable SmartPromos data) Mini-templates: One-page, co-branded assets that solve a specific problem for the partner's audience * Takes ~3 hours to produce * Gate with email for 7 days, then open up * Personalized assets drive 4x more demo requests than generic content (ON24 benchmarks) THE MEASUREMENT LAYER Wire three tracking systems from day one: 1. UTMs on every link * Source = partner name * Medium = channel type * Campaign = sprint month * Track in GA4: template view, template claim, demo intent 2. SmartPromos through Chartable * For podcast-to-podcast attribution * Tracks device conversion: did someone who heard the promo subsequently download your show? 3. Self-reported attribution * "How did you first hear about us?" dropdown on template gates and demo forms * Partner names in the options * Cross-reference against UTM data - when they disagree, trust the human THE 15-MINUTE SLA The setup: * Slack channel for any form submission with partner UTM or word "referred" * Make or Zapier automation (10 minutes to build) * Coverage blocks that overlap with your biggest partner's audience The target: 15 minutes to first reply (not to close) The message: "Hey, thanks for coming via [partner]. Here's a 15-minute fit check - pick a time." Why it matters: Harvard Business Review study shows responding within an hour makes you nearly 7x more likely to qualify a lead. Most nomads respond the next morning because they were asleep in a different time zone. THE SPRINT TIMELINE * Day 1: Build the list and wire the tracking * Day 3: Send 20 outreach messages * Days 4-6: Negotiate and produce assets * Days 8-12: Feed-drops and templates go live * Day 13: Pull numbers and send partners a 5-line recap with their stats * Day 14: Debrief, duplicate the board, load 5 new prospects for next sprint THE COMPOUNDING FLYWHEEL After the first sprint: * You have a proven partner and co-created asset * The partner knows you deliver * The asset has a landing page and tracking * Next sprint: skip prospecting for that partner, go straight to "what do we ship next?" * Add 2 new partners to the rotation Sprint progression: * Sprint 1: 2 partners * Sprint 2: 4 partners * Sprint 3: 6 partners Each tracked asset keeps collecting emails between sprints. WHY THIS BEATS COLD OUTREACH FOR NOMADS * Paid ads: Require budget and constant optimization * SEO: Takes months for results * Partnership marketing: Done this way, gives you signal in 14 days * Location independence: Every asset ships async, no Zoom calls required RESOURCES Get the complete 14-Day Partner Sprint Kit with outreach scripts, negotiation checklist, Notion calendar, UTM spreadsheet, and SLA routing setup at statelessfounder.com/resources [https://statelessfounder.com/resources] ---------------------------------------- Your one move this week: Build the 20-name shortlist. Run the adjacency test. If 4 pass, you're ready to sprint.

27 mei 202613 min

Build a Three-Layer QA Wall for AI Outputs in 48 Hours

BUILD A THREE-LAYER QA WALL FOR AI OUTPUTS IN 48 HOURS Every AI deliverable you ship without quality checks is a bet against model drift, prompt degradation, and silent failures. This episode builds a three-layer QA wall that catches problems before clients do. THE COST OF NOT CHECKING * Human evaluation: $50 per case, 10 minutes * LLM judge evaluation: $0.02 per case, 16 seconds * At 1,000 cases/week: $50,000 vs $20 in evaluation costs LAYER 1: RUBRIC-SCORED LLM JUDGE Deploy an LLM judge against a weighted rubric before every deliverable ships: FIVE-CRITERIA RUBRIC * Task fulfillment (30%): Did it follow instructions? * Factual accuracy (25%): Are claims verifiable? * Clarity and structure (15%): Is it well-organized? * Style and brand fit (10%): Matches client voice? * Citations (10%): Proper attribution? * Safety flags (negative weight): PII leakage, hallucinations SCORING THRESHOLDS * Green (ships automatically): 0.8+ total, no critical flags, top two criteria 4+ * Amber (human edit queue): 0.7-0.8 total, or any criterion ≤2 * Red (blocked/escalated): <0.7 total or any critical flag RESEARCH BACKING * ICLR 2026 AutoMetrics: +33.4% correlation with humans vs direct LLM-as-judge * AAAI 2026 Think-J: Rubric-anchored judges more robust to noisy training data LAYER 2: GOLDEN-SET REPLAY AND DRIFT DETECTION Build a golden set of 40-60 items per output type, scored by humans with agreed-upon labels and rationales. WEEKLY CALIBRATION PROCESS 1. Replay golden set through your judge 2. Measure agreement using Cohen's kappa or Kendall's tau 3. Kappa >0.61 = substantial agreement 4. Track week-over-week trends 5. When agreement drops → pause auto-shipping and investigate DRIFT DETECTION * PLOS One 2026 study: Weekly Bradley-Terry recalibration achieved τ=0.59-0.68 vs humans * Detected three drift patterns: stable, improving, degrading * Without weekly replay, you're "shipping and hoping" GUARDRAILS AGAINST BRITTLENESS * Randomize position: Run both A-B and B-A orders (Chatbot Arena method) * Separate concerns: Rubric is workhorse, pairwise is tiebreaker * Never self-judge: Don't let GPT-4o judge GPT-4o outputs LAYER 3: HUMAN SAMPLING WITH RED/AMBER/GREEN THRESHOLDS Strategic 5-10% human sampling focused on risk and borderlines: SAMPLE COMPOSITION * 50%: Amber decisions (borderlines judge wasn't sure about) * 30%: High-risk greens (long outputs, safety-sensitive, new client styles) * 20%: Random greens (keep judge honest) DASHBOARD THRESHOLDS * Green: Judge precision ≥95%, human disagreement <10%, no critical flags * Amber: One metric slipped → raise cutline by 0.02, bump sampling to 15% * Red: Critical safety event, 2+ major misses in 50-item sample, or kappa <0.5 CLIENT VALUE PROPOSITION "Every output gets scored by a calibrated judge against a six-criterion rubric. Top performers ship automatically. Borderlines get human edit. Weekly 5-10% human sample with dashboard that updates every Monday." THE MONDAY DASHBOARD Five widgets for 30-minute weekly review: 1. Volume and mix: Items processed, percentage green/amber/red 2. Judge health: Agreement vs golden set with 4-week trend 3. Human QA metrics: Precision, disagreement rate, sample size 4. Risk flags: By type and resolution speed 5. Cost per eval: Track efficiency gains COST ANALYSIS: VISA RUN REVENUE MATH * Judge costs: $20/week for 1,000 items * Human sample: 50-100 items at $15-20/hour * Total QA cost: ~$350/week * vs Full human review: $50,000/week * ROI: If $350 prevents one client churn, pays for itself quarterly IMPLEMENTATION CHECKLIST THIS WEEK 1. Build golden set: 40 items from real output (good, borderline, bad) 2. Score manually: Create foundation for everything else 3. Schedule Monday review: 30 minutes on calendar NEXT WEEK 1. Deploy rubric-scored judge on new outputs 2. Set up weekly golden-set replay 3. Implement human sampling workflow RESOURCES The QA Wall Kit includes: * Rubric template with acceptance thresholds * Judge prompt pack (rubric + pairwise modes) * Human sampling SOP with R/A/G dashboard * Monday review checklist RESEARCH SOURCES * ICLR 2026 AutoMetrics: Rubric-style evaluators improve correlation by 33.4% * PLOS One 2026: Bias-calibrated LLM judges with weekly recalibration * AAAI 2026 Think-J: Generative judges outperform classifier-style approaches * UW Health Clinical Study: Cost/latency comparison of human vs LLM evaluation * TREC AutoJudge 2026: Live benchmark studying judge vulnerabilities and guardrails ---------------------------------------- Next episode: Judge fine-tuning vs off-the-shelf models for domain-specific QA

25 mei 202612 min

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

Beschrijving

Reacties

Probeer 14 dagen gratis

Alle afleveringen