Billede af showet The Stateless Founder

The Stateless Founder

Podcast af Santi, Kira

engelsk

Business

Begrænset tilbud

2 måneder kun 19 kr.

Derefter 99 kr. / månedOpsig når som helst.

  • 20 lydbogstimer pr. måned
  • Podcasts kun på Podimo
  • Gratis podcasts
Kom i gang

Læs mere The Stateless Founder

The Stateless Founder teaches digital nomads how to build location-independent businesses powered by AI and automation. Each week, Santi and Kira break down real business models, workflows, costs, and templates so you can grow from anywhere and spend more time traveling and less time grinding.

Alle episoder

21 episoder

episode Build a Minimal LLM Evaluation Loop That Catches Regressions While You Sleep cover

Build a Minimal LLM Evaluation Loop That Catches Regressions While You Sleep

BUILD A MINIMAL LLM EVALUATION LOOP THAT CATCHES REGRESSIONS WHILE YOU SLEEP THE PROBLEM: SILENT AI FAILURES When your website goes down, you get an alert. When Stripe breaks, payments fail immediately. But when your LLM starts producing worse outputs—slightly less accurate summaries, off-tone emails, JSON fields that are almost right—nobody tells you. The model doesn't throw an error. It just gets worse. For nomad founders managing AI workflows across time zones, this silent failure mode is especially dangerous. You're asleep, on a 12-hour bus in Peru, or doing a visa run in Bangkok while your content repurposing tool ships summaries that drop key facts. THE SOLUTION: A THREE-PIECE EVALUATION SYSTEM 1. GOLDEN TEST SETS (15-20 CASES PER OUTPUT TYPE) * Real production data only: Synthetic test cases test synthetic problems * JSONL format: One line per case, input paired with known-good output * Tagged for slicing: Formal tone, has PII, Spanish language, etc. * Three common types: Email rewrites, JSON extraction, content summaries 2. AI JUDGE PROMPTS (G-EVAL PATTERN) * Rubric-guided scoring: Analysis first, then scores per dimension * Cross-family judges: Generate with OpenAI, judge with Anthropic (or vice versa) * Blind randomized order: Prevents position bias * Four dimensions for email rewrites: Instruction-following, tone fit, clarity, PII leak check 3. PAIRWISE A/B TESTING * Compare prompt A vs prompt B: Not just absolute scoring * Randomized presentation: Judge sees outputs in random order * Tie-breaking: Borderline cases escalate to human review RELIABILITY MITIGATIONS JUDGE BIAS PROBLEMS * Self-preference bias: Judges favor their own model family's outputs * Position bias: Prefer whatever they see first or whatever is longer * Verbosity bias: Longer outputs score higher regardless of quality SOLUTIONS * Cross-family separation: Never use same provider for generation and judging * Human sampling: 10-20% of live production jobs reviewed weekly * Focus sampling: Pull cases where judge was least confident * 95% agreement target: If judge-human disagreement exceeds 5% for two weeks, recalibrate THE MONDAY SCORECARD (30 MINUTES WEEKLY) SIX KEY NUMBERS 1. Pass rate per output type: Email rewrites (90% threshold), summarization (88%) 2. Win rate from pairwise A/Bs: New prompt vs baseline 3. P95 latency: 95th percentile response time 4. Cost per 100 jobs: Token usage × per-token price 5. Judge agreement: Percentage alignment with human sample 6. Incidents: Anything that broke during the week DECISION FRAMEWORK * Roll forward: Pass rates stable, costs in line * Hold and investigate: Something dipped * Roll back: Model deprecation broke judge or generator IMPLEMENTATION TOOLS CI REGRESSION GATE * Promptfoo: Open source CLI with YAML config * GitHub Actions: Automated eval runs on every PR * Pass-rate thresholds: Build fails if quality regresses * Non-zero exit code: Blocks deployment automatically COST TRACKING * OpenAI/Anthropic APIs: Return token usage on every call * Real example: 4¢ per generation + 1.2¢ per judge call = $5.20 per 100 jobs * Alert thresholds: Catch cost spikes before monthly review MODEL DEPRECATION MONITORING * Pin model versions: Keep last two working versions in environment variables * Watch deprecation pages: OpenAI and Anthropic maintain lifecycle schedules * One-line rollback: Pinned configs enable instant reversion WEEKLY RHYTHM * Friday: Add 3-5 fresh cases from production traces * Sunday: Open PR with prompt/model changes, let CI run * Monday: Fill scorecard, make decision, assign one action item * Daily: Alerts on latency/cost thresholds catch spikes MONTHLY MAINTENANCE * Refresh golden sets: Replace stale cases with fresh production examples * Close stale failures: Archive resolved issues * Recalibrate judge: If agreement drops below 95% target START SMALL: THE ONE-OUTPUT-TYPE VERSION Don't try to build all three output types at once. Pick your highest-volume type, build 15 golden cases, wire up one judge prompt, run for two weeks. You'll catch things you didn't know were breaking. The full three-type system is the mature version. One type is the version that fits in an afternoon and still saves you from Monday morning client complaints. RESOURCES * Starter Kit: JSONL templates, G-Eval judge prompts, Promptfoo CI config * Monday Scorecard: Notion template with all six metrics * Deprecations Checklist: Model lifecycle monitoring guide * Human Sampling Guide: 10-20% review protocols ---------------------------------------- The vibes-based evaluation method works until it doesn't. When it doesn't, you find out from your customers. This system ensures you know before they do.

18. maj 2026 - 14 min
episode Build Self-Serve Revenue While You Sleep: Weekend Setup Guide cover

Build Self-Serve Revenue While You Sleep: Weekend Setup Guide

BUILD SELF-SERVE REVENUE WHILE YOU SLEEP: WEEKEND SETUP GUIDE THE SELF-SERVE REVENUE PROBLEM 43% of SaaS companies now run hybrid pricing models (base fee + usage), but most nomad founders are still losing revenue to: * Timezone gaps when buyers want to purchase * Failed credit card payments with no recovery system * Manual onboarding calls that don't scale across time zones THREE SELF-SERVE PATTERNS YOU CAN SHIP THIS WEEKEND PATTERN 1: TEMPLATE + ADD-ONS Stack: * Stripe Checkout or Payment Links for one-time purchases * Stripe Billing for recurring add-ons (monthly updates, premium templates) * Optional: Tally forms for gated delivery Costs: * Stripe: 2.9% + $0.30 per card charge * Billing: Additional 0.7% per paid invoice Activation Event: Template duplicated AND first checklist item completed within 24 hours PATTERN 2: MICRO-SAAS WITH HYBRID PRICING Stack: * Stripe Billing with subscriptions * Usage meters for hybrid pricing * Customer portal for self-service management * Usage caps to prevent runaway costs Example Pricing: $29/month base + $0.15 per AI job after 100 jobs Activation Event: First successful job completed within 24-48 hours PATTERN 3: PRODUCTIZED SERVICES Stack: * Stripe Payment Links for deposits * Calendly Free (1 event type, unlimited bookings) * Tally forms for intake Activation Event: Self-scheduled kickoff AND deposit paid within 24 hours DUNNING & RECOVERY AUTOMATION STRIPE CONFIGURATION 1. Go to Billing → Subscriptions and Emails → Manage Failed Payments 2. Enable Smart Retries (ML-driven retry timing) 3. Turn on all customer emails: failed payment, trial ending, upcoming invoice, expiring card 4. Add custom 7-14 day save sequence: Day 0, Day 3, Day 7 5. Include one-click card update links PADDLE CONFIGURATION * Built-in Retain system: 4 emails over 10-12 days * 30-day total retry window * Native SMS and in-app prompt support * Multi-channel recovery without custom development Recovery Results: Founders report recovering $2,400+/month and reducing involuntary churn from 1.0% to 0.3% monthly. REPLY ROUTER FOR 15-MINUTE RESPONSE TIMES System Design: 1. Classify incoming replies by intent (buying, expansion, billing, support) 2. Use lightweight LLM classifier 3. Check sender's local timezone and business hours 4. Page on-call person via Slack/SMS for high-intent messages 5. Auto-acknowledge outside business hours with response time commitment Research Backing: Responding within 5 minutes makes you 21x more likely to qualify leads vs. 30-minute response times. KEY METRICS TO TRACK 1. Activation Rate: % of signups hitting aha event within defined window 2. Day-One Retention 3. Trial-to-Paid Conversion 4. Involuntary Churn: Failed payments as % of MRR 5. Recovery Rate: Broken out by decline reason (expired card vs. insufficient funds) Alert Threshold: If activation rate drops below 30% for two consecutive weeks, stop acquisition and fix onboarding. WEEKEND IMPLEMENTATION CHALLENGE This Weekend: 1. Pick your pattern 2. Set up checkout/paywall 3. Enable Smart Retries and email sequence Next Week: Add reply router Week After: Layer in SMS for high-value accounts RESOURCES * Self-Serve in a Weekend Config Pack: Flowchart, Stripe/Paddle checklists, webhook maps, email templates, and reply-router specifications * All templates and configurations available on the Resources page ---------------------------------------- "The Lisbon Test for self-serve: Can a buyer in Tokyo try your product, hit a paywall, pay, and get started while you're asleep in Portugal? If yes, you've built something location-independent. If no, you've built a job with a nice view."

15. maj 2026 - 15 min
episode Build AI-First SOPs That Survive Model Changes cover

Build AI-First SOPs That Survive Model Changes

BUILD AI-FIRST SOPS THAT SURVIVE MODEL CHANGES When models change on provider schedules you don't control, your prompts break. Today we build the fix: an AI-first SOP template that treats prompts as versioned assets. THE PROBLEM: BRITTLE PROMPTS IN A MOVING TARGET ENVIRONMENT * OpenAI retired GPT-4o from ChatGPT February 13, 2026 (hard cutoff) * Traditional SOPs say "use GPT-4o" with no version, expiration, or fallback * Result: contractors debugging prompts that aren't broken when models disappear THE AI-FIRST SOP SCHEMA HEADER (METADATA BLOCK) * Owner name + backup owner (critical for async teams) * Status: draft/approved/deprecated * SOP version number * Model tag with specific release date * Temperature band (0-0.2 for compliance, 0.3-0.6 for creative) STEPS WITH VERSIONED PROMPTS * Each model call gets unique prompt key + version number * Semantic versioning: major.minor.patch * Major: Output shape changes (text → JSON) * Minor: Instructions change, output contract same * Patch: Typo fixes, threshold tweaks * Full label: prompt_key@1.3.2#model_tag+dataset_hash INPUT/OUTPUT SCHEMAS * Field name, type, required/optional, description * JSON Schema for technical teams, simple tables for everyone else * Contractors don't guess what prompts expect FAILURE MODES & GUARDRAILS * OWASP Top 10 for LLM Applications (v2.0, 2025) catalogs common risks * Document specific failure modes for each workflow * Attach guardrail policy IDs (AWS Bedrock, NeMo Guardrails) * Version guardrail policies too TOOLING OPTIONS SMALL TEAMS (≤3 PEOPLE): PURE NOTION * Database with owner, status, SemVer, model tag, last edited time * Page history provides diffs for rollback * Button stamps changelog entry when publishing new version * Setup time: 45 minutes BIGGER TEAMS: DEDICATED PLATFORMS * PromptLayer: Registry with release labels, rollback, analytics * Speak scaled 1→11 markets training non-technical teams to version prompts * Humanloop: Version control with .prompt files that sync to Git * Note: Platform sunset notice flagged in 2025 docs PLATFORM RISK MITIGATION * Keep SemVer convention, model tags, changelog in your SOP * These survive any platform migration * Tool can disappear; versioning scheme persists THE 30-DAY CHANGE REVIEW PROCESS WHAT TO CHECK MONTHLY * OpenAI deprecations page * Azure model retirement tables * Anthropic deprecation docs * Vertex AI deprecation page WHEN SOMETHING'S FLAGGED 1. Pull affected SOPs 2. Rerun evals on replacement model (even just 5 test cases) 3. If outputs hold: update model tag, bump version 4. If outputs don't hold: patch prompt before deadline REAL EXAMPLE: METICULATE * Scaled to 1.5M LLM requests using PromptLayer * Tagged every call by function and model * When prompts regressed: search failing runs, find working version, rollback * Versioned workflow enabled hotfixes in hours vs days THE COST OF NOT HAVING THIS * 3AM messages from confused contractors * 2 hours debugging prompts that aren't broken * Client complaints on LinkedIn in front of 11K followers * Margins drifting as pricing changes go unnoticed IMPLEMENTATION This week: Pick your most critical AI workflow—the one that would hurt most if it broke tomorrow. Build its SOP first. Pin the model version, write the failure modes, set the 30-day review date. Template: Grab the AI-First SOP template in the show notes. Duplicate it, fill in your model tag and inputs, get versioned prompts with built-in changelog by end of day. ---------------------------------------- RESOURCES * AI-First SOP Template (Notion) [https://statelessfounder.com/resources/ai-sop-template] - Complete template with 3 worked examples * OpenAI API Deprecations [https://platform.openai.com/docs/deprecations/instructgpt-models] * OWASP Top 10 for LLM Applications v2.0 [https://owasp.org/www-project-top-10-for-large-language-model-applications/] * Semantic Versioning Spec [https://semver.org/] CASE STUDIES MENTIONED * Speak: Language learning app scaled 1→11 markets using PromptLayer for non-technical prompt editing * Meticulate: Scaled to 1.5M LLM requests with versioned prompt workflow for rapid rollbacks

13. maj 2026 - 14 min
episode Stop Interviews: Use a 90-Minute AI-Graded Skills Test cover

Stop Interviews: Use a 90-Minute AI-Graded Skills Test

STOP INTERVIEWS: USE A 90-MINUTE AI-GRADED SKILLS TEST THE PROBLEM That founder in Bangkok spent 11 hours across 5 calls in 4 time zones to hire one contractor—who ghosted after the trial project. Sound familiar? Resume screens and portfolio reviews don't tell you if someone can actually handle malformed JSON at 2 AM when you're asleep on the other side of the planet. THE SOLUTION: AI-GRADED SKILLS TESTS Replace interviews with a paid, 90-minute async skills test graded by a calibrated LLM judge with human sampling on borderlines. CORE ARCHITECTURE Golden Set Calibration * Build 6-10 test items per role: 4 happy-path scenarios, 2-3 edge cases, 1 failure-handling test * For automation builders: clean webhook payload, Euro currency with commas, missing email field, duplicate event requiring idempotency logic * Run 3-5 internal testers through the same test to calibrate rubric weights Pairwise Judging with Permutation Debiasing * Never use raw 1-10 scores—LLM judges show systematic position bias * Show candidate work vs. golden answer side-by-side: "Which better satisfies this rubric?" * Flip order and run again—if model picks same winner both times, reliable signal * If it flips, flag for human review Confidence Bands for Decisioning * Compute win rate across all items (% of time candidate beat gold standard) * Calculate 95% Wilson confidence interval around that number * Pass: lower bound above 60% * Borderline: win rate 55-65% or interval straddles 60% * Reject: below 55% with upper bound under 60% Human Sampling Protocol * Every borderline case gets human review * Sample 10-20% of clear passes (stratified by role/region) to check for model drift * Route any critical criterion failure (e.g., factual accuracy in content) to human regardless of overall score CONTENT OPS GRADING Four weighted criteria: * Factual accuracy: 35% (marked critical—auto-routes to human if flagged) * Structure: 25% * Voice adherence: 25% * Brief compliance: 15% ANTI-CHEAT WITHOUT SURVEILLANCE Required Layer: * Randomized inputs (rotate variants monthly) * Time-boxed links (portal locks at 90 minutes) * Honor statement checkbox Optional Additions: * Tab-switch logging * Basic plagiarism detection Avoid: Screen recording, keystroke logging, webcam monitoring—you're hiring async contractors, not surveilling them. FAIR PAYMENT STRUCTURE Regional Pay Bands (90-minute stipend): Content Ops: * Southeast Asia: $30 * Western Europe: $60 * US: $68 Automation Builders: * Southeast Asia: $45 * Western Europe: $83 * US: $98 Based on Upwork median rates and Automattic's $25/hour trial standard. APPEAL PROCESS * 5-day window for human re-review requests * Rubric feedback provided either way * Brand signal: "We take your time seriously enough to build transparent systems" RESEARCH FOUNDATION * Stanford SCALE Autorubric: Per-criterion rubric checks with few-shot calibration * Chatbot Arena methodology: Pairwise comparison with confidence-aware ranking * Position bias studies: 100k+ evaluation instances show systematic bias in LLM judges * G-Eval correlation: GPT-4 achieves ~0.51 Spearman with humans on summarization—good but not perfect QUALITY FLAGS & TRANSPARENCY * Log every prompt, model version, score (HELM-style reporting) * Version everything, changelog everything * Defend every decision with audit trail * 10-20% human sampling concentrated on borderlines and critical criteria THE MATH Traditional hiring: 11 hours of interviews + bad hire that costs a client AI-graded test: $400 for 10 candidates + 40 minutes reviewing 2 borderline cases The math isn't close. RESOURCES The Contractor Skills Test Pack includes: * Golden-set datasets for automation builder and content ops roles * Pairwise grader prompts with permutation logic * Rubric weights and confidence-band calculator * Human sampling SOP and anti-cheat checklist * Regional pay-band tables * Candidate-facing one-pager for Notion NEXT STEPS 1. Grab the Contractor Skills Test Pack 2. Swap in your role and stack 3. Run 3 internal testers to calibrate bands 4. Post your first test by Friday Ship it before your next visa run.

11. maj 2026 - 13 min
episode Build an AI Org Chart That Works While You Sleep cover

Build an AI Org Chart That Works While You Sleep

BUILD AN AI ORG CHART THAT WORKS WHILE YOU SLEEP THE OAXACA DISASTER Kira's 11 PM wake-up call in Oaxaca: contractor in Lagos finished fourteen blog posts, but the Berlin editor was on PTO with no backup assigned. Result? Nine posts reviewed while falling asleep at a tiny Airbnb desk, five shipped unreviewed, and one had the wrong client name in the headline. The 7 AM apology call from a mezcal hangover was the moment she realized her agency didn't have an org chart—it had her. THE FOUR-ROLE FRAMEWORK Not four people—four roles. One person can hold multiple roles when you're small: * Builder: Ships the thing. Writes drafts, builds automations, pushes code * Operator: Owns quality, schedules, budgets, client communications * Reviewer: Independent check. Cannot be the Builder on the same task * Agent/Dispatcher: Routes work, maintains schedules, pages people when things break THREE SCALABLE PATTERNS PATTERN 1: SOLO + CONTRACTORS * Founder: Builder + Operator * Contractor 1: Secondary Builder * Contractor 2: Reviewer * Make automation: Dispatcher with 30-minute human backstop PATTERN 2: POD MODEL (3-5 PEOPLE) * Lead writer (Builder) * Ops person (Operator) * Rotating editor (Reviewer) * Published SLAs: 24h priority campaigns, 48h everything else * Auto-approve on silence if automated checks pass PATTERN 3: AGENCY CELL + DISPATCHER * Multiple pods handling different clients/products * Traffic Manager routes work and maintains coverage * UTC coverage grid shows overlap windows SLA MATRIX & ESCALATION RESPONSE TIME TARGETS * Revenue-critical leads: 15-minute acknowledgment during sender's business hours * Code reviews: 4-hour first look during business hours * Content approvals: 24-48 hours * Support requests: Same business day SEVERITY TIERS (ATLASSIAN FRAMEWORK) * SEV 1: Revenue impact now → immediate paging * SEV 2: Major degradation/deadline today → 30-60 minute window * SEV 3: Normal work → business hours TWO-LAYER ESCALATION 1. On-call Agent (15-minute acknowledgment window) 2. Auto-escalate to Operator if missed COVERAGE GRID & HANDOFFS UTC COVERAGE GRID Spreadsheet with columns: name, role, UTC offset, work start/end, PTO dates. Calculate overlap hours between Builder in Bogotá and Reviewer in Bangkok. FIVE-FIELD HANDOFF PACKET Before passing work across time zones: 1. Context: What we're doing and for whom 2. Constraints: Deadlines, budgets, brand rules 3. Last good output: Most recent working version 4. Budget left: Hours or dollars remaining 5. Fallback: What to do if blocked for 12 hours Receiving person must comment "I own it" and restate next checkpoint in UTC. THE LISBON TEST FOR HANDOFFS Could this work keep moving for 24 hours while you're offline? If any of the five fields is blank, you don't have a handoff—you have a hope. REVIEWER ROTATION GitLab's "Reviewer Roulette": Random assignment from a pool. For small teams, use a shared doc rotating weekly assignments with visible backup coverage. BLAMELESS POSTMORTEMS Google SRE template: What happened, timeline in UTC, root cause, what worked, what failed, three ranked fixes with owners and due dates. Run within 72 hours while details are fresh. Goal: fix the system, never punish. EU AI ACT COMPLIANCE READY August 2, 2026 applicability date for most provisions. Named Reviewers, documented approval chains, and evidence logs aren't just good ops—they're compliance readiness for human oversight requirements. MINIMUM VIABLE PROCESS Start with: * One-page RACI per offer (not per task) * UTC coverage grid in Google Sheets * Five-field handoff packet * Two-tier escalation (15-minute window only for revenue-critical leads) * Pilot on one client for two weeks, then iterate THIS WEEK'S ACTION 1. Download the AI-Augmented Org Packet (RACI template, SLA matrix, coverage grid, escalation tree, handoff checklist) 2. Duplicate and fill in roles for one client/product 3. Assign backup for every single role 4. Run one red-team handoff—hand real task to backup overnight 5. If it ships without you touching it, your org chart works ---------------------------------------- Resources: * AI-Augmented Org Packet [https://statelessfounder.com/resources/ai-org-packet] - Complete templates and frameworks * GitLab Reviewer Roulette [https://about.gitlab.com/blog/reviewer-roulette-one-year-on/] - Rotation system reference * Atlassian Incident Response [https://www.atlassian.com/incident-management/handbook/incident-response] - Severity framework * Google SRE Postmortem Culture [https://sre.google/sre-book/postmortem-culture/] - Blameless postmortem template

8. maj 2026 - 15 min
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
Rigtig god tjeneste med gode eksklusive podcasts og derudover et kæmpe udvalg af podcasts og lydbøger. Kan varmt anbefales, om ikke andet så udelukkende pga Dårligdommerne, Klovn podcast, Hakkedrengene og Han duo 😁 👍
Podimo er blevet uundværlig! Til lange bilture, hverdagen, rengøringen og i det hele taget, når man trænger til lidt adspredelse.

Vælg dit abonnement

Mest populære

Begrænset tilbud

Premium

20 timers lydbøger

  • Podcasts kun på Podimo

  • Ingen reklamer i podcasts fra Podimo

  • Opsig når som helst

2 måneder kun 19 kr.
Derefter 99 kr. / måned

Kom i gang

Premium Plus

100 timers lydbøger

  • Podcasts kun på Podimo

  • Ingen reklamer i podcasts fra Podimo

  • Opsig når som helst

Prøv gratis i 7 dage
Derefter 129 kr. / måned

Prøv gratis

Kun på Podimo

Populære lydbøger

Kom i gang

2 måneder kun 19 kr. Derefter 99 kr. / måned. Opsig når som helst.