Cover image of show The Experimentation Edge

The Experimentation Edge

Podcast by Growthbook

English

Technology & science

Limited Offer

2 months for 19 kr.

Then 99 kr. / monthCancel anytime.

  • 20 hours of audiobooks / month
  • Podcasts only on Podimo
  • All free podcasts
Get Started

About The Experimentation Edge

How do product teams decide what to build and what not to? The Experimentation Edge is the podcast where product, growth, and engineering leaders share how A/B testing, feature flags, and experimentation drive real business outcomes — backed by named companies and real numbers. From DoorDash's 12,000 A/B tests a year to Atlassian's experimentation-led product win to UPS's $500M experimentation team, each episode goes deep with operators running experimentation programs at scale. Hosted by Ashley Stirrup, CMO at GrowthBook and a 25-year executive in data and experimentation. For product managers, engineers, data scientists, and growth leaders at B2B tech companies who care about experimentation culture, statistical rigor, and shipping with confidence. No marketing speak. Just operators explaining what they shipped, what moved the needle, and how experimentation reshaped their teams. Topics: A/B testing, experimentation, growth experimentation, product experimentation, tech experimentation, feature flags, experimentation culture, statistical significance, marketplace experimentation, conversion rate optimization, experimentation at scale.

All episodes

13 episodes

episode Atlassian's Andrew Willingham on the Talent Product That A/B Testing Turned Around artwork

Atlassian's Andrew Willingham on the Talent Product That A/B Testing Turned Around

This episode of The Experimentation Edge explores how A/B testing, feature flags, and user research transformed Atlassian's talent product after it failed with its first users. Andrew Willingham — 11 years at Amazon, now Head of Legal and People Products at Atlassian — shares how product experimentation works when you can't test at scale, why your customer and your user are not the same person, and how the metrics you choose decide which experiments you can even run. Summary Andrew Willingham, Head of Legal and People Products at Atlassian, spent 11 years at Amazon before joining Atlassian a year ago. His path from running A/B tests on millions of Amazon shoppers to building talent management software for a few hundred thousand employees forced a fundamental shift: when you can't run tests at scale, you have to sit with your actual users and watch them fail. He shares how building a talent review product for Amazon's HR specialists completely flopped when handed to HRBPs — and why that failure taught him more than any winning experiment. Now at Atlassian, he's applying that same rigor to reimagining hiring processes with AI, testing everything from recruiter screens to interview sequences that the industry has run the same way for decades. Timestamps 03:09 From marketing Amazon's mobile app to building HR software for 1.5 million associates  08:19 Why a talent review product loved by IO psych experts flopped with actual HRBPs  11:11 How A/B testing helps product managers escape opinion-based politics  15:25 Testing copy that changes behavior: "We'll generate that status report for you"  17:20 The two North Star metrics Andrew optimizes: efficiency and quality  19:05 Khan Academy's metric trap: measuring cognitive engagement, not just completion  21:10 Why product managers resist experimentation — and what changes when you admit you don't know  Takeaways - Your customer and your user may not be the same person — building for HR specialists instead of the HRBPs who actually run talent reviews resulted in a feature nobody could use.  - When you can't test at scale, desk rides replace A/B tests — sitting with users and watching them struggle reveals failures faster than any dashboard.  - Experimentation short-circuits political debates by removing opinion from product decisions.  - Test metrics before you test features — usage time could signal engagement or just mean your product takes too long to do its job.  - The experiments that fail deliver the most valuable learnings, especially when you expected a slam dunk.  Connect with the guest Andrew Willingham on LinkedIn: https://www.linkedin.com/in/andrewwillingham/ [https://www.linkedin.com/in/andrewwillingham/] Learn more about Atlassian: https://www.atlassian.com/ [https://www.atlassian.com/] Sponsor Growthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/ [https://www.growthbook.io/] Topics: A/B testing, product experimentation, feature flags, user research, talent management, qualitative research, metric design, experimentation at scale, growth experimentation. * (03:09) - From marketing Amazon's mobile app to building HR software for 1.5 million associates * (08:19) - Why a talent review product loved by IO psych experts flopped with actual HRBPs * (11:11) - How A/B testing helps product managers escape opinion-based politics * (15:25) - Testing copy that changes behavior: "We'll generate that status report for you" * (17:20) - The two North Star metrics Andrew optimizes: efficiency and quality * (19:05) - Khan Academy's metric trap: measuring cognitive engagement, not just completion * (21:10) - Why product managers resist experimentation — and what changes when you admit you don't know

13 May 2026 - 22 min
episode How DoorDash's Experimentation Platform Saved Millions With One A/B Test artwork

How DoorDash's Experimentation Platform Saved Millions With One A/B Test

This episode of The Experimentation Edge unpacks how DoorDash's experimentation platform runs 12,000+ A/B tests per year across 42 million monthly active users — and now powers merchant-led testing on menu pricing and promotions. Ilya Izrailevsky, Senior Engineering Manager leading the platform, explains how feature flags, marketplace experimentation, and CEO-level experiment reviews built a multi-million-dollar experimentation culture across consumers, dashers, and merchants. Summary Most companies struggle to scale experimentation beyond engineering teams. DoorDash runs over 12,000 experiments per year across 42 million monthly active users — and now they're enabling restaurant owners to run their own A/B tests on menu pricing and promotions. Ilya Izrailevsky, Senior Engineering Manager leading DoorDash's experimentation platform, shares how the company built a three-sided marketplace testing program that balances consumers, dashers, and merchants across 40+ countries. From his time scaling search at Amazon (where offline model evaluation narrowed hundreds of candidates down to 10 for live testing) to preventing DashPass churn at DoorDash, Ilya reveals what happens when experimentation scales beyond product teams — and why CEO-level experiment review emails drive cultural change faster than any training program. One standout learning: expanding delivery radius to 11+ miles increased grocery orders but tanked retail conversions. The lesson wasn't about distance — it was that one metric approach breaks in multi-dimensional marketplaces. DoorDash now segments experimentation by vertical, behavior pattern, and regional market, using AI agents to mine institutional knowledge from past tests and auto-generate experiment summaries that ship company-wide within hours of readout. Timestamps 00:40 From building Wasabi (Intuit's open-source platform) to running ML at Amazon and Uber  03:04 Why product velocity without experimentation creates feature bloat, not impact  05:32 Scaling search at Amazon: billions of products, 10 visible results, 25% win rate  08:22 Offline evaluation as a filter — golden data sets cut model candidates before live traffic  10:23 DoorDash's three-sided marketplace: 300 million feature flag evaluations per second  12:38 CEO Tony Xu reads every experiment email and replies with alternative hypotheses  13:33 Democratization at scale: enabling merchants to A/B test menu pricing and promotions  17:05 DashPass churn experiment uncovered value perception gap — became a full product area  22:03 Why expanding delivery radius killed retail orders but boosted grocery conversions  24:16 No single North Star metric — balancing consumer quality, dasher earnings, merchant mix  27:29 Four-dimensional scale: democratization, global expansion, new verticals, AI agents  31:03 Agentic experimentation: AI mines past tests to generate hypotheses and debug imbalance Takeaways - Win rate matters less than learnings per test — DoorDash ships company-wide experiment summaries (win or lose) that the CEO actively reads and responds to, creating cultural accountability around testing rigor. - Offline evaluation acts as a pre-filter for model velocity — Amazon's search team used golden data sets to cut hundreds of ML candidates down to 10 for live A/B testing, preventing wasted experiment slots. - One-size metrics break in multi-dimensional marketplaces — DoorDash balances consumer retention, dasher utilization, and merchant inventory mix across verticals because optimizing one side degrades the ecosystem. - Democratization requires opinionated templates, not open-ended tools — enabling non-technical users to run tests means embedding success metrics and guardrails into pre-built experiment configs. - AI scales institutional knowledge, not just analysis speed — mining past experiment readouts to auto-generate new hypotheses turns your testing history into a compounding advantage. Connect with the guest LinkedIn: https://www.linkedin.com/in/ilyaizrailevsky/ [https://www.linkedin.com/in/ilyaizrailevsky/] Learn more about DoorDash: https://www.doordash.com/ [https://www.doordash.com/] Sponsor Growthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/ [https://www.growthbook.io/] Topics: A/B testing, experimentation platform, feature flags, marketplace experimentation, machine learning, growth experimentation, statistical significance, experimentation culture, agentic AI workflows. * (00:40) - From building Wasabi (Intuit's open-source platform) to running ML at Amazon and Uber * (03:04) - Why product velocity without experimentation creates feature bloat, not impact * (05:32) - Scaling search at Amazon: billions of products, 10 visible results, 25% win rate * (08:22) - Offline evaluation as a filter — golden data sets cut model candidates before live traffic * (10:23) - DoorDash's three-sided marketplace: 300 million feature flag evaluations per second * (12:38) - CEO Tony Xu reads every experiment email and replies with alternative hypotheses * (13:33) - Democratization at scale: enabling merchants to A/B test menu pricing and promotions * (17:05) - DashPass churn experiment uncovered value perception gap — became a full product area * (22:03) - Why expanding delivery radius killed retail orders but boosted grocery conversions * (24:16) - No single North Star metric — balancing consumer quality, dasher earnings, merchant mix * (27:29) - Four-dimensional scale: democratization, global expansion, new verticals, AI agents * (31:03) - Agentic experimentation: AI mines past tests to generate hypotheses and debug imbalance

12 May 2026 - 31 min
episode How UPS's Experimentation Team Generated Half a Billion From 80+ Apps With A/B Testing artwork

How UPS's Experimentation Team Generated Half a Billion From 80+ Apps With A/B Testing

This episode of The Experimentation Edge shows how UPS's A/B testing program drove $500M+ in incremental revenue across 80+ customer-facing applications. Dave Massey — head of the J.E.D.I. team (Journey Experience and Design Innovation) — walks through the first test that proved UX could move revenue, how he defended counterintuitive results to skeptical execs, and how a small experimentation team can override opinion with data at enterprise scale. Summary Dave Massey walked into UPS in 2016 and immediately got pulled into a meeting about AB testing tools. By the end of the day, he owned the platform—and the problem: UPS hadn't run a single meaningful experiment. Three years later, senior leadership gave him a hard number to hit. Prove UX could move revenue, or the pilot dies. His first test—removing navigation from the checkout flow—delivered $35 million in incremental revenue. Senior leaders didn't believe it. They made him defend the results upside down and sideways. When the dust settled, the data held. Today, Massey's team has driven over half a billion dollars in incremental revenue by treating UPS.com like the e-commerce business it actually is. Massey's approach is simple: test everything, especially what senior leaders think will work. His team, Journey Experience and Design Innovation (nicknamed J.E.D.I.), has built a reputation for saying no with data, not opinion. When a business unit demanded required recipient emails to capture customer data, J.E.D.I. ran the test in 24 hours and killed it. Conversion tanked. Two years later, the international team asked for the same feature—but framed it as a customs solution. That test passed. Same feature, different reason, different outcome. That's the edge Massey's team delivers: rigorous hypothesis design, a UX research team embedded in the experimentation workflow, and zero tolerance for untested ideas. Timestamps 03:09 Dave's first day at UPS: inheriting an AB testing tool with no program  05:59 Senior leadership's ultimatum: prove UX ROI or kill the pilot  08:38 First test result: $35M from removing navigation in checkout  09:48 Defending the numbers: how Massey's team survived scrutiny  11:07 Why a data-driven engineering culture made experimentation inevitable  16:12 Team size: 80 people supporting almost 80 customer-facing applications  19:08 The 24-hour test: when required email fields killed conversion  22:28 Why Massey embeds UX research inside the experimentation team  24:41 AI at UPS: treating it as a tool, not a replacement  Takeaways - Massey's first test removed navigation from UPS's shipping checkout flow and delivered $35 million in incremental revenue—proving e-commerce best practices apply even when customers think "this is just a tool, not e-commerce."  - J.E.D.I.'s win rate stays high because UX research and experimentation teams operate under the same leader, giving the program both behavioral metrics and voice-of-customer insight before tests ever launch.  - When senior leaders push ideas, Massey's team tests them instead of arguing—then delivers results that either validate the idea or identify three better alternatives the data actually supports.  - The same feature (required recipient email) failed for customer data capture but passed for international customs—proof that framing and customer benefit matter more than the feature itself.  - UPS runs everything centrally now, but the real win is that demand for testing has decentralized—business units across the company now come to J.E.D.I. asking to test their ideas. Connect with the guest LinkedIn: https://www.linkedin.com/in/masseycreates/ [https://www.linkedin.com/in/masseycreates/] https://www.linkedin.com/in/davemassey Learn more about UPS: https://www.ups.com [https://www.ups.com] Sponsor Growthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/ [https://www.growthbook.io/] Topics: A/B testing, experimentation, conversion rate optimization, feature flags, UX research, e-commerce experimentation, statistical significance, experimentation team building, growth experimentation, sequential testing. * (03:09) - Dave's first day at UPS: inheriting an AB testing tool with no program * (05:59) - Senior leadership's ultimatum: prove UX ROI or kill the pilot * (08:38) - First test result: $35M from removing navigation in checkout * (09:48) - Defending the numbers: how Massey's team survived scrutiny * (11:07) - Why a data-driven engineering culture made experimentation inevitable * (16:12) - Team size: 80 people supporting almost 80 customer-facing applications * (19:08) - The 24-hour test: when required email fields killed conversion * (22:28) - Why Massey embeds UX research inside the experimentation team * (24:41) - AI at UPS: treating it as a tool, not a replacement

11 May 2026 - 23 min
episode How Experimentation Led to Annual Growth at Fanatics artwork

How Experimentation Led to Annual Growth at Fanatics

Summary Most e-commerce companies test a handful of features each month. Fanatics runs nearly 100 experiments monthly and delivers a big portion of the company's total annual growth through experimentation alone. Medha Umarji, VP of Growth and Experimentation at the multi-billion dollar sports merchandising retailer, explains how she built a program that scales from 10 tests per month to 100—and maintains enough rigor to spot false positives before they become costly decisions. The difference isn't tooling or headcount. It's culture. When your CEO reads Excel spreadsheets for fun and actively wants data to prove him wrong, you stop debating whether to test and start debating how to test smarter. Medha shares the frameworks Fanatics uses to balance speed with rigor: a "do no harm" track for brand plays that won't show up in conversion metrics, a small-sample framework for teams that can't hit statistical significance thresholds, and an experimentation Wiki that feeds a continuous iteration flywheel. One surprising test on ad removal initially showed 95% statistical significance—until they replicated it and found the result was a false positive. The lesson: even at scale, you need to double-click on causality. Timestamps 03:09 How Fanatics scaled from 10 to 100 experiments per month over 10 years 05:25 Why some leadership teams embrace experimentation and others resist it 07:06 How experimentation consistently delivers a big portion of Fanatics' annual growth 08:20 What happens when your CEO consumes Excel spreadsheets and questions everything 10:35 How top-down humility shapes an entire company's testing culture 12:10 The ad removal test that looked like a 95% win—then failed replication 15:55 How Fanatics built an experimentation Wiki that powers their growth engine 22:45 The "do no harm" framework for features that don't measure cleanly in A/B tests 25:20 Why lowering barriers to adoption matters more than statistical perfection early on 26:27 Your odds of winning at experimentation are worse than roulette Takeaways * Replication catches false positives: A 95% confidence level still means 1 in 20 results are noise—if a critical test outcome can't be explained through micro-metrics, run it again before committing resources. * Top-down buy-in shifts the conversation from "why test?" to "how do we test?": When leadership treats data as the tiebreaker, teams stop defending opinions and start building better experiments. * Frameworks like "do no harm" and "small sample" expand who can test: Not every initiative needs 30,000 orders to ship value—lower the barrier for teams that can't hit statistical thresholds while protecting core KPIs. * Documenting experiments in a centralized Wiki creates a growth flywheel: Fanatics' Wiki feeds their roadmap with iterations on already-built features, reducing tech dependency and accelerating velocity. * Micro-metrics establish causality beyond top-line KPIs: If revenue moves but scroll depth, cart adds, and product views don't follow the same pattern, question the result before declaring a win. Connect with the guest LinkedIn: https://www.linkedin.com/in/medhaumarji/ [https://www.linkedin.com/in/medhaumarji/] Learn more about Fanatics https://www.fanatics.com/ [https://www.fanatics.com/]

7 May 2026 - 28 min
episode Inside Chess.com's Plan to Run 1,000 Experiments in a Single Year artwork

Inside Chess.com's Plan to Run 1,000 Experiments in a Single Year

Summary Chess.com ran its first A/B test in 2023. Two years later, the team is on track to run 1,000 experiments in a single year—and they've already shipped 195 in Q1.  In this episode, Ashley Stirrup sits down with Nafis Shaikh, Director of Product Management at Chess.com, to get inside the experimentation engine powering one of the world's most beloved gaming products.  Nafis brings experience from Zynga and Prodigy and a refreshingly honest take on what changes when a product built on passion suddenly has to serve a 10-million-DAU user base that spans absolute beginners to rated FIDE players. He and Ashley get into why one-size-fits-all doesn't actually fit anyone, how to measure an AI coach when you can't tell whether users have their volume on, and a game review experiment that completely upended the team's assumptions about how players want to learn.  Nafis also shares practical advice for product managers trying to introduce experimentation culture to organizations that have never done it before—starting with a simple pre/post test rather than a fancy platform. If you lead product, care about experimentation maturity, or just want to hear how a classic product is scaling its learning loop, this one's worth your time. Timestamps * [00:35] – Chess.com's experimentation origin story and the 1,000-test goal * [05:01] – Designing for a user base that spans beginners to FIDE-rated players * [07:30] – The four metrics dimensions Nafis uses to evaluate tests * [12:03] – How do you A/B test an AI coach when you can't tell who's listening? * [15:49] – Embracing humility and the shift away from "we know what works" * [20:50] – The game review test that surprised everyone: 80% of users review wins * [24:06] – Advice for PMs introducing experimentation at a new company * [29:15] – The onboarding debate and personalization from session zero Takeaways * Scale test volume to learning speed, not just shipping speed * Build hypotheses around user psychology, not just KPI movement * Accept that being wrong is the point—experimentation only works when leadership embraces humility * Start simple if you're new to experimentation; a clean pre/post comparison beats a fancy platform you don't use * Reposition features around how users actually feel, not how you assume they should feel * Design onboarding around the shortest path to value, not the longest path to personalization Guest LinkedIn: https://www.linkedin.com/in/nafis-shaikh-20161916/ [https://www.linkedin.com/in/nafis-shaikh-20161916/] Company website: https://www.chess.com [https://www.chess.com]

21 Apr 2026 - 31 min
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
Rigtig god tjeneste med gode eksklusive podcasts og derudover et kæmpe udvalg af podcasts og lydbøger. Kan varmt anbefales, om ikke andet så udelukkende pga Dårligdommerne, Klovn podcast, Hakkedrengene og Han duo 😁 👍
Podimo er blevet uundværlig! Til lange bilture, hverdagen, rengøringen og i det hele taget, når man trænger til lidt adspredelse.

Choose your subscription

Most popular

Limited Offer

Premium

20 hours of audiobooks

  • Podcasts only on Podimo

  • No ads in Podimo shows

  • Cancel anytime

2 months for 19 kr.
Then 99 kr. / month

Get Started

Premium Plus

Unlimited audiobooks

  • Podcasts only on Podimo

  • No ads in Podimo shows

  • Cancel anytime

Start 7 days free trial
Then 129 kr. / month

Start for free

Only on Podimo

Popular audiobooks

Get Started

2 months for 19 kr. Then 99 kr. / month. Cancel anytime.