Atlassian's Andrew Willingham on the Talent Product That A/B Testing Turned Around

Descripción

This episode of The Experimentation Edge explores how A/B testing, feature flags, and user research transformed Atlassian's talent product after it failed with its first users. Andrew Willingham — 11 years at Amazon, now Head of Legal and People Products at Atlassian — shares how product experimentation works when you can't test at scale, why your customer and your user are not the same person, and how the metrics you choose decide which experiments you can even run. Summary Andrew Willingham, Head of Legal and People Products at Atlassian, spent 11 years at Amazon before joining Atlassian a year ago. His path from running A/B tests on millions of Amazon shoppers to building talent management software for a few hundred thousand employees forced a fundamental shift: when you can't run tests at scale, you have to sit with your actual users and watch them fail. He shares how building a talent review product for Amazon's HR specialists completely flopped when handed to HRBPs — and why that failure taught him more than any winning experiment. Now at Atlassian, he's applying that same rigor to reimagining hiring processes with AI, testing everything from recruiter screens to interview sequences that the industry has run the same way for decades. Timestamps 03:09 From marketing Amazon's mobile app to building HR software for 1.5 million associates 08:19 Why a talent review product loved by IO psych experts flopped with actual HRBPs 11:11 How A/B testing helps product managers escape opinion-based politics 15:25 Testing copy that changes behavior: "We'll generate that status report for you" 17:20 The two North Star metrics Andrew optimizes: efficiency and quality 19:05 Khan Academy's metric trap: measuring cognitive engagement, not just completion 21:10 Why product managers resist experimentation — and what changes when you admit you don't know Takeaways - Your customer and your user may not be the same person — building for HR specialists instead of the HRBPs who actually run talent reviews resulted in a feature nobody could use. - When you can't test at scale, desk rides replace A/B tests — sitting with users and watching them struggle reveals failures faster than any dashboard. - Experimentation short-circuits political debates by removing opinion from product decisions. - Test metrics before you test features — usage time could signal engagement or just mean your product takes too long to do its job. - The experiments that fail deliver the most valuable learnings, especially when you expected a slam dunk. Connect with the guest Andrew Willingham on LinkedIn: https://www.linkedin.com/in/andrewwillingham/ [https://www.linkedin.com/in/andrewwillingham/] Learn more about Atlassian: https://www.atlassian.com/ [https://www.atlassian.com/] Sponsor Growthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/ [https://www.growthbook.io/] Topics: A/B testing, product experimentation, feature flags, user research, talent management, qualitative research, metric design, experimentation at scale, growth experimentation. * (03:09) - From marketing Amazon's mobile app to building HR software for 1.5 million associates * (08:19) - Why a talent review product loved by IO psych experts flopped with actual HRBPs * (11:11) - How A/B testing helps product managers escape opinion-based politics * (15:25) - Testing copy that changes behavior: "We'll generate that status report for you" * (17:20) - The two North Star metrics Andrew optimizes: efficiency and quality * (19:05) - Khan Academy's metric trap: measuring cognitive engagement, not just completion * (21:10) - Why product managers resist experimentation — and what changes when you admit you don't know

How DoorDash's Experimentation Platform Saved Millions With One A/B Test

This episode of The Experimentation Edge unpacks how DoorDash's experimentation platform runs 12,000+ A/B tests per year across 42 million monthly active users — and now powers merchant-led testing on menu pricing and promotions. Ilya Izrailevsky, Senior Engineering Manager leading the platform, explains how feature flags, marketplace experimentation, and CEO-level experiment reviews built a multi-million-dollar experimentation culture across consumers, dashers, and merchants. Summary Most companies struggle to scale experimentation beyond engineering teams. DoorDash runs over 12,000 experiments per year across 42 million monthly active users — and now they're enabling restaurant owners to run their own A/B tests on menu pricing and promotions. Ilya Izrailevsky, Senior Engineering Manager leading DoorDash's experimentation platform, shares how the company built a three-sided marketplace testing program that balances consumers, dashers, and merchants across 40+ countries. From his time scaling search at Amazon (where offline model evaluation narrowed hundreds of candidates down to 10 for live testing) to preventing DashPass churn at DoorDash, Ilya reveals what happens when experimentation scales beyond product teams — and why CEO-level experiment review emails drive cultural change faster than any training program. One standout learning: expanding delivery radius to 11+ miles increased grocery orders but tanked retail conversions. The lesson wasn't about distance — it was that one metric approach breaks in multi-dimensional marketplaces. DoorDash now segments experimentation by vertical, behavior pattern, and regional market, using AI agents to mine institutional knowledge from past tests and auto-generate experiment summaries that ship company-wide within hours of readout. Timestamps 00:40 From building Wasabi (Intuit's open-source platform) to running ML at Amazon and Uber 03:04 Why product velocity without experimentation creates feature bloat, not impact 05:32 Scaling search at Amazon: billions of products, 10 visible results, 25% win rate 08:22 Offline evaluation as a filter — golden data sets cut model candidates before live traffic 10:23 DoorDash's three-sided marketplace: 300 million feature flag evaluations per second 12:38 CEO Tony Xu reads every experiment email and replies with alternative hypotheses 13:33 Democratization at scale: enabling merchants to A/B test menu pricing and promotions 17:05 DashPass churn experiment uncovered value perception gap — became a full product area 22:03 Why expanding delivery radius killed retail orders but boosted grocery conversions 24:16 No single North Star metric — balancing consumer quality, dasher earnings, merchant mix 27:29 Four-dimensional scale: democratization, global expansion, new verticals, AI agents 31:03 Agentic experimentation: AI mines past tests to generate hypotheses and debug imbalance Takeaways - Win rate matters less than learnings per test — DoorDash ships company-wide experiment summaries (win or lose) that the CEO actively reads and responds to, creating cultural accountability around testing rigor. - Offline evaluation acts as a pre-filter for model velocity — Amazon's search team used golden data sets to cut hundreds of ML candidates down to 10 for live A/B testing, preventing wasted experiment slots. - One-size metrics break in multi-dimensional marketplaces — DoorDash balances consumer retention, dasher utilization, and merchant inventory mix across verticals because optimizing one side degrades the ecosystem. - Democratization requires opinionated templates, not open-ended tools — enabling non-technical users to run tests means embedding success metrics and guardrails into pre-built experiment configs. - AI scales institutional knowledge, not just analysis speed — mining past experiment readouts to auto-generate new hypotheses turns your testing history into a compounding advantage. Connect with the guest LinkedIn: https://www.linkedin.com/in/ilyaizrailevsky/ [https://www.linkedin.com/in/ilyaizrailevsky/] Learn more about DoorDash: https://www.doordash.com/ [https://www.doordash.com/] Sponsor Growthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/ [https://www.growthbook.io/] Topics: A/B testing, experimentation platform, feature flags, marketplace experimentation, machine learning, growth experimentation, statistical significance, experimentation culture, agentic AI workflows. * (00:40) - From building Wasabi (Intuit's open-source platform) to running ML at Amazon and Uber * (03:04) - Why product velocity without experimentation creates feature bloat, not impact * (05:32) - Scaling search at Amazon: billions of products, 10 visible results, 25% win rate * (08:22) - Offline evaluation as a filter — golden data sets cut model candidates before live traffic * (10:23) - DoorDash's three-sided marketplace: 300 million feature flag evaluations per second * (12:38) - CEO Tony Xu reads every experiment email and replies with alternative hypotheses * (13:33) - Democratization at scale: enabling merchants to A/B test menu pricing and promotions * (17:05) - DashPass churn experiment uncovered value perception gap — became a full product area * (22:03) - Why expanding delivery radius killed retail orders but boosted grocery conversions * (24:16) - No single North Star metric — balancing consumer quality, dasher earnings, merchant mix * (27:29) - Four-dimensional scale: democratization, global expansion, new verticals, AI agents * (31:03) - Agentic experimentation: AI mines past tests to generate hypotheses and debug imbalance

12 de may de 202631 min

Atlassian's Andrew Willingham on the Talent Product That A/B Testing Turned Around

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios