engelsk
Videnskab & teknologi
BegrĂŠnset tilbud
Derefter 99 kr. / mÄnedOpsig nÄr som helst.
LĂŠs mere Latent Space: The AI Engineer Podcast
The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space
195 episoder
Physical AI that Moves the World â Qasar Younis & Peter Ludwig, Applied Intuition
From building Applied Intuition [https://www.appliedintuition.com/] from YC-era autonomy tooling into a $15B physical AI company, Qasar Younis [https://x.com/qasar] and Peter Ludwig [https://www.linkedin.com/in/peterwludwig/] have spent the last decade living through the full arc of autonomy: from simulation and data infrastructure for robotaxi companies, to operating systems for safety-critical machines, to deploying AI onto cars, trucks, mining equipment, construction vehicles, agriculture, defense systems, and driverless L4 trucks running in Japan today. They join us to explain why âphysical AIâ is not just LLMs on wheels, why the real bottleneck is no longer model intelligence but deployment onto constrained hardware, and why the future of autonomy may look less like one-off demos and more like Android for every moving machine. We discuss: * Applied Intuitionâs mission: building physical AI for a safer, more prosperous world, powering cars, trucks, construction and mining equipment, agriculture, defense, and other moving machines * Why physical AI is different from screen-based AI: learned systems can make mistakes in chat or coding, but safety-critical machines like driverless trucks, autonomous vehicles, and robots need much higher reliability * The evolution from autonomy tooling to a broad physical AI platform: starting with simulation and data infrastructure for robotaxi companies, then expanding into 30+ products across simulation, operating systems, autonomy, and AI models * Why tooling companies came back into fashion: Qasar on why developer tooling looked unfashionable in 2016, why Applied Intuition still bet on it, and how the AI boom made workflows and tools central again * The three core buckets of Applied Intuitionâs technology: simulation and RL infrastructure, true operating systems for vehicles and machines, and fundamental AI models for autonomy and world understanding * Why vehicles need a real AI operating system: real-time control, sensor streaming, latency, memory management, fail-safes, reliable updates, and why âbricking a carâ is much worse than bricking an iPad * Physical machines as âphones before Android and iOSâ: Peter explains why todayâs vehicle and machine software stack is fragmented across many operating systems, and why Applied Intuition wants to consolidate the platform layer * Coding agents inside Applied Intuition: Cursor, Claude Code, internal adoption leaderboards, and how AI tools are changing engineering workflows even in embedded systems and safety-critical software * Verification and validation for physical AI: why evals get harder as models improve, how end-to-end autonomy changes simulation requirements, and why neural simulation has to be fast and cheap enough to make RL practical * From deterministic tests to statistical safety: why autonomy validation is shifting from binary pass/fail requirements toward âhow many ninesâ of reliability and mean time between failures * Cruise, Waymo, and public trust: Qasar and Peter discuss why autonomy failures are not just technical issues, how companies interact with regulators, and why Waymo is setting a high bar for the industry * Simulation vs. reality: why no simulator perfectly represents the real world, how sim-to-real validation works, and why real-world testing will never disappear * World models for physical AI: hydroplaning, construction equipment, visual cues, cause-and-effect learning, and where world models help versus where they are not enough * Onboard vs. offboard AI: why data-center models can be huge and slow, but onboard vehicle models need millisecond-level latency, low power, small size, and distillation-like efficiency * Why physical AI is not constrained by model intelligence alone: the hard part is deploying models onto real hardware, under safety, latency, power, cost, and reliability constraints * Legacy autonomy vs. intelligent autonomy: RTK GPS in mining and agriculture, why hand-coded path-following worked for decades, and why modern systems need perception and dynamic intelligence * Planning for physical systems: how âplan modeâ applies to robotaxis, mining, defense, and multi-step physical tasks where actions change the state of the world * Why robotics demos are not production: the brittle last 1%, humanoid reliability, DARPA Grand Challenge-style prize policy, and the advanced engineering gap between research and deployment * Applied Intuitionâs hard-earned lessons: after nearly a decade, Peter says they can look at a robotics demo and predict the next 20 problems the company will hit * Qasarâs advice to founders: constrain the commercial problem, avoid copying mature-company strategies too early, and remember that compounding technology only matters if you survive long enough to see it compound * Why 2014 YC advice may not apply in 2026: capital markets, AI company dynamics, and the difference between building in stealth with a deep network versus building as a new founder today * What Applied is hiring for: operating systems, autonomy, dev tooling, model performance, evals, safety-critical systems, hardware/software boundaries, and engineers with deep curiosity about how things work Applied Intuition: * YouTube: https://www.youtube.com/@AppliedIntuitionInc [https://www.youtube.com/@AppliedIntuitionInc] * X: https://x.com/AppliedInt [https://x.com/AppliedInt] * LinkedIn: https://www.linkedin.com/company/applied-intuition-inc [https://www.linkedin.com/company/applied-intuition-inc] Qasar Younis: * X: https://x.com/qasar [https://x.com/qasar] * LinkedIn: https://www.linkedin.com/in/qasar/ [https://www.linkedin.com/in/qasar/] Peter Ludwig: * LinkedIn: https://www.linkedin.com/in/peterwludwig/ [https://www.linkedin.com/in/peterwludwig/] Timestamps 00:00:00 Introduction: Applied Intuition, Physical AI, and 10 Years of Building 00:01:37 Physical AI vs. Screen AI: Why Safety-Critical Changes Everything 00:02:51 The Origin Story: Tooling, YC, and the Scale AI Comparison 00:05:41 The Three Buckets: Simulation, Operating Systems, and Autonomy Models 00:11:10 Hardware, Sensors, and the LiDAR Question 00:14:26 The Operating System Layer: Why Vehicles Are Like Pre-Android Phones 00:19:13 Customers, Licensing, and the Better-Together Stack 00:21:19 AI Coding Adoption: Cursor, Claude Code, and the Bimodal Engineer 00:26:41 Verifiable Rewards, Evals, and Neural Simulation 00:31:04 Statistical Validation, Regulators, and the Cruise Lesson 00:40:25 World Models, Hydroplaning, and Cause-Effect Learning 00:43:34 Onboard vs. Offboard: Latency, Embedded ML, and Distillation 00:50:57 Plan Mode for Physical Systems and Next-Token Prediction Universally 00:53:04 Productionization: The 20 Problems Every Robotics Demo Will Hit 00:58:00 Founder Advice: Constraints, Compounding Tech, and Mature-Company Mimicry 01:05:41 Hiring Philosophy: Hardware/Software Boundary and Engineering Mindset 01:08:50 General Motors Institute, Education, and the Curiosity Mindset Transcript Introduction: Applied Intuition, Physical AI, and 10 Years of Building Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and Iâm joined by Swyx, editor of Latent Space. Swyx [00:00:10]: And today weâre very honored to have the founders of Applied Intuition, Qasar and Peter. Welcome. Qasar [00:00:17]: You guys really know how to turn it on to podcast mode. That was, you guys are real pros at this. Qasar [00:00:23]: They were just joking around right before this, and then they flipped it pretty quick. Alessio [00:00:29]: Oh, yeah, itâs good to have you guys. Maybe you just wanna introduce yourself so people know the voice on the mic and theyâll know what theyâre hearing. Peter [00:00:33]: Oh, sure. Yeah, Iâm Peter Ludwig. Iâm the co-founder and CTO of Applied Intuition. Qasar [00:00:38]: And my name is Qasar Younis. I am the CEO and co-founder with Peter. Alessio [00:00:42]: Nice. Can you guys give the high-level overview of what Applied Intuition is? And I was reading through some of the Congress files, when you went out there, Peter, and eighteen of the top twenty global non-Chinese automakers, you two guys, you have customers in agriculture, defense, construction. I think most people have heard of Applied Intuition tied to YC when it was first started, and then you were kinda in stealth for a long time, so maybe just give people the high-level overview of what it is today, and then weâll dive into the different pieces. Peter [00:01:10]: Yeah. So at Applied Intuition, our mission is to build physical AI for a safer, more prosperous world. And so we work on physical AI for all different types of moving systems, everything from cars to trucks to construction and mining equipment, to defense technologies. And weâre a true technology company, so we build and sell the technology, and we sell it to the companies that make the machines. We sell it to the government, really anyone that wants to buy a technology to make machines smart. Physical AI vs. Screen AI: Why Safety-Critical Changes Everything Qasar [00:01:38]: Yeah. And I think in the broader AI landscape, a lot of the focus, rightfully so in the last, three years has been on large language models, and so everything fits in a screen. Like, whether itâs code complete products or things like that. And whatâs different about us is weâre deploying intelligence onto a lot of things that donât have screens. theyâre physical machines. There are sometimes screens within the cabin or for example of a car or a truck or something like that, but most of the value we provide is putting intelligence that is in safety critical environments. So that those two words are really important because learn systems can make mistakes if youâre asking for, like, some, so something like, âTell me about these podcast hosts Qasar [00:02:28]: that Iâm about to go meet.â But you canât do that obviously when you run, like, as an example, we run driverless trucks in Japan right now, as we speak. We canât have errors. Those are L4 trucks. Yeah. Alessio [00:02:40]: Yeah. Was that always the mission? I remember initially, I think people put you and Scale AI very similarly for some things about being kinda like on the data infrastructure side of things. What was the evolution of the company? The Origin Story: Tooling, YC, and the Scale AI Comparison Peter [00:02:51]: Well, from the very beginning, we always wanted to, really be a technology company that helped generally push forward the industrial sector. And so we started off working in autonomy. Our very first customers were robotaxi companies. And we started off doing a lot of work in simulation and data infrastructure. And then over the years, weâve expanded our portfolios. Now we have, over thirty products, and itâs a pretty broad technology play within the landscape of physical AI. Qasar [00:03:19]: Yeah, I think the Scale reason is because weâre all YC Universe companies. But it was a very different company. Scale, was, is more of a services company, data labeling company fundamentally. We started and still are, do a lot of tooling. So like, you think developer tooling is now in vogue again, thanks to the AI boom. But honestly, ten years ago, it was out of vogue. It w Like, doing a tooling company in 2016, 2017 was not, like, the thing to do because, I donât know if you remember, the VCs generally, their views was that toolings are Theyâre just workflows, and workflows ultimately are not really interesting. And weâve gone and come, full circle with that. But when we started the company, our kind of itâs kinda like in the periphery of what the company wants to be. It was like, from our earliest days, like, we wanna deploy software on physical machines, like on cars and on trucks and things like that. And obviously, we didnât know that the transformer boom was gonna happen. We didnât know that autonomy systems would become end-to-end. Those things we didnât know. And why thatâs important when autonomy systems become end-to-end, it is just now those models can be generalized to, multiple form factors. And so back nine, ten years ago, tooling was a great way, and still is a great way to, build the technology and sell technology to our end customers, a lot of them who wanna build this stuff themselves. And so we just offer like a spectrum of solutions from you can just use like one part of a development suite of tools all the way to buying the full thing. The way to think about the company, or at least the way we think about the company is, as Peter said, a technology provider. Itâs kinda like, what NVIDIA does or what an AMD, but we just donât do chips. Qasar [00:05:06]: We donât do silicon. But weâre a technology provider fundamentally. And I think even, we used to joke when we started the company, like, weâre not the guys to build, like, Instagram. Like that was just towards Thatâs not our Thatâs just not us in a most fundamental way. I Alessio [00:05:20]: You have thoughts. Qasar [00:05:21]: Yes. Qasar [00:05:22]: Well, itâs, itâs I mean, I think itâs just like what And I mean, we worked on Maps and stuff, Google Maps. Consumer products are extremely difficult for a lot of different reasons. It just, I think doesnât scratch the itch. I think weâre like Michigan guys who are kind of more of that traditional engineering kind of a realm, or lineage. we used to joke The Three Buckets: Simulation, Operating Systems, and Autonomy Models Peter [00:05:41]: I gotta say, though, what was clear ten years ago was that there was so much more that was possible with software and AI in vehicles Peter [00:05:47]: and that was generally the space that we started in ten years ago. Peter [00:05:51]: And the precise path that weâve taken over the years, I think weâve been strategic, and weâve adjusted to make sure that weâre actually building stuff thatâs valuable to the market. And like, the technology has changed so much. Like our own technology stack has completely changed, I would say, roughly every two years. And so now weâve probably done, letâs say, four complete evolutions of our own technology stack. And I sort of see that cadence roughly keeping up. Peter [00:06:13]: And so the way even we think about engineering is almost on this two-year horizon, weâre preparing ourselves that, hey, like, we wanna invest the appropriate amount, but then also be very dynamic as the research gets published and as our research team figures out new advancements and adapting to that. Qasar [00:06:27]: Yeah. One thing that has been consistent is the type of people weâve, weâve recruited. Itâs engineers who are fall into the sometimes very traditional, like, Google Qasar [00:06:38]: -gen suite, but way different from, other companies. We are hiring folks who really know the intersection of hardware and software, who know really low-level systems. Obviously, traditional ML researchers and folks whoâve, actually, put ML systems into production. Thatâs been pretty consistent. I think that, like, you look at the mix of our engineering, eighty-three percent of the company is engineering, so itâs, like, a giant list. Qasar [00:07:05]: A lot of engineers. Alessio [00:07:06]: Which, by the way, a thousand engineers Qasar [00:07:07]: Yeah. A thousand engineers. Alessio [00:07:08]: thatâs on your website, so I imagine itâs up to date. Qasar [00:07:11]: It is, it is up to date, yes. Yes. Alessio [00:07:12]: okay. And then forty-plus founders. Qasar [00:07:15]: Yeah. We would tend to also, This was more luck than strategy. But weâve recruited a lot of ex-founders. Itâs been a great place for founders, YC and non, âcause obviously I know a lot of the YC folks. Itâs kind of like we recruit a lot of Google people. Qasar [00:07:33]: For them to exercise both their technical and non-technical skills because, weâre, weâre, weâre on the applied side. We have a research team that we do fundamental research, we publish, and weâve, weâve had great traction there. But fundamentally, the business wants to take this intelligence and deploy it into production and thereâs, like, a certain type of person thatâs more interested in that. Alessio [00:07:54]: Yeah. You mentioned the tech stack, Peter, so I just wanted to give you some rein to just go into it. Iâm interested in where Wayve Nutrition, starts and ends in some sense, what wonât you do? What, do you do thatâs common among all the verticals that you cover? Peter [00:08:10]: Thereâs a few buckets of work that we do, and weâve been at this for almost ten years now, so the technologyâs pretty broad. But we got started Qasar [00:08:17]: Yeah, with a thousand engineers, like, you could work on lots of things. Peter [00:08:19]: Thereâs lots of stuff, yeah, espe-especially with AI tools to help. Peter [00:08:22]: So we got our start in simulation and simulation tooling and infrastructure. And so generally, if youâre trying to build a very complex software system that involves moving machines, you need to test that, and the best way to test it is itâs a combination of virtual developments, a simulation, and then also obviously real world testing. Peter [00:08:39]: And then thereâs a very careful process of that correlation between the simulation results and the real world results and ensuring that the simulator is in fact accurate to that. Simulationâs a very deep topic. Peter [00:08:49]: We have a whole suite of products in that, and we could talk for many hours about that specifically. But that is one part of what we do as a company. Reinforcement learning as a subpart of that is also super critical. I think a lot of the a lot of the best advancements happening in a lot of these AI systems right now in some way relate to reinforcement learning, and with now we have lots of compute, and you can do tons of interesting things for reinforcement learning. The second bucket of work that we do is on operating systems technology. true operating systems. Like, think about, schedulers and memory management and middleware and message passing and highly reliable networking and data links. Like, the reality is, if you want to deploy AI onto vehicles, you need a really good operating system. And when we were getting deeper into that space, there wasnât really anything that we were happy with. Peter [00:09:39]: Like, things existed, absolutely, and we were using what was available in the market, and as an engineering organization, we roughly realized these things arenât great. We think we can do this better, and so letâs, letâs build something. And that was then the that was the moment of inspiration that started our operating systems business, which is now a very real business for us. And in order to write and run great AI, you need a great operating system, and so that-thatâs what got us into that. And then the third bucket that we work on, itâs, itâs true fundamental AI technology. Models, we do a lot of work in, as mentioned, the foundational research, but then the also the world models and the actual autonomy models that are running on these physical machines, and thatâs across cars, trucks, mining, construction, agriculture, and defense, and so thatâs both land, air, and sea. Qasar [00:10:31]: And also, a smaller subsector of that third bucket is the interaction of humans with those machines. Qasar [00:10:38]: So thatâs a multimodal, experience. Historically, if youâre moving a dirt mover or any of these machines, there are, like, buttons you press, whether theyâre actual physical tactile buttons or something like a touch screen. Thatâs just That fundamentally is changing to where youâre just talking to the machine and the machine and youâre teaming with the machine. Alessio [00:10:58]: Voice? Qasar [00:10:59]: Yeah, voice, absolutely, yeah. Alessio [00:11:00]: Oh. Qasar [00:11:00]: And also the machine just being aware of who is in the cabin, what their state is. you can think from a safety systems perspective, the most simple version of this is, like, the driver is tired, right? Theyâre, theyâre if you get those alerts when youâre driving your car and says Hardware, Sensors, and the LiDAR Question Qasar [00:11:15]: -maybe take a coffee break, that take that times, a couple of order of magnitudes up. But this concept of teaming man and machine is important. When you think about running agents or just running, different instances of, Claude and doing work for you in the background, you can take that analogy out, almost copy and paste and put it into, like, a farm, where you have a farmer whoâs running a number of machines. So where they interact with the machine is where thereâs maybe a critical decision or a disengagement or something like that, but generally speaking, the agent on the physical machine is running and making decisions on the behalf of the farmer until thereâs something maybe critical. And thatâs also what we work on. So thatâs not pure autonomy. Itâs a little bit of a mix, but it falls under, autonomy. In the automotive sense, thatâs typically defined in SAE levels as an L2++ system Qasar [00:12:05]: -with a human in the loop. But just take that idea, to other verticals. Alessio [00:12:09]: Yeah. Youâve not mentioned hardware at all, like sensors or obviously we you mentioned you donât do chips. I think even in AV thereâs, like, a big, cameras versus lidars. Like, what are, like, in your space maybe some of those design decisions that you made, and are they driven by the OEMâs ability to put things on the machinery? And like, how much influence do you guys have on co-designing those? Peter [00:12:32]: Yeah. So we donât make sensors. Like, weâre, weâre not a manufacturer. Obviously, we use a lot of sensors in our autonomy products. in terms of what actually goes on the vehicles, we have a preferred set of sensors that we, letâs say fully support, and then our customers, they can sort of choose from those. And obviously if thereâs a very strong opinion on supporting something else, weâll add that to the platform as well. And the lidar question is at this point sort of the age-old, Peter [00:12:59]: topic in autonomy, and the state of the industry right now is lidar is hands down a useful sensor, specifically for data collection and the R&D phase of autonomy development. if you see, for example, a Tesla R&D vehicle, it actually has lidar on it Peter [00:13:17]: to this day, right? In the Bay Area we see these. youâll see, like, Model Ys or Cybercab that have lidars on them just driving around. So itâs, itâs useful because it gives you per pixel depth information. So if you can pair a lidar with a camerand you can say that, well, this cameraâs looking this direction, this lidarâs looking this direction, and now for each pixel of the camera I can see how far away is that pixel. you can actually then use that as a part of your model training, and then the that depth information then becomes a learned, a learned state of the camera data. And then when youâre doing the production system, you can now remove the lidar Peter [00:13:52]: and now you can actually get depth with just the camera. And so that difference between, like, a highly sensored R&D vehicle and then the down-costed production vehicle, we use that across our whole portfolio of products. And of course the end goal is you want super low cost and super reliable. Peter [00:14:08]: And then in certain use cases you have some more, bespoke things. Like in defense as an example, you do things at night oftentimes, and so you care about sensors like infrared, more so than And you donât, you donât wanna be putting energy out, so you donât wanna use lidar or radar. Peter [00:14:23]: but you still need to be able to see at nighttime. So yeah, we work the whole gamut. The Operating System Layer: Why Vehicles Are Like Pre-Android Phones Alessio [00:14:27]: Cool. So thatâs kinda like on the hardware level. Then on the OS level, how does that look like? What is, like, unique? my drive- I drive a Tesla. Whenever I drive some other car that has a screen, it always sucks. Alessio [00:14:38]: Itâs on, like, cheap Android tablet. Itâs like, itâs laggy and all of that. What does the OS of, like, the autonomy future look like? Peter [00:14:46]: When most people, itâs really what you just described. When you think about operating system in a vehicle, youâre thinking about the HMI, right? The human machine interface, and absolutely thatâs a an important part of it, but thatâs actually only one thin layer on top. So when we talk about operating systems for, like, AI in vehicles, thereâs many layers that go deep into the CPU critical realm and embedded systems, and youâre talking about the real time control of Peter [00:15:13]: letâs say the electric motors or the engine and the actuators, and you have different redundancies for different, letâs say, the steering actuation in the vehicle. And all of these things, need very core support in the in the operating system. And then of course for autonomy you have real time sensor data thatâs streaming in, and the latencies there are really important, right? If you try to Imagine you try to run Microsoft Windows Peter [00:15:35]: like streaming your sensor data in or controlling the vehicle. Like, the latencies are gonna be absurd. Like, you can never do that. And so whatâs special about what we do is we really have this system level thinking, right? So weâre looking at, we care about every performance characteristics of the entire system, and then we also, because weâre doing a lot of the software or all of that software, we can fine-tune and control all of those things. So we can very carefully tune in the latencies for every aspect of the system. We can carefully tune in the memory management. We can have the right, fail-safes and fallbacks, for different things. âCause you have to account for what if, what if there is a critical failure? What if thereâs a cosmic ray that flips Peter [00:16:14]: a bit in the middle of the processor that causes some, malfunction? And you have to have a fail-safe to all of that, and so the core operating system is a part of that. And then the one last thing, which is a lot less exciting but is, actually a very big topic, is reliability of updates. Peter [00:16:30]: so the I have a Tesla and you get updates fairly frequently, right? Peter [00:16:36]: Once a month. Most companies that are making vehicles Peter [00:16:40]: are basically never doing updates, and theyâre And even if they are doing updates, theyâre usually only updating maybe one module. Maybe theyâre updating the HMI module. But theyâre not able to update, letâs say, the CPU critical parts of the system. Peter [00:16:51]: You have to go into the dealer for that. And so with our operating system now we can actually enable highly reliable updates of any system in the vehicle, and thatâs way easier said than done. Like, thereâs lots of technical, technically deep stuff, in the tech stack to do that in a way that youâre not going to accidentally brick a vehicle. Peter [00:17:08]: And right? If, imagine your Alessio [00:17:10]: That would be bad. Alessio [00:17:11]: Bad. Peter [00:17:11]: Bricking a car is a very expensive Peter [00:17:13]: and honestly, like across the industry maybe one of the most just pure impactful things that weâve done is weâve just, weâre, weâre now enabling the industry to actually do software updates. Alessio [00:17:22]: Just to clarify as well, who is the customer for this? Like, I assume a lot of hardware manufacturers have their own firmware, and Iâm sure some of them would just have you write it for them because youâre experts. And others would have their own. Like, who pays for this? Who invites you into the house? Is it, is it the end user, or is it, is it the manufacturer? Peter [00:17:41]: Yeah. So let me make an analogy firstly on the on the fragmentation of software. So physical machines today are more akin to the state of the phone market before Android and iOS existed, right? So I worked on Android at Google by the way many years ago, and part of the reason that Larry at Google decided to get into Android was they wanted to run Google products on a bunch of phones, and they bought all of these phones from the industry, and it turned out they had like 50 different operating systems on these phones. And it was virtually impossible Peter [00:18:17]: for Google to make their app run on all 50 devices equally well. And so the solution was, well, actually what if, what if they created-A really great operating system and made it attractive to all of these phone makers, and that was sort of the genesis for what Android was and why Android existed. It was a way for Google to get their products onto really wide diversity of devices. The state of the physical, industry right now, itâs a little bit like that. Like, thereâs yes, these companies have firmware, but they have so many different operating systems, itâs so fragmented, and to actually get a modern AI application to run on these vehicles, you actually, you first have to consolidate the operating system, and so thatâs, thatâs why weâve done that. And then, your specific question was who are our customers? Itâs, itâs, generally itâs the companies that are making these machines. Peter [00:19:06]: And weâre, weâre, weâre selling our technology to them to really simplify the architecture and then enable these AI applications to run on them. Customers, Licensing, and the Better-Together Stack Swyx [00:19:13]: How much is reusable across? Like, do you have, like, one OS that is just configured for everything, or is there some more customization that is needed? Peter [00:19:22]: Yeah, highly reusable. So the fundamental technology is quite universal, right? So things that we do have to think about though are, like, chipset support. And so if youâre, if youâre coding, letâs say, an LLM and you have start with an assumption that, âHey, oh, Iâm gonna, Iâm gonna use CUDA, and Iâm gonna run this, on an NVIDIA chip,â then you donât really have to think about the hardware in that sense. Like, youâre just, âOkay, Iâm just Iâm in the CUDA/NVIDIA ecosystem, and Iâm, Iâm going to use that.â But the hardware, especially in safety critical systems, itâs a lot more diverse. Thereâs not one or one or two players. Thereâs a bunch of different chipsets that we have to support. And so our operating system doesnât just run on, like, the equivalent of X86. It has to, it has to run on a number of different architectures from chips from a bunch of different companies. But again, weâve been working on this for a long time now, so we have, we have support for all of those chipsets. And then when you want to then run the AI applications, we can then do that reliably across now a variety of providers. Qasar [00:20:19]: And I think that is, like, heavily inspired by Android, right? Android has a huge suite of testing and itâs a reliable operating system that runs on thousands of devices. And we think we can, we can do the same in all these physical moving machines, with the difference that weâre really in a safety critical realm. Android isnât. Alessio [00:20:40]: So on Android, I donât need to use Gmail, I can use Superhuman. Like, what about your machinery? Like, can people bring somebody elseâs automation to it, or is it kinda like all-in-one? Qasar [00:20:50]: You have to use us. No. Yeah. weâre If, Yeah. Yeah, itâs totally open. Yeah. Peter [00:20:56]: Yeah. our philosophy is that we are a technology company, and so we license our technology to customers to use how they want. And so if a customer wants to If they wanna license our autonomy tech and our operating system, then great, weâll license those. If they just wanna license the operating system and then use different autonomy tech, thatâs fine also, and we have great documentation and Swyx [00:21:17]: Or if they wanna use developer tooling. Peter [00:21:18]: Yeah, exactly. AI Coding Adoption: Cursor, Claude Code, and the Bimodal Engineer Swyx [00:21:19]: Itâs, like, a better together if, obviously, if you, if they work together. Is it all C++ I assume is with different compile targets? Peter [00:21:27]: We use a lot of C++. Peter [00:21:28]: Rust is sort of a hot, the new hot kid on the block Peter [00:21:32]: for a bunch of things as well. But yeah, the lower level you get, especially when you get to real-time constraints, you hit C++ at some point, and at some point maybe you work your way into assembly when needed. Swyx [00:21:44]: Oh, damn. Alessio [00:21:46]: Iâm curious about the coding agent adoption, just, like, since youâre mentioning more esoteric languages. Like, whatâs the adoption internally? What have you learned? Peter [00:21:55]: Yeah. We use everything. So Cursor was, I think the hottest tool in the company for a good while. Now Claude Code, I think has taken the reign on that. We have a internal leader, leaderboard that we use just to sort of encourage adoption Peter [00:22:09]: with-within the company. And yeah, itâs, theyâre phenomenally useful. itâs, Honestly, we take inspiration from some of those tools also in how weâre adapting some of that mindset of thinking to the physical realm. Like if itâs so easy to build an app for this or that thing that lives just on a screen, we can Weâre taking now a lot of the same ideas and applying that to, âOkay, well, if you wanted a physical machine to do something, how easy can we make that, using our own tooling and platform as well?â Alessio [00:22:40]: Are you changing any of, like, the OS architecture, kinda like the way you expose services to, like, be more AI friendly or? Peter [00:22:48]: Yeah, absolutely. The in the early days of our tools infrastructure work, it was a lot about, You had engineers that were experts in certain topics, but the things that youâre dealing with, theyâre oftentimes more mathematical or more abstract, where actually GUI tools are very useful for certain things. Like as an example, we have a product we call Sensor Studio, which is, it helps you design the sensor suite for your autonomous vehicle, whether, again, it could be a car, it could be a drone, could be a mining equipment, could be a robot. And you place sensors in different places. You Thereâs different, Thereâs a library. You can understand what are the trade-offs that youâre making in the design of that system, and that was, like, a very, a very GUI intensive, thing âcause itâs a little more like a CAD tool in that sense Swyx [00:23:37]: Yep Peter [00:23:37]: if youâve seen CAD tools. Nowadays, though, right, we expose all of the underlying APIs for that and now using, AI agents, you can actually configure a sensor suite with just text and likely reach a better result than you couldâve through the GUI in the past, and weâre taking that thinking now through the whole product portfolio. Swyx [00:23:57]: Another thing I was thinking about is just in terms of, like, AI, adoption, does it change your hiring at least a little bit, or how do you, how do you sort of manage engineers, differently? Peter [00:24:08]: Yeah. absolutely, it does. we, I think like every company in the Valley right now, are evolving our hiring practices Peter [00:24:16]: because the skills required to be effective are changing so fast, right? you used to really select for just rote implementation ability and now it is more the AI engineer skill set, right? Where itâs like, yeah, how to implement, but actually-Just banging out code is no longer the core job, right? Itâs, itâs actually knowing what questions to ask, knowing how to tie, how to tie together these different AI tools. And so the interviews that we give now I think are way harder than theyâve ever been. Peter [00:24:46]: But we also allow, right, selective use of AI tools to solve the problems. And I think in that you start to see more of a bimodal distribution of engineers, right? You start to see like wow, thereâs, thereâs this subset of people that they really get it. Like theyâre, theyâre all in and theyâve, theyâve clearly invested the hours needed to learn these tools and how to be effective. Peter [00:25:09]: And then thereâs sort of the group of people that havenât done that, and that the productivity gap is just enormous. And so weâre, weâre trying to obviously select for the people that are really into this. Qasar [00:25:20]: I first wrote the my AI engineer piece three years ago, and when I first wrote about it, I was like, âActually, not everyone should be an AI engineer,â âcause I think thereâs a thereâs an extremist stance where well, every software is an engineer is an AI engineer. And my actual example of people who should not be adopting AI was embedded systems and operating systems, and database people. Are they adopting AI? Peter [00:25:41]: I think itâs the classic bitter lesson, topic, which is the Six months ago I wouldâve said the same thing, but itâs, itâs becoming super useful for every domain. Qasar [00:25:53]: Iâm sure. Peter [00:25:54]: Right? Like, Peter [00:25:56]: there was, I think six months ago, or maybe a year ago, if you tried to use, letâs say the latest Claude model for writing shaders, GPU shaders, the results were probably underwhelming. And if you use the latest model now to do that kind of task, youâre a little bit blown away, like, âWow, that actually worked. Thatâs amazing.â And we see the same thing in the embedded realm. No question though, especially when you get into safety critical systems, the human validation is Peter [00:26:25]: is 100% key. Like I Youâre not gonna trust your life to a an AI written software thatâs, thatâs not been very carefully, checked by humans. And so I think now the really the challenge is about that appropriate level of human validation for these safety critical systems. Verifiable Rewards, Evals, and Neural Simulation Alessio [00:26:41]: How do you think about, yeah, touching on the simulation side, I think verifiable reward and reinforcement learning is, like, the hottest thing. What have you done internally to build around that? And like, what gives you What makes you sleep at night? Like, if somebodyâs like, just web coding something or like Alessio [00:26:57]: wants to try something new, you have like a good enough system. Because I think the opposite is also true, is like if itâs super easy to write anything Alessio [00:27:04]: then it puts a lot of work on like the verifiable Alessio [00:27:07]: side of it. Like, what does that look like for people? Peter [00:27:10]: Yeah. So verifiability, a broader bucket of like evaluations, right? Like how do you evaluate the results that youâre, youâre getting? I think this is probably the hardest problem right now, because the As the models get better, it can be harder and harder to find the faults on the system. Peter [00:27:29]: And so like the problem of doing proper eval to find those faults, like that problem also keeps getting harder as the models get better. But itâs no less important than itâs ever been, right? You still there are still going to be edge cases that are not met and whatnot. And so itâs, itâs a big area of investment for us. On the reinforcement learning topic, the key thing is thereâs all these new requirements that come to be in the latest generation of these technologies. So for example, end-to-end is the big thing right now in autonomy and physical AI, which is you can now train these models that can effectively take sensor data in and then put control signals out, and get really good results out of that. But the way that you train and improve those models is really different from the previous generations. And so to do reinforcement learning on an end-to-end model, you now need to actually simulate all the sensor data, right? So then this becomes a we call our, work in this neural simulation, but itâs Peter [00:28:26]: think of it like a hybrid of Gaussian, splatting and diffusion methods, and where you really care about performance. Like performance is everything. If you canât do enough simulation fast enough and cheap enough, you actually canât get results that are worthwhile, in the end. It also gets to a lot of our work in embedded systems, which is like performance critical work, and that performance optimization, performance criticality, it carries over to a lot of the model training work. because, like, the only way to make it affordable is it has to be really fast. Qasar [00:28:58]: I think itâs worth a few minutes talking about our own, evolving thoughts on verification and validation within Qasar [00:29:05]: kind of, traditional simulators, which are, you can think of like vehicle dynamics or something like that, which youâre just taking textbooks and taking those formulas Qasar [00:29:13]: and putting them into software, to like now this neural sim/world model universe. I think thatâs an interesting topic. Peter [00:29:20]: Yeah. So in more traditional development, right, you oftentimes would have, more black-and-white answers to questions. Peter [00:29:28]: And so the in Europe as an example, thereâs, a regulatory, system, itâs called Euro NCAP. Itâs the European New Car Assessment Program, and as part of that, the vehicles have to pass a bunch of tests, and those tests actually, include, safety systems. So automatic emergency braking for a child that runs in front of a car Peter [00:29:51]: or letâs say an occluded child that runs out and you hit it. And so you have You end up with sort of these binary answers of like, well, did the car under test pass this specific test? And thereâs a very well-known set of test cases Peter [00:30:05]: that the vehicle has to pass. And that was how the industry worked, letâs say, until 10-ish years ago. But whatâs changed now is with these models, everything is statistics, right? Like you no longer have a black-and-white answer, but itâs like, well, how many orders of magnitude or how many nines of reliability can I get in the system, and how can I, how can I prove that to be true? And the big unlock honestly for physical AI as an industry is that these models are just becoming much more reliable. Right? Things like things actually work a lot better. Itâs like the number of nines you can get out of these systems are now good enough that it actually becomes cost effective to really deploy these things. And so the big shift in, so verification and validation has been from a little bit more of a Again the past it was strictly requirements, and are you meeting or not? And now itâs more of a statistical, verification and validation case where itâs all about how many nines of reliability and meantime between failures, that sort of thing. Statistical Validation, Regulators, and the Cruise Lesson Swyx [00:31:04]: And is the target audience regulators or even the customers are yeah, if you I imagine the customers are bought in, and itâs mostly regulators that need to be satisfied. Peter [00:31:15]: We do work with the US government, we do work of course with the European governments and the government of Japan, and the government is not like an AI lab by any means. Peter [00:31:25]: So Swyx [00:31:26]: They just care about the outcome. Peter [00:31:27]: They care about the outcome. Peter [00:31:28]: And so we do education, in that regard, and like so sort of teaching about, âHey, this is how we think validation should be done, and this is an approach that we think is reasonable,â and how to think about like when is a driverless system actually safe enough to go on the roads and that sort of thing. But I wouldnât say that the government is asking for it. Itâs like weâre more teaching the government in that, in that sense. Itâs honestly, itâs more so for our own, our own comfort, right? Like, we want to build very safe systems, and then of course our customers care deeply about that as well. But in that context weâre also typically educating our customers. Qasar [00:32:01]: Yeah. Our first, our first core value is on round safety. So I think we canât underline enough that, us also verifying and validating that the systems that weâre deploying are safe to us is probably as important as, like, some regulator or a customer saying, Swyx [00:32:19]: Of course. Okay. Yeah. Swyx [00:32:20]: You have to satisfy yourselves. Peter [00:32:22]: As I say, as a whole across the world, regulation oftentimes itâs like a almost lowest common denominator. But like, you really have to substantially exceed what the regulators are expecting to make good products. Swyx [00:32:33]: Yeah. One thing I often talk about, I think and I try to make this relatable to the audience also, is Cruise, where they had an accident that basically ended the company. I wonder if people overreact to single incidents, because incidents are going to happen regardless, right? âCause itâs a statistical thing, but as long I donât know if regulators understand that, you cannot extrapolate from a single incident, but we do because thatâs all we have to go on. And your sample sizes are necessarily gonna be lower than, I donât know Swyx [00:33:00]: consumer driving. Qasar [00:33:01]: Yeah. I think the Cruise example wasnât a technology failure. there was The real, compounding issue there was just how did the company talk to the regulators and what was their kind of behavior, and I think that became more of the issue. If you look, Peter [00:33:19]: It isnât It definitely was a technology failure, but it was made much worse by the Swyx [00:33:23]: Put the car back on the woman. Qasar [00:33:25]: Yeah. And let me put it another way. There is a version where Cruise still exists. Swyx [00:33:29]: right. Right. Qasar [00:33:30]: Right. Itâs Swyx [00:33:30]: It was like the last straw Qasar [00:33:31]: It Swyx [00:33:31]: in like a long chain of Swyx [00:33:33]: like issues. Qasar [00:33:33]: So do you feel like ATG had that horrific accident or someone actually dying, because, that was a homeless person crossing the street? So yeah, I think we canât understate enough that ultimately, like, statistical validation of something, thatâs one part of it, but itâs not the only part of it. Like, consumer and letâs say, mainstream adoption of these technologies is also gonna be part of that conversation. I think companies like Waymo are doing a lot of service positively to the industry in the sense of theyâre, theyâre setting a high benchmark and theyâre showing, kind of in a very responsible way how to, how to deal with these. There have been Waymo incidences as well. Theyâve just not been as significant as the Cruise one that you mentioned. But yeah, so I think youâll just continue to see that. I think probably the long term question is really gonna be, again, around Like it is very clear humans are way worse drivers statistically. Qasar [00:34:29]: Like, thereâs no, thereâs no debate. And so at what point But weâre emotional animals. Swyx [00:34:34]: Yeah. So my thing is, like, we have to get to a point as a society where we accept horrific accidents that would never happen by a human because statistically we understand that it is safer overall. In the same way that planes, theyâre safer, than I think theyâre the safest mode of transport that we have. Qasar [00:34:50]: Yeah. itâs more dangerous to drive to the airport than it is to get on a flight. Qasar [00:34:53]: So if youâre ever Qasar [00:34:54]: if youâre ever getting nervous about getting on a plane, just think âI just gotta get to the airport.â Swyx [00:34:58]: Yes, weâre flying. Qasar [00:34:59]: If I get to the airport Qasar [00:35:00]: Iâll be good. Swyx [00:35:00]: But then itâs, planes also concentrate the tail risk if planes Qasar [00:35:03]: Yeah. And Peter [00:35:04]: And I was, I donât think we honestly have to worry about there ever being, accidents from these systems that are like much worse than what humans would cause, âcause humans do terrible things. Peter [00:35:14]: Like, people fall asleep at the wheel all the time. Swyx [00:35:16]: I have. Swyx [00:35:17]: Like, Iâll call, Iâve been a drowsy driver. Peter [00:35:19]: Kinda drunk drivers, and thatâs Peter [00:35:20]: thatâs the extreme end of the example. But these AI systems, you have redundancies, you have fallbacks. Like, thereâs many things have to go wrong for there to actually be a something catastrophic because thereâs, thereâs so many, fallbacks that these systems have. Alessio [00:35:36]: your simulation is like so vast because thereâs so many use cases. What are, like, maybe things that worked in a simulation and then you put it out and itâs like, âF**k, this is Alessio [00:35:45]: this just did not work at all?â Peter [00:35:47]: Yes. Alessio [00:35:47]: Is Peter [00:35:47]: Thatâs maybe a bit of a misconception, about simulation there. So let me go a little bit, more technical on this. So at first go, no simulation is going to represent the real world. Thereâs always a process of this, sim to real matching Peter [00:36:02]: where you actually, you need the real world feedback to basically feed into the parameters that are being used in the simulator, and you have to do that, itâs like this validation flow, a number of times until you can get some confidence that, like I think the simulator is now accurately representing Peter [00:36:19]: whatâs gonna happen in the real world. Now, if you have a situation where youâve done that full validation and you thought that it was accurate and then thereâs something different, those are much trickier cases, and thatâs, that absolutely can happen, but really I think the validation process is a really important part. You can never skip the simulation validation process, like where youâre actually ensuring that, hey, the actual, my sim to real gap here is small enough that I can trust these simulation results. And thereâs, thereâs so many fun things that you can do when you get into it. Like, Iâll, Iâll give one fun example that came up recently is like in these humanoid robotics, systemsOverheating actuators is a real problem, right? So obviously phenomenal demos. I Peter [00:37:01]: The most amazing Alessio [00:37:02]: For 10 minutes. Peter [00:37:03]: The most amazing I can get. I love, I love watching robots do acrobatics like everybody but the these systems actually overheat, right? If, like, And one of the ways you can use simulation though is you can actually have that, the temperature of those actuators be one of the parameters thatâs represented Peter [00:37:18]: in the simulation. And if youâre doing reinforcement learning over a certain task, then the robot can actually adjust its motions in the simulation to account for the fact that, oh, it knows that as itâs moving, itâs actually beginning to overheat this motor. But if you didnât have that parameter of, letâs say, the heat of that motor represented in the simulation initially, then your RL policy might It will disregard that. And now you run that on the robot and the robot will overheat and fail. Alessio [00:37:43]: I guess the question is, like, how do you have all of these parameters taken care of while also understanding the deployment environment? Like, temperature is like a great example, right? Well Alessio [00:37:53]: why did you make my robot worse when it runs in like a freezer? Alessio [00:37:57]: So it actually shouldnât worry about that. itâs like, yeah, how do you design these simulations? Peter [00:38:02]: This is honestly the This is what makes simulation so hard, right? itâs because you Simulation is fundamentally about youâre trying to optimize the development of a system, right? Like, how can I build this system faster and better and cheaper and what are all the levers that I have to actually accomplish that? And because simulationâs just a software program, you can, you can change it a lot more easily than you can hardware systems. And then whatâs particularly awesome about the letâs say, world models and using that as a part of simulation is now the simulation doesnât just scale with, letâs say, adding new math equations in Peter [00:38:36]: but we can actually scale the simulation environment now with additional real world data and that also unlocks a whole new field of robotics. Qasar [00:38:46]: There is a meniscus line where you cross where still doing real world testing is better. thereâs, in this, sim-to-real gap, you can reproduce reality at exceedingly expensive costs and this So nothing is free. So really you have to youâre finding that line where youâre getting great performance, youâre getting great feedback, whether itâs on the training side or on the eval side, but itâs way cheaper than doing it in the real world. At some point it, that doesnât make sense. And so even, from our earliest days in autonomy, our view was youâre still gonna do real world testing. You Thereâs, thereâs not, thereâs not this, magical land where youâre not gonna do that. And maybe even like a more nuanced version of this in like traditional software development is, most of your testing for software in a vehicle, 95% of that can be like traditional CI/CD kind of, flows that you would have in traditional web development. But once you have Now you, letâs say you have a truck. Well, you can do like 4% of those in like a rig which has all the components, the electrical and electronics of a truck, but doesnât have, it doesnât have the tires and it doesnât have the And then you have the 1%, which is actually the vehicle. Thereâs something Thereâs a similar analogy in terms of using simulation for intelligent systems. You can do a lot in a simulator, but in using world models, but ultimately itâs, itâs physical AI. So youâre gonna deploy it on physical machines and Qasar [00:40:17]: the freezer example comes to, comes to light. Alessio [00:40:20]: The world model thing has been to me the hardest thing to Alessio [00:40:22]: wrap my head around. Like we have Faith Eliyon on the podcast. World Models, Hydroplaning, and Cause-Effect Learning Qasar [00:40:25]: Weâve been doing a small series with like another Intuition company, General Intuition as well. Qasar [00:40:31]: yeah, and I mean, lots of, lots of coverage on NeRFs and yes. Alessio [00:40:34]: Yeah. It feels like we talk with about, the heliocentric system, right? Itâs like in a world model, if you just feed visual data, the model might learn that the sun spins around the Earth. It makes sense, right? And itâs like, well, not really. And I think what are like some of these other things that like hydroplaning is one thing I think about, is like can a world model understand hydroplaning and like what amount of water like causes it to happen? And itâs like, yeah, to me itâs like I donât understand how you guys do it. I guess itâs like the real thing is like when youâre doing both cars and the highway in Japan versus the excavator in a mine in, Qasar [00:41:13]: Arizona Alessio [00:41:13]: wherever youâre Arizona, wherever youâre deploying them. Alessio [00:41:15]: How much of it are you relying on the world models to like generate the simulations for you and then try and close the gap after versus like giving the world models as a tool to your engineers to like curate the simulations if that makes sense? Peter [00:41:28]: Yeah, totally. So yeah, I can say at a pure engineering level, I think if youâre hoping to do real world deploys and youâre purely relying on a world model approach, you probably wonât get to something that works, before you go bankrupt. So there is just a very practical mindset of like, world models are amazing and theyâre extremely useful for a lot of use cases, but there are a lot of other things that you need to do to actually get something started and something deployed and working. most fundamentally, world models are all about Itâs understanding the world, but also understanding whatâs going to happen. Itâs like the cause-effect relationship. Peter [00:42:01]: Right? And so like it, right, if you have a take some sort of construction tool, and that construction tool is gonna be doing some work on the Earth in some way, itâs gonna be moving earth, the world model needs to understand that cause-effect relationship. Like, okay, when I, when I take this material from here and put it over there and now I have things that are over here and not over there anymore and that cause-effect, relationship. data obviously is a is a big problem. The hydroplaning Peter [00:42:26]: one is actually a really great example because itâs actually quite non-obvious sometimes. Right? Itâs like, well, itâs, itâs raining and well this road, has, letâs say the appropriate curvature to it so the water is running off the road and cars are driving faster here and then you approach a road thatâs very flat and water is now puddling on that road and all of a sudden cars are driving slower because when they were driving faster they were starting to lose control. And there are a lot of visual nuance, very nuanced visual cues in the scene and so I do think in the world model concept thereâs a good chance that the model actually would learn that you should just drive slower when these visual cues exist, and thatâs obviously the beautiful-The beauty of, these kinds of models where they just, they learn these non-obvious things. Swyx [00:43:14]: It doesnât need to know about hydroplaning to know that it needs to drive slower. Peter [00:43:17]: Yes. Swyx [00:43:17]: I guess itâs Yeah. I wanna ask questions about, also deploying models. I presume, like, you use a lot of these world models for training data and simulation, but what about deploying it onto the systems in production? Presumably you have you have, like, GPUs on device Onboard vs. Offboard: Latency, Embedded ML, and Distillation Swyx [00:43:36]: but theyâre I keep saying on device. Whatâs the whatâs the right term for that? Peter [00:43:40]: On machine. Swyx [00:43:41]: On machine. Peter [00:43:41]: Or embedded, yeah. Swyx [00:43:42]: Yeah. What is the embedded world like? because for people who are not used to that world, this is very alien. Peter [00:43:49]: Yeah. So itâs actually We call it onboard and off board. Peter [00:43:52]: So like, onboard software and off board software. Peter [00:43:54]: And the great thing about off board software is you donât have to care about time, and you can run really large models, right? So you can, you can say, âWell, this model, I donât care if it takes one second for it to give me a result or 10 seconds for it to give me a result, because we have time.â And the models can be really big, and they can run, in a data center or on a on a huge GPU and you can obviously have distribute to compute, et cetera. But onboard you donât have any of those benefits. Youâre like, âWell, I need I have this many milliseconds where I need an answer from this model.â And so a lot more of the energy then is about, think of it more like distillation and itâs like truly efficiency and like, literally every fraction of a millisecond counts. And you canât have a situation where the model takes too long because then the vehicle canât actually function. Peter [00:44:42]: And so you can, you can still use a lot of the same techniques, and the models themselves you can think of as like a derivative of larger models that you can run offline, and then youâre, youâre trying to just get a model that is still performs really well but itâs, itâs a itâs smaller, small enough version that you can then run on this embedded system where you care about latency and power. Qasar [00:45:03]: Yeah. And I think like, the broader point I think which, maybe is not obvious but itâs worth saying is in physical AI world, weâre not really constrained right now by, like, the intelligence of the models. Itâs actually what Peterâs talking about, itâs actually deploying them in Swyx [00:45:19]: The hardware they give you. Qasar [00:45:21]: Yeah. On the hardware you give you. Qasar [00:45:22]: And so And thereâs just a reality is of safety critical systems. So those end up being the your limiting factors Qasar [00:45:29]: rather than, letâs say, a limiting factor for, a foundation model company Qasar [00:45:34]: is gonna be just capital maybe or researchers. Qasar [00:45:38]: So weâre, weâre in that way dealing with, for us as people who kind of come in that realm with like a very interesting Those constraints force creativity. Swyx [00:45:47]: And I imagine, nobody was deploying or giving you the hardware for transformers back in 2018, whatever, but now they are. Whatâs the evolution like? just peel back the curtains a little bit. Peter [00:45:59]: Yeah. Transformers first off, I think the paper was originally published in 2017. Swyx [00:46:02]: 2017. Swyx [00:46:02]: So thereâs no time. Peter [00:46:04]: And I Swyx [00:46:05]: But Iâm just saying I guess Iâm saying, like, embedded ML systems usually, like, a lot less parameters, a lot less compute, and now, like, orders of magnitude more. Peter [00:46:14]: Yeah. absolutely. what I was gonna say though was I think in the in the original paper in 2017, maybe itâs in the last paragraph, somewhere in the paper they talk about, like, âOh, by the way, this technique might be useful for, like, images and videos as well.â Peter [00:46:30]: These last subjects. Peter [00:46:31]: And it took a few years for that impact to really hit. But like, now, weâre seeing transformers are everywhere. Swyx [00:46:39]: Yeah. Vision transformers. Peter [00:46:40]: And then then the compute just keeps getting better and better. But you do have this fundamental trade-off, right? Itâs like you have power, you have cost, and performance and like, getting the right, getting the right mix of those things in an embedded package that can also be, like, shaken and baked in all the Peter [00:47:00]: conditions that these things have to have to operate in. But yeah, I think that theyâre only going to keep getting better and so we also try to plan our strategy understanding that, we know the rate of improvements of these systems. Swyx [00:47:11]: Yeah. So like, Google just released the Gemma 2B model Swyx [00:47:15]: that effective 2B model. Is that useful to you guys or is that too big? Peter [00:47:18]: You can run that model on an embedded system, definitely. Peter [00:47:21]: the So yes, itâs, itâs useful in that regard. The bigger question is, like, what do you use it for in an embedded system? Like, you actually need to customize it quite a bit to make it useful for something. But yeah, you could run a two billion parameter model, definitely. Swyx [00:47:35]: It also interesting, like, what percent is a custom ML model that only does that thing versus a generalist LLM Swyx [00:47:41]: which probably is not that useful actually for your context. Peter [00:47:46]: Like, you, like, you can imagine different use cases, right? Peter [00:47:48]: So the Swyx [00:47:49]: The voice stuff, yes. Peter [00:47:49]: Yeah, the voice test. Totally, yes. Peter [00:47:51]: So for the actual, autonomy elements, thatâs 100% in-house. We do every bit of that, the data simulation, the model, everything. But when you get into the more generic use cases like voice or voice assistant kind of thing, thatâs where these more generalist models like Gemma actually can be quite, can be quite useful. Swyx [00:48:09]: Yeah. And then thereâs also obviously a trade-off between, like, what percent must you do on machine, versus just call home. Peter [00:48:16]: Yeah. Itâs all about latency. Swyx [00:48:17]: Latency. Peter [00:48:17]: Itâs all about latency. Yeah. Swyx [00:48:18]: Yeah. Well, like, I think actually in a lot of contexts, especially in the US, you can just have a connection to the web. Qasar [00:48:26]: Yeah. I think though most of our universe is everything has to be fairly, embedded and local because just the nature of Even in the US thereâs a lot of like Swyx [00:48:39]: Patchiness Qasar [00:48:40]: donât have Qasar [00:48:41]: have coverage, right? And if you look at, like, the old world of autonomy within mining, which is, like, long before transformers and kind of, neural networks, in the like CNN and kind of a universe, they were really just hand-coded, systems. They were just like, this machine is gonna run to that place with this Peter [00:49:03]: That was our GPS, like very accurate GPS. Qasar [00:49:05]: Yeah. And so that worked, and that worked for 20 years, so why would we actually need to use transformers or kind of more modern end-to-end systems? Mainly because you can only really run a path and run backwards. That provided a lot of value, but m-Not as much as you get when the machine is actually intelligent. Itâs, itâs seeing, itâs perceiving, itâs acting in a dynamic world. Alessio [00:49:28]: I looked up RTK, real-time kinematic, one to two-centimeter accuracy. Qasar [00:49:32]: Yeah. Fantastic. But the and fantastic in faraway lands where thereâs not gonna be cell phone coverage. Peter [00:49:39]: Yeah, so itâs widely used on the legacy mining and agricultural autonomy systems today. So like, for example, a combine that can be precise within one or two centimeters as itâs driving down the field, they use RTK. Qasar [00:49:53]: Yes. Peter [00:49:53]:
AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)
Today, we check in a year after the first [https://www.latent.space/p/unsupervised-learning]Unsupervised Learning x Latent Space Crossover special [https://www.latent.space/p/unsupervised-learning] to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe [https://www.ai.engineer/europe/], but before the Cursor-xAI deal [https://cursor.com/blog/spacex-model-training]. Unsupervised Learning is a podcast that interviews the sharpest minds in AI about whatâs real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs. Thanks to Jacob and the UL production team for hosting and editing this! Jacob Effron * LinkedIn: https://www.linkedin.com/in/jacobeffron/ [https://www.linkedin.com/in/jacobeffron/] * X: https://x.com/jacobeffron [https://x.com/jacobeffron] Full Episode on Their YouTube We discuss: * swyxâs view from the center of the AI engineering zeitgeist: OpenClaw, harness engineering, context engineering, evals, observability, GPUs, multimodality, and why conference tracks now reveal what matters most in AI * Whether AI infrastructure has finally stabilized: why âskillsâ may be the minimal viable packaging format for agents, why infra companies have had to reinvent themselves every year, and why application companies have had an easier time surviving model volatility * The vertical vs. horizontal AI startup debate: why application companies can act as the outsourced AI team for enterprises, why some horizontal companies still matter, and why sandboxes may be the clearest reinvention of classic cloud infrastructure for the AI era * The âagent labâ playbook: starting with frontier models, specializing for your domain, then training your own models once you have enough data, workload, and user behavior to justify the cost and latency savings * Why domain-specific model training is real, not just marketing: how companies like Cursor and Cognition can get users to choose their in-house models, and why search, domain specialization, and distillation are becoming more important * Open models, custom chips, and alternative inference infrastructure: why swyx has turned more bullish on open source, why non-NVIDIA hardware is suddenly getting real attention, and why every 10x speedup can unlock new product experiences * What it means to sell to agents instead of humans: why agent experience may mostly just be good developer experience by another name, why APIs and docs matter more than ever, and how pretraining-data incumbents are compounding advantages in an agent-first world * Why memory and personalization may become the next big wedge: todayâs models mostly reward frequency of mentions, but in the future, swyx expects product choice to be shaped much more by personalized memory systems * The state of the AI coding wars: why coding has become one of the largest and fastest-growing categories in AI, how Anthropic, OpenAI, Cursor, and Cognition have all ridden the wave, and why the category may still have more room to run * Capability exploration vs. efficiency: why the industry is still in a token-maxing, experiment-heavy phase where people are rewarded for spending more rather than less * Claude Code vs. Codex and the strange stickiness of coding products: why first magical product experiences may matter more than expected, and why the bigger mystery may be why only a few names have emerged as real winners so far * What the end state of the coding market might look like: two major players, a longer tail of niche products, and possible disruption if Microsoft, Mistral, xAI, or the Chinese labs push harder into coding * Where application companies still have room against the labs: why frontier labs are trying to expand into verticals like finance and healthcare, but still leave space for focused companies that own the workflow and the last mile * Why coding may be a preview of every other AI market: the first category to truly go parabolic, the clearest example of foundation model companies colliding with application companies, and a template for how future vertical AI markets may develop * Why AI valuations now feel unbounded: from billion-dollar ARR products built in a year to trillion-dollar market caps, swyx and Jacob unpack how the AI market has broken traditional startup intuitions about scale and durability * Consumer AI vs. coding AI: why ChatGPTâs consumer category may have plateaued on frequency and product design, while coding continues to feel like a daily-use category with real momentum * The next product frontier beyond coding: consumer agents, computer use, and âcoding agents breaking containment,â with swyxâs thesis that 2025 was the year of coding agents and 2026 may be the year they begin to do everything else * Whether foundation models are really killing startup categories: why swyx is less worried for early founders, more worried for mid-size startups and traditional SaaS, and why building something ambitious may now be the best job interview for a frontier lab * AI vs. SaaS and the internal culture war around adoption: the tension between AI-native employees who want to rip out expensive software and skeptics who think quick AI-built replacements create fragile systems * Why traditional SaaS may be under real pressure: swyxâs own experience spending six figures on event and sponsor management software, the temptation to rebuild it cheaply with AI, and the broader question of whether teams will trust custom AI-native replacements * Biosafety, security, and frontier model access: why swyx raised biosafety at a dinner with Anthropicâs Mike Krieger, why Krieger argued security is the bigger issue, and what restricted model releases reveal about Anthropic vs. OpenAI * The era of giant models: why 10T+ parameter systems may only be a temporary rationing phase before bigger clusters arrive, why labs may increasingly keep their most powerful models private for distillation, and why scale alone no longer feels like a complete answer * Memory as the slowest scaling factor in AI: why context windows have improved far more slowly than people hoped, why million-token context still has not changed most real workflows, and why memory may be the key bottleneck for the next generation of systems * What swyx changed his mind on in the past year: becoming more bullish on open models, more convinced that the top tier of agent startups behaves very differently from the median AI company, and more optimistic about fine-tuning and specialized model adaptation * âDark factoriesâ and zero-human-review coding: the next frontier after zero human-written code, where models not only write the code but ship it without human review, forcing companies to rethink testing and verification from first principles * Why RL and post-training may matter more than people assumed: even if the resulting models get thrown out every few months, the data, workflows, and domain-specific improvements persist * Synthetic rubrics, Doctor GRPO, and multi-turn RL: why reinforcement learning is becoming much more domain-specific and multi-step than many people realize, opening the door to much deeper customization * The next frontier after coding: memory, personalization, and world models, including why swyx thinks world models matter not just for robotics or gaming, but for giving AI something closer to lived understanding * Fei-Fei Li, spatial intelligence, and the Good Will Hunting analogy: the idea that todayâs LLMs may know everything by reading it all, but still lack the lived experience that turns knowledge into a deeper kind of intelligence Timestamps * 00:00:00 Intro preview: AI coding wars, startup pressure, and market structure * 00:00:28 Welcome to the Latent Space Ă Unsupervised Learning crossover * 00:01:17 What AI builders are focused on now: OpenClaw, harnesses, and infra * 00:04:33 Why AI infra is harder than apps, and where startups can still win * 00:06:39 Should companies train their own models? * 00:09:28 Open models, custom chips, and the new inference race * 00:11:25 Designing products for agents, not just humans * 00:16:49 The state of the AI coding wars in 2026 * 00:19:27 Capability exploration, token-maxing, and why coding is going parabolic * 00:21:41 What the end state of the coding market could look like * 00:23:50 Where app companies still have room against the labs * 00:27:02 Why AI valuations and market swings feel unprecedented * 00:28:56 Consumer AI vs. coding AI, and why sticky products still matter * 00:32:28 What the next breakthrough product experience might be * 00:32:53 2026 thesis: coding agents break containment and eat the world * 00:35:27 Are foundation models wiping out startup categories? * 00:37:33 AI vs. SaaS, vibe coding, and internal team tensions * 00:40:01 Biosafety, security, and the politics of restricted model releases * 00:42:19 Giant models, compute constraints, and the limits of scale * 00:44:30 Memory as the real bottleneck in AI * 00:44:57 Why swyx changed his mind on open models * 00:47:44 Dark factories and the future of zero-human-review coding * 00:49:36 Why post-training and RL may matter more than people think * 00:51:50 Memory, world models, and the next frontier of intelligence * 00:53:54 The Good Will Hunting analogy for LLMs * 00:54:21 Outro Transcript [00:00:00] swyx: Isnât that crazy? That number is just mind boggling. [00:00:03] Jacob Effron: What is the state of the AI coding wars today? [00:00:05] swyx: Weâre in a phase of sort of like capability exploration. The general thesis that I have been pursuing now is that the same way that 2025 was a year coding agents 2026 is coding agents breaking containments to do everything else. [00:00:16] Jacob Effron: Do you worry about the foundation models just getting into a bunch of these startup categories? [00:00:21] swyx: Mid-size startups. Yes. [00:00:23] Jacob Effron: What do you think the end state of this market is [00:00:25] swyx: for the market structure to, to significantly change? There would be [00:00:28] Jacob Effron: today on unsupervised learning. We had a, a fun episode and whatâs really become an annual tradition, a crossover episode with our friends at Latent space. Swix and I sat down and we talked about everything happening in the AI ecosystem today. What we thought of the various changes at the model layer, whatâs happening in the infra world, the coding wars, and a bunch of other things. Itâs a ton of fun to do this with someone I really respect and another great podcaster in the game. Without further ado, hereâs our episode. Well switch. This is, uh, super fun to be back with another unsupervised learning, uh, latent space crossover episode. [00:01:02] swyx: Yeah, [00:01:02] Jacob Effron: I feel like a lot of places we could start, but you know, one thing I always find fascinating, uh, about the way you spend your time is you obviously are like at the epicenter of this engineering movement and community, and you run these events and conferences and put on these. Awesome talks and, and I think just have a great pulse on the zeitgeist of whatâs going on. [00:01:16] swyx: Yeah. [00:01:17] Jacob Effron: Maybe to, to start just what are the biggest topics people are thinking about right now? [00:01:21] swyx: Yeah, so I just came back from London, uh, where we did a IE Europe and weâre doing roughly one per quarter now, which Yeah, youâve [00:01:27] Jacob Effron: really up [00:01:27] swyx: the, hopefully [00:01:28] Jacob Effron: up the, up the pace. [00:01:29] swyx: Itâs trying. Weâre trying to match AI speed, you know? [00:01:30] Jacob Effron: Yeah, exactly. The tops would be completely different, I imagine. Uh, [00:01:33] swyx: yeah. You know, I definitely curate the tracks, like you can see what I think. When you see the track list and the, the speakers that I invite, obviously Open Claw is like the story of the last four or five months, and then be, be just below that. I would consider harness engineering, context engineering to be two related topics in agents and rag. And then thereâs a long tail of Evergreen stuff like evals, observability, GPUs, uh, and uh, LM infra and just general, just in general. We also have other updates on like multimodality and, uh, generative media, letâs call it. Um, but I definitely, the, the first three that I mentioned are top of mind people. Yeah. [00:02:13] Jacob Effron: I think harness is particular like, so interesting. Um, you know, there was this tweet from Harrison Chase, the, the lane chain, CEO, that, that caught my eye recently where he said, you know, it finally feels like we have stability, uh, around the infrastructure for, uh, you know, around ai. And I think what. He basically was implying his like, look over the past two, three years as a company at the epicenter of AI infrastructure, it was a bit like playing whack-a-mole, right? You were constantly moving around with, however, the building patterns were evolving [00:02:36] swyx: for Harrison for sure. Right? Like heâs basically had to reinvent the company every year since he started Lang Chain. Right? It was Lang chain, Ang graph and LP agents and like, uh, I think heâs like one of the most nimble, adept sharp people about this. Yeah. Yeah. [00:02:49] Jacob Effron: Saying now, now is finally the time stability [00:02:51] swyx: this. Yeah. [00:02:52] Jacob Effron: Yeah. Um, do you buy that or what have you kind of make of that take? [00:02:56] swyx: I think that. It, itâs very expensive to say this Time is different sometimes, but when youâre just writing code, like itâs actually okay to just like try to make a call and I think it may not even matter if this call is right or not. Like I just donât even care that much because you can be right on a thesis, but if you donât, you donât figure out how to monetize the thesis, then who cares if you said something first that said, um, it does feel like, for example. Uh, we went through a lot of different ways of passion packaging integrations up with, uh, with agents. And it feels like weâve landed at skills, which is like the minimal viable format. Yeah. Which is just a markdown file, uh, with some scripts attached to it, and I donât see how it can be more simple than that. And so there is some justification for. The stability around harnesses. I feel like there may be more adaptation with regards to maybe like the real time elements or subagents or memory or any of those like agent disciplines, letâs call it in, in agent engineering. Uh, but if, if the thesis is that, okay, you just want agents are LMS with tools in the loop with a file system, what they can do. Retrieval with, with skills and all these like standard tooling that now seems to be relatively consensus then probably. That makes sense. Um, I just think like thereâs no point trying to stake your reputation on this thesis that weâre there because if it changes again, just change with it. Itâs fine. [00:04:33] Jacob Effron: Yeah. Itâs always, you know, Iâve always been struck by how that is. Much more challenging for infrastructure companies and application companies. Like obviously I think, yeah. You know, on the application side youâve seen, you know, Brett Taylor from Sierra Max, from Lara. Like, theyâre like, look, we build, you know, whatâs ahead of the models and weâre willing to throw everything out every three months, you know, as the models get better and better. Exactly. Yeah. But the thing you at least have there is you have. Uh, you have an end customer, right? Thatâs like decently sticky. Um, you know, they will mostly stick, you know, theyâll, theyâll give you a shot at least of, of building these things. What Iâve always found more challenging, uh, at, at the kind of like, you know, reinvent yourself every three months of the infrastructure layer, itâs like, you know, developers are definitely a, a pickier audience maybe than an accounting firm or, uh, you know, a bank. Yeah. And so itâs definitely a, a, a more challenging position to be in to, to have to constantly reinvent yourself. [00:05:17] swyx: Yeah. Yeah. Yeah. And, and like when they turn, itâs like. Very complete. Like, theyâll leave to like the, the hot new thing, uh, because thereâs like no defensibility, I guess. Like e even, even if you are a database, like, uh, people can migrate workloads off databases. Like itâs, itâs a, itâs a known thing. Uh, so I think like basically what weâre talking about is the vertical versus horizontal, uh, debate in, in AI startups. And uh, the way I think about it also is just that like when you are. Um, Lara, when you are a bridge, like you are the outsource AI team, right? You, you are, your job is to apply whatever state ofthe art AI methods. [00:05:55] Jacob Effron: Yeah. Like this translation layer between model capabilities and your [00:05:57] swyx: own customers. Yeah. To, to the end customers and like, well, if they didnât have you, they wouldâve to hire in house and theyâre not gonna hire in house so they have you. And like, I think thatâs like a reasonable, like very robust to any whatever trends and, and discoveries that people make in, in the engineering layer. I do think like there is, um. It like sort of useful horizontal companies being built, but theyâre all. Very much like, sort of like the reinventions of classic cloud in the AI era and the, the primary one being sandboxes. Yeah. Um, which like, itâs another form of compute guys, like, letâs not get too excited about it. But I mean, like the, the workloads are enormous. [00:06:38] Jacob Effron: Right. [00:06:38] swyx: Yeah. [00:06:39] Jacob Effron: Itâs interesting, and I feel like as, as part of this, you know, the questions that folks are asking around infrastructure, thereâs a lot around, you know, the extent to which companies should have their own AI teams and what they should be doing in-house. And, you know, uh, I think thereâs questions around should people be training their own models? Should people be doing, you know, rl, uh, in-house based on the data they have? I feel like, you know, one has to evolve their takes on this every, every three months with paces. But where, where are you at on this today? [00:07:00] swyx: I think, well, I mean actually all models have gone up. Um, and obviously Iâm involved in cognition and also cursors doing, doing, uh, a lot of own model training. And I think that that is some part of the, what Iâve been calling the agent lab playbook, where you start off with the state of the art models from, uh, from the big labs and you, uh, specialize for your domain. But once you have enough workload and enough high quality data from your users, then you can obviously train your own models and like save a lot on cost and latency and all that, all that good stuff. Um, you also get like a marketing bonus of like calling it some fancy name and putting out some research [00:07:38] Jacob Effron: from my seat. I canât tell how much of it is like actual, you know, value thatâs provided to the end user. And how much of it is that marketing bonus? Right. It seems some combination of the [00:07:45] swyx: I think itâs both. [00:07:46] Jacob Effron: Yeah. [00:07:46] swyx: Um, no, no. There, there actually is real value. Um, and you, you know that for a number of reasons. Like one, even when itâs not subsidized, people do choose it as like one of the top four or five. This is both composer two and, uh, suite 1.6 I one of the top five models. Like in a, in a fair market? In a free market, yeah. In a, in a, in a model switch. Or people do choose it and like, itâs not subsidized. Like, so thatâs as good as it gets. Uh, but beyond that, like domain specific models, for example. For search with, with both, which both companies have absolutely makes, makes a ton of sense. Everyone says like, yeah, we should always, always do this. And honestly like, I think the infrastructure for that is becoming easier with, um, like thinking machines tinker thing as well as primary like, uh, lab stuff. Yeah, I mean like, this is one of those like reversal of the, the bitter lesson where you first bootstrap on the large models and the general purpose models to get big. And as you get very well-defined workloads that are just high quantity but not high variance, um, then you just distill down to a smaller model and run that on your own. Right. Which like totally makes sense. [00:08:50] Jacob Effron: What Iâm less clear on is the kind of DIY RL use case, which I think is really mostly around, you know, improved, uh, quality for, for different things. Obviously thereâs probably like more efficient ways to, you know, get a smaller model thatâs thatâs faster and cheaper. And itâll be interesting to see whether. You know, obviously you had, you know, uh, two, three years ago this whole case of companies that were, you know, pre-training and claiming better outcomes in, in their domains than getting kind of cooked as each model iteration improved. You know, I wonder whether thatâs a, a similar story plays out in the, uh, in, in the, our all space. Yeah, for the focus on, on on pure outcomes and quality, not the cost side, which clearly your own models for cost at scale makes a ton of sense. [00:09:28] swyx: I think there are this, there are two sides of the same coin. Like you basically always want to hold, uh, quality constant or trade off a little bit of quality for a drastic decreasing cost. And thatâs true for everyone. Uh, one element I wanted to bring out, which is very much in favor of open models, is custom chips. So this would be cereus, but also talu. And then thereâs a huge range of stuff in between. This has been a huge story this past year on just like everything non Nvidia is getting bid up, including like freaking MatX is working for, which is very, which is very rewarding for me, but I think one of those things where like, oh, like the suddenly, because the number of alternative. Hard, uh, hardware is increasing and the inference that you can get is insanely high. Like, um, weâre talking thousands of tokens per second instead of less than a hundred. So the trade off for qua quality doesnât hold as much anymore because the speed is so high. [00:10:24] Jacob Effron: Have you seen a lot of companies go all in on the alternative chip? [00:10:26] swyx: So cognition has Yeah. On Cerebras, uh, and, and so has OpenAI Um, uh, and so no, I donât think so beyond that, uh, and that, do you think thatâs like a, thatâs mostly, thatâs foreshadowing of, thatâs, yeah. I used to be kind of a skeptic in terms of like, okay, so what if I get my inference at a hundred to a hundred tokens per second sped up to 200 tokens per second. Itâs only two X faster. Itâs not that big a deal. Um, but when you, uh, I think every 10 x does unlock a different usage pattern. Um, and you, we have proof in Talas and, and some of the others. That you can actually, um, drastically imp improve inference speed and what happens from there? I donât even really know, like itâs, itâs so hard to predict when entire applications just appear at once. Yeah. Uh, and it also isnât that expensive, right? So like, um, this is one of those things where like, I, I think the, the investment cycle is gonna be multi-year. Um, and I. Would caution people to not dismiss it too, too quickly. [00:11:25] Jacob Effron: Yeah. I mean, one other like infra question I was curious to get your thoughts on is obviously it seems increasingly a lot of the cutting edge infra companies are building for agents as the buyers of their product or users of their product, right? [00:11:35] swyx: Ooh, [00:11:36] Jacob Effron: and [00:11:37] swyx: another huge theme. Yeah. Yeah. [00:11:38] Jacob Effron: And Iâm trying to figure out like what. What, what do you have to do differently about selling into agents? Um, are they just the ultimate rational developers? Uh, or is there, you know, [00:11:46] swyx: no, absolutely not. Um, I think they are easily prompt, injected and, uh, very tuned towards like, basically com compounding existing winners. [00:11:57] Jacob Effron: Yeah, [00:11:57] swyx: so like if, like, congrats if you won the lottery for getting into the training data right before 2023, because now youâre like installed in there for the foreseeable future. But yeah. Uh, you know, one stat that Versal, uh, CTO Malta dropped at my conference was that there are now, uh, 60% of traffic to Elleâs, um, like app arch, like admin app architecture for like configuring versal applications, uh, is bought. Itâs not, itâs not human. Uh, so like your primary customer is agents now. Um, and itâs mostly co like mostly coding agents, mostly people using CLI on CP or whatever. But yeah, I mean, I think. More. I, I think step one, if it doesnât exist as an API that agents can use, it doesnât exist. Right, right. Which I think is like, uh, itâs a good hygiene thing anyway, to, to make everything API available, but not as like an extra, um. Push on like products, people to not only work on the ui, um, you should probably work on the on SCLI stuff. Beyond that, I think honestly there is like, so I, I come from the sensibility of, I think everything that you are trying to do for agents experience now, which is the term that Matt Bowman and Nullify is trying to coin, is the same thing that you should have been doing for developer experience. That you should have had good docs, you should have had a consistent API, uh, that is. Mostly stateless. Um, you should have, I guess, discoverable or progressive disclosure or like search or like whatever. And so now that people have energy in like finding these customers to do that, thatâs great. Um, do I believe in. Extending beyond that into something like a EO, um, for gaming The chatbots? Not necessarily, but obviously thereâs gonna be huge advantages when people who figure out the short term wins. Yeah. And short term wins can compound. [00:13:43] Jacob Effron: Do you think these compounding advantages to like the, the pre-training data cutoff companies, like, you know, obviously over some period of time, I imagine that doesnât persist. And so as you think about like. I dunno, three, four years from now what the, you know, selection criteria end up being. Do you think it still mirrors exactly what you were saying before? Like itâs exactly what you should have been doing all along to sell a good product to developers? [00:14:01] swyx: It could be, except that I think in three, four years weâll probably have much better memory and personalization. So then general a EO or GEO doesnât really matter as much. So I think whatever memory or personalization system we end up with will probably d determine what you end up choosing much more. Than, than what is currently the case, which is just frequency of mentions, letâs call it. Yeah, [00:14:26] Jacob Effron: yeah. [00:14:26] swyx: Uh, so you just spa quantity and I think thatâs, I mean, thatâs something Iâm looking forward to. I do think, like, like, you know, I, I think that the fundamental exercise to work through for yourself is if you start a new, um, sort of. Uh, disruptor company. Now thereâs a, thereâs a big incumbent that everyone knows, like, like superb base. Super base is like, kind of like the Postgres, like database, uh, incumbent. If you wanna start like new superb base, how would you compete with them? And I donât necessarily have the answer, but I, I, I do think like people, like resend like relatively new. I think they would start like 20, 23 and still there was, there was a recent survey where like, people. Checked what Claude recommends by default. If you just donât prompt it with anything, just say, gimme an email provider and says, resent as in like 70, 70% of each cases. Like the fact that you can get in there with like such a relatively short existence, I think is, is encouraging. [00:15:14] Jacob Effron: Yeah. [00:15:14] swyx: I do think like. Um, you do want to do whatever it is to, to like to, to get in that Very short mentions this because, um, itâs not gonna be 20 of them, itâs gonna be like three. [00:15:26] Jacob Effron: No, definitely. It feels like, uh, you know, probably more, more consolidation than ever. Uh, or, or kind of like, you know, uh, a winner take most market than maybe the, the, the physics of go-to market in the past. Yeah. Might have, uh, enabled. [00:15:38] swyx: The other thing also is like, semantic association is gonna be very important, uh, in the sense that like, you want to do like the combo articles where youâre like, use my thing with for sale, with blah, blah. And like that all gets picked up in a, in a corpus. And so thatâs. Probably one thing that you, you wanna do? Well, I donât know what else. Uh, itâs, itâs, itâs, itâs one of those things where like, I think I feel, I feel Iâm behind, uh, I donât know how you feel about this, but like, [00:16:04] Jacob Effron: I think AI is just everyone constantly feeling like theyâre behind some, uh, [00:16:08] swyx: yeah. With, [00:16:09] Jacob Effron: I wanna meet the person that doesnât feel behind, [00:16:11] swyx: but like with, with ax, right? Like, so, so like, my, my stance was that exactly what I said before, like everything that you, that you should do for agents is something that you should have done for humans anyway. Yeah. And so. To the extent that youâre just getting it more energy to, to do things for agents, great. But like, uh, itâs hard to articulate what new thing apart from just like more spam, um, that you should be doing. Anyway, that would be my take right now. Um, I I, I do think like there, there will be more turns at this. I think the personalization turn that is coming, um, will be big. And I donât know what that looks like because like basically weâre kind of, we feel kind of tapped out on the memory side of things. [00:16:49] Jacob Effron: Yeah. I, I guess since we last chatted, you know, you, you took this role over at cognition, um, and youâve obviously have a, have a front row seat to the AI coding space today. You know, I feel like coding in many ways. You know, people view it as this, like, I mean, besides being like the, the mother of all markets and this massive opportunity, I think itâs kinda a preview of like, whatâs to come for many other spaces. Both. Yeah. You know, I feel like agents are most advanced in coding. I also feel like the, you know, competition between foundation models and application companies, you know, and, uh, mirrors what we may see in other spaces. And so maybe for our listeners, can you just lay out like what is the state of the AI coding wars today? [00:17:25] swyx: Um, it is massive, right? Like, uh, and I donât think necessarily, last time we talked about this, we appreciated the size of what [00:17:32] Jacob Effron: No, I wish we did. [00:17:33] swyx: I state of AI coding wars today, um, both opening eye philanthropic have made it their p serials to competing coding. Um, and. Tropic is like 2.5 billion in a RR just from Cloud Code. The way they recognize a RR is. Opt for debate, uh, open ai. I donât think the, a public number is known, but letâs call it 2 billion as well. And then cursor is like, rumored to be 2 billion, you know? And, and those, those are like the public numbers that are known? Yeah. Um, so like huge markets that have just been created in the past one year. Like, like anthropic, just like Claude Code just recently celebrated their one year anniversary, which is, yeah, pretty nice. Um, so, and then I think, like the other thing that I see is thereâs, thereâs some other people who are like, oh, hereâs like the, the sort of relative penetration of, uh, Claude use cases, right? Like, and itâs like coding 50% and then legal, whatever. Health, uh, itâs like the, the remaining ones. And there was a very popular tweet that was like, okay, Iâll look at the, the empty space and all these other use cases. If you are a new founder today, you should be betting on the other stuff because on, on a sort of catch up Yeah. Theory and my. Consider my, my pushback is the same pushback that, uh, I had on app over Google, which is like, well, well why is this time different? Like, why, if it went from letâs say 10 to 50% in the past year, why canât I keep going? Uh, and like getting that wrong is actually a very painful one because you could have just did, did the momentum bet. Instead of the mean reversion bed. So I, I, I think that that is the, the state of things now that people are very, very much into psychosis. Um, theyâre are getting rewarded for spending more rather than spending less. And I think weâre not in that phase of efficiency. Weâre in a phase of sort of like capability exploration. So I think people who are more crazy, who are more. Uh, creative, um, get rewarded comparatively. Yeah. [00:19:27] Jacob Effron: Well, itâs interesting. I mean, it feels like behind these like token maxing, leaderboards and whatnot is this, itâs like the first phase of this transition from a workforce perspective is you just gotta show your employer like, Hey, I, I use these tools. [00:19:37] swyx: Hereâs my nu number of tokens I cost, and thatâs it. They donât care about the quality. Right. It is, uh, maybe distasteful to someone who cares about the craft and, and all that. Um, but directionally everyone just wants you to go up regardless. And so, um, there it is not very discerning. Itâs, and itâs probably very sloppy, but I think itâs net fine because weâre still probably underusing ai just in generally. Yeah. Um, and so I think thatâs like very interesting. Like we had on the podcast, uh, Ryan La Poplar from OBI, who spends a billion tokens a day. Yeah. Um, and thatâs for those county home, itâs like something like 10,000 worth, $10,000 worth a day of API tokens. If they, they did market rates, um, and like most of us canât afford that. Yeah. But like. And, and, and probably a lot of what he does is slop. [00:20:25] Jacob Effron: Right. [00:20:25] swyx: But like, heâs going to dis, heâs like, if there were a new capability, he would discover it first before you because he was, he was trying and you were not trying. Right. And like, you only do things that work like, well, good for you. But like the, the people who are going to discover the next hot thing are living at the edge. [00:20:42] Jacob Effron: Right and increase in living at the edge of just having the compute budget to like run these experiments. I mean, kind of similar to what living at the edge on the research side has always been. You know, it was constrained in many ways by the amount of compute you had to run these experiments. It feels similarly on the, almost on the builder or like actualizing these tools now. [00:20:56] swyx: Yeah. The other thing thatâs, I mean, very obvious is philanthropic is kind of like the high price premium player. Um, that where, you know. Restricting limits or restricting model releases even is like the name of the game. Whereas Codex is like, come on in guys, use our SDK, use our login and we donât care. Weâre gonna reset limits. Whatever you do want to try to exploit the subsidies where you can get it. And definitely Codex is super subsidized right now. Gemini also very subsidized. Um, and. Comparatively, like, I think you should make, Hey, I guess while, while thatâs going on, itâs not that bad to be a capabilities explorer on just the $200 a month plan from Cloud Code or from OpenAI. Um, and, uh, I I, I, my sense is that people arenât even there yet. [00:21:41] Jacob Effron: How do you think this, like, market ultimately plays? I mean, itâs obviously such a big market that, you know, any slice of that market is interesting for, for anyone going after it. But I think what, what makes people so interesting in the coding market particularly is it feels like itâs kind of this. Foreshadowing of what will happen in other, you know, any other kind of application market that the foundation models eventually turn to and are all their models against and gather data around. And so how do you think, you know, like does there end up being room for lots of different kinds of players or like, what do you think the end state of this market is and is that, do you think thatâs applicable to other markets? [00:22:10] swyx: I feel like there will be, I mean. Status quo is probably the most likely outcome, which is there are two big players and thereâs a small range of longer tail people that, um, fit other use cases that the, the two big players donât. That feels right to me. I think that, um, for it to, for the market structure to, to significantly change there would be, there needs to be significant change in like the economics or like the, the brand building or like the, the, the, the value propositions of the, of the companies involved and I. Havenât seen any in the last six months that, that have really changed the stories materially. So I feel like they would just keep going until something, something else happens. Something else happens, meaning like Microsoft wakes up and like goes like. Guys, we have GitHub, we have, uh, you know, we, we, weâll, weâll do something much bigger here than other, other than just copilot. Um, and, uh, that would be a big change. Um, MSL has put out a model now, and I was in a breakfast with, uh, Alex Wang, where they were like, yeah, like, we, we really, really want to go after the coding use case. We havenât done anything yet, but like, donât underestimate them. Right. Um, and, and similarly for the Chinese labs. Um, I think theyâre trying to go after it. Like ZAI is doing stuff. GLM uh, ZI and GLM is same thing. Um, uh, and, and so itâs, so like everyoneâs trying to get a piece of that pie. I, I feel like the, the status quo has been pretty stable for the past, like almost a year Iâll say. [00:23:39] Jacob Effron: Yeah. And is the room for the, not like, you know, for, for the application companies more on like the enterprise side or like where do the, where do the, like what surface area do the model companies leave for application companies? [00:23:50] swyx: Yeah, thatâs a good one. Um. Itâs very much evolving. Um, it, I, I, I will say because opening I did not have this, the, this level of attention on coding. Yeah. Uh, a year ago. We just donât have that much history. Right. Um, and it seems like, for example, so the big push at Open I now is the Super app. Um, is that a consumer thing? Is that like a products like. Portfolio rationalization thing, how much is that gonna take away attention from coding at the time when they actually do want to put more coding? I think itâs, itâs very unclear. So I do think like thereâs, thereâs all these, like in both big labs, thereâs. Uh, sorry. Both of the, and, and drop and, and deep minus and XAI are are separate cases. Um, they are trying to see the other time expansion areas. So cloud code for finance. Yeah. Um, uh, cloud cowork, all those, all those things. Whereas I think cursor and cognition are like comparatively just focused on coding and so I, I do think they leave space and I do think for the other verticals that also means the same thing. Right. That, uh, that theyâre not gonna be that. Um, intensely focused on, on, on that domain. Except for, I, I think I would mark out finance and healthcare as like the next ones, um, that theyâre clearly going after. Uh, I, I would say comparatively, healthcare seems more thorny. There, there, thereâve been some announcements about it, but like, I would respect the, the finance work a lot more just because like the, the path to money is a lot clearer. [00:25:12] Jacob Effron: Yeah, no, I mean, obviously like, I, I think, you know, maybe similar to, to the space thatâs being left in these other domains, you know, thereâs obviously. Uh, a lot thatâs required to actually implement these tools in enterprises, uh, versus, you know, maybe just giving them, uh, giving model access to, to folks outta the box. [00:25:27] swyx: Yeah, yeah. Yeah. So the, the agent lab thing is like, weâll do the last mile for you. Whereas I think the model labs tend to just trust the model and, and be minimalist about it. Both of them work. [00:25:38] Jacob Effron: Yeah. [00:25:38] swyx: I, I donât, I donât necessarily think one, uh, beats the other, uh, for every, for every use case. Um, all I, all I do know is that it does seem like. Uh, the large enterprises do want a dedicated partner that isnât just the model labs, which is kind of interesting. [00:25:55] Jacob Effron: We, weâve been in this phase of, of pure capability exploration. And so I think nothing has been, you know, better for the large labs, right? I mean, theyâre always gonna be, uh, uh, the frontier of, of capability exploration. And so I think have a very good relationship with a lot of these enterprises. But ultimately over time, like. The, uh, the incentive structure of these labs is always gonna be maximal, you know, token consumption for, uh, for the end customers they work with. And thereâs just, I think, so few companies that have actually gotten to massive scale. Maybe coding again is the most interesting. So itâs the first space that really is just completely gone, you know? Yeah. You must love it every day. Like absolutely insane. And. I think it [00:26:32] swyx: gets even. Okay. I mean, like, I think we, we say good things about crystal cognition, but the sheer liftoff of like both end UPIC and open ai. âcause they, they, they have independent valuations. I mean, letâs throw an XEI in there because itâs now I ping at 1.2 trillion. That number is just mind boggling. Like I, I feel like in normal investing or normal startups, thereâs kind of like a ceiling market cap or valuation. Totally. That, that like you, you reach and you go like, all right, letâs, itâs gonna be chiller from now on. And these guys are not slow down. No. [00:27:02] Jacob Effron: Well, I also think the dynamic is fascinating about some of these later stage companies is, is, you know, in the past, I feel like in, in venture world, if you got to a certain level of scale, the question around you was really more a valuation question. And this is like why there was different phase, like, you know, types of venture people did and like the late stage growth people were just incredible at like, you know, a little bit of whatâs the ultimate market opportunity of this company, but also whatâs the right way to, to value it. Like we know itâs, itâs in some bands of an outcome that is like. Sure thereâs some variance to it, but itâs like relatively understood what that bands is and then maybe you get over time surprised to the upside. Whereas any kind of like later, even the labs themselves, any later stage company, the bands of which that company might be worth right now, even in a year or two years are so massive because of how fast the ecosystem changes that itâs like. Even for later stage companies, every three months could be an existential level event to the upside to the downside. Yeah. Um, and I think that, like, you are obviously seeing it in the, in the positive with code, which, you know, if you think about a company like philanthropic, you know, that. For a while, it was like unclear if they were going to have access to enough capital, um, to really stay in the, in the race, right? And then coding hit at the exact right time. They had the perfect model for it. They executed brilliantly. Um, and you know, now are, are, you know, uh, you know, one of the most valuable companies in the world. [00:28:13] swyx: Uh, at the same time, I, I donât find, I, I have zero sympathy for opening eye because theyâre crushing it and theyâre all rich. You know, this is like a high class champagne problem to have to, uh, to be number two at coding or whatever. Like, who cares? Like, youâre, youâre doing great. [00:28:27] Jacob Effron: Yeah. Itâs funny though. I canât even, I mean, you would be closer to this, uh, you know, even that youâre in the AI coding space, but itâs like a lot of people I talk to think Codex is just as good, if not better than Claude Code. Right. I think one thing that Iâve been really surprised by, and maybe, maybe Cloud Code is a better product in some ways, Iâm curious your thoughts is just in consumer AI with chat GBT. You saw this big first mover advantage, right? Where admittedly today, like, I donât know, Claude Gemini. Great products. Not sure, not abundantly clear chat GBTs any better, but like. People stick with chat, GBT, itâs the first thing to introduce them. [00:28:56] swyx: They stay, but theyâre not growing anymore. I donât know if youâve seen [00:28:59] Jacob Effron: Right. But that to me is more of like a, a, a product problem than it is. Theyâre not like, itâs not like theyâve like lost share to someone else. My understanding is the overall problem with consumer AI today is much more of a how do you take this tool and, you know, for, for folks like us, like knowledge workers, itâs like this incredible magic tool, but itâs not necessarily a daily active use tool for a lot of people around the world today. And what are the like products? Itâs, itâs kind of a category wide problem. Like in coding, for example, like. The entire space has gone parabolic. There may be some relative growth in, uh, in other consumer AI players, but itâs not like consumer AI as a category is like going parabolic and theyâre not capturing most of that thing. I think itâs actually the larger problem is much more, hey, the category has kind of hit a bit of a plateau of people havenât figured out how to bring, you know, tons more users on board. Yeah, yeah. Or increase the frequency of those users. And so it seems more of a category wide problem than it is, you know, a massive market share of change. I was gonna draw the comparison to, to the coding space where Claude Co is the first product, obviously, to introduce people to this magical experience. You know, by all accounts, codex is, is pretty damn close to as good, if not better. Um, but like still that first product, you, you wouldâve thought that would not be a super sticky, uh, you know, product surface area. And it actually has, it turns out, I, it feels like the first lab to introduce you and experience really does, uh, keep a lot of, uh, a lot of the focus. [00:30:12] swyx: I, I think. M maybe itâs like still, still early days. You know, Chad, BT is like three plus years old and Yeah. Cloud code is only one. Just turned a year. Yeah. So give it time, you know? Yeah. Like, yeah. I mean, definitely sometimes a lot of people have switched from to Codex. Maybe that will keep going. I, itâs like really hard to tell. Uh, yeah. I, I, I do, I do think that. Because we are in this like, high volatility, high temperature phase. Um, the loyalty and stickiness to first movers and category creators, I donât think is as high as it might be in some other, uh, areas in our careers that weâve looked at. [00:30:47] Jacob Effron: Yeah. Though, I mean, Iâve been surprised by the cloud code thing. I, I wouldâve thought that, like, in many ways I always worried about the [00:30:52] swyx: enterprise. You think you wouldâve been gone by now? [00:30:53] Jacob Effron: Not gone. But I wouldâve, I I always worried that the, that the consumer business of these companies would be quite sticky. And then the enterprise API business. Uh, was actually like, you know, in some ways like your least loyal buyers, like they would, they would move to, [00:31:05] swyx: right, right. But, but they worked out that it wasnât the enterprise API it was enterprise product. [00:31:09] Jacob Effron: Totally. And maybe that was the, that was the secret that like, but the amount of lock-in or just default behavior that has happened in that space, uh, is, is more than I mightâve imagined with two products that by all accounts are pretty damn similar. Yeah. [00:31:22] swyx: No fight there. Uh, I will say I do think that Codex is still in like a catch up. Like in terms of personal experience. Um, the only thing I like out of, out of Codex is the, is like Spark and like yeah. Uh, the, I, I feel like the skills integration is a little bit better. I feel like, uh, the, the speed is a bit better. Maybe âcause itâs in, is written in rust or whatever. Um, very minor things that you like. Almost like telling yourself rather than like objectively assessing between two, two of them. I, I, I do think, like vibes wise, I think thatâs going on. Um, the, the, you know, I, I feel like the, the missing questions, uh, in, in this whole debate is like, why is this so concentrated in only two names, right? Yeah. Like, um, how, where, like, where is the Gemini? You know, presence, whereâs the Xai presence? Um, and like they are trying, itâs just they havenât made that much progress yet. [00:32:12] Jacob Effron: But what the, what the Claude Co moment does show, and it actually in some ways makes you a little more bullish on the potential for someone else to catch up because it does feel like if youâre the first person to introduce some magical net new product experience, that that actually might be stickier than one might have imagined. [00:32:27] swyx: Right, right, right. Okay. Yeah. [00:32:28] Jacob Effron: And so itâs, everyone can believe they have shot [00:32:29] swyx: that. What do you think that new product experience might be like? I, I, itâs, itâs like, and this is a failure of imagination on my part. Like, I always wonder, like, people always say this like, well, the, the thing that will save us is like being first to the next new thing. Like what is it? [00:32:41] Jacob Effron: Yeah. [00:32:42] swyx: Itâs like, [00:32:45] Jacob Effron: I dunno, something around like, uh, consumer agent, computer use, like hybrid. I think, obviously, I think weâre like scratching the surface on the consumer side. [00:32:53] swyx: So my, my current theory is like the. Open claw is like a vision of things to come. [00:32:58] Jacob Effron: Totally. [00:32:58] swyx: Um, and uh, itâs good that O open I has like the association with open claw, but by no means do they have the rights to win it. The general thesis that I have been pursuing now is that the year the same way that 2025 was the year of coding agents, 2026 is coding agents breaking containment to do everything else. Um, and so coding agents continue to still win, but because they generate software and software eats the world, so like, itâs kind of like the trans. Associated property of like software, eat the world, coding agents, eat software, therefore coding agents eat the world. Um, which is like an interesting, [00:33:30] Jacob Effron: yeah, and breaking containment always an easier phase phrase in the consumer context than the enterprise one. Youâve seen people run these really cool, uh, experiments in their own personal lives. I think like, [00:33:37] swyx: yes. [00:33:38] Jacob Effron: Figuring out, you know, how you, obviously everyoneâs focused, you know, on the enterprise side now around how you create these experiences. I feel like the vibes, you know, people love to have these narratives of like, everything is completely shifted. Itâs like I actually, you know, open AI. Organizationally, uh, you know, volatility aside is, you know, great products, great team, great models like everyone else in the world is incentivized for there to be. Two, three more. Everyone would love more like great model companies. And so I feel like the, the natural forces of the world revolt when any one company, you know, is too much the star of the show, right? Thereâs so many people in the ecosystem that are incentivized for that not to happen. And so I think Iâd be shocked if we donât have. Uh, uh, reversion of vibes, not maybe completely the other way, but at least a little bit more equal at some point over the next six, 12 months. [00:34:24] swyx: I, I think thereâs just a kind of different stages when, when you talk about the world, one wanting more model companies, I talked think about like the neo labs. [00:34:30] Jacob Effron: Yeah. [00:34:31] swyx: And I mean, I donât know, is it fair to say none of them have really broken through in the past year? [00:34:35] Jacob Effron: I think thatâs totally fair, [00:34:37] swyx: which is rough. Um, and well, how are we gonna, how are we gonna grow that diversity in, in, in choice, like. Um, thatâs, this is it. [00:34:46] Jacob Effron: Yeah. Itâll be really interesting to see what, what, what ends up happening with that. And youâve seen, you know, folks like Nvidia, you know, very incentivized to make sure thereâs, thereâs a broader platform of, of other model providers. [00:34:57] swyx: I think, uh, I donât know people say this, but I, I, I donât think they try it hard. Nvidia tries harder to build neo clouds [00:35:05] Jacob Effron: Yeah. [00:35:06] swyx: Than neo labs. [00:35:07] Jacob Effron: Well, they try pretty damn hard to build neo Cloud, so [00:35:09] swyx: thatâs, [00:35:09] Jacob Effron: yeah. [00:35:10] swyx: But like, you know, letâs call it like the, the core weaves of the world, much happier place in the, you know, than any neo lab built on top of them. [00:35:18] Jacob Effron: Yeah. That one might argue itâs, itâs easier to, to enable a neo cloud to be successful than it is. Uh, you canât will a neo lab into existence the same way you, so Nvidia [00:35:25] swyx: has more direct control over it. Uh, for sure. [00:35:27] Jacob Effron: What else is kind of catching your eye today on the startup side? I mean, you worry, thereâs obviously this whole narrative of like, you know, the foundation models, you know, they announced a product and every stock goes down 15%. Like [00:35:36] swyx: Yeah. [00:35:37] Jacob Effron: Do you, do you worry about the foundation models just kind of eating into to a bunch of these startup categories? [00:35:43] swyx: Not really. I, I think actually like. As, uh, thereâs, thereâs, okay, thereâs, thereâs, thereâs the, thereâs the point of view of like being an investor in startups, and thereâs a point of view of like, do you wanna start something? And I think honestly, like the, the downside for all these is so. Minimal in, in a sense of like, the worst you do is you just get hired into one of these labs anyway. So I, I think the, the market for people who just do things and try things and try to execute in like a competent way, even if like it doesnât work out commercially, even if it just wasnât that great anyway. Like, but like thatâs your job interview to go into, into one of these things anyway, so, um, I donât feel that. From a, from a very, very small startup perspective, mid-size startups. Yes. Uh, I will say thereâs been a lot of dead, um, LM Infra, a lot of LM infra consolidation like the, the, uh, lang fuses of the world getting absorbed into, into click house. And I, I think. Like people have maybe worked out the domain specific playbook, uh, and like, I think thatâs okay. Um, and, and yeah, Iâm not that, not that worried about, uh, okay. So, um, I, I would say Iâd be more worried about traditional SaaS, like low NPSS. This is the whole AI versus SaaS debate that has, thatâs been going on. Uh, and, and like literally Iâm going through that exact thing in my company where, so I like kind of. Thinking through this on a very visceral, visceral level, right? On one hand you have the people who say you vibe coders donât appreciate the amount of work that goes into A-A-C-R-M and like, yeah, you think you can rip out Salesforce? So did the 30 entrepreneurs before you, right? Like, like, you know, you classically underestimate the things that you donât. Deeply, no. And, and, and target audience is not you. Uh, at the same time, like we have never been able to build software so easily and customize software so easily and like Yeah, youâre not gonna use 90% of the things in Salesforce. So like, yeah. Whatâs the typical, so what have you, what [00:37:33] Jacob Effron: have you done internally? [00:37:34] swyx: So we have there the main SaaS that we do for event management and sponsor management. Thatâs, and we paid 200 KA year for that. Not, not huge, but like chunky for, for, for my, my scale. Um, and like, yeah, I could probably spend 2000 and, and build like a custom version of that. Um, the, the, the trick has been dealing with my, the rest of my team and getting them on board. Yeah. âcause Iâm the most ethical person on my team, but like, I canât make that decision myself. And I think in the same way Iâve been telling with other CEOs team leaders as well, itâs like, well you can be super cloud pilled. You can be super LM psychosis and that you think thatâs okay, but you like you have to bring your team with you. And I think like there, the sort of widening disparity in LM psychosis in companies is causing real s real riffs because. And on one hand, on one hand, the people who are less AI native are not getting with the picture. Theyâre not, theyâre actually like behind, theyâre actually not waking up to the fact that like you, everything you think is necessary is not actually that necessary. And in fact, exactly would be better of you if you just like held your nose and went in and when came out the other side. Yeah, only talking to agents in natural language and like your life would actually be better and you just, youâre just like close-minded. Thereâs that perspective. The other perspective is, oh, you vibe coder. You, you did this in a weekend and you got the 80% solution and now the rest of your employees. Have to pick up the rest of your s**t, right, that you, that you thought you were, you were such hot, amazing, uh, uh, at, but like, actually you didnât figure it out. And like, actually LMS are still useless at this and blah, blah, blah. So like, I think thereâs this huge debate going on in every company right now. Um, and like, um, you know, I have a small microcosm of it, but like, yeah, it, itâs making me hesitate to, to pull the trigger. But like I will at some point, itâs like maybe Iâve put it off for one year, but not like five. Yeah, but like, so, so like SaaS is definitely getting squeezed. Um, it does make me wonder, like, I, I do think that thereâs an opportunity for a more AI native, um, system of record thing that is not just Postgres. Um, or not just MongoDB, although both are very good. Maybe itâs like a convex or like people Yeah. Bring up convex a lot. I donât know, like, like, I, I just feel like the sort of quote unquote firebase of, of AI apps isnât really a thing yet. Um, beyond what we have. Uh, which, which is fine. Itâs, itâs, itâs just. We could probably start in a more sort of rapid iteration cycle first before scaling up to like a Postgres or MongoDB, which are more sort of old tech. I was at a dinner with, uh, Mike Krieger, the CPO of en philanthropic, and, and he, we were just kind of going around the room going like, what are people most worried about? Yeah. And, uh, for me, uh, I, instead of security, I brought up biosafety. Yeah, [00:40:21] Jacob Effron: classic. [00:40:22] swyx: Um, actually, like I said, it was. Cliche and classic, and the rest of the table were, were like, what do you mean? Someone sitting at home can manufacture a virus that wipes out half of humanity, [00:40:32] Jacob Effron: almost like the OG Jeffrey Hinton. Like, this is why you should be scared. [00:40:35] swyx: Iâm like, yeah, like the read the, you know, risk reports. Like this is like the thing. Um, I think, and Mike was just sitting there knowing he was sitting on Mythos and going like, actually itâs security. Um, and I think like, um, I think the, thereâs, thereâs, part of it is. A very good marketing. Like too good. Yeah, like I would actually advise and topic to tune down the marketing because also itâs, it is just a very good model and you donât have to make so many marketing claims around it. At the same time, it is not really a private model. If you give it to 40 companies. Each of whom have like 10,000 employees or whatever. Right. Itâs not, itâs not private, itâs, itâs like thereâs bad actors in there. [00:41:18] Jacob Effron: Yeah. Hopefully, hopefully not as, uh, as bad as releasing it widely, but, uh, no, I mean, itâs an interesting. You know, itâs an interesting case study for how all, I mean, many model releases might, I mean, you know, this might be the first model release that looks like the rest of âem from from now on, right? [00:41:31] swyx: It, it, so itâs, itâs the, thereâs an overall product strategy, uh, for anthropic of like bundle, uh, you know, restrict access bundle, uh, product with model maybe. Whereas, uh, OpenAI has definitely been a lot more sort of. Philosophically aligned on like, we will just enable access everywhere and we donât know what you, what will come out of it. Right. [00:41:51] Jacob Effron: Right. Though, I mean, this current moment, uh, obviously the cynical take is also just ties to the amount of compute that both companies [00:41:56] swyx: Yeah. Right, right, right. Yeah, I think, I think thatâs true. I I do think like the, the, this is the, the, the scale, the dawn of like larger than 10 trillion parameter models is very interesting. I donât think it, I think itâs a temporary phenomenon because we have much larger compute clusters coming online for everyone over the next like three, five years. Itâs, and this is like already written in, in the cards. [00:42:18] Jacob Effron: Yeah. [00:42:19] swyx: So to the extent that like, you know, will we have rationing of models, uh, above 10 trillion, uh, in like two years? I donât think so. I think everyone will have no, weâll just [00:42:29] Jacob Effron: have rationing of the next phase. [00:42:30] swyx: Right. Right. But like, thatâs as it should be almost like, um. My, my classic example, which I, this is just me theorizing, not anything confirmed by Google. When Google announced Gemini, they actually announced three sizes, which was Flash Pro Ultra. They never released Ultra. They only have Pro and Flash. Um, so my theory is they have ultra sitting in a basement and they just could distilling from it for, for flashing pro. Um, which like, yeah, I mean, I, I actually think thatâs. As it should be for any lab that they, that they do that. [00:43:02] Jacob Effron: Yeah. Just because those are the models that people actually wanna end up using. And itâs just like cost prohibit. [00:43:06] swyx: It is more, yeah, itâs cost. Yeah. Itâs, itâs not the want, itâs just, just, just the cost. Um, I do think, like, uh, it is interesting that, uh, for a while I was, I was considering the theory that models capped out at two, 2 trillion, and I think thatâs proving to be wrong. And well then if Iâm wrong, how wrong? How wrong am I? Do we do 200 trillion? Do we do two quarter trillion, whatever? Um, and I donât think we have the straight answer to that, but like, uh, itâs interesting that we are continuing to scale number of pers when everyone kind of assu like can see that weâre not going to get like the next thousand or 1 million x from this paradigm. So like the others, like the alias of the world are working on other. Um, model architecture improvements. We need a different scaling law, I guess, because like, weâre, I, I feel like people already already feel like weâre tapped out on this. Like the, the end, the end state of this is we turn most of the world into data centers and like, I donât know. I donât know if we want that. [00:44:08] Jacob Effron: Yeah, I mean, uh, if the, if, if, if the return of intelligence are there, maybe, uh, maybe not so bad. [00:44:13] swyx: I, I, I think there, thereâs just a sheer amount of like, like un scalability that like is wrangling peopleâs sensibilities right now. Um, especially in terms of like context lengths. Um, my classic quote is that context length is like the slowest scaling factor in, in lms. [00:44:30] Jacob Effron: Yeah. [00:44:30] swyx: Um, we, like, we took maybe. Three years to go from like 4,000 context length to a million and thatâs about it. Yeah. Like Gemini has had a million token context length for two years now. Um, and no oneâs using it. Like, so like yeah, itâs memory. Memory is probably gonna be the, the biggest limiting constraint on all these things. [00:44:50] Jacob Effron: Yeah. Certainly seems that way. I guess Iâm curious over the last year since you recorded last, like whatâs one thing youâve changed your mind on? [00:44:57] swyx: I feel like I was kind of bearish on open models like last year. Um, in a sense of, like, I, I had just done the podcast with an Al [00:45:07] Jacob Effron: Yeah. [00:45:08] swyx: Of Braintrust where he, and he, I mean, you know, he has a good cross section of all the top AI companies and he says market share of open source is 5% and going down. Um, I think thatâs changed. I think itâs going up. Um, and even if, [00:45:22] Jacob Effron: even though the capability gap does seem to be increasing. Spending on the [00:45:26] swyx: time. Itâs hard to tell. Yeah, itâs, itâs really hard to tel
Shopifyâs AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym â with Mikhail Parakhin, Shopify CTO
Early bird discounts for the San Francisco Worldâs Fair [https://www.ai.engineer/wf], the biggest AIE gathering of the year, end today - prices will go up by ~$500 tonight so do please lock in ASAP! From near-universal AI tool adoption inside Shopify to internal systems for ML experimentation, auto-research, customer simulation, and ultra-low-latency search, Mikhail Parakhin joins us for a deep dive into what it actually looks like when a 20-year-old, $200B software company goes all-in on AI. We cover why Shopify has become much more vocal about its internal stack, what changed after the December model-quality inflection [https://www.latent.space/p/wtf2025?utm_source=publication-search], and why the real bottleneck in AI coding is no longer generation, but review, CI/CD, and deployment stability. We also go inside Tangle [https://shopify.engineering/tangle], Tangent [https://apps.shopify.com/tangent-1], SimGym [https://apps.shopify.com/simgym], which are three major AI initiatives that Shopify is doing to make experimentation reproducible, optimization automatic, customer behavior simulatable, and search and catalog intelligence faster and cheaper at scale. Along the way, Mikhail explains UCP [https://www.shopify.com/ucp], Liquid AI [https://www.liquid.ai/blog/liquid-ai-announces-multi-year-partnership-with-shopify-to-bring-sub-20ms-foundation-models-to-core-commerce-experiences], and why token budgets are directionally right but often measured badly, why AI-written code can still increase bugs in production, what makes Shopifyâs customer simulation defensible, and what he learned from the Sydney era at Bing. We discuss: * Mikhailâs path from running a major Microsoft business unit spanning Windows, Edge, Bing, and ads to becoming CTO of Shopify * Why Shopify is talking more publicly about AI now, and why staying at the frontier has become necessary for the company * Shopifyâs internal AI adoption curve, the December inflection, and why CLI-style tools are rising faster than traditional IDE-based tools * Why Jensen Huang is directionally right on token budgets, but raw token count is still the wrong way to evaluate engineering output * Why the real unlock is not more agents in parallel, but better critique loops, stronger models, and spending more on review than generation * Why AI coding can still lead to more bugs in production even if models write cleaner code on average than humans * Why Shopify built its own PR review flow, and why Mikhail thinks most off-the-shelf review tools miss the point * How PR volume, test failures, and deployment rollback are becoming the real bottlenecks in the agent era * Why Git, pull requests, and CI/CD may need a new metaphor once code is written at machine speed * What Tangle is, and how Shopify uses it to make ML and data workflows reproducible, collaborative, and production-ready from the start * Why Tangle is different from Airflow, and why content-addressed caching creates network effects across teams * What Tangent is, and how Shopify is using auto-research loops to optimize search, themes, prompt compression, storage, and more * Why Tangent is becoming a democratizing tool for PMs and domain experts, not just ML engineers * Why AutoML finally feels real in the LLM era, and where auto-research still falls short today * Why Tangle, Tangent, and SimGym become much more powerful when combined into one system * What SimGym is, why simulated customers only work if you have real historical behavior, and why Shopifyâs data gives it a moat * How SimGym evolved from comparing A/B variants to telling merchants what to change on a single live storefront to raise conversions * Why customer simulation is so expensive, from multimodal models to browser farms to serving and distillation costs * How Shopify models merchant and buyer trajectories, runs counterfactuals, and thinks about interventions like discounts, campaigns, and notifications * Why category-level behavior is so different across commerce, and why ideas like Chinese Restaurant Processes are showing up again in practice * Shopifyâs new UCP and catalog work, including runtime product search, bulk lookups, and identity linking * Why Shopify is using Liquid AI, and why Mikhail sees it as the first genuinely competitive non-transformer architecture he has used in practice * Where Liquid already works inside Shopify today, from low-latency query understanding to large-scale catalog and Sidekick Pulse workloads * Whether Liquid could become frontier-scale with enough compute, and why Shopify remains pragmatic and merit-based about model choice * Who Shopify is hiring right now across ML, data science, and distributed databases * The Sydney story at Bing, why its personality was not an accident, and what Mikhail learned from deliberately shaping AI character early on Mikhail Parakhin * LinkedIn: https://www.linkedin.com/in/mikhail-parakhin/ [https://www.linkedin.com/in/mikhail-parakhin/] * X: https://x.com/MParakhin [https://x.com/MParakhin] Timestamps 00:00:00 Introduction: Mikhail Parakhin, Microsoft, and Shopify 00:01:16 Why Shopify Is Talking More About AI 00:02:29 Internal AI Adoption at Shopify and the December Inflection 00:06:54 Token Budgets, Jensen Huang, and Why Usage Metrics Can Mislead 00:10:55 Why Shopify Built Its Own AI PR Review System 00:12:38 AI Coding, More Bugs, and the Real Deployment Bottleneck 00:14:11 Why Git, PRs, and CI/CD May Need to Change for Agents 00:18:24 Tangle: Shopifyâs Reproducible ML and Data Workflow Engine 00:21:19 Why Tangle Is Different from Airflow 00:26:14 Tangent: Auto Research for Optimization and Experimentation 00:30:07 How Tangent Democratizes Experimentation Beyond ML Engineers 00:33:06 The Limits of Auto Research 00:36:36 Why Tangle, Tangent, and SimGym Compound Together 00:37:20 SimGym: Simulating Customers with Shopifyâs Historical Data 00:42:47 The Infra Behind SimGym 00:46:00 Why SimGym Gets Better with Real Customer History 00:47:30 Counterfactuals, HSTU, and Modeling Merchant Trajectories 00:51:55 CRPs, Clustering, and Category-Level Customer Behavior 00:53:30 UCP, Shopify Catalog, and Identity Linking 00:55:07 Liquid AI: Why Shopify Uses Non-Transformer Models 00:59:13 Real Shopify Use Cases for Liquid 01:03:00 Can Liquid Scale into a Frontier Model? 01:09:49 Hiring at Shopify: ML, Data Science, and Databases 01:10:43 Sydney at Bing: Personality Shaping and AI Character 01:13:32 Closing Thoughts Transcript [00:00:00] swyx: Okay. Weâre here in the studio, a remote studio, with Mikhail Parakhin, CTO of Shopify. Welcome. [00:00:08] Mikhail Parakhin: Thank you. Welcome. [00:00:10] swyx: I donât even know if I should introduce you as CTO of Shopify. I feel like you have many identities. Uh, you led sort of the, the Bing ML team, I guess, uh, uh, or ads team. I, I donât know, I donât know, uh, you know, itâs, uh, people va-variously refer you as like CEO or, or, uh, I donât know what that, that, that said previous role at Microsoft was. [00:00:29] Mikhail Parakhin: Uh, that was... Yeah, my previous role w- at Microsoft was the-- I actually was the CEO of one of Microsoftâs business units, which included, as I, you know, as we discussed, all the things that people like to laugh about, uh, including Windows and Edge and Bing and ads and everything. [00:00:47] swyx: Yeah, yeah. What a, what a, what a wild time. Youâve obviously, uh, done a lot since you landed at Shopify. Uh, one of the reasons I reached out was because you started promoting more sort of internal tooling, uh, primarily Tangle, but also a lot of people have seen and adopted Tobiâs QMD, uh, and obviously, I think, uh, Shopify has always been sort of leading in terms of, uh, engineering. I think more-- itâs just more recent that you guys have been more vocal about your sort of AI adoption. Is that, is that true? [00:01:16] Mikhail Parakhin: Well, I think AI tools in general are fairly recent development, uh, and weâve-- Shopify, you know, at this stage of its development, weâre developing AI in-in-house and other, uh, building tools that use AI and, you know, interfacing with the wider AI community, uh, you know, are on the sort of the, uh, runaway trajectory. So it just did by sort of natural byproduct. We, we talk about it more also. We just, uh, just even yesterday, Andrej Karpathy was famous in tweeting about, oh, are there some, uh, ways, uh, that, that you can organize your agents to store the data and then, uh, look up the data so that you donât have to research or, or lose context every- Yes time. And a little bit tongue in cheek, I tweeted that, âHey, weâve, weâve done it much earlier, and we even have different approaches, Tobi and I.â Tobi, of course, is a big fan of QMD, and Iâm more of a SQL, SQLite fan. But, uh, yeah, very similar things that weâve already done here. The point is, yeah, weâre very dynamic, you know, explosively growing company, and we have to be at the forefront of AI adoption, obviously. [00:02:29] swyx: Yeah. Yeah. Um, you, your team kindly prepared some slides actually that we were gonna bring up on to, uh, the screen. I think I can, I can screen share, and then we can kind of go through some of the shocking stats that maybe, maybe put some numbers to what exactly is going on. So here we have, uh- An internal AI tool adoption chart. What are we looking at here? What ? [00:02:54] Mikhail Parakhin: Yeah, this is very interesting statistics. Uh, this is number of daily active workers, you know, think of, uh, DAO, basically the active users of- [00:03:05] swyx: Yeah ... [00:03:05] Mikhail Parakhin: AI tool as a percentage of all the people in the company, right? And then- Yeah ... different AI tools. And, uh, you could see two things here is that one is the green is total. Uh, green is just total. So you could see that it approaches really % by now. Itâs hard not to do your job now without interacting deeply, at least with one tool. You could see another interesting thing is just as many people commented in December was the phase transition when suddenly models gotten good enough that, that everything took off and started growing. Uh, it, it was many people noticed that the thing is that small improvements accumulated into this big change in Sep- December roughly timeframe. [00:03:52] swyx: Yeah. [00:03:52] Mikhail Parakhin: The other thing I would claim you could see is that, uh, CLI-based tools and tools that donât require you to look at the code becoming more popular, and you could see, yeah, various versions of, uh, Cloud Code and Codex and Pi and internal development tools taking off. Uh, exactly, yeah, uh, and blue is our River, just internal agent for coding, where tools, uh, that require IDEs such as, uh, GitHub, Copilot or Cursor, theyâre not exactly shrinking, but theyâre not growing as fast. Like, uh, red, red line is, is the IDE kind of tools. So you could see that theyâre, theyâre not experiencing as, as fast of a growth. [00:04:37] swyx: As I understand it, basically, every employee has their choice, right? Of choose whatever tool you use, and then youâre just kind of doing a, a daily sur-survey or something. [00:04:47] Mikhail Parakhin: Exactly. And, uh, we- Yeah ... the, the push is to get your job done, you can use any tool, and we effectively fund unlimited tokens for everybody. Uh, we, we do, we do try to control the models that, uh, people use, but from the bottom, not from top. Like we basically say, âHey, please donât use anything less than Opus four point six.â [00:05:09] swyx: Oh . [00:05:10] Mikhail Parakhin: Some people, some people end up using GPT five point four extra high. Some people use Opus four point six. Um, uh, you know, uh, there are some, uh, there are plus and minuses in going for full one million context window versus not. But, uh, we try to discourage people from using anything less than that. [00:05:28] swyx: Yeah, yeah. Got it, got it. Uh, I mean, uh, thatâs, you know... The, the next chart here, it really kind of shows the expansion and the sort of December twenty twenty-five inflection, right? That, uh, people are using a lot of tokens. I think itâs also really interesting that no one was kind of abusing it in twenty twenty-five. Like it was- Had comparatively, uh, to this year, there was almost no growth. I mean, itâs still like, you know, probably, probably gave fifty percent. [00:05:56] Mikhail Parakhin: Yeah. This is just a different scale. Itâs still exponential- Yeah, yeah ...growth at just a different- ...rate of expansion. Uh, there was inflection point, and Sean, I would claim the, the super interesting part here is that you could see that the distribution becoming more and more skewed. Yes. The top percentiles grow faster. So that means- Yeah ...the people in the top ten percentile, they, their consumption grows faster than seventy-five and so forth. So, uh, the distribution skews more and more towards the highest users, which is... I donât know what it tells me. Itâs like it feels not ideal, to be honest. Or maybe itâs okay. Weâll see. [00:06:36] swyx: Why does it feel not ideal? Is, is it because of, um, quantity over quality, or whatâs the concern? [00:06:42] Mikhail Parakhin: Because take it to the limit. That means, you know, if, if this rate of separation continued- Ah, yes ...a year, there will be one person consuming all the tokens. So itâs just, itâs kinda strange. [00:06:54] swyx: Yeah, I mean, um, uh, I, I think internal like teaching and all that, uh, will, will help sort of distribute things more widely. But in, in the early days, of course, the people who are sort of more AI-pilled will obviously find more ways to use it than the people who are less AI-pilled. Maybe letâs, letâs call it that. Iâll just, Iâll just kinda quickly, uh, pause from the, the... You know, we will go back to the rest of the slides, but I just wanna, um, review, you know, there are a lot of CTOs of, of large companies like yourself where theyâre all considering some kind of token budget, right? Like I think itâs something, something that Jensen Huang has been talking about, where like if your 200K engineer is not using 100K of tokens every year, like theyâre, theyâre underutilizing coding agents. Of course, Jensen Huang would say that, but like it seems a very quantity over quality approach and like some, some people are basically saying like, well, is this comparable to judging engineer quality by lines of code, right? Which we also know is like kind of flawed, but better than nothing. So I, I donât know if you have like a sort of management take here on, on how to view this kind of, uh, metrics. [00:08:02] Mikhail Parakhin: Well, I mean, youâre, youâre baiting me. I, I like... This is my favorite topic. Uh, if you let me, Iâll probably talk for two hours on just this. I have a lot of things to say. Like I do think Jensen gotten a lot of bad press saying, âOh, of course youâre, you know, this, uh, the- ...the cake seller says you donât need enough cakes.â You know? Like, of course. Uh, but, uh, I actually, uh, think thatâs undeserved. I think he, heâs actually right. Uh, I do think- He, [00:08:33] swyx: heâs directionally correct. [00:08:35] Mikhail Parakhin: Yeah. Yeah. Heâs directionally correct for sure. Uh- [00:08:37] swyx: Who knows what the right number is? Yeah. [00:08:39] Mikhail Parakhin: The thing that I do Uh, want to say, and this is something that we learned through trial and error and very important is like two things. One is that itâs not about just consuming tokens. Uh, you can consume tokens and, and in fact, the anti-pattern is running multiple agents, too many agents in parallel that donât communicate with each other. Thatâs almost useless, uh, compared to just fewer agents and burns tokens very efficiently. Uh, setting up the right critique loop, especially with the high quality models, where one agent does something, the other one, ideally with a different model, critiques it, uh, suggests ways to improve it, the agent redoes it with this critique and, and so it takes much longer. So people donât like it because latency goes up. You know, they, they have to wait until this debate is happening. But, uh, the quality of the code is much higher. And another thing, just since you mentioned like, look, uh, uh, yeah, the overall budget is just like, uh, lines of codes. Lines of codes are exploding for everybody right now, or partially because AI is really mover balls, but partially just because AI can write a lot more code, you know, doesnât get tired. And so you have to have to have a very strong narrow waist during PR review. Otherwise, just the number of bugs will go through the roof. Itâs, uh, itâs this unexpected consequence of the just volume trumping everything. I would claim by now good model writes code on average with fewer bugs than, than the average human. But since they write so much more of it, like more of it will make it into production. So you have to- You still [00:10:26] swyx: have [00:10:26] Mikhail Parakhin: more bugs. Yeah. Have to have a very rigorous PR reviews, also automated of course. But, uh, yeah, that to spend a lot budget there. Like this, this for me, for me, actually, the important metric is the ratio of budget spent during code generation versus, uh, spent, uh, expensive tokens like GPT, uh, five point four Pro or, uh, uh, Deep Think from Gemini, you know, checking on PR reviews. [00:10:55] swyx: Yeah, totally. Uh, I noticed in your chart you didnât have any review tools. Do you just use like, like letâs say a Claude code to review tools? Or do you have another set of review tools like the Greptiles, the Code Rabbits, uh, Devin Reviews has a review tool. I donât know if youâve had those specialist review tools. [00:11:13] Mikhail Parakhin: You are a little bit jumping on my store tool right now because the graphs I was only showing public tools. Uh, uh, the-- I havenât found a good PR review tool that, that does what I think should be done. And, uh, partially my, my thinking is because itâs so... It just goes against both what people feel like emotionally they prefer and, uh, some of the, uh, you know, frankly Even business models that, that the companies run. At peer review tool, uh, time, you want to run the largest models. That means, I donât know, Codex or, or, uh, Cloud Code is not gonna cut it. You need to have pro-level models if you really want to, uh, stand the tide of bots from going into production. And you need us to spend a lot of time, the models taking turns, but you donât want, like, a big swarm of, uh, of, uh, agents. So in fact, you end up in a different dual-dualistic world where you generate not that many tokens. You, in fact, generate few tokens, but it takes f-a long time because these are expensive models taking turns rather than many, many agents trying to do many things in parallel. So thatâs, thatâs why I feel like I havenât found good tools, so we are using our own for peer review for now. [00:12:33] swyx: Yeah. Yeah. I mean, uh, I think a lot of companies are building their own, uh, especially to their needs, right? [00:12:38] Mikhail Parakhin: Mm-hmm. [00:12:38] swyx: Um, I, uh, you also have a chart here going back to the slides on, uh, PR merge growth, where weâre now at thirty percent, uh, month on month rather than ten percent. Uh, and also the, the estimated complexity is going up. You know, this is productivity, right? âCause y- presumably thereâs more stuff going into the code base and more, more features getting worked on. Iâm curious about the backlog, right? Like the, the, the-- I actually donât mind a pro-level model taking an hour or two hours to review my PR, because Iâve dealt with humans who take a week to review my PR, right? And I keep pinging them on Slack, âHey, hey, review my PR.â So, you know, I think thereâs some trade-off here where, like, it still doesnât make sense. [00:13:18] Mikhail Parakhin: Exactly. That, thatâs exactly m-my point. Uh, that on one hand, you can tolerate longer latencies at, uh, PR. On the other hand, like right now, the real problem is not in spending time waiting for PR. Itâs real problem is since thereâs so much more code than- Yeah ... uh, probability of at least some tests failing going up, and then you, like, keep de-failing, then you have to find the offending PR, evict it, retest it without that PR, and so deployment cycle becomes much longer. Uh, so it actually, in terms of the overall time to deploy, itâs total time savings if you spend more time on a longer model, like thinking for an hour, because then, then you, you donât have to spend all that time during testing and rolling, you know, rolling back the deployment. [00:14:03] swyx: Yeah, totally. Thatâs still worth it. You know, you donât look at the individual, look at the aggregate, and look at the, the, the change in the aggregate system. [00:14:11] Mikhail Parakhin: Exactly. [00:14:11] swyx: Iâm kind of curious if, like, thereâs this PR mentality and, like, c-- the, the, the CICD paradigm will be changed eventually. Some people are like, obviously a lot of people want new GitHub, but I even wonder if, like, Git is the problem, right? Like, is that the bottleneck? Is the concept of a PR a bottleneck? Do you guys use stack diffs? I donât know if, uh, thatâs a, like, a merge queue stack diff type of thing. [00:14:34] Mikhail Parakhin: We, we use, we use Stacks, we u- we use Graphite. We worked with, uh, Graphite a lot. Uh, so we use Stack, uh, PRs. I think, uh, like thatâs clearly the overall CICD in general, and the interaction with the code repository right now is the, clearly the sort of the, the main issue and the bottleneck for us, uh, and highest top of mind. I would say we probably need a different metaphor or different whole design of how to process it in new agentic world. I havenât seen anything dramatically better yet. I, I think everybody right now is just trying to keep their head above the water âcause, âcause there, thereâs so many PRs and then everybodyâs CICD pipelines start creaking, the, the times are increasing, the number of bugs slipping by increasing, and you have to, have to clap on down. And so we are a little bit in this situation when we need to first stabilize that story and then start thinking, hey, what, what it could be a completely different and new world, which I havenât... I know some people working on it. I havenât seen something, like anything super compelling yet, but clearly the old thing were designed for humans will need to be morphed into something new. [00:15:53] swyx: One of the thing that I, I think about is kind of like the merge conflict is basically a global mutex on the whole system, right? And in, in hu- in human organizations, we do have something like that. Itâs the company standup. But like, other than that, itâs like itâs actually fitting for us to be somewhat decentralized, somewhat plugged into one stream of information source, but somewhat lossy. Like itâs okay, you know, that, that not every delivery is like atomic consistency. Like weâre not dealing with a database sometimes. [00:16:27] Mikhail Parakhin: This is a very good point, uh, because since humans donât write code too fast, you know that global mutex is not too bad. Once you- [00:16:36] swyx: Yes ... [00:16:37] Mikhail Parakhin: start writing code at the speed of machine, it becomes the, you know, the bottleneck. Then what do you do? Maybe, and I canât believe Iâm saying this because I, Iâm long-- lifelong opponent of, uh, microservices, and I always thought that was, like, a really bad idea. And now that youâre saying it, like, maybe in new guys like microservices will make a comeback, you know, because then you, you can ship things independently in tiny things and, and the managing all that complexity automatically will be much easier. I donât know. Like, weâll s-- weâll have to see. [00:17:10] swyx: Yeah. I mean, I donât know what the Microsoft or, or Shopify thing is, but I, I read this paper from Google where they have a monorepo that deploys into microservices, right? And then, uh, the other concept that I think about a lot is the Chaos Monkey concept from, from Netflix. Being able to create, like, this robust system where, um, uh, you know, you, you have the service discovery, you have the, uh, the independent, independent microservices discovery and, and, uh, you know, probably going to be a fair amount of duplication. Thatâs how an organic system sort of scales, uh, that, that you have that... I donât know how you call it. Slack? Robustness? Depend-- uh, d-duplication. I, I, I forget the-- I, Iâm-- And this-- those-- these are not exactly the terms- Hmm ... Iâm looking for, but I c-canât really think of the words. Okay. I was gonna go into Tangent and Tangle. Uh, so, uh, we, we sort of discussed the overall stats that, uh, Shopify has. Uh, but, you know, I, I think some, some pretty cool stuff that you guys are working on is your ML experimentation, uh, and your, your sort of auto tr-research training pipeline. Presumably youâre much closer to this one because itâs, itâs a sort of personal hobby of yours. How, how would you explain them in, together? I thought we have a slide that, like, uh, has the s- the system diagram. [00:18:24] Mikhail Parakhin: Yeah. Tangle first and then Tangent as a- [00:18:27] swyx: Yeah ... [00:18:28] Mikhail Parakhin: as a thing on top of Tangle. And, uh, Tangle is the third generation, I claim, of, uh, systems of, uh, running any data processing, but a bit with a skew for ML experiments, but not necessarily. Any sort of data processing tasks where you need to iterate, share, and you have scale so that you want maximum efficiency. You know how, like, normally you would work, you would-- Imagine youâre a data scientist or an ML practitioner, you would get Jupiter notebooks or, or maybe you would get, uh, you know, Pyth- your Python scripts, and you would manage the data, and you produce those TSV files, and you put them in some JFS or something. Then you would notice that, oh, it has this, uh, weird missing values. You go and write another script that, uh, goes and replaces them with, uh- [00:19:20] swyx: Ah ... [00:19:21] Mikhail Parakhin: dash S. And then, then you, then you run some, some, uh, âOh, I need to filter bots.â And so you run some light GBM model that, uh, removes the bots. And then, then you like-- And then you, you kind of like get into shape, and then you start experimenting, and you run multiple experiments, and then youâre like, âOh my God,â like, âthis experiment is worse.â You undo, and you cannot get to previous result. And like, âAh, what did I do?â Like that. Again, then, then you finally like get everything working. Then you like start throwing it over the fence to production. You, you replicate it, those things donât work, and then sometimes you like donât notice that you forgot some feature naming and the, the features donât match. But then, like imagine you, you did everything, and then six months later youâre like, have to repeat it because now thereâs more data, or you wanted to do another pass, and youâre like, âWhat, what did I do?â Or like, or like, âThis script crashes now,â or the, âthe path has changed.â And then, then youâre trying to, like you spend another month just doing ar- digital archeology on your own, you know, history, right? Now multiply that by many, many teams. Now imagine you got an intern that you wanna ramp up. Now you have to show that intern, âOh, you know, look, hereâs the folder, thereâs the scripts, you know, ask your cloud agent to do, and then, uh, to, to figure it out.â And then cloud agent does something, and then youâre, âAh, yeah, right, right, it was the wrong folder. I forgot to tell you, I actually have this other thing I forgot myself.â And, and thatâs, thatâs the, like, the daily life we all, uh, all know it, uh, if, if youâre a data scientist, machine practitioner, ma- machine learning practitioner or, uh, or even like any data managing, uh, person. [00:21:00] swyx: Yeah. So I, I used to do this, uh, f- uh, on the quant finance side, uh, in, in my hedge fund. So we did this before Airflow, and then, uh, obviously Airflow came along and, uh, then more recently Dagster, uh, I would say is like, in my mind, what I would use for that shape of problem, uh, where you had to materialize assets and create a pipeline. [00:21:19] Mikhail Parakhin: And thatâs, thatâs very good segue because... So Airflow is great, but Airflow is more about you, you have something and you wanna repeatedly run it in production on schedule. Itâs less about you as a team developing things and being able to share, and you grabbing the standard pipeline and saying, âHey, I wanna change this tiny little component in the huge sea of data processing, and I donât wanna-- I wanna run ten experiments on this, and I wanna do hyperparameter optimization.â All that is very hard to do with Airflow. Itâs very easy to do with Tango. Tango is m- more about, itâs everything about group of people Running experiments, it might be agents too nowadays. Uh, running experiments cheaply, collaborating, sharing results. Uh, you donât need to understand fully. You, you grab-- you clone somebody elseâs experiment or somebody elseâs pipeline, uh, run, uh, change small piece, run it, be, like, get it to production state, and then ship in one click. So then the... You donât have to port it into any other system to, to run in production. You can just run the same experiment. Itâs, itâs fully production ready. And, and itâs, uh, it has lots of... Again, as I said, itâs third generation system. The original one was, I would claim there was Ether and then, uh, at least in my career, Ether was the first, first, uh, that pioneered this type of approach. And then there was, uh, Nirvana, which, uh, uh, at Yandex, which did kind of sec-second take on this. And now this one aggregates the, the learnings from all of those and, and Airflow as well to, to get to the state where you try it, it, it feels kind of magical. Uh, âcause now everything is based on content, uh, hashes. So even if the version changed, but if the output didnât change, nothing is being rerun. Itâs very efficient. If you... Multiple people start experiment that needs the same sort of data preprocessing, itâs not repeated multiple times. Itâs automatically done only once. If you start ten experiments that all require, you know, some, some data preparation first as the first step, and you donât have to coordinate for that. Like, you donât have to know that other people are starting it. You now, itâs very easy compos-, uh, composability, any language you can u- uh, you wanna use, and itâs very visual. So you can see immediately, you can edit it easily, you can assemble small things with just even mouse clicks if you want to, and, uh, share, clone. And everybody knows also itâs fully kind of static in the sense that we rerun it second time, it will exactly have the same results. Like, you will never have to do digital archeology. So full versioning and everything is also there. [00:24:06] swyx: Uh, so, so people can, uh... Itâs open source. Go to the GitHub repo and, and, uh, check it out. Uh, and it is also a really good, uh, blog post about it. I think all these is, like, really appealing. The, the, the, the thing that I think sells me the most about it is that, um, sort of development to production transition, right? Which I think, um, a lot of people havenât really solved that, uh, strictly, right? Like, we develop really, really well in, in Python notebooks, but then, you know, thatâs obviously not a sort of production ready process. I think that, like, any way in which that is solved, I think is, is very appealing. Then the other thing that you mentioned, which also raised my eyebrows, was content-based caching, which you mentioned is, is, um, you know, is ve-very much, uh, um, a sort of efficiency measure about, uh, you know, just like recalculation only on, on sort of content addressing Which I think makes sense. Uh, it surprised me that the savings could be this much, but maybe I just havenât worked at your scale where thereâs so much duplication, uh, that people just rerun because they change a single ID upstream. [00:25:10] Mikhail Parakhin: It does, yeah. But itâs not only you rerun. The, the main savings are coming from the fact that you ran it, you got your job done, and you moved on. Then- Yeah ... somebody else in some department you donât know existed runs the same task, but on a newer version. [00:25:27] swyx: Yeah. [00:25:27] Mikhail Parakhin: Like right now, you canât, in, in most of the organizations, you canât even find out about it so that you canât even measure that youâre spending that time twice, right? Here- Yeah ... if everybodyâs on Tango, thatâs detected automatically and detected that the output is the same. And then for that person, all it looks like is like experiment just suddenly moved, jumped forward, right? Uh, uh- Yeah ... so thatâs because, because the, thereâs network effect of multiple people helping each other. [00:25:51] swyx: Yeah. This is one of those things where itâs designed to be a platform from the beginning rather than an individual developerâs tool from the beginning, right? And, and everythingâs gonna streams down from there. That is the sort of Tango, uh, orchestrator, and itâs, it manages jobs. Weâve seen a few versions of this, and this is obviously, uh, uh, the sort of, uh, unique approaches that you guys have, have, uh, figured out. And then thereâs Tangent. [00:26:14] Mikhail Parakhin: Yeah. And Tangent is basically an automatic auto research loop that can help and kind of do your work for you. Uh- ... you know, uh, effectively, effectively, Andrej Karpathy recently popularized it with auto research. Yes. Remember he said like he was, uh, speed running this, uh... Yeah, uh, you know the story. The, here weâre basically bringing the same capability into Tango so that, uh, the, uh, Tangent can analyze it. Itâs just an agent that can run multiple experiments, figure out what can be changed, and keep on rerunning it, keep on modifying until, uh, maximizing some goal, some loss function, whatever you need to, to achieve. And in general, I would say if youâre not using auto research-like approach in whatever you do, like literally whatever you do, then youâre missing out. We saw at Shopify that taking like a wildfire, anything where you can put measurements can be done dramatically better. Our- [00:27:19] swyx: Mm-hmm ... [00:27:20] Mikhail Parakhin: uh, speed of, uh, templatization HTML, uh, completely new UX tem- uh, templatization of, uh, reducing latency for liquid themes. Uh, we-- Our, uh, search, uh, recently we moved from Itâs hard even, uh, quote from eight hundred QPS to forty-two hundred QPS with the same quality just by pure optimizations and not a research loop that kept running and changing code in our index serve on the same number of machines, just increasing the throughput. We, we managed to improve the quality of gisting and machine learning process. Uh, you know, gisting is the prompt compression technique that [00:27:59] swyx: allows for [00:28:00] Mikhail Parakhin: lower latency and, and lower and, uh, actually higher quality slightly. So like literally whatever different walks of life, and it doesnât have to be AI related. Uh, we, we had a reduction in, uh, storage because the agents would go and find data sets that clearly are derivative, uh, and then you donât need to store things twice. You know, we, we, we found somewhat embarrassingly that it was one of the largest tables was hashing random IDs into another random ID, and we literally- Oof put only one. So it was translating, yeah, two random IDs hashed [00:28:36] swyx: into [00:28:37] Mikhail Parakhin: each. So, so [00:28:37] swyx: it has access to the code as well, so it can, it can check the, like what, what the hell is it doing? [00:28:42] Mikhail Parakhin: So there, there cou- it could be run in two levels. You, uh, you know, at the superficial level, it could just use ex-existing components and, uh, reshuffle them. Uh, you know, like you can grab- Yeah ... uh, XGBoost, and you can grab some, some Py- PyTorch module, and then can grab some, you know, grab another tools and, and combine them. At a deeper level, since Tangle is all sort of CLI based underneath you, every, every component is a wrapped really CLI, uh, call and a YAML file, it can analyze code and create new components and, and, uh, keep on iterating as well. So, so you can, you can both have quick modifications of existing t- uh, pipelines with the, with components that are already there pre-baked, or you can create new components, uh, and- [00:29:29] swyx: Yeah ... [00:29:29] Mikhail Parakhin: keep iterating on those. So auto research is, again, this is probably the, the thing I was excited the most in the last two months happening, and we see it taking like, like totally like a wildfire. Just, uh, everybody, every day, every... well, every day, every minute, I would, uh, have somebody Slack message saying, âOh, look how much better I made it.â And, uh, itâs all throughout the research. [00:29:53] swyx: Is this democratized in some way in, in the sense that like is it your ML, uh, engineers and researchers doing this, or is it your regular PMs and software engineers also have the ability to auto-- to use Tangent? [00:30:07] Mikhail Parakhin: This is an awesome question. Like, Tango in general and Tangent in particular are extremely democratizing. Like they- Yeah ... they are the main tools for- âCause I donât [00:30:15] swyx: need the details. [00:30:16] Mikhail Parakhin: Yeah. Exactly. Initially used by ML and AI engineers, but then literally, as you said, PMs are like the highest user right now is one of PMs on our org, uh, Sartak and he was, he was number one by, by usage of, of this âcause theyâre just, uh, energetic and knowledgeable, and now it, it unlocks a lot of capability where you donât have to co-change code manually. [00:30:39] swyx: I mean, I mean, because it kind of cuts out the ML, ML engineer from the process because the, the, the PMs have the domain knowledge and the ability to think about, uh, from first principles about, okay, what, what results do I want? And they can-- they even have the access to the data that, that needs to go in. So itâs like in some ways, like this is the magic black box that weâve always wanted for, for training and, and for, uh, I guess, uh, uh, hill climbing, whatever. [00:31:04] Mikhail Parakhin: Itâs basically cloud code for your AI development- ... uh, situation, right? Like now, now you donât have to know exactly how algorithms work. You can just, uh, bring your domain knowledge and expertise and product knowledge and iterate within Tangent until youâve gotten the results that you need. [00:31:21] swyx: In my previous roles, every time that someone has pitched AutoML, you know, Iâve always been like, âUh, this is not, this is not gonna work. Itâs, you know, itâs, itâs always gonna be a flop.â Somehow itâs working now. I mean, presumably the answer is now we have LLMs and itâs good enough, right? Itâs, itâs an emergent property that we can do auto research, but like, it doesnât feel that satisfying that how come we didnât do this before, right? Like we just did like parameter search and like, I donât know. Thatâs maybe thatâs it. [00:31:48] Mikhail Parakhin: Yeah. Bayesian optimization and hyperparameter optimization was, was the one that, or facet of AutoML that was used very actively, which incidentally also built into, uh, Tango. But, you know, I know Patrice Simard very well, and, uh, he was such a, uh, such a proponent of AutoML, and he put, like literally spent careers trying to democratize it. Without LLMs, it just turned out to be very hard. Like it, you, you would have flexibility within certain narrow domain, but it was hard to wider scale, and now with LLMs suddenly itâs like magic wand, and so suddenly everybody- ... is an AutoML expert. [00:32:28] swyx: Yeah, I, I think itâs multiple things, right? Like Iâm, Iâm just gonna bring up the, the, the chart again, right? Like LLMs can do the monitoring very well. That is the very potentially unbounded, super unstructured. It can do the analysis very well, it can do the... Uh, and basically it is much more intelligence poured into every single step. Uh, thereâs maybe nothing structurally changed about AutoML, but this is just m-more intelligent and more unstructured. [00:32:53] Mikhail Parakhin: Exactly. [00:32:54] swyx: Any flaws that youâve run into? Like everyone is like drinking the Kool-Aid, oh my God, time savings, uh, you know, performance improvements. Like what, what, uh, issues have you have, uh, come up? [00:33:06] Mikhail Parakhin: This is really cool. Itâs not a solution to all the worldâs problems for sure. The limitations are usually the ones I-- And this is where we get into a bit of a subjective territory. Uh, I can only share what Iâve, Iâve seen so far, and Iâm sure the situation, uh, is changing, and, you know, maybe after I say it, like many people will reach out and say, âHey, what about this?â And you donât know that, and then, then weâll be probably right. But what Iâve seen is auto research is very good at doing kind of obvious things that you donât have bandwidth to do or you didnât notice or maybe youâre not aware of like the-- some standard practices. It is not good at doing something completely out of distribution, something that, you know, you have to think for, for multiple days, uh, and, and do something like none of this. So, so itâs, uh, I, uh, set an experiment once, uh, on, on my sort of, uh, hobby thing, and I let it run for, uh, ended up, uh, several weeks run, uh, you know, itâs like full production kind of scale, so it, you know, slow runs and, and it ex-- it performed in the end, uh, over four hundred experiments, and only one was successful. Iâm like, âOkay, thatâs, thatâs good.â But- [00:34:18] swyx: But it saved time. [00:34:19] Mikhail Parakhin: Yeah, I saved time. Like it, it was the, that thing. Yeah, if I, if I were doing four hundred experiments myself, my betting average, as I said, would have been much higher, Iâm sure. But also, first of all, it would take me like three years to do four hundred experiments. And, uh, I didnât have to do them. Like the machines were just, uh, the price of electricity did that. So, and I got one improvement, uh, that in, uh, my, my-- Honestly, when I was starting that experiment, my thinking was to go and show that, âHey, Andre, maybe you just donât know how to optimize.â And I was super smart because in, in my pro-problem, it was optimized for many years, and it was like fully improved. Uh, and I didnât expect it, you know, auto research to find anything at all. Yet it did. So instead of making fun of Andre, I ended up, uh, a big, big supporter. Yeah, thatâs exactly the tweet. Yes. [00:35:10] swyx: You and Toby really, really go back and forth on-online a lot, which is really funny. Uh, think of it as, as an eval for the optimalness of the code itâs running on. Uh, itâs almost like it reminds me of like a Kolmogorov complexity thing, but, uh, I guess itâs-- thereâs some optimal thing that youâre trying to sort of reduce down to, I guess. Um, and so, so you, you, you know, you should congratulate yourself that you had, uh, you know, uh, ninety-nine percent, uh, optimality. [00:35:36] Mikhail Parakhin: Exactly, yeah. I think Andre really deserves a lot of credit for popularizing this approach. This is, uh, this is incredibly, I think, powerful and cool and You know, the, uh, even him, him just mentioning it led to a lot of gains in a lot of places in the industry, so we should be thankful. [00:35:56] swyx: Yeah. I think he also has a just... I donât know what it is. Like, um, you know, it, it is a simple self-contained project that people can take and apply to other things, which is, is, is one thing, but also just the name. Just like somehow no one, no one managed to call their thing auto research. Itâs just naming things is very important. I think that that is mostly, uh, our coverage of Tango and, and, uh, Tangents. I think obviously, you know, thereâs a lot of, uh, ML infra at, at Shopify that people can, uh, dive into. Weâre about to go into SimGym, but before I do that, any, any other sort of broader comments around this whole effort? Like where is it, where is it leading to? [00:36:36] Mikhail Parakhin: As a segue to SimGym, like all those things start composing strongly. And, uh, you could see a huge unlock when you can look at each one of the tools and, and you see, oh, theyâre extremely useful. Uh, Tango is useful by itself. Auto Research is useful by itself. SimGym is useful by itself. If you combine all three, you create like synergetic effect. I think thatâs why we wanted to even, uh, cover them today is because this is something that if you go back even, you know, five years ago, wouldâve been unthinkable. Uh, replicating that, uh, would, would be either incredibly costly or impossible, right? With probably thousands of people are required. [00:37:20] swyx: Well, we have serverless human, uh, serverless intelligence, right? Like, uh, so yes, you do have thousands of hu-- of, of intelligences, not just, not humans. And thatâs, thatâs close enough, right? Even if theyâre not AGI, theyâre, theyâre close enough to do the, the task that you need them to do. And, and, you know, thatâs, thereâs plenty for, for a lot of routine work, knowledge work. Okay, letâs get into SimGym. Um, this is one of those things I, I was surprised to see actually itâs apparently your, uh, one of your most popular launches, and I think something that, uh, I think Sim AI, I think Yunjun Park, who did the Smallville thing, thereâs a very small cottage industry of people trying to do like the simulate customer thing. I think a lot of people maybe donât super trust this yet because theyâre like, well, obviously they would just do what you prompt them to do, right? But maybe just think, uh, tell us about the sort of inspiration or origin story. [00:38:10] Mikhail Parakhin: Thatâs exactly actually the thing I wanted to cover, because if you donât have the historical data, all you can do is prompt a-agents in a vacuum, and they will do exactly what you prompt them to do. In fact, when I first proposed it, and this is a bit of, um, my brainchild initially, if I, I can boast, even Toby said like, âBut wouldnât they, they just repeat what, what you tell them?â And, uh, but Iâm like, âYes, except Shopify has decades of history of how people made changes and what there is, uh, there, what it resulted in terms of sales.â So now what we can do is we can-- we have this... Itâs not, itâs a noisy data. Thereâs a small, usually websites, uh, you know, like things, things are never in isolation. Itâs almost never AB experiment. Itâs always AA experiment when thereâs has two meanings, but basically, you know, in different time you run two different things. But if you aggregate in general, uh, like everything together, and you apply, uh, denoising and collaborative filtering like approach, you can extract a very clear signal. And then you can optimize your agents. And thatâs why it took so long. It took almost a year of that optimization of just us sitting and fiddling, and, and we had this internal goals of correlation of hitting-- internal goal was to hit zero point seven correlation with, uh, add to cart events, for example. Like that, that if we run real AB test experiment, that it should, it should go and, and rep-uh, replicate, uh, same sort of success that, that humans had or lack thereof. And it, it took forever, and I donât think thatâs easily replicatable because, uh, like who else would have that data? You have to have this historic, you know, decades, uh, worth of data. And now, now the, like the other thing you need is in-infrastructure and the scale, right? Because, uh, w- again, what we found, uh, stat sig results, you need to run a lot of simulations, a lot of agents, and, and itâs-- Those are expensive things. Like youâre, youâre making actions in the browser because you want a real friction. You want to, to be able to get the image like of what humans will see because you wanna, uh, detect effects like, âHey, if I make my images larger, will I have more sales or l- uh, fewer sales?â And like usually peopleâs intuition here, by the way, is that I increase my images, I will have more because they look nicer. You know, designers all look sparse and big images. Like usually your sales tank, right? But, but, uh, you know, from HTML, all the characters look the same only the, the size tag looks different, right? So itâs very hard. So you have to take visual information, you have to run this in simulated browser environment on the big farm and, and of course, you have to have, uh, like very, very expensive model, good model with multi-model model. So all this itâs-- is whatâs taken so long and, uh, to share my personal fail a little bit there, Sean, is like, you know, we always had this bias to-- for like large company bias. You know, we always, uh, whenever you-- we do, weâre like, âHey, weâll run an experiment,â right? We make, make a change, and we will run an experiment and then, uh, see, uh, see which oneâs better or like, âNo, this is worse,â and most of them are worse, so you discard it and keep iterating, hill climbing. And weâre like, âOh, like smaller merchants, they cannot get stat sig results. They cannot really run experiments simply because, you know, in a week there would be not enough data for them.â So we thought from this perspective. What we didnât realize is that most people donât have A and B, they just have one thing, and they need suggestions of What A and B should be. So, uh, we first build this, hey, we run simulation on two separate teams and, and, uh, say, âHey, which one is better?â We then morphed it into, and very recently just released it, when you have just your site, your theme, we run over it and we say, âHey, hereâs what predicted values of, of, uh, uh, conversions are, and hereâs how we think you should modify it to increase your conversions.â And then circling back to what you started with, the proof is in the pudding. Like, if we are not correlating with reality, like, people will not be using it. And, uh, thankfully, we see literally every day more users than the previous day. So, so right now, uh, right now- Itâs working. Yeah. Iâm-- Right now my problem is how to pay for it all because the so our major thing is how to optimize the LLMs, do distillation, how to run the headless browsers, uh, and handful browsers, uh, uh, cheaper so that we can accommodate the increase in traffic. [00:42:47] swyx: Yeah. I, I understand that you, uh, you published a lot of technical detail at GTC, so I was just gonna bring it up a little bit. I think s- was this in, in con-conjunction with some kind of GTC presentation? Or something like that, right? [00:42:59] Mikhail Parakhin: Well, we, yeah, we, we did it in several place, but yeah, we had the engineering- Yeah blog, uh, as well. Yeah. [00:43:05] swyx: Yeah. So youâre running, uh, GPT OSS. Uh, [00:43:08] Mikhail Parakhin: the, this is an older version. You know, now we run multimodal model. But yeah- Yeah ... GPT OSS, we still run GPT OSS as well for [00:43:15] swyx: And then you have the VMs, and you also have browser-based. I really like this one where it you said, âIt violates almost every assumption that standard LLM serving is designed for.â And then you had like, basically orders of magnitude differences between everything. [00:43:29] Mikhail Parakhin: Exactly. Which is, which, uh, which was, you know, a bit of a challenge to implement, like when, like even simple things. Uh, be- since it violates all the assumptions, for example, multi-instance GPUs, like MIGs donât work as well. But we needed, uh, to get MIG to work because, âcause otherwise itâs way too expensive. And so we had to deal with the, yeah, with, uh, lots of infrastructure and, and, uh, work with, uh, uh, Fireworks and CentML, uh, you know, to help with optimizations and browser-based, as you mentioned. Yeah, like, takes a village. [00:44:04] swyx: Okay. So thereâs a lot of like, I guess, experimentation in the infrastructure so far, and youâve published more or less what you have here. I guess Iâm, Iâm less familiar with CentML. I, I donât do, uh, that much work in this, this part of the stack. But why was it the sort of preferred instance platform? [00:44:22] Mikhail Parakhin: There are really three probably top companies. There used to be, uh, uh- Three top companies, uh, at least I was aware of that did, uh, LM optimization. You know, together Fireworks and Santa ML, not necessarily in that order. Santa ML recently got acquired by NVIDIA. Uh, what they did is if you have a model and you want to optimize it to a specific prof-- uh, profile of usage, uh, they would go and do it. And, uh, we work with, with those companies, uh, this was work particularly in with Santa ML and NVIDIA to get them the best possible results out of it. And, and sometimes you, you have to retune depending on, like sometimes you want the maximum throughput, sometimes you want minimal latency, sometimes you want like the cheapest, right? And, yeah, or some combination. And so yeah, these are people who would come and help you. [00:45:14] swyx: I see. I see. Yeah, yeah. Iâm familiar with these people for the LLM, you know, autoregressive stack. But the other interesting category of these optimizers is also the diffusion people, whereas like Fel and, you know, uh, Pruna recently has come up a lot as well, which I think is like really underappreciated, uh, at least by myself, because I, I thought, oh, all the workload would be LLMs, but actually thereâs a lot of diffusion as well. [00:45:38] Mikhail Parakhin: Exactly. [00:45:38] swyx: Thereâs a lot here, so I, I, I... itâs, itâs, uh, itâs, itâs, itâs hard to cover. But I, I do think like people underappreciate the importance of customer simulation, basically. I think this is something that Iâm candidly still getting to terms with. Uh, you know, uh, you also-- your team also like prepared this, like, really nice diagram. Uh, I, I assume this is AI generated. [00:46:00] Mikhail Parakhin: Yeah, it looks- [00:46:01] swyx: Maybe itâs not. [00:46:01] Mikhail Parakhin: Yeah, it looks, uh, Gemini-ish. Yeah, but, uh, uh, honestly, I, I donât know where, where the hell they generated. It looks, look, uh, looks like itâs, uh, Google. But the interesting part, John, that, that, uh, we havenât covered, but I, I wanted to mention is if your store had previous customers, rather than itâs a new store, youâre like new merchant just launching things, it helps tremendously in just correlation and forecast. Yeah, we take your previous, uh, customerâs behavior, and we create agents that replicate those specific distribution of, of customers that you get, and then we a- we apply those to your changes, and then that, that raised raw, you know, the re-- uh, just correlation with the add to cart events or to-- with conversion or whatever it, it, it may be, uh, quite dramatically. So, uh, replicating humans in general seems like an interesting, cool challenge. [00:46:58] swyx: As a shareholder, I think this is the-- like if people are Shopify shareholders, they should really deeply understand this because this is basically the moat. The, the more you use Shopify, the more it will just automatically improve, right? Like youâre, youâre doing the job for them. [00:47:13] Mikhail Parakhin: Yeah, thatâs what we started with. Like, uh- ... uh, otherwise, if youâre just a startup, I wouldnât do it if, uh, you know, if it was my startup because Without the data, it, yeah, as, as you said, itâs, itâs exactly the case that, uh, whatever you say in prompt, thatâs, thatâs what the agents will be doing. [00:47:30] swyx: The statistician in me wants to like really satisfy the sort of, um, statistical intuition, I guess. Um, to me itâs kind of, uh, the, the word that comes to mind is, um, ergodicity. Uh, so letâs say a, a customer takes this path, customer takes this path, customer takes this path, right? Um, the... In my mind, the way I explain it is like, okay, here, hereâs the ninety-five percentile, hereâs the five percentile, and hereâs the median, right? Um, but to me, what SimGym is potentially doing is that it can, uh, modify... It can sort of model the sort of in-between sort of journeys as well, that, that maybe are dependent on the previous states. This may be like a very RL-type conclusion where like basically the summary statistics, if you only did naive AB testing, you only have the, the statistics at, at, at a certain point, and you only judge based on the sort of overall summary statistics. But here you can actually model trajectories. Does that make sense? Or- [00:48:31] Mikhail Parakhin: That makes total sense because like, well, that, that makes even more sense that maybe even you realize bec- because- [00:48:38] swyx: Okay. Please, [00:48:38] Mikhail Parakhin: please. Yes ... we do-- Yeah. The, so internally, uh, we have this system, we talked about it briefly once at NeurIPS. We have a huge HSTU-based system that models the whole companies, uh, and their possible paths. And like- Yeah ... what you are, what you are showing, like actually at any point of time, you can either model the userâs behavior or you mo- can also think about, uh, the whole merchant as a company, as the entity that acts in the world. You can model that as well. And then you can do, can do counterfactuals. In your graph, like in your blue graph, uh, if youâre... Imagine in the center there, uh, somewhere in the middle, you would have an intervention. I give that person a coupon, or I donât know, I send a personal thank you card, or give a discount in some- somewhere. And then you can, uh, then you can do forward rollouts from that counterfactual. So what would have happened with that intervention or without the intervention? And you can even ch- change where that intervention, uh, in time can happen, right? Like some- where, where in this journey. So we, we do this at the Shopify scale for our merchants, and then if we notice that something that they can be fixing, like thereâs a strong counterfactual, like we have Shopify policy, they basically get a notification like, âHey, we think your... something is wrong with your-â I donât know, Canadian sales. Like, uh, it looks like itâs misconfigured. Hereâs what you need to do. Or do you think like, uh, you have to set up this campaign with these parameters? And we do that at the buyer level to literally offer discounts or cashback or, or things to buyers. So this is-- Iâm getting very excited. Like this is my sort of area of, uh, interest, I guess, and, and hobby. But being able to m-model something complex as human beings or companies and model counterfactuals on it, where you can have interventions in the future and optimize when to make intervention, what kind inter-- uh, what kind of intervention to make. Itâs such an unlock that previously was completely impossible. Like the-- it was, it was always dreamed of, but never... Like how would you even simulate it without LLMs or HTUs? I think very, very exciting times. [00:50:59] swyx: I just wanted to, uh, to maybe illustrate this. I, Iâm not the best illustrator, but I, I am a conceptual statistics guy. And y-you know, you cannot just do this. Like this is a dimensionality AB test doesnât do, right? Like, uh, because it doesnât have the, the, the change over time, uh, stochastic nature, uh, and it doesnât have the sort of contextual like... Hereâs all the context to this point. Um, okay, cool. Um, thatâs SimGym. Youâre, youâre gonna burn a lot of tokens on this thing. But youâre, youâre one of the, the only scale platforms in the world that can, uh, that can do this across a huge variety of workloads, right? Iâm even curious on a sort of human, uh, research level of like, well, do, does retail behave d-differently from like clothing sales? D-does that behave differently from electronic sales? I, I donât know. I donât know what else you guys... The Kardashian shoppers, do they differ from like people who buy, uh, I donât know, cars and, uh, whatever. [00:51:55] Mikhail Parakhin: Well, very different, and different sensitivities and different modes of, uh, shopping and, and different levels of whatâs important. Now, to-totally, you can do aggregations at, uh, at a store level. You can do aggregations at a different, uh, category level. I donât know if, uh, you know, for our statisticians among us, I couldnât believe, but we-- recently weâre looking at it, and we had to bring back, uh, CRPs, you know, Chinese restaurant process. Itâs a, like, way of aggregating and, like, naturally grow clustering. So across... Specifically to answer questions that, uh, like you were just posing on how, how if, if buyers behave different categories. And Iâm like, âI havenât seen CRP since two thousand and one.â Itâs [00:52:37] swyx: so What? Itâs so- What is... No, I havenât, I havenât seen this. No. This is not in my training. Uh, [00:52:44] Mikhail Parakhin: but, but yeah, it, uh, uh, it actually, like the, the-- there was a very popular kind of theory, popular neurips HTML circles in early two thousands, uh, kind of nice. And now, now it has practical applications, uh- Yeah ... that we were resurrecting. [00:53:03] swyx: Yeah, amazing. Uh, I, I can see, I can see how this is like a, uh, a fun job for you where you get to apply all these things. Um, yeah, yeah, so super cool. Super cool. So, okay, so, so anyone who, who knows what CRPs are and has always wanted to use them at work, uh, they should, they should definitely join Shopify. Okay, so w-we have a lot and but I, Iâm, Iâm being mindful of the time. I, I do wanted to, to sort of cover some other things. Um, I-Iâll give you a choice, UCP or Liquid? [00:53:30] Mikhail Parakhin: Liquid. I think, I think on UCP, you know, like UCP is very important for us and, and it just we are-- UCP, we have a structured, uh, discussions, and you can read about them, and we have, uh, blog posts, and we have a big release this week, in fact, like with our catalog. Oh, [00:53:46] swyx: okay. [00:53:46] Mikhail Parakhin: Uh, yeah, [00:53:46] swyx: but- Le-I mean, we, we can, we can discuss the, the, the release briefly because weâll release this after the-- after itâs already announced so whatever. Thereâs a catalog that you guys are doing? [00:53:55] Mikhail Parakhin: Yeah. So we are, we are- Okay ... we are bringing in capabilities of a whole, uh, Shopify catalog. Basically, you now you can search for products, you can do lookups by specific ID, you can do bulk lookups when you need to bring m-multiple products. You donât need to know in ad-in advance what youâre trying to show or to sell or check out. Like, you can now, you can now have this decided at, at runtime, and this big area for investment for us for both non-personalized and personalized searches, trying to provide basically a win-window into whole universe of products that are being sold everywhere in the world. And Shopify is really not exactly, but almost like a super set of any-anything being sold. Now we are bringing it into UCP and, uh, and, uh, identity linking is another big thing for us, uh, so that you, you can use, uh, like Google or whatever, whatever identity you have, uh, theyâre minimizing friction. [00:54:56] swyx: Yeah. So [00:54:57] Mikhail Parakhin: yeah, big release for us. But Liquid AI of course we never talk about, and the problem might be more, more aligned with what we d-discussed previously on this chat. [00:55:07] swyx: Sure. The main thing that everyone understands about Liquid is that it is inspired by Worm, and I still donât know why. Iâm curious on your explanation. I think you, you, uh, you can make things very approachable. And also I think like what is the potential of like the, the level of efficiency that you get out of Liquid? [00:55:23] Mikhail Parakhin: You- we all familiar with transformer architectures. And, uh, for the longest time, there was a competing architecture, itâs called the state space models. So, so Sams, uh, you know, Chris, Chris Reyes, one of the pioneers and, and lots of startups, uh, trying to make those realities. They have, uh, significant benefits being main being, uh, being much faster and, uh, lower footprint and not quadratic in length, you know, sort of, uh, linear in, in, uh, in your context length. But with state space models- They never quite made it. Like theyâre used-- They have, uh, certain niches when they thrive, their hybrid architectures are useful, but they never quite made it. And liquid neural networks are, you can think of them as a next step, like, uh, sort of, uh, state-space model square. Itâs non-transformer architecture thatâs more complicated than sta-state sp
đŹ Training Transformers to solve 95% failure rate of Cancer Trials â Ron Alfa & Daniel Bear, Noetik
Today, we explain this piece of âclickbaitâ from our guest! TL;DR: 95% of cancer treatments fail to pass clinical trials [https://www.nature.com/articles/s41467-025-64552-2], but it may be a matching problem â if we better understood what patients have which tumors which will respond to which treatments, success rates improve dramatically and millions of lives can be saved â with the treatments we ALREADY have. See our full episode [https://youtu.be/uqM8qjbLRHA] dropping today: Why Big Pharma is licensing AI Models Tolstoy famously wrote, âAll healthy cells are alike; each cancer cell is unhappy in its own way.â Or something like that. Cancer might be the most misunderstood disease out there. Itâs not one disease, itâs a family of diseases. Hundreds, maybe thousands, of unique diseases each with its own underlying biology. With this lens, saying youâll âcure cancerâ is like saying youâll solve legos. We keep hearing AI will cure cancer, but sadly it may not be so easy. Todayâs guests â Ron Alfa [https://x.com/Ronalfa/status/2031083722980864010] and Daniel Bear [https://www.linkedin.com/in/daniel-bear-b79480279] from Noetik [https://www.noetik.ai/] â thinks they can use AI to break through a core bottleneck in the treatment development process. GSK recently signed a $50M deal for their technology [https://x.com/BiotechTV/status/2011577286634729785] that also includes an (undisclosed) long-term licensing deals for Noetikâs models like the recently announced TARIO-2 [https://x.com/Ronalfa/status/2045579548977500197?s=20], an autoregressive transformer trained [https://x.com/owl_posting/status/2026313562721853730] on one of the largest sets of tumor spatial transcriptomics datasets in the world. Whole-plex spatial transcriptomics is the richest way to read a tumor, and approximately ~0% of cancer patients going through standard care ever get one â and TARIO-2 can now predict an ~19,000-gene spatial map from the H&E assay every patient already has. Most big AI plays in BioTech have focused on discovery, and usually result in an in-house development effort (meaning tools companies usually become drug companies). This deal stands out in that it is a software licensing deal, and represents a commitment to a platform rather than a drug. With attention on other software tools for drug development (see the Boltz episode [https://www.latent.space/p/boltz] and Isomorphic for example), it is starting to look like the appetite of Pharma for biotech tools has finally started to grow. Why the sudden interest? Cancer is hard Biology is hard, cancer is harder. But despite this, weâve made incredible progress. So many cancers that would have been death sentences twenty years ago are routinely survivable. It used to be our main strategy was just chemotherapy â poison you and hope the tumor dies before you do. Now, there are many treatments that actually kill a tumor and leave the rest of you intact! Immune checkpoint inhibitors like Keytruda and Opdivo target the defenses of dozens of tumor types. CAR-T therapy adds modified T-cells to your blood that can target B-cell malignancies very accurately. Antibody Drug Conjugates such as Trastuzumab combine a drug with an antibody, allowing it to target very specific (cancer) cells. We truly live in marvelous times. With that said, we still have a long way to go. For every type of cancer with a miracle treatment, we have many more that are still death sentences. The world spends $20-30 billion a year trying to cure cancers, with hundreds of clinical trials yearly.Yet, progress is slow with a 95% failure rate in clinical trials [https://www.nature.com/articles/s41467-025-64552-2]. The lab doesnât translate to the clinic Are we leaving something on the table? Enter Noetik and Ron Alfa. Ronâs core thesis is that many of these âfailedâ treatments actually work! But weâre not looking at the right patients with the right tumors. If only we had a way to really understand the unique types of cancer biologies and which patients will respond to which treatments, we might be able to show a much higher success rate. Millions of lives (and billions of dollars) may ride on this. The Hard part: Blind Faith in Data Collection Ron and Noetik had the conviction to spend almost two years just collecting data. Lots, and lots, and lots, of data. Noetik has acquired thousands of actual human tumors, and collects a large multimodal dataset of hundreds of millions of images that allows them to create a detailed map of the cell makeup in the local environment. These are real human tumors, not frankenstein mouse models or immortal cell lines. This data is then fed into a massive self-supervised model, creating a âvirtual cell [https://www.latent.space/p/biohub]â. This model has a deep understanding of cancer biology â Noetik has worked carefully to show it can distinguish different types of tumors. Maybe even tumors we didnât identify as distinct previously! More recently they figured out how to scale up their model and data, and see no limit in their scaling laws! Noetikâs models can simulate how a patient will respond to experimental treatments. They are working with partners to test promising drugs that were demonstrated to be safe, but not effective. If these models work as hoped, Noetik will bring new cancer treatments to patients without developing a new drug! Their models will also guide the discovery process towards drugs that are more likely to make it through clinical trials. You can imagine why this is so attractive to GSK. Weâll see⊠Ron and Dan make pretty persuasive arguments that their models will truly assist in cohort selection in useful ways and this seems valuable. And we think itâs pretty clear that * Translation from lab to clinic is the biggest bottleneck for drug development. * Better cohort selection using biomarkers is likely to improve translation from lab to clinic. Noetik has already had some success here. Weâll see if theyâre able to translate that into a reliable advantage. Stepping back a bit from the technology, curing cancer is a pretty unambiguously positive application of AI. It is also a very hard problem to solve. Our guess is that most people have been impacted by cancer or will be at some point soon. And we hope that learning about the amazing work that companies like Noetik are doing will inspire a generation of AI engineers to work on the hardest and most exciting problems that society faces. Full Video Pod: This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe [https://www.latent.space/subscribe?utm_medium=podcast&utm_campaign=CTA_2]
Notionâs Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future â Simon Last & Sarah Sachs of Notion
For all those who missed out on London, see you in Miami [https://www.ai.engineer/miami] next week! Notion, the knowledge work decacorn [https://www.saastr.com/notion-and-growing-into-your-10b-valuation-a-masterclass-in-patience/], has been building AI tooling since before ChatGPT [https://www.notion.com/blog/introducing-notion-ai?utm_source=chatgpt.com], with many hits from Q&A in 2023 [https://www.notion.com/blog/introducing-q-and-a?utm_source=chatgpt.com] and unified AI in 2024 [https://www.notion.com/releases/2024-07-29] and Meeting Notes in 2025 [https://www.notion.com/blog/notion-ai-for-work?utm_source=chatgpt.com]. At the end of their last Make user conference, Ryan Nystrom teased Notion 3.0âs Custom Agents [https://youtu.be/KZ3hAy_XZwI?si=fqza-i0BAD2jYGyc&t=3133] - and they are finally embracing the Agent Lab playbook [https://www.latent.space/p/agent-labs?utm_source=publication-search]! Sarah Sachs [https://x.com/sarahmsachs] and Simon Last [https://x.com/simonlast] of Notion join us for a deep dive into how Notion built Custom Agents, why it took years and multiple rebuilds to get right, and what it means to turn a productivity tool into an agent-native system of record for enterprise work. We go inside the product, engineering, evals, pricing, and org design decisions behind one of the most ambitious AI product efforts in software today â from early failed tool-calling experiments in 2022 to agent harnesses, progressive tool disclosure, meeting notes as data capture, and the long-term vision for software factories and agentic work. We discuss: * Sarah and Simonâs path to launching Notion Custom Agents, and why the feature was rebuilt four or five times before it was ready for production * Why early agent attempts failed: no tool-calling standard, short context windows, unreliable models, and too much complexity exposed to the model * The âAgent Labâ thesis [https://www.latent.space/p/agent-labs?utm_source=publication-search]: not just wrapping a model, but understanding how people collaborate and building the right product system around frontier capabilities * How Notion thinks about roadmap timing: not swimming upstream against model limitations, but also building early enough that the product is ready when the models are * Why coding agents feel like the kernel of AGI, and how Notion is thinking about âsoftware factoriesâ made up of agents that spec, code, test, debug, review, and maintain codebases together * How Sarah runs AI engineering at Notion (ânotes from Token Town [https://x.com/sarahmsachs/status/2031473087791902991]â): objective-setting over idea ownership, low-ego teams comfortable deleting their own work, and a culture designed to swarm around fast-changing opportunities * The âSimon Vortex,â company hackathons, and why security gets pulled in early rather than late * How Notion organizes AI: core AI capabilities and infrastructure, product packaging teams, and a broader company mandate that every product surface must increasingly work for both humans and agents * Why prototypes have become much easier to build internally, and how âdemos over memosâ changes product development inside a tool the whole company already uses every day * Notionâs eval philosophy: regression tests, launch-quality evals, and âfrontier/headroomâ evals that intentionally only pass ~30% of the time so the company can see where model capabilities are going * What a âModel Behavior Engineerâ is, and why Notion treats eval writing, failure analysis, and model understanding as a distinct function rather than just software engineering * The changing role of software engineers in the age of coding agents, and why the new job looks less like typing code and more like supervising a rigorous outer system of agents, PRs, and verification loops * How the âsoftware factoryâ should work: specs, self-verification, bug flows, subagents, and minimizing human intervention while preserving the invariants that matter * A live walkthrough of a Notion Custom Agent handling coworking space tenant applications by triaging email, enriching applicants with web search, and writing structured data into a Notion database * How agents compose inside Notion: shared databases as primitives, agents invoking other agents, âmanager agentsâ supervising dozens of specialized agents, and memory implemented simply as pages and databases * Notionâs take on MCP vs CLI: why Simon is bullish on CLIâs self-debugging nature, where MCP still makes sense, and how Sarah thinks about capability, determinism, permissioning, and pricing alignment * The evolution of Notionâs internal agent harness: from early JavaScript coding agents, to custom XML, to Markdown and SQL-like abstractions, to tool definitions, progressive disclosure, and a much shorter system prompt * Why Notion cares about teaching âthe top of the class,â building for sophisticated operators rather than abstracting away too much capability for everyone * How agent setup works today: agents that can configure themselves, inspect their own failures, and edit their own instructions â with guardrails around permissions * How Notion prices Custom Agents: credits as an abstraction over tokens, model type, serving tier, web search, and future sandbox costs; why usage-based pricing was necessary; and how âautoâ tries to match the right model to the right task * Why Notion is not eager to train a foundation model, where they do fine-tune and optimize today, and why retrieval/ranking is one of the most important investment areas as more searches come from agents rather than humans * Why Meeting Notes became one of Notionâs strongest growth loops: not just as transcription, but as high-signal data capture that powers search, custom agents, follow-up workflows, and the broader system of record for company collaboration * Why Notion is more interested in being the place where collaboration data lives than in building hardware themselves â and how wearables or other capture devices may eventually feed into that system Sarah SachsLinkedIn: https://www.linkedin.com/in/sarahmsachs [https://www.linkedin.com/in/sarahmsachs]X: https://x.com/sarahmsachs [https://x.com/sarahmsachs] Simon LastLinkedIn: https://www.linkedin.com/in/simon-last-41404140 [https://www.linkedin.com/in/simon-last-41404140]X: https://x.com/simonlast [https://x.com/simonlast] Full Video Episode Timestamps * 00:00:00 Introduction and launching Notion Custom Agents * 00:01:17 Why Notion rebuilt agents four or five times * 00:03:35 Building for where models are going, not just where they are * 00:05:32 The Agent Lab thesis, wrappers, and product intuition * 00:08:07 User journeys, leadership, and low-ego AI teams * 00:13:16 The Simon Vortex, hackathons, and bringing security in early * 00:16:39 Team structure, demos over memos, and building for agents * 00:20:25 Evals, Notionâs Last Exam, and the Model Behavior Engineer role * 00:27:37 Evals as an agent harness and the changing role of software engineers * 00:30:42 The software factory: specs, verification, and agent workflows * 00:32:18 Live demo: a custom agent for coworking space applications * 00:35:08 Composing agents, manager agents, and memory as pages * 00:38:15 Notion Mail, Gmail, native integrations, and tools * 00:39:43 MCP vs CLI and the cost of capability * 00:44:13 When Notion uses MCP vs building its own integrations * 00:47:43 The history of Notionâs agent harness rebuilds * 00:55:35 Power users, public tools, and the setup agent * 00:58:01 Self-fixing agents, permissions, and âflippyâ * 01:01:13 Pricing, credits, and choosing the right model automatically * 01:09:01 Why Notion isnât training its own frontier model * 01:14:07 Retrieval, ranking, and search built for agents * 01:17:27 Meeting Notes as data capture and workflow automation * 01:21:18 Wearables, hardware, and Notion as the system of record * 01:23:45 Outro Transcript [00:00:00] Alessio: Hey everyone. Welcome to the Latent Space podcast. This is Alessio founder of Kernel Labs and Iâm joined by swyx, editor of the Latent Space. [00:00:11] swyx: Hello. Hello. Weâre back in the beautiful studio that, uh, Alessio has set up for us with Simon and Sarah from Notion. Welcome. [00:00:18] Sarah Sachs: Thanks for having us. [00:00:19] Alessio: Thanks for having us. Yeah. [00:00:20] swyx: Congrats on the launch recently the custom agents, finally itâs here. Howâs it feel? [00:00:26] Sarah Sachs: We ship things slowly. So it had been in Alpha for a little bit and at the point at which is itâs an alpha, um, thereâs a group of people that are making sure itâs ready for prod, and then thereâs a group of people working on the next thing. So sometimes some of these launches are a bit delayed satisfaction, so itâs quite nice to remind yourself all the work you did because we do have a habit of like. Being two or three milestones ahead. Uh, just âcause you have to be, you know, you canât get complacent. Um, but itâs been great that people understood how this is helpful. And I think thatâs just easier in general building AI tools today than it was two, three years ago. People kind of get it and so that user education, um, thereâs just, it was our most successful launch in terms of free trials and converting people and things like that. It was really successful, so yeah. But thereâs a lot to build. [00:01:12] swyx: Making it free for three months helps. [00:01:16] Sarah Sachs: Yep. [00:01:17] Simon Last: It was definitely super exciting for me because itâs probably the fourth or fifth time that we rebuilt that. [00:01:22] swyx: Yes. [00:01:23] Simon Last: And I mean, [00:01:24] swyx: youâve been building this since like 20, 22. [00:01:26] Simon Last: Yeah, I mean, like, it was even right when we got access to like GPT four in late 20 22, 1 of the first ideas we had is like, oh, okay, letâs make an agent that I, we used the word assistant at the time, there wasnât really the word, the word agent yet, but, oh, weâll give an access to all the tools the notion can do, and then it, we run in the background like, like do work for us. And then we just tried that many times and it just. Was too early. Um, [00:01:48] swyx: I need to force you to like double click on that. What is too early? What didnât work? [00:01:52] Sarah Sachs: We were fine to, like, before function calling came out. We were trying to fine tune with the Frontier Labs and with fireworks, like a function calling model on notion functions. This is right when I joined. I joined because, um, we needed a manager as Simon was needed to be able to go on vacation. So, uh, thatâs, thatâs around when I joined, so you can speak much more to it. [00:02:11] Simon Last: Yeah, we did partnerships with both philanthropic and open AI at different times, uh, to try to, at the time the, I mean, when we first tried, there wasnât even a constant of like tools yet. We, we sort of designed our own like, like tool calling framework and then we tried to fine tune the models to, uh, to use it over multiple turns. Um, and because it, it didnât work well out the box, I think. Yeah. The models are just too dumb and the context thing was also way too short. [00:02:37] Alsesio: Yeah. [00:02:37] Simon Last: Um, and yeah, we just kind of banged our head against it for a long time. Uh, unfortunately it was always like, there was always like sort of. Glimmers that it was working, but um, it never felt quite robust enough to be like a useful, delightful thing. Um, until I would say, uh, the big unlock was probably like Sonic 3.6 or seven, uh, early last year. And thatâs when we started working on our agent, which we shipped last year. Um, and then, and then uh, uh, custom agents, kinda a similar capability and that, that one just took longer because we, we just wanted to get the reliability up a lot higher. âcause itâs actually running in the background. [00:03:14] Sarah Sachs: And the product interface of like permissions and understanding, you know, this custom agent is shared in a Slack channel with X group of people and has access to documents that are surfaced to Y group of people. And the intersect experts, Y might not be whole. And so how do you build the product around making sure administrators understand that permissioning took multiple swings. [00:03:35] Alsesio: Everything is hard back at the end of the day. Yeah. Iâm curious, like when the models are not working, how do you inform the product roadmap of like, okay, we should probably build, expecting the models to be better at some reasonable pace, but at the same time we need to, you know, you had a lot of customers in 2022. Itâs not like you were a new company or like no user base. [00:03:54] Simon Last: Yeah, I mean I think thereâs always the balance of, you know, like you want to be a GI pilled and thinking ahead and building for where things are going. Uh, but also you wanna be like shipping useful things. And so we always try to like, like keep a balance there. You know, we. We try to take clear, like a portfolio approach. You know, weâre always working on multiple projects and, and weâre always trying to work on, you know, maintaining things where that have already shipped, like, like shipping new things that are like eminently working well and make them really good. And, and then we wanna always have a few projects that are a little bit crazy. Um, [00:04:23] Alsesio: and what are the a GI peel projects that you have today? Iâm curious about, uh, you donât have to share exactly what youâre working on, but Iâm curious what are things today that maybe in 18 months people will be like, oh, obviously this was gonna work [00:04:35] Sarah Sachs: 18 months. [00:04:37] Alsesio: Yeah, 18 months is, you know, [00:04:37] Sarah Sachs: itâs a long time and Yeah. Yeah. [00:04:39] Simon Last: I mean, thereâs a number of things happening. I think one thing thatâs becoming more clear is I think like, like, uh, coding agents are the kernel of EGI, sort of, everything is a coding agent. Mm-hmm. I think thatâs, thatâs sort of one, one direction. Um, and then, yeah, the exciting thing about that is sort of your agent can sort of bootstrap its own software and capabilities and actually debug and maintain them. And so yeah, weâre, weâre, weâre thinking a lot about that. And then, yeah, like, like another category of things that Iâm, Iâm really excited about is like, uh, we call the software factory also. People are using this, uh, this, this sort of word. Um, basically it just means can you create sort of like a, as automated as possible, a workflow for developing debugging. Mm-hmm. Merging, reviewing, and maintaining a code base and a service where thereâs a bunch of agents working together inside, and like, like how does that work? [00:05:28] Sarah Sachs: If you think back to your initial question, like, why did this take so long? I think something, [00:05:32] swyx: I didnât say that, but Yes. Okay. Go ahead. [00:05:34] Sarah Sachs: Why, what, what changed over the three and half years of trying [00:05:37] swyx: it? Exactly. Right. Because most people always say like, it didnât work yet. Then reasoning models came, then it worked. I was like, okay, letâs go a little [00:05:43] Sarah Sachs: bit. Thatâs, I mean, thatâs part of it, but I think the other part of it that I actually think is really what will set notion apart for every new capability is we have like. Two skills that are crucial when it comes to frontier capabilities. One is not letting yourself swim upstream. So like quickly realizing if youâre just pressing against model capabilities versus not exposing the model to the right information, not having the right infrastructure set up. That and of itself is the skill of intuition. And the second is to see, okay, youâre not swimming upstream. Which direction is the river flowing and what is like, how do we think ahead about the product and start building it even if itâs not great yet, so that when it is there, weâre ready for it. Right? And like those can sometimes feel like counterintuitive things. Like we can be trying to fine tune a tool calling model when they donât exist yet. And that the trick is to not do that for too long, but realize that there was something there. And weâve had a lot of things which like, um, weâre just like not swimming in the right direction with the streams. I think we had multiple versions of transcription before we got meeting notes, right? Oh, I gotta talk [00:06:39] swyx: about that. Yeah. [00:06:40] Sarah Sachs: Yeah. Um, and so. I, I, I think that like we, we really closely partner with the Frontier Labs on capabilities and we also have to have strong conviction on, as those capabilities move. Notion is about being the best place for you to collaborate and do your work. And how does that narrative change if the way that we work changes? Yeah. [00:06:58] swyx: Yeah. You told me you were a fan of the Agent Lab thesis, and this is, this is kind of it, right? [00:07:02] Sarah Sachs: Right. I show that thesis to so many candidates. Like I have it as like micro chrome autofill. Um, at this point, like itâs one of my most visitations [00:07:10] swyx: because like, is this the, hereâs why you should work in notion and not open, open eye. I, itâs like, [00:07:14] Sarah Sachs: hereâs, hereâs whatâs different about it. [00:07:16] swyx: Yeah. [00:07:16] Sarah Sachs: And hereâs why. Itâs not just a rapper. I actually think more and more people understand itâs not just a wrapper. [00:07:21] swyx: Yeah. [00:07:22] Sarah Sachs: Um, and by the way, like in the beginning, parts of what we build are wrappers on functionality. That works well, of course, but thatâs not really the most, um. I would say thatâs not the product that, that drives revenue. And thatâs not necessarily always what users need. [00:07:35] swyx: I mean, you know, notion is the AWS wrapper, but like the, the wrapper is very beautiful and like very, very well polished. So [00:07:40] Sarah Sachs: like the analogy, [00:07:41] swyx: like [00:07:42] Sarah Sachs: the analogy that Iâve been coming back to his Datadog in AWS [00:07:45] swyx: Yeah. [00:07:46] Sarah Sachs: So, uh, Datadog could not exist with, without cloud storage. Right. That itâs kind of fundamental that that works. Um, and AWS has like a CloudWatch product, but Datadog is an expert on understanding how people want observability on the products they launch. And weâre experts in understanding how people wanna collaborate, and thatâs really where our expertise lies. [00:08:04] swyx: Totally. [00:08:04] Sarah Sachs: Um, regardless of the tools that we use, [00:08:07] Alsesio: Iâm kind of curious how you think about implicit versus explicit expertise. I feel like Datadog is half and half implicit and explicit. Itâs like they understand across markets and industries what engineering teams usually look for. With notion, itâs almost like more of the expertise is at the edge because you as a platform, youâre like so horizontal that the end user is not really the same. Mm-hmm. Like with Datadog, the end user is always like, yeah, an engineering lead, a kinda like SRE related person with notion. It can be anything. So Iâm curious how you put that expertise into a product versus, you know, obviously it, WS cannot build notion. Itâs, that doesnât quite work in this case, but [00:08:44] Simon Last: itâs, itâs a little bit differently shaped. I think, you know, a classic vertical SaaS, like the data is kind of like that. They understand their individual customer very deeply. Itâs kinda a narrow slice, um, notion has always been super horizontal. And our, our task has always been to sort of balance these two somewhat opposing forces of like, weâre listening to our customers and what they want us to build. Itâs a broad slice. And then also weâre thinking about like, okay, how do we decompose what they want into, uh, nice primitives that are, that are really nice to use and weâll, weâll get us like as much bang for the buck as possible. And then, you know. Maintain the whole system, make it all like, like super clean and nice to use. [00:09:22] Sarah Sachs: We still have user journeys. I mean, we still focus on like core. I actually think the failure of our team is when we focus too much on what are cools that are, what are tools that are [00:09:31] Simon Last: mm-hmm. [00:09:31] Sarah Sachs: Cool tools. I actually think thatâs when we make have the least velocity because you still need some sort of focus on a user journey. So like for instance, weâll all sit down every Friday and look at the P 99 of like the most token exhaustive custom agent transcript and just look at why it didnât do well and cut a bunch of tasks. Like we still focus on like, this has, like this should work. Email triaging should work. Mm-hmm. Right. And similarly, like when weâre talking about before building, um, chatting, um, before we started filming about, okay, how can I do PDF export? Well thatâs functionality that then merits. Maybe we should build a tool that has access to a computer sandbox in a file system and the ability to write code. Right? Right. Um, but itâs because weâre thinking about the fact that our users to do their, to do their daily work, need to export PDFs, not because weâre like, Hmm, I think a computer tool could be cool. Like, letâs just see what happens. Mm-hmm. Like we, we have to focus on some user journeys, otherwise we just donât have like, enough strategy to, to prioritize. [00:10:29] swyx: I think thereâs a lot of like really strong opinions that youâve had. Do you have like sort of like a towel of Sarah Sachs? Like, you know, like what, how do you run your team? Like I feel like you just have accumulated all these strong opinions. Obviously part, part of this is your, your token town thing. [00:10:43] Sarah Sachs: I think the TAs working with Service X is, um, youâd have to, it depends who you ask. Um, I think it depends if youâre on my team or a partner Right. Or a vendor. [00:10:54] swyx: Yeah. There other people want to run their teams the way that youâre Yeah. Youâre like bringing these things. And then also similarly, uh, Simon, when you did the custom agents demo, you had like, well, weâve been using custom agents and hereâs the super long list of everything that we do. No humans ever read it. Right? Thatâs what you said. I was like, [00:11:07] Sarah Sachs: yeah. So I think for, for me, um, something that I learned very quickly and became very comfortable with was that my job was not to be the ideas per person or the technical expert. My job was to make it so that everybody understood the objective, had a resource to help prioritize what they should work on, and had an avenue to prioritize what they thought was important. And I think thatâs true with all, all leadership, but I think especially on the AI team. Almost all of our best ideas come from prototypes, from people that have a cool idea because they saw a user problem, and itâs a huge disservice if all of those ideas have to pass, like the sniff test of what me and a product partner or Simon and Ivan decided were the direction, right? Because a lot of what weâre doing is leaning into capabilities, so. I think thatâs the first thing is like, I donât really view like the role of engineering leadership as like, uh, hierarchical, nor has it ever been, but especially now, like very willing to change direction based on, um, like proof is in the pudding. Yeah. And like, and I think we have rebuilt our harness three or four times. And when you do that, then the second rule of engineering leadership is like you need to build a team thatâs comfortable deleting their own code and is very low ego and is driven by whatâs best for the company. And, um, doesnât write design docs because they think itâs their promotion packet. Right. And thatâs a culture that notion had long before I joined, but like our willingness to just swarm on different problems and um, redo things that weâve built before because something has changed. Like, thereâs a lot of friction that can happen at companies when you do that. And it doesnât happen at Notion. And because it doesnât happen when new people join. Like they donât wanna be the ones that are saying, we shouldnât do this. I wrote that code. So then itâs, you know, you, you create a culture that everyone thoughts and that culture comes directly, I think from Simon and Ivan though, um, because theyâre very open-minded. [00:12:50] swyx: Anything that you, [00:12:50] Simon Last: youâd add? Iâm not a manager, like, like, like Sarah is. Um, a lot of my role is really to try to think a little bit ahead, make sure that weâre, weâre building on the right capabilities and then like the prototyping stuff. And yeah, itâs really, really critical to always just be starting again. Itâs like, okay, this is new thing. What does this mean? What if we just rethought everything or wrote everything? And so I, Iâm, Iâm basically just doing that in a loop every six months. [00:13:16] swyx: Yeah. Do you believe in internal hackathons for this stuff? [00:13:19] Sarah Sachs: I think thereâs like two different versions. So one is like, we just have a, a, a solid bench of senior engineers that come and go on what we call the Simon Vortex and Productionizing what we built, right? Because when youâre in the Simon Vortex, the velocity is super high. The direction changes daily, and itâs meant to be like the equivalent of a SC Works lab. We donât need to do hackathons for that. We need to have senior engineers that we trust to come in and out of those projects. For instance, like management boundaries are really loose. Like you report to him, but you work for her right now. Yeah. Thatâs something that when we hire managers, itâs important they donât care about because we tend to form more structures. Yeah. Donât be too [00:13:54] swyx: territorial. [00:13:55] Sarah Sachs: We form more. Itâs after we ship things, not not before, just historically. Um, the second thing is we do have companywide hackathons. Actually we just had our demos day for the hackathon we had last week this morning. Thatâs more for people that arenât directly working on the project, feeling like they have the time to pause and learn how to make themselves more productive or how they would use notion custom agents to build something. Or part of the hackathon was actually encouraging everyone across the company to build their own agentic tool loop, calling from scratch. Follow like an every blog post on how to do what I think because we want [00:14:26] swyx: just with the compound engineering one. Yeah. [00:14:28] Sarah Sachs: We want everyone to use cloud code in the company or whatever the coding agent they please and understand that fundamental. So we set aside a day and a half. Weâre all leadership, encourage everyone on their teams across the company to do it. So we have hackathons like that. I would say like kind of facetiously, like everything we build is a little bit like a hackathon until it graduates and puts on big boy pants and as a product ops rollout leader and has a assigned data scientists and stuff like that, [00:14:54] swyx: security review enterprise stuff, [00:14:56] Sarah Sachs: actually security reviews one of the things that we bring in first because it just slows us down way more and, um, causes a lot of tension and they build better product if theyâre involved early. So, um, that is probably the first person to get involved in something thatâs the [00:15:09] swyx: right PR approved answer. [00:15:10] Sarah Sachs: No, but itâs not just PR approved. It like, um, um, itâs [00:15:13] swyx: actually real. Itâs actually real. Itâs like, um, Iâm just saying scar [00:15:15] Sarah Sachs: tissue. [00:15:15] swyx: Yeah, [00:15:16] Sarah Sachs: because like, you know, my backgroundâs also, I worked at Robinhood for a number of years. Yes. So like, uh, compliance and things like that, um, are a little bit more, you learn the hard way when it doesnât come naturally. [00:15:26] Simon Last: Yeah. I think the. The hackathon is really important for uplifting the general population, but like, if thatâs the only way you can build new things, youâre kind of toast. I mean, it, it has to be like the daily processes, like, you know, building these new things. Um, and it has to be about, I think like, I think in the AI era a lot more leverage accumulates to the most curious and excited people. And so itâs like weâre all about just like activating that energy. You know, like if someoneâs protesting something on the weekend that theyâre excited about and itâs important, that should be the main thing that weâre doing. Yeah. Um, itâs not a hackathon that we schedule once a quarter, itâs just like, yeah. Daily process. Part of the culture. [00:16:02] Sarah Sachs: I mean, thatâs how we shift image generation and notion now. It was always this thing that would be kind of nice to have, but it wasnât really clear where that was necessarily aligned in product priorities. Itâd be a lot of work. And we had someone on the database collections team, Jimmy, who was like. I really wanna do image generation for cover photos and inside notion. And weâre like, if you wanna build it, like itâs, do it please. Like we encourage you. We gave âem all the resources of working directly with Gemini and being able to like track the token usage and it working through endpoints. We gave them eval, support, everything, and then became a, a full project. [00:16:34] Alsesio: Yeah. [00:16:35] Sarah Sachs: Thatâs why you canât have like ego as a, a leader. Like thatâs, thatâs how we work. [00:16:39] Alsesio: Whatâs the size of the team today, both engineering and overall? [00:16:43] Sarah Sachs: I manage, uh, the team. Thatâs what weâll call it. Core AI capabilities and infrastructure. Thatâs about 50 people. But then we have per i partner teams that do packaging. So how it shows up in the corner chat versus custom agents versus meeting notes, thatâs another 30, 40 people. And, and then every team that has a product service at Notion that a user can interface with owns the tool that the agent interfaces with the editor team. The team that did CRDT for offline mode is the same team that handles how two agents, um, edit competing blocks. Mm-hmm. Right? Itâs the same problem. The team that built the underlying SQL engine is the same team that owns how the agent asks it to run a SQL query, and it does it performantly. And so from that regard, anyone working on product engineering is tasked with making them work for customers that are humans and agents because over time the majority of our traffic will be coming from agencies using in our interface, not humans. And so. Our objective is to make it so that the whole product org is building for agents. [00:17:40] Alsesio: Yeah. How has it changed internally? The activation bar is kind of lowered a lot. Like anybody can kind of create a prototype very, somewhat easily, especially if youâre like an existing code base. Have you raised the bar on like what type of prototype people need to bring forward to gonna be taken? Not like seriously, but like, you know what I [00:17:58] Simon Last: mean? Yeah. I think the bar is lowered in many ways. Be like, one thing our, uh, our team built that is really cool is our, uh, our, our design team made a whole separate GitHub repo, uh, called the, the design Playground. And itâs basically just to create a bunch of like, like helper components and you, uh, for, for quickly a throwing together UIs. And itâs become like actually quite sophisticated. Like it has like an agent in there and like, uh, thatâs pretty fun. So like, we pretty much, like, they donât do mocks, they just make like, like full, full prototypes. [00:18:27] swyx: Here it is. It works. [00:18:28] Simon Last: They give you like a u rl. Theyâre like, okay, all right. So we have to make the, like the real production version of that. Um, and then for engineers. A prototype looks like just making it a feature flag that actually works. Like thatâs sort of the bar. [00:18:39] Sarah Sachs: Something to understand thatâs really unique about notion. One of the reasons I joined weâre super lucky is no one uses Notion in their job as much as people that work at Notion. [00:18:46] Simon Last: Of course. [00:18:47] Sarah Sachs: So I think thereâs very few companies, maybe if you worked on Chrome I guess, but like everything that we ship, we ship internally first and get a lot of really quick feedback. And also sometimes our dev instance is totally borked and you have to change a bunch of flags to get things done. And thatâs kind of like, but everyone, so people that do it ticketing, people that do supply chain procurement, recruiting, everyone is using the same instance of notion with like a lot of flags on for these prototypes people build. Um, and so we have this, Brian Levin, one of the designers on our team, I think evangelize this concept of demos over memos. [00:19:18] swyx: Ooh, too [00:19:20] Sarah Sachs: good. Um, which has been, uh, very good for building demos, and I think itâs put a big pressure point on us to have really strong product conviction, because if anything can be demoed, you really need a strong filter of making sure that if you know, youâre doing X amount of work, youâre making the, youâre, youâre focusing on one tower, youâre not just building a really flat hill. Right. Thatâs actually where I think there has to be more conviction from our PMs, um, and our designers and, and well, the company really to have conviction of what journey weâre going on. [00:19:52] Simon Last: But overall, I feel like it works pretty well. Like people, almost all the engineers have good enough taste to realize that like, this prototype doesnât actually make sense in the product, or, or it does. So itâs not that common that I would see a prototype. Itâs like, oh, this makes no sense. Mm-hmm. Itâs like, you know, people are doing reasonable things and, and, and then itâs just a matter of. Which things we build first and then often just, just figuring out how to turn it on and off. Thereâs our, in the, in our like experimental chat ui, thereâs this, thereâs probably like, like a hundred check boxes in there. [00:20:22] Sarah Sachs: Kills me [00:20:23] Simon Last: the things you could turn on and off. [00:20:25] Sarah Sachs: Uh, but I think that, okay, so that is kind of true, Simon, but like being the person that manages the evals team, like there is a level of intensity that it adds to the platform team. So, you know, if weâre gonna do image generation and notion, all of a sudden the way that we do attachments and the way that we, um, our LLM completion like cortex talks and expects tokens back and now itâs getting images back. Like thereâs a lot of platform work that we do need to, like solidify a little bit. So sometimes itâll be in dev for a couple weeks before it makes it to prod just because we still have to like, make it robust, make it HIPAA compliant, ZDR compliant, figure out the right contracting with the vendor, whatever it is. And we need to eval it because we want the team. To still maintain what they build. Thatâs the one thing is like if we have a bunch of prototypes, it canât just be like a small group of people that then maintain whatever end prototypes. So we have invested a lot of people in an eval and model behavior understanding teams that, we call it agent dev velocity. So your dev velocity building agents can be faster if we invest in that platform. And so we have a whole org dedicated to Asian, um, platform velocity so that you can build your own eval and then maintain it once you ship it. So if a new model release comes out and we, every [00:21:38] swyx: team maintains their own eval, [00:21:40] Sarah Sachs: we maintain the eval framework. Every team owns their own evals and a lot of them weâve integrated to Optin, to ci, or we run them nightly and we have a team, uh, a custom agent that triggers to a team to look at the major failures. Thatâs really critical because if we have like all these different surfaces now, a lot of itâs on the same agent harness, so itâs easier to maintain. Itâs just packaging of different agent harnesses, but new functionality of the agent. Letâs say that like we wanna update like. Uh, you know, they deprecated, sonnet, um, four or whatever it is and we need to auto update. Are [00:22:11] swyx: they already? Thatâs so, okay. Yeah. Actually wasnât that long ago. [00:22:14] Alsesio: They were [00:22:14] Alsesio: just 3.5. [00:22:15] Sarah Sachs: 3.537. Just got deprecated. [00:22:18] swyx: 3 7, 5 0.2 or, yeah. No, [00:22:20] Sarah Sachs: itâs not. 5.2 is five point. Five point no. Yeah, five four is 40% more expensive than five two. So if they deprecated five two, you would hear they can, you would hear from me about that one. Um, but, uh, another conversation to have. [00:22:35] swyx: I have a cheeky evals question for you. Have you noticed any secret degradation from any of the major model providers? [00:22:40] Sarah Sachs: Secret degradation, [00:22:42] swyx: like. During the War Bay, when itâs high traffic, it suddenly gets dumber. [00:22:47] Sarah Sachs: Yeah. I mean, not just between the, I mean, we definitely notice flakiness, weâve definitely noticed, particularly for some providers, that things are slower during working hours and [00:22:57] swyx: thereâs a latency argument. Yes. Not a quality argument. [00:22:59] Sarah Sachs: No. I think the quality difference thatâs interesting is, um, even though companies that say theyâre selling the same, a, itâs really into like quanti quantization, but like companies that say theyâre selling the same model through different vendors, whether it be through first party or Bedrock, Azure, et cetera. We do see different qualities sometimes, and thatâs not necessarily whatâs advertised. [00:23:21] swyx: Yeah. Kidney went to the point of like, if we, they shipped like this, like eval across all the providers and it was like very obvious we were secret equalizing and it was very, [00:23:28] Sarah Sachs: yeah. But [00:23:29] swyx: thatâs very embarrassing. [00:23:30] Sarah Sachs: You know, um, we hire Subprocess to figure that out for us. So we just wanna understand where itâs regressing or where itâs optimized. And sometimes weâre okay with regressions that optimize latency if theyâre the appropriate regressions. Our job is to make sure we have the evals to understand the changes that are important to us. And even like when weâre partnering with labs on pre-releasees of models, theyâll send us multiple snapshots. And this is less about quantization, but more just regressions. Like they have shipped models that were not the snapshots that we wanted, and they have changed the snapshots that they shipped based on the feedback that we give. Because our feedback tends to be more enterprise work focused and not coding agent focused. And definitely those can be bummers, like, you know, uh, we know that this wasnât the version you wanted, but weâll help you make it work. I mean, we always make it work, but that definitely happens. [00:24:16] Alsesio: Yeah. Do you have, um, failing evals that youâre just hoping, oh, that will have success eventually when a good model comes out? [00:24:23] Sarah Sachs: Uh, I mean, yeah. So I think. I mean, I could talk about this for 60 minutes, so I will limit myself. I think itâs a real issue when people say evals and itâs just like, thatâs quality, thatâs like unit, I mean, itâs like saying testing. Itâs not just unit tests, right? So. We have the equivalent of unit test. Regression test. Those live in ci, those have to pass a certain percent, you know, within some stochastic error rate. Then we have, as youâre building a product, evals of these arenât passing right now, and this is launch quality. So we have a report card and we need to, on these categories, you know, be it 80 or 90% of all of these user journeys to launch, and then what we have what we call frontier or headroom evals, where we actively wanna be at 30% pass rate. And thatâs actually been a effort that we took in partnership with philanthropic and OpenAI in the past maybe two or three months, because we actually hit a point where our evals were saturated and we werenât able to really give insightful feedback other than it wasnât worse. And not only is that not helpful for our partners, itâs not helpful for us to understand where the stream is going. You know, going back to that analogy. And so we spent a lot of time thinking about. What notions last exam looks like, right? Mm-hmm. Not just humanities, last exam. Ooh, notions last exam. Mm-hmm. And, um, thereâs a lot of, you know, dreams about what that would look like. I know weâve talked a lot about benchmarking, um, swix, but, uh, yeah. Notions last exam is a big thing inside the company and we have people, full-time staff to it exclusively. Mm. We have a data scientist, a model behavior engineer, and an full-time, um, evals engineer just dedicated to the evals that we pass 30% of the time. [00:25:56] swyx: What youâre hiring for [00:25:57] Sarah Sachs: MBEs? I am hiring [00:25:58] swyx: What is an MBEA [00:25:59] Sarah Sachs: model? Behavior Engineer Model. Behavior engineers started with a title data specialist before I joined when they were working with Simon on like, uh, Google Sheets and like Simon just needed someone to look through Google Sheets and say, yes, no, this looks bad. This looks good. Right? And so we hired people with kind of diverse linguistics background. We had like a linguistics PhD dropout. Mm-hmm. And a Stanford ate new grad. And theyâre amazing. And they formed a new function basically. And over time weâve built a whole team, um, with a manager whoâs now kind of reinventing what that role is with coding agents. So they used to be kind of manually inspecting code. Now theyâre primarily building agents that can write evals for themselves or LLM judges. Thereâs a really funny day I can send you the picture where Simon, about a year and a half ago, was teaching them how to use GitHub. Um, and theyâre on the whiteboard and it was like, okay, I think it would be so much faster if our data specialists learned how to use GitHub and like learned how to commit these things in Dakota. And, and that was then and now I think, you know, coding has been a lot more accessible. Um, but moving forward itâs this mix of like data scientist PM and prompt engineer because thereâs craft in understanding like even like what models can and canât do things. How do we define like that headroom? How do we define like what a good journey is? Um, is this model better or not? Why is this failing? Thereâs some qualitative work, but then thereâs also like a lot of instinct and taste to it, and thatâs not necessarily software engineering. And so we have like very firm conviction and we have had for a number of years now that that is its own career path and we have always welcomed the misfits, so to speak. So we really firmly believe that you donât need an engineering background to be the best at this job. And thatâs whatâs quite unique about this particular role. [00:27:37] Simon Last: Yeah, this is something that Iâve been pretty excited about recently is we made an effort basically to treat the eval system as like an agent harness. So if you think about it, like, you know, you should be able to have an agent end-to-end, download a dataset, run an eval, iterate on a failure, debug, and, and then implement a fix. And ultimately you should be able to, you know, drive the full time process with a human sort of observing the, you know, the outer uh, system. So yeah, we went, went pretty hard on that. And thatâs, thatâs worked extremely well so far. Itâs like basically just to turn it into a coding agent, uh, uh, problem. [00:28:11] swyx: Your coding agent or just whatever [00:28:13] Simon Last: harness No coding agent. Yeah, code, cloud code. It should be totally general. Yeah. I think if it would be a mistake to like, like fix it on any, any particular coding agent. At the end of the day, itâs just like CLI tools. [00:28:21] Sarah Sachs: Itâs like the same way that you wouldâve a coding agent write the unit test. You should have a coding agent write the eval. [00:28:26] swyx: Yeah. [00:28:26] Sarah Sachs: But thereâs a lot of supervision in that still. We just donât believe that supervision has to come from software engineers because a lot of it is like, um, kind of you XREE and whatever, and these are the people that also triage failures and tell us where we should be investing next. [00:28:40] swyx: Yeah. Iâm gonna go ahead and ask a spicy question. Is there a data, there are no software engineers at Notion. [00:28:46] Simon Last: Um, [00:28:46] Sarah Sachs: what does it mean to be a software engineer? [00:28:47] swyx: Exactly. [00:28:48] Simon Last: I mean, I think the way things are going is like weâre on some continuum where. If, if you look back three years ago, humans were typing all the code and then we had auto complete, youâre typing list of the code. Then we had sort of like filling agents, filling lines, and now weâre getting into like agents doing longer range tasks where you can debug and implement a fix and then verify it works and you know, get your, get your PR even like, like Merion deployed. I think weâre sort of just moving up the abstraction ladder and then the human role becomes more about observing and maintaining the outer system. Thereâs a string of agents flowing through, like me prs whatâs going off the rails. Like what do I need to approve? Is there like a learning or memory mechanism that that works? So itâs kind of a hard engineering problem. Thereâs a, you know, thereâs, thereâs a lot to do there. I think weâre just sort of moving up stack [00:29:34] Sarah Sachs: the same transition machine learning engineers have made, right? Like I havenât looked at a PR curve in a while. [00:29:39] swyx: Yeah. You used to do this stuff and now, um, auto research can do it, [00:29:42] Sarah Sachs: right? Like I think it depends on what you define as a software engineer. [00:29:46] swyx: Yes. Itâs, thatâs changing for sure. [00:29:49] Sarah Sachs: I think every software engineer in notion this summer went through like this, um, sheer, um, one of our engineering leads of the company called it, like every software engineer is going through the, the, uh, identity crisis that every manager goes through, where all of a sudden they realize their ability to write code is less important than their ability to delegate in context switch. And I think that is a transition out of being a software engineer. But [00:30:12] Simon Last: yeah. Yeah, thereâs a critical difference to being a manager, which is that like, it is actually very deeply technical. The problem, you know, humans are very like, like, like fuzzy and you canât like treat a team of humans like a, like a rigorous system where like, you know, prs like, like flow through and can be in like a block status and then what happens when theyâre blocked, right. With a set of agents, you actually can do that. And, and, and I think itâs actually, thereâs a lot of interesting technical rigor that that goes into that itâs like itâs a technical design problem. Ultimately. [00:30:42] Alsesio: What is the design of the software factory that youâre building? [00:30:46] Simon Last: Yeah, I mean, I think weâre. Trying a lot of different things. I mean, ultimately you want to design a system that requires as little human intervention as possible, but like still maintaining the in variance that, that you care about. So yeah, weâre exploring a lot different ideas there. I mean, I think I could talk about a few things I think are important there. Like, one thing I think is really important is, um, having some kind of like specification layer you can just commit marked on files. Mm-hmm. That works pretty well, but [00:31:15] swyx: itâs nice to be notion man. Iâm just saying like the spec, like Yeah. The natural home for specs is notion. [00:31:21] Simon Last: Yeah. Right. It can be a database of pages. Yeah. I mean, it needs to be something that is, you know, human readable and I viewable and I think thatâs pretty key. Another really key component is like the, the self verification loop. Yes. You need really, really good testing layers, basically. And thatâs a really deep, uh, uh, problem. But by getting that right, you know, and then, and then itâs kinda like the workflow of like. What happens when thereâs a bug? How does it flow into the system? Like, is it like a subagent working on it? How does it make a PR and how does that get reviewed? And me, and then, you know, so thereâs like the, the flow or process. [00:31:56] swyx: Yeah. Cool. Uh, you know, one thing we did work out before you guys came in was this demo or this [00:32:01] Simon Last: agents [00:32:02] swyx: agent demo. Uh, [00:32:03] Simon Last: so every, [00:32:04] Alsesio: every time we do an episode, we try the product. Right. I donât think thereâs ever been an episode that I havenât tried. Yeah. Um, [00:32:11] swyx: and we, we try, try is a, a big word. Like since day one lane space has been on Notion, but this is the, this is the net new thing. Yes. [00:32:18] Alsesio: So this is for Nel Labs, which is the space weâre in. So next week weâre opening applications for tenants. So thereâs a web form, let me, we got this form done here. Uh, so, uh, before. Uh, the workflow would be I get an email, then I look at the person. It was like, should I spend time talking to this person? Then I respond, they respond back. So I build this. So the name it came up for on its own. Can you maybe h how do, how does it come up with its own name? [00:32:43] Simon Last: Yeah, thatâs a pretty app name. Itâs, it, it is just a random, itâs a random, a name generator. [00:32:47] Alsesio: Oh, thatâs funny. It just came, [00:32:49] Simon Last: the fact that it picked that is, is kind of hilarious. Iâm pretty sure itâs just determined, [00:32:54] Sarah Sachs: resilient collector. I, I think Iâve never looked at the code for that. Iâve never second guessed it. I think itâs kind of like a madlib situation. [00:33:00] Simon Last: Yeah, I think youâre right. Yeah. Itâs, itâs totally a, a deterministic. Oh, I thought it was great. Yes. Although, although when the, if you use the AI to set itself up, it can update its own name, so. Okay. Um, [00:33:11] Sarah Sachs: how did you create it? It, did you just do [00:33:12] Alsesio: classroom? I, [00:33:13] Sarah Sachs: okay. [00:33:13] Alsesio: I did, yeah. Iâll say just check my inbox for applications for a coworking space. Keep a people, so it created the database for me. Which I have here. And I guess database is like an notion table because everything is notion. Um, and then whenever um, an email comes in, like here, it just creates a new role for the person. Mm-hmm. And then it uses web search to enrich the mm-hmm. The profile. So it kind of like searches the web and itâs like, this is who this person is, this is when they say they wanna move in and kind of updates everything else. This is, I mean, itâs not a GI, but to me, I donât wanna do this work. So it feels like, I mean, it took me maybe like 15 minutes to set up the whole thing. Um, and I really like that most of the information should live here. You know, it is not like some other tool asking me [00:34:01] Sarah Sachs: Yeah. [00:34:01] Alsesio: To like, bring my stuff there. Itâs like I wouldâve probably already created an ocean thing. [00:34:06] Sarah Sachs: Mm-hmm. [00:34:06] Alsesio: So [00:34:07] Sarah Sachs: most of our biggest use cases and gains are from. That extra layer of human involvement in the process to make it so right. And so like one of our biggest use cases is bug triaging. So if someone posts something in Slack, can you just have a custom agent that lives there that has its own routing constitution of what team this belongs to, creates a task in your task database and then posts in that Slack channel, right? Like thatâs like one of the first things that we built internally, I think. And itâs completely changed the way that notion functions as a company. Nothing falls through, well, most things donât fall through the crack. We donât know what we donât know. But itâs not replacing people, itâs replacing processes. [00:34:44] Alsesio: Yeah. [00:34:44] Sarah Sachs: Right. [00:34:45] Alsesio: And Iâm curious how you think about composability of these things. So the other one I was working on is like a. These filler. So whenever somebody signs up as a tenant, kind of heâll sell the lease for them. There should probably some agent that is like office manager agent mm-hmm. That can handle the request, make the lease, and then, uh, give them a ADA access to the office and all of that. How do you think about that feature? [00:35:08] Simon Last: Yeah, so I mean, thereâs, thereâs two ways you can compose. One way is by using like the data primitives. So you can, you know, you, you could give, you have one agent, uh, be writing to the database and thereâs another agent thatâs walked in the database. So thatâs, thatâs one way that they, they can coordinate thatâs like a little bit more decoupled and mm-hmm. Works really well. Or you, you can couple them. So I, I think itâs actually not released yet. Releasing it like next week is, uh, in the settings for an agent, you can give access to invoke any other agent. [00:35:34] swyx: Hmm. [00:35:34] Simon Last: So you can have them just. Just, uh, uh, talk directly. So [00:35:37] swyx: you, was there a limit on like, number of recursions or just, [00:35:40] Simon Last: um, probably, [00:35:42] swyx: you know what I mean? Like, you can just get an infinite loop that way thereâs [00:35:45] Simon Last: some kind of Yeah, [00:35:46] Sarah Sachs: I think itâs, there is actually a number somewhere. [00:35:49] swyx: I believe Iâm just, you know, like, youâre, youâre, someoneâs gonna screw up. You [00:35:51] Simon Last: should you try to see [00:35:53] swyx: Yeah. I mean, everythingâs gonna be paperclips. [00:35:55] Simon Last: Oh, yeah. Yeah. But, uh, but, but thatâs really useful. Yeah. So we, you know, like I just, I, I helped, uh, someone internally the other day, they had, they had built like over 30 custom agents for, uh, for our go to market team doing all kinds of different things. You know, for example, like researching, you know, like, like filling information about, about a customer or like, like triaging customer feedback or like, uh, something like that. Literally over 30 of them. And, and then he, and then he even made like a database of all the agents and then he is like, okay, and, and now Iâm getting 70, over 70 notifications per day with just the agents are blocked on various things. Uh, and then I was like, oh, okay, cool. You know, the obvious thing to do there is to make a manager agent, [00:36:32] Sarah Sachs: right? [00:36:33] Simon Last: Thatâs gonna sort of blocks be another abstraction layer in between your, your, uh, uh, 30 agents. Uh, so yeah, we, we send out with like a manager agent and then has access to invoke all the other agents and itâs sort of like, like watching and observing them and then it sort of, it just creates a layer of abstraction. So instead of 70 notifications per day, itâs like, like five. And then, and then the manager agent can help like, uh, debug and fix any problems with the, [00:36:54] swyx: does this is a concept of like an inbox or something like piece, youâre basically saying that they can message each other? [00:37:00] Simon Last: Yeah. [00:37:01] Sarah Sachs: Well [00:37:01] swyx: they use the system of record, which, which is [00:37:02] Sarah Sachs: notion, so we [00:37:03] Simon Last: actually, yeah, we didnât make any special concepts at all. [00:37:06] swyx: Theyâre interested to the motion notifications that I wouldâve got, [00:37:09] Sarah Sachs: they can just like write a task to a database that the other agentâs task to listening to, or they can actually call a web book to the agent, like they can just add the agent. Okay. [00:37:17] Simon Last: Yeah, I mean, this is something that, that weâre still working on. I, I think we, you know, like, like generally, generally the way we do these things is, you know, you first make it possible, maybe like a sort of janky way. So I, I, I think the way I set âem up is like, you know, we created like a new database that was sort of like issues mm-hmm. That the custom agents were, were experiencing, and then gave them all access to file an issue and then the manager has access to, to read the issues. Um, and that works pretty well, essentially like, like give it its own like internal issue tracker just for the agents. And then, you know, if that becomes a, a concept that seems useful, generally maybe we will think of how to package it in. But I mean, generally we try to just keep it to composing the primitive if we can. You know, another example of this is we have no built-in memory concept. Memory is, is just pages and databases. And so if you wanna give a memory, just give it a page and give it. Edit access to that page and the [00:38:03] swyx: human can edit it. Agent can edit [00:38:04] Simon Last: it. Yeah. And so that works, that pattern works extremely well on it. And you know, depending this case, you can have it be just a page or it could be an entire database with, you know, or, you know, I can have sub pages is is pretty on what you can do with that. [00:38:15] Alsesio: So when I was setting this up, uh, I connected my inbox and it was like, do you wanna use Gmail or Notion Mail? And Iâm like, I donât wanna use Eater, I just want you to do it. Iâm curious how you think about, you know, notion, mail, notion, calendar, all of these kind of ui ux interfaces, full stack [00:38:29] Simon Last: notion. [00:38:30] Alsesio: Yeah. When like at the same time you have the agents abstracting them away from you in a way, you know, how do you spend like the product calories so to speak? [00:38:37] Simon Last: Yeah, I mean, I think itâs pretty important that you donât have to use, not your mail to connect to the mail capability. So we can just connect to Gmail or, or whatever you want, uh, to use. And weâre thinking of the mail service as being really great to the extent that itâs really agent built, right? So maybe the mail app is just sort of a prepackaged agent that helps you automate your, your inbox. [00:39:00] Alsesio: Yeah, the auto labeling is great. Think [00:39:03] Sarah Sachs: the, when we, um, integrate with Gmail for instance, we have a series of tools available that are available via MCP or API to Gmail. When we integrate with Notion Mail, we have the Notion Mail engineering team to build us the, um, exact right tools that optimize latency, optimize performance and quality. They own that quality. Um, thereâs product leads there. Theyâre directly thinking about the user problems that happen in mail. So it tends to be when we build integrations and connections, we build natively first. Um, and then think about, um, extending them generally just because itâs also easier. Mm-hmm. Um, um, to build natively first. Um, so that tends to be how we phase things out. [00:39:43] swyx: Talking about integrations, you prompted me, so I gotta ask. M-C-P-C-L-I. Whatâs going on? Whatâs the [00:39:48] Simon Last: Yeah. Opinion. I think, I mean, Iâm, Iâm definitely bullish and excited about cli. I think thereâs a few really cool things about cli. So one really cool thing is like, um, is that itâs in the terminal environment, so it gets a bunch of extra power. So it, you know, for example, it can like, like paginating and cursor through like long outputs. Um, and it has a progressive disclosure inherently. Uh, so, you know, you donât see all the tools at once. Itâs just, you see the CLI wrapper and you can like use the, the help commands and, and, and read files. And then I think the most important thing thatâs, thatâs super cool is that there, itâs also inherently a, a bootstrapped. So if thereâs an issue, uh, the agent can debug and fix itself within the same environment that it uses the tool. [00:40:30] swyx: Mm. [00:40:30] Simon Last: Right. Like, you know, I think I saw a tweet this morning. Someone said, you know, my agent didnât have a browser, so I asked it to make all a browser tool and within a hundred lines of code, it gave itself a little browser, like, like wrapping the, the, the chromium API, um. Thatâs pretty incredible. And then if there was a bug, it would just immediately try to fix it. Mm-hmm. Right. On the other hand, if you use an, you know, if you use like of, of the Chrome dev tools, MCP, Iâve had this issue where like, like sometimes the transport gets like messed up. If it gets messed up, the agent has no way to fix itself. It, it no longer has a browser, itâs, itâs not broken. Right. I think thatâs, thatâs pretty fundamental, but I would say like a lot of the, the bad things about it can be fixed. Uh, so I think like, as a progressive disclosure, that can be fixed with, with right harness. Like, it, it obviously doesnât make sense to show it all the tools all the time. Thatâs not really inherent to the MCP protocol. Itâs just like how you wrap it and use it. [00:41:16] swyx: Thereâs many poorly built MCPs because we didnât know. [00:41:19] Simon Last: Yeah, yeah. I mean it was just early, like, like the obvious thing is, uh, you know, to start with is, is to just show it all the tools and itâs like, okay, now we have a hundred tools. Yeah. And like the tool calling actually works. So letâs of [00:41:28] swyx: your success [00:41:29] Simon Last: give it a way to like, like filter to source the tools. So yeah, I would say like broadly speaking, Iâm really bullish on cli. Iâm still bullish on CPS and in a certain environment. I think in, in particular, CP is really great for when you want sort of like a narrow, lightweight agent. I think thereâs, thereâs definitely a lot of use cases where, where you donât want like a full coding agent with a compute run time. And also you want it to be like more tightly permissioned. MCP inherently has a really strong permission model, like all you can do is call the tools. A CLI is a little bit murkier. Itâs like, can I access the, if PI token are you, like, properly sort of like re-encrypt the token so it canât like exfiltrate it, it introduce a lot of like, like new issues, which are. Real and hard to solve. And MCP is just like the dumb simple thing that works and it that itâs pretty good. [00:42:12] Sarah Sachs: Iâll add two more perspectives, not from it working well for Notion, but how notion like commits to both platforms. Notion is dedicated to being the best system of record for where people do their enterprise work. So we will always support our MCP and so far as other people are using cps, right? So regardless of our perspective, weâve put a lot of effort into our MCP and we have a fantastic team that weâre building, um, to do more there. And the second thing Iâll say, I think, um, we all think a lot, but lately Iâve been thinking a lot about making sure thereâs a value alignment and pricing, um, with capability. [00:42:43] swyx: Literally our next question [00:42:44] Sarah Sachs: and. Needing language to execute deterministic tasks feels wasteful and requiring on a language model to interface with third party providers seems wasteful for tasks that donât require it. And particularly because our custom agents are using usage-based pricing. We think of pricing as like the barrier of entry for use of our product, and weâre quite committed to making sure that itâs not wasteful. Um, not just because itâs a bad deal for our customers, but itâs also bad business. We wanna have as many buyers, like thereâs a, thereâs an elasticity of demand and so if we can have our agents properly execute code that calls on CLI deterministically, itâs a one-time cost, right? Versus constantly having a language model integrate with an MCP over and over and over and paying those like repeated token fees and itâs happening outside the cash window, then youâre paying for it over and over and over and itâs just kind of unnecessary and less deterministic when it doesnât have to be. [00:43:36] Alessio: Yeah, the open-endedness I think is like, the main thing is like, well, if I go write code to just call an API, I would never use an MCP. But then you need an NCP sometimes when you know what to call, but you donât want it to restart versus like, I think the it built a browser from scratch is like, itâs great when youâre doing it on your own, but like if your customers were having your AI write a browser from scratch every time and you had to pay the token cost of that, yeah. Youâd be like, no, no. The Chrome dev tools CP is actually pretty great. Just use that. Iâm curious, how do you make that decision? Like should it be. Just straight API call very narrow. Should it be an MCP? Should
VĂŠlg dit abonnement
Mest populĂŠre
BegrĂŠnset tilbud
Premium
20 timers lydbĂžger
Podcasts kun pÄ Podimo
Ingen reklamer i podcasts fra Podimo
Opsig nÄr som helst
1 mÄned kun 9 kr.
Derefter 99 kr. / mÄned
Premium Plus
100 timers lydbĂžger
Podcasts kun pÄ Podimo
Ingen reklamer i podcasts fra Podimo
Opsig nÄr som helst
PrĂžv gratis i 7 dage
Derefter 129 kr. / mÄned
1 mÄned kun 9 kr. Derefter 99 kr. / mÄned. Opsig nÄr som helst.