Latent Space: The AI Engineer Podcast

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

🔥😂91 h 17 min · 15. april 20261 h 17 min

Beskrivelse

For all those who missed out on London, see you in Miami [https://www.ai.engineer/miami] next week! Notion, the knowledge work decacorn [https://www.saastr.com/notion-and-growing-into-your-10b-valuation-a-masterclass-in-patience/], has been building AI tooling since before ChatGPT [https://www.notion.com/blog/introducing-notion-ai?utm_source=chatgpt.com], with many hits from Q&A in 2023 [https://www.notion.com/blog/introducing-q-and-a?utm_source=chatgpt.com] and unified AI in 2024 [https://www.notion.com/releases/2024-07-29] and Meeting Notes in 2025 [https://www.notion.com/blog/notion-ai-for-work?utm_source=chatgpt.com]. At the end of their last Make user conference, Ryan Nystrom teased Notion 3.0’s Custom Agents [https://youtu.be/KZ3hAy_XZwI?si=fqza-i0BAD2jYGyc&t=3133] - and they are finally embracing the Agent Lab playbook [https://www.latent.space/p/agent-labs?utm_source=publication-search]! Sarah Sachs [https://x.com/sarahmsachs] and Simon Last [https://x.com/simonlast] of Notion join us for a deep dive into how Notion built Custom Agents, why it took years and multiple rebuilds to get right, and what it means to turn a productivity tool into an agent-native system of record for enterprise work. We go inside the product, engineering, evals, pricing, and org design decisions behind one of the most ambitious AI product efforts in software today — from early failed tool-calling experiments in 2022 to agent harnesses, progressive tool disclosure, meeting notes as data capture, and the long-term vision for software factories and agentic work. We discuss: * Sarah and Simon’s path to launching Notion Custom Agents, and why the feature was rebuilt four or five times before it was ready for production * Why early agent attempts failed: no tool-calling standard, short context windows, unreliable models, and too much complexity exposed to the model * The “Agent Lab” thesis [https://www.latent.space/p/agent-labs?utm_source=publication-search]: not just wrapping a model, but understanding how people collaborate and building the right product system around frontier capabilities * How Notion thinks about roadmap timing: not swimming upstream against model limitations, but also building early enough that the product is ready when the models are * Why coding agents feel like the kernel of AGI, and how Notion is thinking about “software factories” made up of agents that spec, code, test, debug, review, and maintain codebases together * How Sarah runs AI engineering at Notion (“notes from Token Town [https://x.com/sarahmsachs/status/2031473087791902991]”): objective-setting over idea ownership, low-ego teams comfortable deleting their own work, and a culture designed to swarm around fast-changing opportunities * The “Simon Vortex,” company hackathons, and why security gets pulled in early rather than late * How Notion organizes AI: core AI capabilities and infrastructure, product packaging teams, and a broader company mandate that every product surface must increasingly work for both humans and agents * Why prototypes have become much easier to build internally, and how “demos over memos” changes product development inside a tool the whole company already uses every day * Notion’s eval philosophy: regression tests, launch-quality evals, and “frontier/headroom” evals that intentionally only pass ~30% of the time so the company can see where model capabilities are going * What a “Model Behavior Engineer” is, and why Notion treats eval writing, failure analysis, and model understanding as a distinct function rather than just software engineering * The changing role of software engineers in the age of coding agents, and why the new job looks less like typing code and more like supervising a rigorous outer system of agents, PRs, and verification loops * How the “software factory” should work: specs, self-verification, bug flows, subagents, and minimizing human intervention while preserving the invariants that matter * A live walkthrough of a Notion Custom Agent handling coworking space tenant applications by triaging email, enriching applicants with web search, and writing structured data into a Notion database * How agents compose inside Notion: shared databases as primitives, agents invoking other agents, “manager agents” supervising dozens of specialized agents, and memory implemented simply as pages and databases * Notion’s take on MCP vs CLI: why Simon is bullish on CLI’s self-debugging nature, where MCP still makes sense, and how Sarah thinks about capability, determinism, permissioning, and pricing alignment * The evolution of Notion’s internal agent harness: from early JavaScript coding agents, to custom XML, to Markdown and SQL-like abstractions, to tool definitions, progressive disclosure, and a much shorter system prompt * Why Notion cares about teaching “the top of the class,” building for sophisticated operators rather than abstracting away too much capability for everyone * How agent setup works today: agents that can configure themselves, inspect their own failures, and edit their own instructions — with guardrails around permissions * How Notion prices Custom Agents: credits as an abstraction over tokens, model type, serving tier, web search, and future sandbox costs; why usage-based pricing was necessary; and how “auto” tries to match the right model to the right task * Why Notion is not eager to train a foundation model, where they do fine-tune and optimize today, and why retrieval/ranking is one of the most important investment areas as more searches come from agents rather than humans * Why Meeting Notes became one of Notion’s strongest growth loops: not just as transcription, but as high-signal data capture that powers search, custom agents, follow-up workflows, and the broader system of record for company collaboration * Why Notion is more interested in being the place where collaboration data lives than in building hardware themselves — and how wearables or other capture devices may eventually feed into that system Sarah SachsLinkedIn: https://www.linkedin.com/in/sarahmsachs [https://www.linkedin.com/in/sarahmsachs]X: https://x.com/sarahmsachs [https://x.com/sarahmsachs] Simon LastLinkedIn: https://www.linkedin.com/in/simon-last-41404140 [https://www.linkedin.com/in/simon-last-41404140]X: https://x.com/simonlast [https://x.com/simonlast] Full Video Episode Timestamps * 00:00:00 Introduction and launching Notion Custom Agents * 00:01:17 Why Notion rebuilt agents four or five times * 00:03:35 Building for where models are going, not just where they are * 00:05:32 The Agent Lab thesis, wrappers, and product intuition * 00:08:07 User journeys, leadership, and low-ego AI teams * 00:13:16 The Simon Vortex, hackathons, and bringing security in early * 00:16:39 Team structure, demos over memos, and building for agents * 00:20:25 Evals, Notion’s Last Exam, and the Model Behavior Engineer role * 00:27:37 Evals as an agent harness and the changing role of software engineers * 00:30:42 The software factory: specs, verification, and agent workflows * 00:32:18 Live demo: a custom agent for coworking space applications * 00:35:08 Composing agents, manager agents, and memory as pages * 00:38:15 Notion Mail, Gmail, native integrations, and tools * 00:39:43 MCP vs CLI and the cost of capability * 00:44:13 When Notion uses MCP vs building its own integrations * 00:47:43 The history of Notion’s agent harness rebuilds * 00:55:35 Power users, public tools, and the setup agent * 00:58:01 Self-fixing agents, permissions, and “flippy” * 01:01:13 Pricing, credits, and choosing the right model automatically * 01:09:01 Why Notion isn’t training its own frontier model * 01:14:07 Retrieval, ranking, and search built for agents * 01:17:27 Meeting Notes as data capture and workflow automation * 01:21:18 Wearables, hardware, and Notion as the system of record * 01:23:45 Outro Transcript [00:00:00] Alessio: Hey everyone. Welcome to the Latent Space podcast. This is Alessio founder of Kernel Labs and I’m joined by swyx, editor of the Latent Space. [00:00:11] swyx: Hello. Hello. We’re back in the beautiful studio that, uh, Alessio has set up for us with Simon and Sarah from Notion. Welcome. [00:00:18] Sarah Sachs: Thanks for having us. [00:00:19] Alessio: Thanks for having us. Yeah. [00:00:20] swyx: Congrats on the launch recently the custom agents, finally it’s here. How’s it feel? [00:00:26] Sarah Sachs: We ship things slowly. So it had been in Alpha for a little bit and at the point at which is it’s an alpha, um, there’s a group of people that are making sure it’s ready for prod, and then there’s a group of people working on the next thing. So sometimes some of these launches are a bit delayed satisfaction, so it’s quite nice to remind yourself all the work you did because we do have a habit of like. Being two or three milestones ahead. Uh, just ‘cause you have to be, you know, you can’t get complacent. Um, but it’s been great that people understood how this is helpful. And I think that’s just easier in general building AI tools today than it was two, three years ago. People kind of get it and so that user education, um, there’s just, it was our most successful launch in terms of free trials and converting people and things like that. It was really successful, so yeah. But there’s a lot to build. [00:01:12] swyx: Making it free for three months helps. [00:01:16] Sarah Sachs: Yep. [00:01:17] Simon Last: It was definitely super exciting for me because it’s probably the fourth or fifth time that we rebuilt that. [00:01:22] swyx: Yes. [00:01:23] Simon Last: And I mean, [00:01:24] swyx: you’ve been building this since like 20, 22. [00:01:26] Simon Last: Yeah, I mean, like, it was even right when we got access to like GPT four in late 20 22, 1 of the first ideas we had is like, oh, okay, let’s make an agent that I, we used the word assistant at the time, there wasn’t really the word, the word agent yet, but, oh, we’ll give an access to all the tools the notion can do, and then it, we run in the background like, like do work for us. And then we just tried that many times and it just. Was too early. Um, [00:01:48] swyx: I need to force you to like double click on that. What is too early? What didn’t work? [00:01:52] Sarah Sachs: We were fine to, like, before function calling came out. We were trying to fine tune with the Frontier Labs and with fireworks, like a function calling model on notion functions. This is right when I joined. I joined because, um, we needed a manager as Simon was needed to be able to go on vacation. So, uh, that’s, that’s around when I joined, so you can speak much more to it. [00:02:11] Simon Last: Yeah, we did partnerships with both philanthropic and open AI at different times, uh, to try to, at the time the, I mean, when we first tried, there wasn’t even a constant of like tools yet. We, we sort of designed our own like, like tool calling framework and then we tried to fine tune the models to, uh, to use it over multiple turns. Um, and because it, it didn’t work well out the box, I think. Yeah. The models are just too dumb and the context thing was also way too short. [00:02:37] Alsesio: Yeah. [00:02:37] Simon Last: Um, and yeah, we just kind of banged our head against it for a long time. Uh, unfortunately it was always like, there was always like sort of. Glimmers that it was working, but um, it never felt quite robust enough to be like a useful, delightful thing. Um, until I would say, uh, the big unlock was probably like Sonic 3.6 or seven, uh, early last year. And that’s when we started working on our agent, which we shipped last year. Um, and then, and then uh, uh, custom agents, kinda a similar capability and that, that one just took longer because we, we just wanted to get the reliability up a lot higher. ‘cause it’s actually running in the background. [00:03:14] Sarah Sachs: And the product interface of like permissions and understanding, you know, this custom agent is shared in a Slack channel with X group of people and has access to documents that are surfaced to Y group of people. And the intersect experts, Y might not be whole. And so how do you build the product around making sure administrators understand that permissioning took multiple swings. [00:03:35] Alsesio: Everything is hard back at the end of the day. Yeah. I’m curious, like when the models are not working, how do you inform the product roadmap of like, okay, we should probably build, expecting the models to be better at some reasonable pace, but at the same time we need to, you know, you had a lot of customers in 2022. It’s not like you were a new company or like no user base. [00:03:54] Simon Last: Yeah, I mean I think there’s always the balance of, you know, like you want to be a GI pilled and thinking ahead and building for where things are going. Uh, but also you wanna be like shipping useful things. And so we always try to like, like keep a balance there. You know, we. We try to take clear, like a portfolio approach. You know, we’re always working on multiple projects and, and we’re always trying to work on, you know, maintaining things where that have already shipped, like, like shipping new things that are like eminently working well and make them really good. And, and then we wanna always have a few projects that are a little bit crazy. Um, [00:04:23] Alsesio: and what are the a GI peel projects that you have today? I’m curious about, uh, you don’t have to share exactly what you’re working on, but I’m curious what are things today that maybe in 18 months people will be like, oh, obviously this was gonna work [00:04:35] Sarah Sachs: 18 months. [00:04:37] Alsesio: Yeah, 18 months is, you know, [00:04:37] Sarah Sachs: it’s a long time and Yeah. Yeah. [00:04:39] Simon Last: I mean, there’s a number of things happening. I think one thing that’s becoming more clear is I think like, like, uh, coding agents are the kernel of EGI, sort of, everything is a coding agent. Mm-hmm. I think that’s, that’s sort of one, one direction. Um, and then, yeah, the exciting thing about that is sort of your agent can sort of bootstrap its own software and capabilities and actually debug and maintain them. And so yeah, we’re, we’re, we’re thinking a lot about that. And then, yeah, like, like another category of things that I’m, I’m really excited about is like, uh, we call the software factory also. People are using this, uh, this, this sort of word. Um, basically it just means can you create sort of like a, as automated as possible, a workflow for developing debugging. Mm-hmm. Merging, reviewing, and maintaining a code base and a service where there’s a bunch of agents working together inside, and like, like how does that work? [00:05:28] Sarah Sachs: If you think back to your initial question, like, why did this take so long? I think something, [00:05:32] swyx: I didn’t say that, but Yes. Okay. Go ahead. [00:05:34] Sarah Sachs: Why, what, what changed over the three and half years of trying [00:05:37] swyx: it? Exactly. Right. Because most people always say like, it didn’t work yet. Then reasoning models came, then it worked. I was like, okay, let’s go a little [00:05:43] Sarah Sachs: bit. That’s, I mean, that’s part of it, but I think the other part of it that I actually think is really what will set notion apart for every new capability is we have like. Two skills that are crucial when it comes to frontier capabilities. One is not letting yourself swim upstream. So like quickly realizing if you’re just pressing against model capabilities versus not exposing the model to the right information, not having the right infrastructure set up. That and of itself is the skill of intuition. And the second is to see, okay, you’re not swimming upstream. Which direction is the river flowing and what is like, how do we think ahead about the product and start building it even if it’s not great yet, so that when it is there, we’re ready for it. Right? And like those can sometimes feel like counterintuitive things. Like we can be trying to fine tune a tool calling model when they don’t exist yet. And that the trick is to not do that for too long, but realize that there was something there. And we’ve had a lot of things which like, um, we’re just like not swimming in the right direction with the streams. I think we had multiple versions of transcription before we got meeting notes, right? Oh, I gotta talk [00:06:39] swyx: about that. Yeah. [00:06:40] Sarah Sachs: Yeah. Um, and so. I, I, I think that like we, we really closely partner with the Frontier Labs on capabilities and we also have to have strong conviction on, as those capabilities move. Notion is about being the best place for you to collaborate and do your work. And how does that narrative change if the way that we work changes? Yeah. [00:06:58] swyx: Yeah. You told me you were a fan of the Agent Lab thesis, and this is, this is kind of it, right? [00:07:02] Sarah Sachs: Right. I show that thesis to so many candidates. Like I have it as like micro chrome autofill. Um, at this point, like it’s one of my most visitations [00:07:10] swyx: because like, is this the, here’s why you should work in notion and not open, open eye. I, it’s like, [00:07:14] Sarah Sachs: here’s, here’s what’s different about it. [00:07:16] swyx: Yeah. [00:07:16] Sarah Sachs: And here’s why. It’s not just a rapper. I actually think more and more people understand it’s not just a wrapper. [00:07:21] swyx: Yeah. [00:07:22] Sarah Sachs: Um, and by the way, like in the beginning, parts of what we build are wrappers on functionality. That works well, of course, but that’s not really the most, um. I would say that’s not the product that, that drives revenue. And that’s not necessarily always what users need. [00:07:35] swyx: I mean, you know, notion is the AWS wrapper, but like the, the wrapper is very beautiful and like very, very well polished. So [00:07:40] Sarah Sachs: like the analogy, [00:07:41] swyx: like [00:07:42] Sarah Sachs: the analogy that I’ve been coming back to his Datadog in AWS [00:07:45] swyx: Yeah. [00:07:46] Sarah Sachs: So, uh, Datadog could not exist with, without cloud storage. Right. That it’s kind of fundamental that that works. Um, and AWS has like a CloudWatch product, but Datadog is an expert on understanding how people want observability on the products they launch. And we’re experts in understanding how people wanna collaborate, and that’s really where our expertise lies. [00:08:04] swyx: Totally. [00:08:04] Sarah Sachs: Um, regardless of the tools that we use, [00:08:07] Alsesio: I’m kind of curious how you think about implicit versus explicit expertise. I feel like Datadog is half and half implicit and explicit. It’s like they understand across markets and industries what engineering teams usually look for. With notion, it’s almost like more of the expertise is at the edge because you as a platform, you’re like so horizontal that the end user is not really the same. Mm-hmm. Like with Datadog, the end user is always like, yeah, an engineering lead, a kinda like SRE related person with notion. It can be anything. So I’m curious how you put that expertise into a product versus, you know, obviously it, WS cannot build notion. It’s, that doesn’t quite work in this case, but [00:08:44] Simon Last: it’s, it’s a little bit differently shaped. I think, you know, a classic vertical SaaS, like the data is kind of like that. They understand their individual customer very deeply. It’s kinda a narrow slice, um, notion has always been super horizontal. And our, our task has always been to sort of balance these two somewhat opposing forces of like, we’re listening to our customers and what they want us to build. It’s a broad slice. And then also we’re thinking about like, okay, how do we decompose what they want into, uh, nice primitives that are, that are really nice to use and we’ll, we’ll get us like as much bang for the buck as possible. And then, you know. Maintain the whole system, make it all like, like super clean and nice to use. [00:09:22] Sarah Sachs: We still have user journeys. I mean, we still focus on like core. I actually think the failure of our team is when we focus too much on what are cools that are, what are tools that are [00:09:31] Simon Last: mm-hmm. [00:09:31] Sarah Sachs: Cool tools. I actually think that’s when we make have the least velocity because you still need some sort of focus on a user journey. So like for instance, we’ll all sit down every Friday and look at the P 99 of like the most token exhaustive custom agent transcript and just look at why it didn’t do well and cut a bunch of tasks. Like we still focus on like, this has, like this should work. Email triaging should work. Mm-hmm. Right. And similarly, like when we’re talking about before building, um, chatting, um, before we started filming about, okay, how can I do PDF export? Well that’s functionality that then merits. Maybe we should build a tool that has access to a computer sandbox in a file system and the ability to write code. Right? Right. Um, but it’s because we’re thinking about the fact that our users to do their, to do their daily work, need to export PDFs, not because we’re like, Hmm, I think a computer tool could be cool. Like, let’s just see what happens. Mm-hmm. Like we, we have to focus on some user journeys, otherwise we just don’t have like, enough strategy to, to prioritize. [00:10:29] swyx: I think there’s a lot of like really strong opinions that you’ve had. Do you have like sort of like a towel of Sarah Sachs? Like, you know, like what, how do you run your team? Like I feel like you just have accumulated all these strong opinions. Obviously part, part of this is your, your token town thing. [00:10:43] Sarah Sachs: I think the TAs working with Service X is, um, you’d have to, it depends who you ask. Um, I think it depends if you’re on my team or a partner Right. Or a vendor. [00:10:54] swyx: Yeah. There other people want to run their teams the way that you’re Yeah. You’re like bringing these things. And then also similarly, uh, Simon, when you did the custom agents demo, you had like, well, we’ve been using custom agents and here’s the super long list of everything that we do. No humans ever read it. Right? That’s what you said. I was like, [00:11:07] Sarah Sachs: yeah. So I think for, for me, um, something that I learned very quickly and became very comfortable with was that my job was not to be the ideas per person or the technical expert. My job was to make it so that everybody understood the objective, had a resource to help prioritize what they should work on, and had an avenue to prioritize what they thought was important. And I think that’s true with all, all leadership, but I think especially on the AI team. Almost all of our best ideas come from prototypes, from people that have a cool idea because they saw a user problem, and it’s a huge disservice if all of those ideas have to pass, like the sniff test of what me and a product partner or Simon and Ivan decided were the direction, right? Because a lot of what we’re doing is leaning into capabilities, so. I think that’s the first thing is like, I don’t really view like the role of engineering leadership as like, uh, hierarchical, nor has it ever been, but especially now, like very willing to change direction based on, um, like proof is in the pudding. Yeah. And like, and I think we have rebuilt our harness three or four times. And when you do that, then the second rule of engineering leadership is like you need to build a team that’s comfortable deleting their own code and is very low ego and is driven by what’s best for the company. And, um, doesn’t write design docs because they think it’s their promotion packet. Right. And that’s a culture that notion had long before I joined, but like our willingness to just swarm on different problems and um, redo things that we’ve built before because something has changed. Like, there’s a lot of friction that can happen at companies when you do that. And it doesn’t happen at Notion. And because it doesn’t happen when new people join. Like they don’t wanna be the ones that are saying, we shouldn’t do this. I wrote that code. So then it’s, you know, you, you create a culture that everyone thoughts and that culture comes directly, I think from Simon and Ivan though, um, because they’re very open-minded. [00:12:50] swyx: Anything that you, [00:12:50] Simon Last: you’d add? I’m not a manager, like, like, like Sarah is. Um, a lot of my role is really to try to think a little bit ahead, make sure that we’re, we’re building on the right capabilities and then like the prototyping stuff. And yeah, it’s really, really critical to always just be starting again. It’s like, okay, this is new thing. What does this mean? What if we just rethought everything or wrote everything? And so I, I’m, I’m basically just doing that in a loop every six months. [00:13:16] swyx: Yeah. Do you believe in internal hackathons for this stuff? [00:13:19] Sarah Sachs: I think there’s like two different versions. So one is like, we just have a, a, a solid bench of senior engineers that come and go on what we call the Simon Vortex and Productionizing what we built, right? Because when you’re in the Simon Vortex, the velocity is super high. The direction changes daily, and it’s meant to be like the equivalent of a SC Works lab. We don’t need to do hackathons for that. We need to have senior engineers that we trust to come in and out of those projects. For instance, like management boundaries are really loose. Like you report to him, but you work for her right now. Yeah. That’s something that when we hire managers, it’s important they don’t care about because we tend to form more structures. Yeah. Don’t be too [00:13:54] swyx: territorial. [00:13:55] Sarah Sachs: We form more. It’s after we ship things, not not before, just historically. Um, the second thing is we do have companywide hackathons. Actually we just had our demos day for the hackathon we had last week this morning. That’s more for people that aren’t directly working on the project, feeling like they have the time to pause and learn how to make themselves more productive or how they would use notion custom agents to build something. Or part of the hackathon was actually encouraging everyone across the company to build their own agentic tool loop, calling from scratch. Follow like an every blog post on how to do what I think because we want [00:14:26] swyx: just with the compound engineering one. Yeah. [00:14:28] Sarah Sachs: We want everyone to use cloud code in the company or whatever the coding agent they please and understand that fundamental. So we set aside a day and a half. We’re all leadership, encourage everyone on their teams across the company to do it. So we have hackathons like that. I would say like kind of facetiously, like everything we build is a little bit like a hackathon until it graduates and puts on big boy pants and as a product ops rollout leader and has a assigned data scientists and stuff like that, [00:14:54] swyx: security review enterprise stuff, [00:14:56] Sarah Sachs: actually security reviews one of the things that we bring in first because it just slows us down way more and, um, causes a lot of tension and they build better product if they’re involved early. So, um, that is probably the first person to get involved in something that’s the [00:15:09] swyx: right PR approved answer. [00:15:10] Sarah Sachs: No, but it’s not just PR approved. It like, um, um, it’s [00:15:13] swyx: actually real. It’s actually real. It’s like, um, I’m just saying scar [00:15:15] Sarah Sachs: tissue. [00:15:15] swyx: Yeah, [00:15:16] Sarah Sachs: because like, you know, my background’s also, I worked at Robinhood for a number of years. Yes. So like, uh, compliance and things like that, um, are a little bit more, you learn the hard way when it doesn’t come naturally. [00:15:26] Simon Last: Yeah. I think the. The hackathon is really important for uplifting the general population, but like, if that’s the only way you can build new things, you’re kind of toast. I mean, it, it has to be like the daily processes, like, you know, building these new things. Um, and it has to be about, I think like, I think in the AI era a lot more leverage accumulates to the most curious and excited people. And so it’s like we’re all about just like activating that energy. You know, like if someone’s protesting something on the weekend that they’re excited about and it’s important, that should be the main thing that we’re doing. Yeah. Um, it’s not a hackathon that we schedule once a quarter, it’s just like, yeah. Daily process. Part of the culture. [00:16:02] Sarah Sachs: I mean, that’s how we shift image generation and notion now. It was always this thing that would be kind of nice to have, but it wasn’t really clear where that was necessarily aligned in product priorities. It’d be a lot of work. And we had someone on the database collections team, Jimmy, who was like. I really wanna do image generation for cover photos and inside notion. And we’re like, if you wanna build it, like it’s, do it please. Like we encourage you. We gave ‘em all the resources of working directly with Gemini and being able to like track the token usage and it working through endpoints. We gave them eval, support, everything, and then became a, a full project. [00:16:34] Alsesio: Yeah. [00:16:35] Sarah Sachs: That’s why you can’t have like ego as a, a leader. Like that’s, that’s how we work. [00:16:39] Alsesio: What’s the size of the team today, both engineering and overall? [00:16:43] Sarah Sachs: I manage, uh, the team. That’s what we’ll call it. Core AI capabilities and infrastructure. That’s about 50 people. But then we have per i partner teams that do packaging. So how it shows up in the corner chat versus custom agents versus meeting notes, that’s another 30, 40 people. And, and then every team that has a product service at Notion that a user can interface with owns the tool that the agent interfaces with the editor team. The team that did CRDT for offline mode is the same team that handles how two agents, um, edit competing blocks. Mm-hmm. Right? It’s the same problem. The team that built the underlying SQL engine is the same team that owns how the agent asks it to run a SQL query, and it does it performantly. And so from that regard, anyone working on product engineering is tasked with making them work for customers that are humans and agents because over time the majority of our traffic will be coming from agencies using in our interface, not humans. And so. Our objective is to make it so that the whole product org is building for agents. [00:17:40] Alsesio: Yeah. How has it changed internally? The activation bar is kind of lowered a lot. Like anybody can kind of create a prototype very, somewhat easily, especially if you’re like an existing code base. Have you raised the bar on like what type of prototype people need to bring forward to gonna be taken? Not like seriously, but like, you know what I [00:17:58] Simon Last: mean? Yeah. I think the bar is lowered in many ways. Be like, one thing our, uh, our team built that is really cool is our, uh, our, our design team made a whole separate GitHub repo, uh, called the, the design Playground. And it’s basically just to create a bunch of like, like helper components and you, uh, for, for quickly a throwing together UIs. And it’s become like actually quite sophisticated. Like it has like an agent in there and like, uh, that’s pretty fun. So like, we pretty much, like, they don’t do mocks, they just make like, like full, full prototypes. [00:18:27] swyx: Here it is. It works. [00:18:28] Simon Last: They give you like a u rl. They’re like, okay, all right. So we have to make the, like the real production version of that. Um, and then for engineers. A prototype looks like just making it a feature flag that actually works. Like that’s sort of the bar. [00:18:39] Sarah Sachs: Something to understand that’s really unique about notion. One of the reasons I joined we’re super lucky is no one uses Notion in their job as much as people that work at Notion. [00:18:46] Simon Last: Of course. [00:18:47] Sarah Sachs: So I think there’s very few companies, maybe if you worked on Chrome I guess, but like everything that we ship, we ship internally first and get a lot of really quick feedback. And also sometimes our dev instance is totally borked and you have to change a bunch of flags to get things done. And that’s kind of like, but everyone, so people that do it ticketing, people that do supply chain procurement, recruiting, everyone is using the same instance of notion with like a lot of flags on for these prototypes people build. Um, and so we have this, Brian Levin, one of the designers on our team, I think evangelize this concept of demos over memos. [00:19:18] swyx: Ooh, too [00:19:20] Sarah Sachs: good. Um, which has been, uh, very good for building demos, and I think it’s put a big pressure point on us to have really strong product conviction, because if anything can be demoed, you really need a strong filter of making sure that if you know, you’re doing X amount of work, you’re making the, you’re, you’re focusing on one tower, you’re not just building a really flat hill. Right. That’s actually where I think there has to be more conviction from our PMs, um, and our designers and, and well, the company really to have conviction of what journey we’re going on. [00:19:52] Simon Last: But overall, I feel like it works pretty well. Like people, almost all the engineers have good enough taste to realize that like, this prototype doesn’t actually make sense in the product, or, or it does. So it’s not that common that I would see a prototype. It’s like, oh, this makes no sense. Mm-hmm. It’s like, you know, people are doing reasonable things and, and, and then it’s just a matter of. Which things we build first and then often just, just figuring out how to turn it on and off. There’s our, in the, in our like experimental chat ui, there’s this, there’s probably like, like a hundred check boxes in there. [00:20:22] Sarah Sachs: Kills me [00:20:23] Simon Last: the things you could turn on and off. [00:20:25] Sarah Sachs: Uh, but I think that, okay, so that is kind of true, Simon, but like being the person that manages the evals team, like there is a level of intensity that it adds to the platform team. So, you know, if we’re gonna do image generation and notion, all of a sudden the way that we do attachments and the way that we, um, our LLM completion like cortex talks and expects tokens back and now it’s getting images back. Like there’s a lot of platform work that we do need to, like solidify a little bit. So sometimes it’ll be in dev for a couple weeks before it makes it to prod just because we still have to like, make it robust, make it HIPAA compliant, ZDR compliant, figure out the right contracting with the vendor, whatever it is. And we need to eval it because we want the team. To still maintain what they build. That’s the one thing is like if we have a bunch of prototypes, it can’t just be like a small group of people that then maintain whatever end prototypes. So we have invested a lot of people in an eval and model behavior understanding teams that, we call it agent dev velocity. So your dev velocity building agents can be faster if we invest in that platform. And so we have a whole org dedicated to Asian, um, platform velocity so that you can build your own eval and then maintain it once you ship it. So if a new model release comes out and we, every [00:21:38] swyx: team maintains their own eval, [00:21:40] Sarah Sachs: we maintain the eval framework. Every team owns their own evals and a lot of them we’ve integrated to Optin, to ci, or we run them nightly and we have a team, uh, a custom agent that triggers to a team to look at the major failures. That’s really critical because if we have like all these different surfaces now, a lot of it’s on the same agent harness, so it’s easier to maintain. It’s just packaging of different agent harnesses, but new functionality of the agent. Let’s say that like we wanna update like. Uh, you know, they deprecated, sonnet, um, four or whatever it is and we need to auto update. Are [00:22:11] swyx: they already? That’s so, okay. Yeah. Actually wasn’t that long ago. [00:22:14] Alsesio: They were [00:22:14] Alsesio: just 3.5. [00:22:15] Sarah Sachs: 3.537. Just got deprecated. [00:22:18] swyx: 3 7, 5 0.2 or, yeah. No, [00:22:20] Sarah Sachs: it’s not. 5.2 is five point. Five point no. Yeah, five four is 40% more expensive than five two. So if they deprecated five two, you would hear they can, you would hear from me about that one. Um, but, uh, another conversation to have. [00:22:35] swyx: I have a cheeky evals question for you. Have you noticed any secret degradation from any of the major model providers? [00:22:40] Sarah Sachs: Secret degradation, [00:22:42] swyx: like. During the War Bay, when it’s high traffic, it suddenly gets dumber. [00:22:47] Sarah Sachs: Yeah. I mean, not just between the, I mean, we definitely notice flakiness, we’ve definitely noticed, particularly for some providers, that things are slower during working hours and [00:22:57] swyx: there’s a latency argument. Yes. Not a quality argument. [00:22:59] Sarah Sachs: No. I think the quality difference that’s interesting is, um, even though companies that say they’re selling the same, a, it’s really into like quanti quantization, but like companies that say they’re selling the same model through different vendors, whether it be through first party or Bedrock, Azure, et cetera. We do see different qualities sometimes, and that’s not necessarily what’s advertised. [00:23:21] swyx: Yeah. Kidney went to the point of like, if we, they shipped like this, like eval across all the providers and it was like very obvious we were secret equalizing and it was very, [00:23:28] Sarah Sachs: yeah. But [00:23:29] swyx: that’s very embarrassing. [00:23:30] Sarah Sachs: You know, um, we hire Subprocess to figure that out for us. So we just wanna understand where it’s regressing or where it’s optimized. And sometimes we’re okay with regressions that optimize latency if they’re the appropriate regressions. Our job is to make sure we have the evals to understand the changes that are important to us. And even like when we’re partnering with labs on pre-releasees of models, they’ll send us multiple snapshots. And this is less about quantization, but more just regressions. Like they have shipped models that were not the snapshots that we wanted, and they have changed the snapshots that they shipped based on the feedback that we give. Because our feedback tends to be more enterprise work focused and not coding agent focused. And definitely those can be bummers, like, you know, uh, we know that this wasn’t the version you wanted, but we’ll help you make it work. I mean, we always make it work, but that definitely happens. [00:24:16] Alsesio: Yeah. Do you have, um, failing evals that you’re just hoping, oh, that will have success eventually when a good model comes out? [00:24:23] Sarah Sachs: Uh, I mean, yeah. So I think. I mean, I could talk about this for 60 minutes, so I will limit myself. I think it’s a real issue when people say evals and it’s just like, that’s quality, that’s like unit, I mean, it’s like saying testing. It’s not just unit tests, right? So. We have the equivalent of unit test. Regression test. Those live in ci, those have to pass a certain percent, you know, within some stochastic error rate. Then we have, as you’re building a product, evals of these aren’t passing right now, and this is launch quality. So we have a report card and we need to, on these categories, you know, be it 80 or 90% of all of these user journeys to launch, and then what we have what we call frontier or headroom evals, where we actively wanna be at 30% pass rate. And that’s actually been a effort that we took in partnership with philanthropic and OpenAI in the past maybe two or three months, because we actually hit a point where our evals were saturated and we weren’t able to really give insightful feedback other than it wasn’t worse. And not only is that not helpful for our partners, it’s not helpful for us to understand where the stream is going. You know, going back to that analogy. And so we spent a lot of time thinking about. What notions last exam looks like, right? Mm-hmm. Not just humanities, last exam. Ooh, notions last exam. Mm-hmm. And, um, there’s a lot of, you know, dreams about what that would look like. I know we’ve talked a lot about benchmarking, um, swix, but, uh, yeah. Notions last exam is a big thing inside the company and we have people, full-time staff to it exclusively. Mm. We have a data scientist, a model behavior engineer, and an full-time, um, evals engineer just dedicated to the evals that we pass 30% of the time. [00:25:56] swyx: What you’re hiring for [00:25:57] Sarah Sachs: MBEs? I am hiring [00:25:58] swyx: What is an MBEA [00:25:59] Sarah Sachs: model? Behavior Engineer Model. Behavior engineers started with a title data specialist before I joined when they were working with Simon on like, uh, Google Sheets and like Simon just needed someone to look through Google Sheets and say, yes, no, this looks bad. This looks good. Right? And so we hired people with kind of diverse linguistics background. We had like a linguistics PhD dropout. Mm-hmm. And a Stanford ate new grad. And they’re amazing. And they formed a new function basically. And over time we’ve built a whole team, um, with a manager who’s now kind of reinventing what that role is with coding agents. So they used to be kind of manually inspecting code. Now they’re primarily building agents that can write evals for themselves or LLM judges. There’s a really funny day I can send you the picture where Simon, about a year and a half ago, was teaching them how to use GitHub. Um, and they’re on the whiteboard and it was like, okay, I think it would be so much faster if our data specialists learned how to use GitHub and like learned how to commit these things in Dakota. And, and that was then and now I think, you know, coding has been a lot more accessible. Um, but moving forward it’s this mix of like data scientist PM and prompt engineer because there’s craft in understanding like even like what models can and can’t do things. How do we define like that headroom? How do we define like what a good journey is? Um, is this model better or not? Why is this failing? There’s some qualitative work, but then there’s also like a lot of instinct and taste to it, and that’s not necessarily software engineering. And so we have like very firm conviction and we have had for a number of years now that that is its own career path and we have always welcomed the misfits, so to speak. So we really firmly believe that you don’t need an engineering background to be the best at this job. And that’s what’s quite unique about this particular role. [00:27:37] Simon Last: Yeah, this is something that I’ve been pretty excited about recently is we made an effort basically to treat the eval system as like an agent harness. So if you think about it, like, you know, you should be able to have an agent end-to-end, download a dataset, run an eval, iterate on a failure, debug, and, and then implement a fix. And ultimately you should be able to, you know, drive the full time process with a human sort of observing the, you know, the outer uh, system. So yeah, we went, went pretty hard on that. And that’s, that’s worked extremely well so far. It’s like basically just to turn it into a coding agent, uh, uh, problem. [00:28:11] swyx: Your coding agent or just whatever [00:28:13] Simon Last: harness No coding agent. Yeah, code, cloud code. It should be totally general. Yeah. I think if it would be a mistake to like, like fix it on any, any particular coding agent. At the end of the day, it’s just like CLI tools. [00:28:21] Sarah Sachs: It’s like the same way that you would’ve a coding agent write the unit test. You should have a coding agent write the eval. [00:28:26] swyx: Yeah. [00:28:26] Sarah Sachs: But there’s a lot of supervision in that still. We just don’t believe that supervision has to come from software engineers because a lot of it is like, um, kind of you XREE and whatever, and these are the people that also triage failures and tell us where we should be investing next. [00:28:40] swyx: Yeah. I’m gonna go ahead and ask a spicy question. Is there a data, there are no software engineers at Notion. [00:28:46] Simon Last: Um, [00:28:46] Sarah Sachs: what does it mean to be a software engineer? [00:28:47] swyx: Exactly. [00:28:48] Simon Last: I mean, I think the way things are going is like we’re on some continuum where. If, if you look back three years ago, humans were typing all the code and then we had auto complete, you’re typing list of the code. Then we had sort of like filling agents, filling lines, and now we’re getting into like agents doing longer range tasks where you can debug and implement a fix and then verify it works and you know, get your, get your PR even like, like Merion deployed. I think we’re sort of just moving up the abstraction ladder and then the human role becomes more about observing and maintaining the outer system. There’s a string of agents flowing through, like me prs what’s going off the rails. Like what do I need to approve? Is there like a learning or memory mechanism that that works? So it’s kind of a hard engineering problem. There’s a, you know, there’s, there’s a lot to do there. I think we’re just sort of moving up stack [00:29:34] Sarah Sachs: the same transition machine learning engineers have made, right? Like I haven’t looked at a PR curve in a while. [00:29:39] swyx: Yeah. You used to do this stuff and now, um, auto research can do it, [00:29:42] Sarah Sachs: right? Like I think it depends on what you define as a software engineer. [00:29:46] swyx: Yes. It’s, that’s changing for sure. [00:29:49] Sarah Sachs: I think every software engineer in notion this summer went through like this, um, sheer, um, one of our engineering leads of the company called it, like every software engineer is going through the, the, uh, identity crisis that every manager goes through, where all of a sudden they realize their ability to write code is less important than their ability to delegate in context switch. And I think that is a transition out of being a software engineer. But [00:30:12] Simon Last: yeah. Yeah, there’s a critical difference to being a manager, which is that like, it is actually very deeply technical. The problem, you know, humans are very like, like, like fuzzy and you can’t like treat a team of humans like a, like a rigorous system where like, you know, prs like, like flow through and can be in like a block status and then what happens when they’re blocked, right. With a set of agents, you actually can do that. And, and, and I think it’s actually, there’s a lot of interesting technical rigor that that goes into that it’s like it’s a technical design problem. Ultimately. [00:30:42] Alsesio: What is the design of the software factory that you’re building? [00:30:46] Simon Last: Yeah, I mean, I think we’re. Trying a lot of different things. I mean, ultimately you want to design a system that requires as little human intervention as possible, but like still maintaining the in variance that, that you care about. So yeah, we’re exploring a lot different ideas there. I mean, I think I could talk about a few things I think are important there. Like, one thing I think is really important is, um, having some kind of like specification layer you can just commit marked on files. Mm-hmm. That works pretty well, but [00:31:15] swyx: it’s nice to be notion man. I’m just saying like the spec, like Yeah. The natural home for specs is notion. [00:31:21] Simon Last: Yeah. Right. It can be a database of pages. Yeah. I mean, it needs to be something that is, you know, human readable and I viewable and I think that’s pretty key. Another really key component is like the, the self verification loop. Yes. You need really, really good testing layers, basically. And that’s a really deep, uh, uh, problem. But by getting that right, you know, and then, and then it’s kinda like the workflow of like. What happens when there’s a bug? How does it flow into the system? Like, is it like a subagent working on it? How does it make a PR and how does that get reviewed? And me, and then, you know, so there’s like the, the flow or process. [00:31:56] swyx: Yeah. Cool. Uh, you know, one thing we did work out before you guys came in was this demo or this [00:32:01] Simon Last: agents [00:32:02] swyx: agent demo. Uh, [00:32:03] Simon Last: so every, [00:32:04] Alsesio: every time we do an episode, we try the product. Right. I don’t think there’s ever been an episode that I haven’t tried. Yeah. Um, [00:32:11] swyx: and we, we try, try is a, a big word. Like since day one lane space has been on Notion, but this is the, this is the net new thing. Yes. [00:32:18] Alsesio: So this is for Nel Labs, which is the space we’re in. So next week we’re opening applications for tenants. So there’s a web form, let me, we got this form done here. Uh, so, uh, before. Uh, the workflow would be I get an email, then I look at the person. It was like, should I spend time talking to this person? Then I respond, they respond back. So I build this. So the name it came up for on its own. Can you maybe h how do, how does it come up with its own name? [00:32:43] Simon Last: Yeah, that’s a pretty app name. It’s, it, it is just a random, it’s a random, a name generator. [00:32:47] Alsesio: Oh, that’s funny. It just came, [00:32:49] Simon Last: the fact that it picked that is, is kind of hilarious. I’m pretty sure it’s just determined, [00:32:54] Sarah Sachs: resilient collector. I, I think I’ve never looked at the code for that. I’ve never second guessed it. I think it’s kind of like a madlib situation. [00:33:00] Simon Last: Yeah, I think you’re right. Yeah. It’s, it’s totally a, a deterministic. Oh, I thought it was great. Yes. Although, although when the, if you use the AI to set itself up, it can update its own name, so. Okay. Um, [00:33:11] Sarah Sachs: how did you create it? It, did you just do [00:33:12] Alsesio: classroom? I, [00:33:13] Sarah Sachs: okay. [00:33:13] Alsesio: I did, yeah. I’ll say just check my inbox for applications for a coworking space. Keep a people, so it created the database for me. Which I have here. And I guess database is like an notion table because everything is notion. Um, and then whenever um, an email comes in, like here, it just creates a new role for the person. Mm-hmm. And then it uses web search to enrich the mm-hmm. The profile. So it kind of like searches the web and it’s like, this is who this person is, this is when they say they wanna move in and kind of updates everything else. This is, I mean, it’s not a GI, but to me, I don’t wanna do this work. So it feels like, I mean, it took me maybe like 15 minutes to set up the whole thing. Um, and I really like that most of the information should live here. You know, it is not like some other tool asking me [00:34:01] Sarah Sachs: Yeah. [00:34:01] Alsesio: To like, bring my stuff there. It’s like I would’ve probably already created an ocean thing. [00:34:06] Sarah Sachs: Mm-hmm. [00:34:06] Alsesio: So [00:34:07] Sarah Sachs: most of our biggest use cases and gains are from. That extra layer of human involvement in the process to make it so right. And so like one of our biggest use cases is bug triaging. So if someone posts something in Slack, can you just have a custom agent that lives there that has its own routing constitution of what team this belongs to, creates a task in your task database and then posts in that Slack channel, right? Like that’s like one of the first things that we built internally, I think. And it’s completely changed the way that notion functions as a company. Nothing falls through, well, most things don’t fall through the crack. We don’t know what we don’t know. But it’s not replacing people, it’s replacing processes. [00:34:44] Alsesio: Yeah. [00:34:44] Sarah Sachs: Right. [00:34:45] Alsesio: And I’m curious how you think about composability of these things. So the other one I was working on is like a. These filler. So whenever somebody signs up as a tenant, kind of he’ll sell the lease for them. There should probably some agent that is like office manager agent mm-hmm. That can handle the request, make the lease, and then, uh, give them a ADA access to the office and all of that. How do you think about that feature? [00:35:08] Simon Last: Yeah, so I mean, there’s, there’s two ways you can compose. One way is by using like the data primitives. So you can, you know, you, you could give, you have one agent, uh, be writing to the database and there’s another agent that’s walked in the database. So that’s, that’s one way that they, they can coordinate that’s like a little bit more decoupled and mm-hmm. Works really well. Or you, you can couple them. So I, I think it’s actually not released yet. Releasing it like next week is, uh, in the settings for an agent, you can give access to invoke any other agent. [00:35:34] swyx: Hmm. [00:35:34] Simon Last: So you can have them just. Just, uh, uh, talk directly. So [00:35:37] swyx: you, was there a limit on like, number of recursions or just, [00:35:40] Simon Last: um, probably, [00:35:42] swyx: you know what I mean? Like, you can just get an infinite loop that way there’s [00:35:45] Simon Last: some kind of Yeah, [00:35:46] Sarah Sachs: I think it’s, there is actually a number somewhere. [00:35:49] swyx: I believe I’m just, you know, like, you’re, you’re, someone’s gonna screw up. You [00:35:51] Simon Last: should you try to see [00:35:53] swyx: Yeah. I mean, everything’s gonna be paperclips. [00:35:55] Simon Last: Oh, yeah. Yeah. But, uh, but, but that’s really useful. Yeah. So we, you know, like I just, I, I helped, uh, someone internally the other day, they had, they had built like over 30 custom agents for, uh, for our go to market team doing all kinds of different things. You know, for example, like researching, you know, like, like filling information about, about a customer or like, like triaging customer feedback or like, uh, something like that. Literally over 30 of them. And, and then he, and then he even made like a database of all the agents and then he is like, okay, and, and now I’m getting 70, over 70 notifications per day with just the agents are blocked on various things. Uh, and then I was like, oh, okay, cool. You know, the obvious thing to do there is to make a manager agent, [00:36:32] Sarah Sachs: right? [00:36:33] Simon Last: That’s gonna sort of blocks be another abstraction layer in between your, your, uh, uh, 30 agents. Uh, so yeah, we, we send out with like a manager agent and then has access to invoke all the other agents and it’s sort of like, like watching and observing them and then it sort of, it just creates a layer of abstraction. So instead of 70 notifications per day, it’s like, like five. And then, and then the manager agent can help like, uh, debug and fix any problems with the, [00:36:54] swyx: does this is a concept of like an inbox or something like piece, you’re basically saying that they can message each other? [00:37:00] Simon Last: Yeah. [00:37:01] Sarah Sachs: Well [00:37:01] swyx: they use the system of record, which, which is [00:37:02] Sarah Sachs: notion, so we [00:37:03] Simon Last: actually, yeah, we didn’t make any special concepts at all. [00:37:06] swyx: They’re interested to the motion notifications that I would’ve got, [00:37:09] Sarah Sachs: they can just like write a task to a database that the other agent’s task to listening to, or they can actually call a web book to the agent, like they can just add the agent. Okay. [00:37:17] Simon Last: Yeah, I mean, this is something that, that we’re still working on. I, I think we, you know, like, like generally, generally the way we do these things is, you know, you first make it possible, maybe like a sort of janky way. So I, I, I think the way I set ‘em up is like, you know, we created like a new database that was sort of like issues mm-hmm. That the custom agents were, were experiencing, and then gave them all access to file an issue and then the manager has access to, to read the issues. Um, and that works pretty well, essentially like, like give it its own like internal issue tracker just for the agents. And then, you know, if that becomes a, a concept that seems useful, generally maybe we will think of how to package it in. But I mean, generally we try to just keep it to composing the primitive if we can. You know, another example of this is we have no built-in memory concept. Memory is, is just pages and databases. And so if you wanna give a memory, just give it a page and give it. Edit access to that page and the [00:38:03] swyx: human can edit it. Agent can edit [00:38:04] Simon Last: it. Yeah. And so that works, that pattern works extremely well on it. And you know, depending this case, you can have it be just a page or it could be an entire database with, you know, or, you know, I can have sub pages is is pretty on what you can do with that. [00:38:15] Alsesio: So when I was setting this up, uh, I connected my inbox and it was like, do you wanna use Gmail or Notion Mail? And I’m like, I don’t wanna use Eater, I just want you to do it. I’m curious how you think about, you know, notion, mail, notion, calendar, all of these kind of ui ux interfaces, full stack [00:38:29] Simon Last: notion. [00:38:30] Alsesio: Yeah. When like at the same time you have the agents abstracting them away from you in a way, you know, how do you spend like the product calories so to speak? [00:38:37] Simon Last: Yeah, I mean, I think it’s pretty important that you don’t have to use, not your mail to connect to the mail capability. So we can just connect to Gmail or, or whatever you want, uh, to use. And we’re thinking of the mail service as being really great to the extent that it’s really agent built, right? So maybe the mail app is just sort of a prepackaged agent that helps you automate your, your inbox. [00:39:00] Alsesio: Yeah, the auto labeling is great. Think [00:39:03] Sarah Sachs: the, when we, um, integrate with Gmail for instance, we have a series of tools available that are available via MCP or API to Gmail. When we integrate with Notion Mail, we have the Notion Mail engineering team to build us the, um, exact right tools that optimize latency, optimize performance and quality. They own that quality. Um, there’s product leads there. They’re directly thinking about the user problems that happen in mail. So it tends to be when we build integrations and connections, we build natively first. Um, and then think about, um, extending them generally just because it’s also easier. Mm-hmm. Um, um, to build natively first. Um, so that tends to be how we phase things out. [00:39:43] swyx: Talking about integrations, you prompted me, so I gotta ask. M-C-P-C-L-I. What’s going on? What’s the [00:39:48] Simon Last: Yeah. Opinion. I think, I mean, I’m, I’m definitely bullish and excited about cli. I think there’s a few really cool things about cli. So one really cool thing is like, um, is that it’s in the terminal environment, so it gets a bunch of extra power. So it, you know, for example, it can like, like paginating and cursor through like long outputs. Um, and it has a progressive disclosure inherently. Uh, so, you know, you don’t see all the tools at once. It’s just, you see the CLI wrapper and you can like use the, the help commands and, and, and read files. And then I think the most important thing that’s, that’s super cool is that there, it’s also inherently a, a bootstrapped. So if there’s an issue, uh, the agent can debug and fix itself within the same environment that it uses the tool. [00:40:30] swyx: Mm. [00:40:30] Simon Last: Right. Like, you know, I think I saw a tweet this morning. Someone said, you know, my agent didn’t have a browser, so I asked it to make all a browser tool and within a hundred lines of code, it gave itself a little browser, like, like wrapping the, the, the chromium API, um. That’s pretty incredible. And then if there was a bug, it would just immediately try to fix it. Mm-hmm. Right. On the other hand, if you use an, you know, if you use like of, of the Chrome dev tools, MCP, I’ve had this issue where like, like sometimes the transport gets like messed up. If it gets messed up, the agent has no way to fix itself. It, it no longer has a browser, it’s, it’s not broken. Right. I think that’s, that’s pretty fundamental, but I would say like a lot of the, the bad things about it can be fixed. Uh, so I think like, as a progressive disclosure, that can be fixed with, with right harness. Like, it, it obviously doesn’t make sense to show it all the tools all the time. That’s not really inherent to the MCP protocol. It’s just like how you wrap it and use it. [00:41:16] swyx: There’s many poorly built MCPs because we didn’t know. [00:41:19] Simon Last: Yeah, yeah. I mean it was just early, like, like the obvious thing is, uh, you know, to start with is, is to just show it all the tools and it’s like, okay, now we have a hundred tools. Yeah. And like the tool calling actually works. So let’s of [00:41:28] swyx: your success [00:41:29] Simon Last: give it a way to like, like filter to source the tools. So yeah, I would say like broadly speaking, I’m really bullish on cli. I’m still bullish on CPS and in a certain environment. I think in, in particular, CP is really great for when you want sort of like a narrow, lightweight agent. I think there’s, there’s definitely a lot of use cases where, where you don’t want like a full coding agent with a compute run time. And also you want it to be like more tightly permissioned. MCP inherently has a really strong permission model, like all you can do is call the tools. A CLI is a little bit murkier. It’s like, can I access the, if PI token are you, like, properly sort of like re-encrypt the token so it can’t like exfiltrate it, it introduce a lot of like, like new issues, which are. Real and hard to solve. And MCP is just like the dumb simple thing that works and it that it’s pretty good. [00:42:12] Sarah Sachs: I’ll add two more perspectives, not from it working well for Notion, but how notion like commits to both platforms. Notion is dedicated to being the best system of record for where people do their enterprise work. So we will always support our MCP and so far as other people are using cps, right? So regardless of our perspective, we’ve put a lot of effort into our MCP and we have a fantastic team that we’re building, um, to do more there. And the second thing I’ll say, I think, um, we all think a lot, but lately I’ve been thinking a lot about making sure there’s a value alignment and pricing, um, with capability. [00:42:43] swyx: Literally our next question [00:42:44] Sarah Sachs: and. Needing language to execute deterministic tasks feels wasteful and requiring on a language model to interface with third party providers seems wasteful for tasks that don’t require it. And particularly because our custom agents are using usage-based pricing. We think of pricing as like the barrier of entry for use of our product, and we’re quite committed to making sure that it’s not wasteful. Um, not just because it’s a bad deal for our customers, but it’s also bad business. We wanna have as many buyers, like there’s a, there’s an elasticity of demand and so if we can have our agents properly execute code that calls on CLI deterministically, it’s a one-time cost, right? Versus constantly having a language model integrate with an MCP over and over and over and paying those like repeated token fees and it’s happening outside the cash window, then you’re paying for it over and over and over and it’s just kind of unnecessary and less deterministic when it doesn’t have to be. [00:43:36] Alessio: Yeah, the open-endedness I think is like, the main thing is like, well, if I go write code to just call an API, I would never use an MCP. But then you need an NCP sometimes when you know what to call, but you don’t want it to restart versus like, I think the it built a browser from scratch is like, it’s great when you’re doing it on your own, but like if your customers were having your AI write a browser from scratch every time and you had to pay the token cost of that, yeah. You’d be like, no, no. The Chrome dev tools CP is actually pretty great. Just use that. I’m curious, how do you make that decision? Like should it be. Just straight API call very narrow. Should it be an MCP? Should

Kommentarer

Vær den første til å kommentere

Registrer deg nå og bli medlem av Latent Space: The AI Engineer Podcast sitt community!

Prøv gratis

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

Early bird discounts for the San Francisco World’s Fair [https://www.ai.engineer/wf], the biggest AIE gathering of the year, end today - prices will go up by ~$500 tonight so do please lock in ASAP! From near-universal AI tool adoption inside Shopify to internal systems for ML experimentation, auto-research, customer simulation, and ultra-low-latency search, Mikhail Parakhin joins us for a deep dive into what it actually looks like when a 20-year-old, $200B software company goes all-in on AI. We cover why Shopify has become much more vocal about its internal stack, what changed after the December model-quality inflection [https://www.latent.space/p/wtf2025?utm_source=publication-search], and why the real bottleneck in AI coding is no longer generation, but review, CI/CD, and deployment stability. We also go inside Tangle [https://shopify.engineering/tangle], Tangent [https://apps.shopify.com/tangent-1], SimGym [https://apps.shopify.com/simgym], which are three major AI initiatives that Shopify is doing to make experimentation reproducible, optimization automatic, customer behavior simulatable, and search and catalog intelligence faster and cheaper at scale. Along the way, Mikhail explains UCP [https://www.shopify.com/ucp], Liquid AI [https://www.liquid.ai/blog/liquid-ai-announces-multi-year-partnership-with-shopify-to-bring-sub-20ms-foundation-models-to-core-commerce-experiences], and why token budgets are directionally right but often measured badly, why AI-written code can still increase bugs in production, what makes Shopify’s customer simulation defensible, and what he learned from the Sydney era at Bing. We discuss: * Mikhail’s path from running a major Microsoft business unit spanning Windows, Edge, Bing, and ads to becoming CTO of Shopify * Why Shopify is talking more publicly about AI now, and why staying at the frontier has become necessary for the company * Shopify’s internal AI adoption curve, the December inflection, and why CLI-style tools are rising faster than traditional IDE-based tools * Why Jensen Huang is directionally right on token budgets, but raw token count is still the wrong way to evaluate engineering output * Why the real unlock is not more agents in parallel, but better critique loops, stronger models, and spending more on review than generation * Why AI coding can still lead to more bugs in production even if models write cleaner code on average than humans * Why Shopify built its own PR review flow, and why Mikhail thinks most off-the-shelf review tools miss the point * How PR volume, test failures, and deployment rollback are becoming the real bottlenecks in the agent era * Why Git, pull requests, and CI/CD may need a new metaphor once code is written at machine speed * What Tangle is, and how Shopify uses it to make ML and data workflows reproducible, collaborative, and production-ready from the start * Why Tangle is different from Airflow, and why content-addressed caching creates network effects across teams * What Tangent is, and how Shopify is using auto-research loops to optimize search, themes, prompt compression, storage, and more * Why Tangent is becoming a democratizing tool for PMs and domain experts, not just ML engineers * Why AutoML finally feels real in the LLM era, and where auto-research still falls short today * Why Tangle, Tangent, and SimGym become much more powerful when combined into one system * What SimGym is, why simulated customers only work if you have real historical behavior, and why Shopify’s data gives it a moat * How SimGym evolved from comparing A/B variants to telling merchants what to change on a single live storefront to raise conversions * Why customer simulation is so expensive, from multimodal models to browser farms to serving and distillation costs * How Shopify models merchant and buyer trajectories, runs counterfactuals, and thinks about interventions like discounts, campaigns, and notifications * Why category-level behavior is so different across commerce, and why ideas like Chinese Restaurant Processes are showing up again in practice * Shopify’s new UCP and catalog work, including runtime product search, bulk lookups, and identity linking * Why Shopify is using Liquid AI, and why Mikhail sees it as the first genuinely competitive non-transformer architecture he has used in practice * Where Liquid already works inside Shopify today, from low-latency query understanding to large-scale catalog and Sidekick Pulse workloads * Whether Liquid could become frontier-scale with enough compute, and why Shopify remains pragmatic and merit-based about model choice * Who Shopify is hiring right now across ML, data science, and distributed databases * The Sydney story at Bing, why its personality was not an accident, and what Mikhail learned from deliberately shaping AI character early on Mikhail Parakhin * LinkedIn: https://www.linkedin.com/in/mikhail-parakhin/ [https://www.linkedin.com/in/mikhail-parakhin/] * X: https://x.com/MParakhin [https://x.com/MParakhin] Timestamps 00:00:00 Introduction: Mikhail Parakhin, Microsoft, and Shopify 00:01:16 Why Shopify Is Talking More About AI 00:02:29 Internal AI Adoption at Shopify and the December Inflection 00:06:54 Token Budgets, Jensen Huang, and Why Usage Metrics Can Mislead 00:10:55 Why Shopify Built Its Own AI PR Review System 00:12:38 AI Coding, More Bugs, and the Real Deployment Bottleneck 00:14:11 Why Git, PRs, and CI/CD May Need to Change for Agents 00:18:24 Tangle: Shopify’s Reproducible ML and Data Workflow Engine 00:21:19 Why Tangle Is Different from Airflow 00:26:14 Tangent: Auto Research for Optimization and Experimentation 00:30:07 How Tangent Democratizes Experimentation Beyond ML Engineers 00:33:06 The Limits of Auto Research 00:36:36 Why Tangle, Tangent, and SimGym Compound Together 00:37:20 SimGym: Simulating Customers with Shopify’s Historical Data 00:42:47 The Infra Behind SimGym 00:46:00 Why SimGym Gets Better with Real Customer History 00:47:30 Counterfactuals, HSTU, and Modeling Merchant Trajectories 00:51:55 CRPs, Clustering, and Category-Level Customer Behavior 00:53:30 UCP, Shopify Catalog, and Identity Linking 00:55:07 Liquid AI: Why Shopify Uses Non-Transformer Models 00:59:13 Real Shopify Use Cases for Liquid 01:03:00 Can Liquid Scale into a Frontier Model? 01:09:49 Hiring at Shopify: ML, Data Science, and Databases 01:10:43 Sydney at Bing: Personality Shaping and AI Character 01:13:32 Closing Thoughts Transcript [00:00:00] swyx: Okay. We’re here in the studio, a remote studio, with Mikhail Parakhin, CTO of Shopify. Welcome. [00:00:08] Mikhail Parakhin: Thank you. Welcome. [00:00:10] swyx: I don’t even know if I should introduce you as CTO of Shopify. I feel like you have many identities. Uh, you led sort of the, the Bing ML team, I guess, uh, uh, or ads team. I, I don’t know, I don’t know, uh, you know, it’s, uh, people va-variously refer you as like CEO or, or, uh, I don’t know what that, that, that said previous role at Microsoft was. [00:00:29] Mikhail Parakhin: Uh, that was... Yeah, my previous role w- at Microsoft was the-- I actually was the CEO of one of Microsoft’s business units, which included, as I, you know, as we discussed, all the things that people like to laugh about, uh, including Windows and Edge and Bing and ads and everything. [00:00:47] swyx: Yeah, yeah. What a, what a, what a wild time. You’ve obviously, uh, done a lot since you landed at Shopify. Uh, one of the reasons I reached out was because you started promoting more sort of internal tooling, uh, primarily Tangle, but also a lot of people have seen and adopted Tobi’s QMD, uh, and obviously, I think, uh, Shopify has always been sort of leading in terms of, uh, engineering. I think more-- it’s just more recent that you guys have been more vocal about your sort of AI adoption. Is that, is that true? [00:01:16] Mikhail Parakhin: Well, I think AI tools in general are fairly recent development, uh, and we’ve-- Shopify, you know, at this stage of its development, we’re developing AI in-in-house and other, uh, building tools that use AI and, you know, interfacing with the wider AI community, uh, you know, are on the sort of the, uh, runaway trajectory. So it just did by sort of natural byproduct. We, we talk about it more also. We just, uh, just even yesterday, Andrej Karpathy was famous in tweeting about, oh, are there some, uh, ways, uh, that, that you can organize your agents to store the data and then, uh, look up the data so that you don’t have to research or, or lose context every- Yes time. And a little bit tongue in cheek, I tweeted that, “Hey, we’ve, we’ve done it much earlier, and we even have different approaches, Tobi and I.” Tobi, of course, is a big fan of QMD, and I’m more of a SQL, SQLite fan. But, uh, yeah, very similar things that we’ve already done here. The point is, yeah, we’re very dynamic, you know, explosively growing company, and we have to be at the forefront of AI adoption, obviously. [00:02:29] swyx: Yeah. Yeah. Um, you, your team kindly prepared some slides actually that we were gonna bring up on to, uh, the screen. I think I can, I can screen share, and then we can kind of go through some of the shocking stats that maybe, maybe put some numbers to what exactly is going on. So here we have, uh- An internal AI tool adoption chart. What are we looking at here? What ? [00:02:54] Mikhail Parakhin: Yeah, this is very interesting statistics. Uh, this is number of daily active workers, you know, think of, uh, DAO, basically the active users of- [00:03:05] swyx: Yeah ... [00:03:05] Mikhail Parakhin: AI tool as a percentage of all the people in the company, right? And then- Yeah ... different AI tools. And, uh, you could see two things here is that one is the green is total. Uh, green is just total. So you could see that it approaches really % by now. It’s hard not to do your job now without interacting deeply, at least with one tool. You could see another interesting thing is just as many people commented in December was the phase transition when suddenly models gotten good enough that, that everything took off and started growing. Uh, it, it was many people noticed that the thing is that small improvements accumulated into this big change in Sep- December roughly timeframe. [00:03:52] swyx: Yeah. [00:03:52] Mikhail Parakhin: The other thing I would claim you could see is that, uh, CLI-based tools and tools that don’t require you to look at the code becoming more popular, and you could see, yeah, various versions of, uh, Cloud Code and Codex and Pi and internal development tools taking off. Uh, exactly, yeah, uh, and blue is our River, just internal agent for coding, where tools, uh, that require IDEs such as, uh, GitHub, Copilot or Cursor, they’re not exactly shrinking, but they’re not growing as fast. Like, uh, red, red line is, is the IDE kind of tools. So you could see that they’re, they’re not experiencing as, as fast of a growth. [00:04:37] swyx: As I understand it, basically, every employee has their choice, right? Of choose whatever tool you use, and then you’re just kind of doing a, a daily sur-survey or something. [00:04:47] Mikhail Parakhin: Exactly. And, uh, we- Yeah ... the, the push is to get your job done, you can use any tool, and we effectively fund unlimited tokens for everybody. Uh, we, we do, we do try to control the models that, uh, people use, but from the bottom, not from top. Like we basically say, “Hey, please don’t use anything less than Opus four point six.” [00:05:09] swyx: Oh . [00:05:10] Mikhail Parakhin: Some people, some people end up using GPT five point four extra high. Some people use Opus four point six. Um, uh, you know, uh, there are some, uh, there are plus and minuses in going for full one million context window versus not. But, uh, we try to discourage people from using anything less than that. [00:05:28] swyx: Yeah, yeah. Got it, got it. Uh, I mean, uh, that’s, you know... The, the next chart here, it really kind of shows the expansion and the sort of December twenty twenty-five inflection, right? That, uh, people are using a lot of tokens. I think it’s also really interesting that no one was kind of abusing it in twenty twenty-five. Like it was- Had comparatively, uh, to this year, there was almost no growth. I mean, it’s still like, you know, probably, probably gave fifty percent. [00:05:56] Mikhail Parakhin: Yeah. This is just a different scale. It’s still exponential- Yeah, yeah ...growth at just a different- ...rate of expansion. Uh, there was inflection point, and Sean, I would claim the, the super interesting part here is that you could see that the distribution becoming more and more skewed. Yes. The top percentiles grow faster. So that means- Yeah ...the people in the top ten percentile, they, their consumption grows faster than seventy-five and so forth. So, uh, the distribution skews more and more towards the highest users, which is... I don’t know what it tells me. It’s like it feels not ideal, to be honest. Or maybe it’s okay. We’ll see. [00:06:36] swyx: Why does it feel not ideal? Is, is it because of, um, quantity over quality, or what’s the concern? [00:06:42] Mikhail Parakhin: Because take it to the limit. That means, you know, if, if this rate of separation continued- Ah, yes ...a year, there will be one person consuming all the tokens. So it’s just, it’s kinda strange. [00:06:54] swyx: Yeah, I mean, um, uh, I, I think internal like teaching and all that, uh, will, will help sort of distribute things more widely. But in, in the early days, of course, the people who are sort of more AI-pilled will obviously find more ways to use it than the people who are less AI-pilled. Maybe let’s, let’s call it that. I’ll just, I’ll just kinda quickly, uh, pause from the, the... You know, we will go back to the rest of the slides, but I just wanna, um, review, you know, there are a lot of CTOs of, of large companies like yourself where they’re all considering some kind of token budget, right? Like I think it’s something, something that Jensen Huang has been talking about, where like if your 200K engineer is not using 100K of tokens every year, like they’re, they’re underutilizing coding agents. Of course, Jensen Huang would say that, but like it seems a very quantity over quality approach and like some, some people are basically saying like, well, is this comparable to judging engineer quality by lines of code, right? Which we also know is like kind of flawed, but better than nothing. So I, I don’t know if you have like a sort of management take here on, on how to view this kind of, uh, metrics. [00:08:02] Mikhail Parakhin: Well, I mean, you’re, you’re baiting me. I, I like... This is my favorite topic. Uh, if you let me, I’ll probably talk for two hours on just this. I have a lot of things to say. Like I do think Jensen gotten a lot of bad press saying, “Oh, of course you’re, you know, this, uh, the- ...the cake seller says you don’t need enough cakes.” You know? Like, of course. Uh, but, uh, I actually, uh, think that’s undeserved. I think he, he’s actually right. Uh, I do think- He, [00:08:33] swyx: he’s directionally correct. [00:08:35] Mikhail Parakhin: Yeah. Yeah. He’s directionally correct for sure. Uh- [00:08:37] swyx: Who knows what the right number is? Yeah. [00:08:39] Mikhail Parakhin: The thing that I do Uh, want to say, and this is something that we learned through trial and error and very important is like two things. One is that it’s not about just consuming tokens. Uh, you can consume tokens and, and in fact, the anti-pattern is running multiple agents, too many agents in parallel that don’t communicate with each other. That’s almost useless, uh, compared to just fewer agents and burns tokens very efficiently. Uh, setting up the right critique loop, especially with the high quality models, where one agent does something, the other one, ideally with a different model, critiques it, uh, suggests ways to improve it, the agent redoes it with this critique and, and so it takes much longer. So people don’t like it because latency goes up. You know, they, they have to wait until this debate is happening. But, uh, the quality of the code is much higher. And another thing, just since you mentioned like, look, uh, uh, yeah, the overall budget is just like, uh, lines of codes. Lines of codes are exploding for everybody right now, or partially because AI is really mover balls, but partially just because AI can write a lot more code, you know, doesn’t get tired. And so you have to have to have a very strong narrow waist during PR review. Otherwise, just the number of bugs will go through the roof. It’s, uh, it’s this unexpected consequence of the just volume trumping everything. I would claim by now good model writes code on average with fewer bugs than, than the average human. But since they write so much more of it, like more of it will make it into production. So you have to- You still [00:10:26] swyx: have [00:10:26] Mikhail Parakhin: more bugs. Yeah. Have to have a very rigorous PR reviews, also automated of course. But, uh, yeah, that to spend a lot budget there. Like this, this for me, for me, actually, the important metric is the ratio of budget spent during code generation versus, uh, spent, uh, expensive tokens like GPT, uh, five point four Pro or, uh, uh, Deep Think from Gemini, you know, checking on PR reviews. [00:10:55] swyx: Yeah, totally. Uh, I noticed in your chart you didn’t have any review tools. Do you just use like, like let’s say a Claude code to review tools? Or do you have another set of review tools like the Greptiles, the Code Rabbits, uh, Devin Reviews has a review tool. I don’t know if you’ve had those specialist review tools. [00:11:13] Mikhail Parakhin: You are a little bit jumping on my store tool right now because the graphs I was only showing public tools. Uh, uh, the-- I haven’t found a good PR review tool that, that does what I think should be done. And, uh, partially my, my thinking is because it’s so... It just goes against both what people feel like emotionally they prefer and, uh, some of the, uh, you know, frankly Even business models that, that the companies run. At peer review tool, uh, time, you want to run the largest models. That means, I don’t know, Codex or, or, uh, Cloud Code is not gonna cut it. You need to have pro-level models if you really want to, uh, stand the tide of bots from going into production. And you need us to spend a lot of time, the models taking turns, but you don’t want, like, a big swarm of, uh, of, uh, agents. So in fact, you end up in a different dual-dualistic world where you generate not that many tokens. You, in fact, generate few tokens, but it takes f-a long time because these are expensive models taking turns rather than many, many agents trying to do many things in parallel. So that’s, that’s why I feel like I haven’t found good tools, so we are using our own for peer review for now. [00:12:33] swyx: Yeah. Yeah. I mean, uh, I think a lot of companies are building their own, uh, especially to their needs, right? [00:12:38] Mikhail Parakhin: Mm-hmm. [00:12:38] swyx: Um, I, uh, you also have a chart here going back to the slides on, uh, PR merge growth, where we’re now at thirty percent, uh, month on month rather than ten percent. Uh, and also the, the estimated complexity is going up. You know, this is productivity, right? ‘Cause y- presumably there’s more stuff going into the code base and more, more features getting worked on. I’m curious about the backlog, right? Like the, the, the-- I actually don’t mind a pro-level model taking an hour or two hours to review my PR, because I’ve dealt with humans who take a week to review my PR, right? And I keep pinging them on Slack, “Hey, hey, review my PR.” So, you know, I think there’s some trade-off here where, like, it still doesn’t make sense. [00:13:18] Mikhail Parakhin: Exactly. That, that’s exactly m-my point. Uh, that on one hand, you can tolerate longer latencies at, uh, PR. On the other hand, like right now, the real problem is not in spending time waiting for PR. It’s real problem is since there’s so much more code than- Yeah ... uh, probability of at least some tests failing going up, and then you, like, keep de-failing, then you have to find the offending PR, evict it, retest it without that PR, and so deployment cycle becomes much longer. Uh, so it actually, in terms of the overall time to deploy, it’s total time savings if you spend more time on a longer model, like thinking for an hour, because then, then you, you don’t have to spend all that time during testing and rolling, you know, rolling back the deployment. [00:14:03] swyx: Yeah, totally. That’s still worth it. You know, you don’t look at the individual, look at the aggregate, and look at the, the, the change in the aggregate system. [00:14:11] Mikhail Parakhin: Exactly. [00:14:11] swyx: I’m kind of curious if, like, there’s this PR mentality and, like, c-- the, the, the CICD paradigm will be changed eventually. Some people are like, obviously a lot of people want new GitHub, but I even wonder if, like, Git is the problem, right? Like, is that the bottleneck? Is the concept of a PR a bottleneck? Do you guys use stack diffs? I don’t know if, uh, that’s a, like, a merge queue stack diff type of thing. [00:14:34] Mikhail Parakhin: We, we use, we use Stacks, we u- we use Graphite. We worked with, uh, Graphite a lot. Uh, so we use Stack, uh, PRs. I think, uh, like that’s clearly the overall CICD in general, and the interaction with the code repository right now is the, clearly the sort of the, the main issue and the bottleneck for us, uh, and highest top of mind. I would say we probably need a different metaphor or different whole design of how to process it in new agentic world. I haven’t seen anything dramatically better yet. I, I think everybody right now is just trying to keep their head above the water ‘cause, ‘cause there, there’s so many PRs and then everybody’s CICD pipelines start creaking, the, the times are increasing, the number of bugs slipping by increasing, and you have to, have to clap on down. And so we are a little bit in this situation when we need to first stabilize that story and then start thinking, hey, what, what it could be a completely different and new world, which I haven’t... I know some people working on it. I haven’t seen something, like anything super compelling yet, but clearly the old thing were designed for humans will need to be morphed into something new. [00:15:53] swyx: One of the thing that I, I think about is kind of like the merge conflict is basically a global mutex on the whole system, right? And in, in hu- in human organizations, we do have something like that. It’s the company standup. But like, other than that, it’s like it’s actually fitting for us to be somewhat decentralized, somewhat plugged into one stream of information source, but somewhat lossy. Like it’s okay, you know, that, that not every delivery is like atomic consistency. Like we’re not dealing with a database sometimes. [00:16:27] Mikhail Parakhin: This is a very good point, uh, because since humans don’t write code too fast, you know that global mutex is not too bad. Once you- [00:16:36] swyx: Yes ... [00:16:37] Mikhail Parakhin: start writing code at the speed of machine, it becomes the, you know, the bottleneck. Then what do you do? Maybe, and I can’t believe I’m saying this because I, I’m long-- lifelong opponent of, uh, microservices, and I always thought that was, like, a really bad idea. And now that you’re saying it, like, maybe in new guys like microservices will make a comeback, you know, because then you, you can ship things independently in tiny things and, and the managing all that complexity automatically will be much easier. I don’t know. Like, we’ll s-- we’ll have to see. [00:17:10] swyx: Yeah. I mean, I don’t know what the Microsoft or, or Shopify thing is, but I, I read this paper from Google where they have a monorepo that deploys into microservices, right? And then, uh, the other concept that I think about a lot is the Chaos Monkey concept from, from Netflix. Being able to create, like, this robust system where, um, uh, you know, you, you have the service discovery, you have the, uh, the independent, independent microservices discovery and, and, uh, you know, probably going to be a fair amount of duplication. That’s how an organic system sort of scales, uh, that, that you have that... I don’t know how you call it. Slack? Robustness? Depend-- uh, d-duplication. I, I, I forget the-- I, I’m-- And this-- those-- these are not exactly the terms- Hmm ... I’m looking for, but I c-can’t really think of the words. Okay. I was gonna go into Tangent and Tangle. Uh, so, uh, we, we sort of discussed the overall stats that, uh, Shopify has. Uh, but, you know, I, I think some, some pretty cool stuff that you guys are working on is your ML experimentation, uh, and your, your sort of auto tr-research training pipeline. Presumably you’re much closer to this one because it’s, it’s a sort of personal hobby of yours. How, how would you explain them in, together? I thought we have a slide that, like, uh, has the s- the system diagram. [00:18:24] Mikhail Parakhin: Yeah. Tangle first and then Tangent as a- [00:18:27] swyx: Yeah ... [00:18:28] Mikhail Parakhin: as a thing on top of Tangle. And, uh, Tangle is the third generation, I claim, of, uh, systems of, uh, running any data processing, but a bit with a skew for ML experiments, but not necessarily. Any sort of data processing tasks where you need to iterate, share, and you have scale so that you want maximum efficiency. You know how, like, normally you would work, you would-- Imagine you’re a data scientist or an ML practitioner, you would get Jupiter notebooks or, or maybe you would get, uh, you know, Pyth- your Python scripts, and you would manage the data, and you produce those TSV files, and you put them in some JFS or something. Then you would notice that, oh, it has this, uh, weird missing values. You go and write another script that, uh, goes and replaces them with, uh- [00:19:20] swyx: Ah ... [00:19:21] Mikhail Parakhin: dash S. And then, then you, then you run some, some, uh, “Oh, I need to filter bots.” And so you run some light GBM model that, uh, removes the bots. And then, then you like-- And then you, you kind of like get into shape, and then you start experimenting, and you run multiple experiments, and then you’re like, “Oh my God,” like, “this experiment is worse.” You undo, and you cannot get to previous result. And like, “Ah, what did I do?” Like that. Again, then, then you finally like get everything working. Then you like start throwing it over the fence to production. You, you replicate it, those things don’t work, and then sometimes you like don’t notice that you forgot some feature naming and the, the features don’t match. But then, like imagine you, you did everything, and then six months later you’re like, have to repeat it because now there’s more data, or you wanted to do another pass, and you’re like, “What, what did I do?” Or like, or like, “This script crashes now,” or the, “the path has changed.” And then, then you’re trying to, like you spend another month just doing ar- digital archeology on your own, you know, history, right? Now multiply that by many, many teams. Now imagine you got an intern that you wanna ramp up. Now you have to show that intern, “Oh, you know, look, here’s the folder, there’s the scripts, you know, ask your cloud agent to do, and then, uh, to, to figure it out.” And then cloud agent does something, and then you’re, “Ah, yeah, right, right, it was the wrong folder. I forgot to tell you, I actually have this other thing I forgot myself.” And, and that’s, that’s the, like, the daily life we all, uh, all know it, uh, if, if you’re a data scientist, machine practitioner, ma- machine learning practitioner or, uh, or even like any data managing, uh, person. [00:21:00] swyx: Yeah. So I, I used to do this, uh, f- uh, on the quant finance side, uh, in, in my hedge fund. So we did this before Airflow, and then, uh, obviously Airflow came along and, uh, then more recently Dagster, uh, I would say is like, in my mind, what I would use for that shape of problem, uh, where you had to materialize assets and create a pipeline. [00:21:19] Mikhail Parakhin: And that’s, that’s very good segue because... So Airflow is great, but Airflow is more about you, you have something and you wanna repeatedly run it in production on schedule. It’s less about you as a team developing things and being able to share, and you grabbing the standard pipeline and saying, “Hey, I wanna change this tiny little component in the huge sea of data processing, and I don’t wanna-- I wanna run ten experiments on this, and I wanna do hyperparameter optimization.” All that is very hard to do with Airflow. It’s very easy to do with Tango. Tango is m- more about, it’s everything about group of people Running experiments, it might be agents too nowadays. Uh, running experiments cheaply, collaborating, sharing results. Uh, you don’t need to understand fully. You, you grab-- you clone somebody else’s experiment or somebody else’s pipeline, uh, run, uh, change small piece, run it, be, like, get it to production state, and then ship in one click. So then the... You don’t have to port it into any other system to, to run in production. You can just run the same experiment. It’s, it’s fully production ready. And, and it’s, uh, it has lots of... Again, as I said, it’s third generation system. The original one was, I would claim there was Ether and then, uh, at least in my career, Ether was the first, first, uh, that pioneered this type of approach. And then there was, uh, Nirvana, which, uh, uh, at Yandex, which did kind of sec-second take on this. And now this one aggregates the, the learnings from all of those and, and Airflow as well to, to get to the state where you try it, it, it feels kind of magical. Uh, ‘cause now everything is based on content, uh, hashes. So even if the version changed, but if the output didn’t change, nothing is being rerun. It’s very efficient. If you... Multiple people start experiment that needs the same sort of data preprocessing, it’s not repeated multiple times. It’s automatically done only once. If you start ten experiments that all require, you know, some, some data preparation first as the first step, and you don’t have to coordinate for that. Like, you don’t have to know that other people are starting it. You now, it’s very easy compos-, uh, composability, any language you can u- uh, you wanna use, and it’s very visual. So you can see immediately, you can edit it easily, you can assemble small things with just even mouse clicks if you want to, and, uh, share, clone. And everybody knows also it’s fully kind of static in the sense that we rerun it second time, it will exactly have the same results. Like, you will never have to do digital archeology. So full versioning and everything is also there. [00:24:06] swyx: Uh, so, so people can, uh... It’s open source. Go to the GitHub repo and, and, uh, check it out. Uh, and it is also a really good, uh, blog post about it. I think all these is, like, really appealing. The, the, the, the thing that I think sells me the most about it is that, um, sort of development to production transition, right? Which I think, um, a lot of people haven’t really solved that, uh, strictly, right? Like, we develop really, really well in, in Python notebooks, but then, you know, that’s obviously not a sort of production ready process. I think that, like, any way in which that is solved, I think is, is very appealing. Then the other thing that you mentioned, which also raised my eyebrows, was content-based caching, which you mentioned is, is, um, you know, is ve-very much, uh, um, a sort of efficiency measure about, uh, you know, just like recalculation only on, on sort of content addressing Which I think makes sense. Uh, it surprised me that the savings could be this much, but maybe I just haven’t worked at your scale where there’s so much duplication, uh, that people just rerun because they change a single ID upstream. [00:25:10] Mikhail Parakhin: It does, yeah. But it’s not only you rerun. The, the main savings are coming from the fact that you ran it, you got your job done, and you moved on. Then- Yeah ... somebody else in some department you don’t know existed runs the same task, but on a newer version. [00:25:27] swyx: Yeah. [00:25:27] Mikhail Parakhin: Like right now, you can’t, in, in most of the organizations, you can’t even find out about it so that you can’t even measure that you’re spending that time twice, right? Here- Yeah ... if everybody’s on Tango, that’s detected automatically and detected that the output is the same. And then for that person, all it looks like is like experiment just suddenly moved, jumped forward, right? Uh, uh- Yeah ... so that’s because, because the, there’s network effect of multiple people helping each other. [00:25:51] swyx: Yeah. This is one of those things where it’s designed to be a platform from the beginning rather than an individual developer’s tool from the beginning, right? And, and everything’s gonna streams down from there. That is the sort of Tango, uh, orchestrator, and it’s, it manages jobs. We’ve seen a few versions of this, and this is obviously, uh, uh, the sort of, uh, unique approaches that you guys have, have, uh, figured out. And then there’s Tangent. [00:26:14] Mikhail Parakhin: Yeah. And Tangent is basically an automatic auto research loop that can help and kind of do your work for you. Uh- ... you know, uh, effectively, effectively, Andrej Karpathy recently popularized it with auto research. Yes. Remember he said like he was, uh, speed running this, uh... Yeah, uh, you know the story. The, here we’re basically bringing the same capability into Tango so that, uh, the, uh, Tangent can analyze it. It’s just an agent that can run multiple experiments, figure out what can be changed, and keep on rerunning it, keep on modifying until, uh, maximizing some goal, some loss function, whatever you need to, to achieve. And in general, I would say if you’re not using auto research-like approach in whatever you do, like literally whatever you do, then you’re missing out. We saw at Shopify that taking like a wildfire, anything where you can put measurements can be done dramatically better. Our- [00:27:19] swyx: Mm-hmm ... [00:27:20] Mikhail Parakhin: uh, speed of, uh, templatization HTML, uh, completely new UX tem- uh, templatization of, uh, reducing latency for liquid themes. Uh, we-- Our, uh, search, uh, recently we moved from It’s hard even, uh, quote from eight hundred QPS to forty-two hundred QPS with the same quality just by pure optimizations and not a research loop that kept running and changing code in our index serve on the same number of machines, just increasing the throughput. We, we managed to improve the quality of gisting and machine learning process. Uh, you know, gisting is the prompt compression technique that [00:27:59] swyx: allows for [00:28:00] Mikhail Parakhin: lower latency and, and lower and, uh, actually higher quality slightly. So like literally whatever different walks of life, and it doesn’t have to be AI related. Uh, we, we had a reduction in, uh, storage because the agents would go and find data sets that clearly are derivative, uh, and then you don’t need to store things twice. You know, we, we, we found somewhat embarrassingly that it was one of the largest tables was hashing random IDs into another random ID, and we literally- Oof put only one. So it was translating, yeah, two random IDs hashed [00:28:36] swyx: into [00:28:37] Mikhail Parakhin: each. So, so [00:28:37] swyx: it has access to the code as well, so it can, it can check the, like what, what the hell is it doing? [00:28:42] Mikhail Parakhin: So there, there cou- it could be run in two levels. You, uh, you know, at the superficial level, it could just use ex-existing components and, uh, reshuffle them. Uh, you know, like you can grab- Yeah ... uh, XGBoost, and you can grab some, some Py- PyTorch module, and then can grab some, you know, grab another tools and, and combine them. At a deeper level, since Tangle is all sort of CLI based underneath you, every, every component is a wrapped really CLI, uh, call and a YAML file, it can analyze code and create new components and, and, uh, keep on iterating as well. So, so you can, you can both have quick modifications of existing t- uh, pipelines with the, with components that are already there pre-baked, or you can create new components, uh, and- [00:29:29] swyx: Yeah ... [00:29:29] Mikhail Parakhin: keep iterating on those. So auto research is, again, this is probably the, the thing I was excited the most in the last two months happening, and we see it taking like, like totally like a wildfire. Just, uh, everybody, every day, every... well, every day, every minute, I would, uh, have somebody Slack message saying, “Oh, look how much better I made it.” And, uh, it’s all throughout the research. [00:29:53] swyx: Is this democratized in some way in, in the sense that like is it your ML, uh, engineers and researchers doing this, or is it your regular PMs and software engineers also have the ability to auto-- to use Tangent? [00:30:07] Mikhail Parakhin: This is an awesome question. Like, Tango in general and Tangent in particular are extremely democratizing. Like they- Yeah ... they are the main tools for- ‘Cause I don’t [00:30:15] swyx: need the details. [00:30:16] Mikhail Parakhin: Yeah. Exactly. Initially used by ML and AI engineers, but then literally, as you said, PMs are like the highest user right now is one of PMs on our org, uh, Sartak and he was, he was number one by, by usage of, of this ‘cause they’re just, uh, energetic and knowledgeable, and now it, it unlocks a lot of capability where you don’t have to co-change code manually. [00:30:39] swyx: I mean, I mean, because it kind of cuts out the ML, ML engineer from the process because the, the, the PMs have the domain knowledge and the ability to think about, uh, from first principles about, okay, what, what results do I want? And they can-- they even have the access to the data that, that needs to go in. So it’s like in some ways, like this is the magic black box that we’ve always wanted for, for training and, and for, uh, I guess, uh, uh, hill climbing, whatever. [00:31:04] Mikhail Parakhin: It’s basically cloud code for your AI development- ... uh, situation, right? Like now, now you don’t have to know exactly how algorithms work. You can just, uh, bring your domain knowledge and expertise and product knowledge and iterate within Tangent until you’ve gotten the results that you need. [00:31:21] swyx: In my previous roles, every time that someone has pitched AutoML, you know, I’ve always been like, “Uh, this is not, this is not gonna work. It’s, you know, it’s, it’s always gonna be a flop.” Somehow it’s working now. I mean, presumably the answer is now we have LLMs and it’s good enough, right? It’s, it’s an emergent property that we can do auto research, but like, it doesn’t feel that satisfying that how come we didn’t do this before, right? Like we just did like parameter search and like, I don’t know. That’s maybe that’s it. [00:31:48] Mikhail Parakhin: Yeah. Bayesian optimization and hyperparameter optimization was, was the one that, or facet of AutoML that was used very actively, which incidentally also built into, uh, Tango. But, you know, I know Patrice Simard very well, and, uh, he was such a, uh, such a proponent of AutoML, and he put, like literally spent careers trying to democratize it. Without LLMs, it just turned out to be very hard. Like it, you, you would have flexibility within certain narrow domain, but it was hard to wider scale, and now with LLMs suddenly it’s like magic wand, and so suddenly everybody- ... is an AutoML expert. [00:32:28] swyx: Yeah, I, I think it’s multiple things, right? Like I’m, I’m just gonna bring up the, the, the chart again, right? Like LLMs can do the monitoring very well. That is the very potentially unbounded, super unstructured. It can do the analysis very well, it can do the... Uh, and basically it is much more intelligence poured into every single step. Uh, there’s maybe nothing structurally changed about AutoML, but this is just m-more intelligent and more unstructured. [00:32:53] Mikhail Parakhin: Exactly. [00:32:54] swyx: Any flaws that you’ve run into? Like everyone is like drinking the Kool-Aid, oh my God, time savings, uh, you know, performance improvements. Like what, what, uh, issues have you have, uh, come up? [00:33:06] Mikhail Parakhin: This is really cool. It’s not a solution to all the world’s problems for sure. The limitations are usually the ones I-- And this is where we get into a bit of a subjective territory. Uh, I can only share what I’ve, I’ve seen so far, and I’m sure the situation, uh, is changing, and, you know, maybe after I say it, like many people will reach out and say, “Hey, what about this?” And you don’t know that, and then, then we’ll be probably right. But what I’ve seen is auto research is very good at doing kind of obvious things that you don’t have bandwidth to do or you didn’t notice or maybe you’re not aware of like the-- some standard practices. It is not good at doing something completely out of distribution, something that, you know, you have to think for, for multiple days, uh, and, and do something like none of this. So, so it’s, uh, I, uh, set an experiment once, uh, on, on my sort of, uh, hobby thing, and I let it run for, uh, ended up, uh, several weeks run, uh, you know, it’s like full production kind of scale, so it, you know, slow runs and, and it ex-- it performed in the end, uh, over four hundred experiments, and only one was successful. I’m like, “Okay, that’s, that’s good.” But- [00:34:18] swyx: But it saved time. [00:34:19] Mikhail Parakhin: Yeah, I saved time. Like it, it was the, that thing. Yeah, if I, if I were doing four hundred experiments myself, my betting average, as I said, would have been much higher, I’m sure. But also, first of all, it would take me like three years to do four hundred experiments. And, uh, I didn’t have to do them. Like the machines were just, uh, the price of electricity did that. So, and I got one improvement, uh, that in, uh, my, my-- Honestly, when I was starting that experiment, my thinking was to go and show that, “Hey, Andre, maybe you just don’t know how to optimize.” And I was super smart because in, in my pro-problem, it was optimized for many years, and it was like fully improved. Uh, and I didn’t expect it, you know, auto research to find anything at all. Yet it did. So instead of making fun of Andre, I ended up, uh, a big, big supporter. Yeah, that’s exactly the tweet. Yes. [00:35:10] swyx: You and Toby really, really go back and forth on-online a lot, which is really funny. Uh, think of it as, as an eval for the optimalness of the code it’s running on. Uh, it’s almost like it reminds me of like a Kolmogorov complexity thing, but, uh, I guess it’s-- there’s some optimal thing that you’re trying to sort of reduce down to, I guess. Um, and so, so you, you, you know, you should congratulate yourself that you had, uh, you know, uh, ninety-nine percent, uh, optimality. [00:35:36] Mikhail Parakhin: Exactly, yeah. I think Andre really deserves a lot of credit for popularizing this approach. This is, uh, this is incredibly, I think, powerful and cool and You know, the, uh, even him, him just mentioning it led to a lot of gains in a lot of places in the industry, so we should be thankful. [00:35:56] swyx: Yeah. I think he also has a just... I don’t know what it is. Like, um, you know, it, it is a simple self-contained project that people can take and apply to other things, which is, is, is one thing, but also just the name. Just like somehow no one, no one managed to call their thing auto research. It’s just naming things is very important. I think that that is mostly, uh, our coverage of Tango and, and, uh, Tangents. I think obviously, you know, there’s a lot of, uh, ML infra at, at Shopify that people can, uh, dive into. We’re about to go into SimGym, but before I do that, any, any other sort of broader comments around this whole effort? Like where is it, where is it leading to? [00:36:36] Mikhail Parakhin: As a segue to SimGym, like all those things start composing strongly. And, uh, you could see a huge unlock when you can look at each one of the tools and, and you see, oh, they’re extremely useful. Uh, Tango is useful by itself. Auto Research is useful by itself. SimGym is useful by itself. If you combine all three, you create like synergetic effect. I think that’s why we wanted to even, uh, cover them today is because this is something that if you go back even, you know, five years ago, would’ve been unthinkable. Uh, replicating that, uh, would, would be either incredibly costly or impossible, right? With probably thousands of people are required. [00:37:20] swyx: Well, we have serverless human, uh, serverless intelligence, right? Like, uh, so yes, you do have thousands of hu-- of, of intelligences, not just, not humans. And that’s, that’s close enough, right? Even if they’re not AGI, they’re, they’re close enough to do the, the task that you need them to do. And, and, you know, that’s, there’s plenty for, for a lot of routine work, knowledge work. Okay, let’s get into SimGym. Um, this is one of those things I, I was surprised to see actually it’s apparently your, uh, one of your most popular launches, and I think something that, uh, I think Sim AI, I think Yunjun Park, who did the Smallville thing, there’s a very small cottage industry of people trying to do like the simulate customer thing. I think a lot of people maybe don’t super trust this yet because they’re like, well, obviously they would just do what you prompt them to do, right? But maybe just think, uh, tell us about the sort of inspiration or origin story. [00:38:10] Mikhail Parakhin: That’s exactly actually the thing I wanted to cover, because if you don’t have the historical data, all you can do is prompt a-agents in a vacuum, and they will do exactly what you prompt them to do. In fact, when I first proposed it, and this is a bit of, um, my brainchild initially, if I, I can boast, even Toby said like, “But wouldn’t they, they just repeat what, what you tell them?” And, uh, but I’m like, “Yes, except Shopify has decades of history of how people made changes and what there is, uh, there, what it resulted in terms of sales.” So now what we can do is we can-- we have this... It’s not, it’s a noisy data. There’s a small, usually websites, uh, you know, like things, things are never in isolation. It’s almost never AB experiment. It’s always AA experiment when there’s has two meanings, but basically, you know, in different time you run two different things. But if you aggregate in general, uh, like everything together, and you apply, uh, denoising and collaborative filtering like approach, you can extract a very clear signal. And then you can optimize your agents. And that’s why it took so long. It took almost a year of that optimization of just us sitting and fiddling, and, and we had this internal goals of correlation of hitting-- internal goal was to hit zero point seven correlation with, uh, add to cart events, for example. Like that, that if we run real AB test experiment, that it should, it should go and, and rep-uh, replicate, uh, same sort of success that, that humans had or lack thereof. And it, it took forever, and I don’t think that’s easily replicatable because, uh, like who else would have that data? You have to have this historic, you know, decades, uh, worth of data. And now, now the, like the other thing you need is in-infrastructure and the scale, right? Because, uh, w- again, what we found, uh, stat sig results, you need to run a lot of simulations, a lot of agents, and, and it’s-- Those are expensive things. Like you’re, you’re making actions in the browser because you want a real friction. You want to, to be able to get the image like of what humans will see because you wanna, uh, detect effects like, “Hey, if I make my images larger, will I have more sales or l- uh, fewer sales?” And like usually people’s intuition here, by the way, is that I increase my images, I will have more because they look nicer. You know, designers all look sparse and big images. Like usually your sales tank, right? But, but, uh, you know, from HTML, all the characters look the same only the, the size tag looks different, right? So it’s very hard. So you have to take visual information, you have to run this in simulated browser environment on the big farm and, and of course, you have to have, uh, like very, very expensive model, good model with multi-model model. So all this it’s-- is what’s taken so long and, uh, to share my personal fail a little bit there, Sean, is like, you know, we always had this bias to-- for like large company bias. You know, we always, uh, whenever you-- we do, we’re like, “Hey, we’ll run an experiment,” right? We make, make a change, and we will run an experiment and then, uh, see, uh, see which one’s better or like, “No, this is worse,” and most of them are worse, so you discard it and keep iterating, hill climbing. And we’re like, “Oh, like smaller merchants, they cannot get stat sig results. They cannot really run experiments simply because, you know, in a week there would be not enough data for them.” So we thought from this perspective. What we didn’t realize is that most people don’t have A and B, they just have one thing, and they need suggestions of What A and B should be. So, uh, we first build this, hey, we run simulation on two separate teams and, and, uh, say, “Hey, which one is better?” We then morphed it into, and very recently just released it, when you have just your site, your theme, we run over it and we say, “Hey, here’s what predicted values of, of, uh, uh, conversions are, and here’s how we think you should modify it to increase your conversions.” And then circling back to what you started with, the proof is in the pudding. Like, if we are not correlating with reality, like, people will not be using it. And, uh, thankfully, we see literally every day more users than the previous day. So, so right now, uh, right now- It’s working. Yeah. I’m-- Right now my problem is how to pay for it all because the so our major thing is how to optimize the LLMs, do distillation, how to run the headless browsers, uh, and handful browsers, uh, uh, cheaper so that we can accommodate the increase in traffic. [00:42:47] swyx: Yeah. I, I understand that you, uh, you published a lot of technical detail at GTC, so I was just gonna bring it up a little bit. I think s- was this in, in con-conjunction with some kind of GTC presentation? Or something like that, right? [00:42:59] Mikhail Parakhin: Well, we, yeah, we, we did it in several place, but yeah, we had the engineering- Yeah blog, uh, as well. Yeah. [00:43:05] swyx: Yeah. So you’re running, uh, GPT OSS. Uh, [00:43:08] Mikhail Parakhin: the, this is an older version. You know, now we run multimodal model. But yeah- Yeah ... GPT OSS, we still run GPT OSS as well for [00:43:15] swyx: And then you have the VMs, and you also have browser-based. I really like this one where it you said, “It violates almost every assumption that standard LLM serving is designed for.” And then you had like, basically orders of magnitude differences between everything. [00:43:29] Mikhail Parakhin: Exactly. Which is, which, uh, which was, you know, a bit of a challenge to implement, like when, like even simple things. Uh, be- since it violates all the assumptions, for example, multi-instance GPUs, like MIGs don’t work as well. But we needed, uh, to get MIG to work because, ‘cause otherwise it’s way too expensive. And so we had to deal with the, yeah, with, uh, lots of infrastructure and, and, uh, work with, uh, uh, Fireworks and CentML, uh, you know, to help with optimizations and browser-based, as you mentioned. Yeah, like, takes a village. [00:44:04] swyx: Okay. So there’s a lot of like, I guess, experimentation in the infrastructure so far, and you’ve published more or less what you have here. I guess I’m, I’m less familiar with CentML. I, I don’t do, uh, that much work in this, this part of the stack. But why was it the sort of preferred instance platform? [00:44:22] Mikhail Parakhin: There are really three probably top companies. There used to be, uh, uh- Three top companies, uh, at least I was aware of that did, uh, LM optimization. You know, together Fireworks and Santa ML, not necessarily in that order. Santa ML recently got acquired by NVIDIA. Uh, what they did is if you have a model and you want to optimize it to a specific prof-- uh, profile of usage, uh, they would go and do it. And, uh, we work with, with those companies, uh, this was work particularly in with Santa ML and NVIDIA to get them the best possible results out of it. And, and sometimes you, you have to retune depending on, like sometimes you want the maximum throughput, sometimes you want minimal latency, sometimes you want like the cheapest, right? And, yeah, or some combination. And so yeah, these are people who would come and help you. [00:45:14] swyx: I see. I see. Yeah, yeah. I’m familiar with these people for the LLM, you know, autoregressive stack. But the other interesting category of these optimizers is also the diffusion people, whereas like Fel and, you know, uh, Pruna recently has come up a lot as well, which I think is like really underappreciated, uh, at least by myself, because I, I thought, oh, all the workload would be LLMs, but actually there’s a lot of diffusion as well. [00:45:38] Mikhail Parakhin: Exactly. [00:45:38] swyx: There’s a lot here, so I, I, I... it’s, it’s, uh, it’s, it’s, it’s hard to cover. But I, I do think like people underappreciate the importance of customer simulation, basically. I think this is something that I’m candidly still getting to terms with. Uh, you know, uh, you also-- your team also like prepared this, like, really nice diagram. Uh, I, I assume this is AI generated. [00:46:00] Mikhail Parakhin: Yeah, it looks- [00:46:01] swyx: Maybe it’s not. [00:46:01] Mikhail Parakhin: Yeah, it looks, uh, Gemini-ish. Yeah, but, uh, uh, honestly, I, I don’t know where, where the hell they generated. It looks, look, uh, looks like it’s, uh, Google. But the interesting part, John, that, that, uh, we haven’t covered, but I, I wanted to mention is if your store had previous customers, rather than it’s a new store, you’re like new merchant just launching things, it helps tremendously in just correlation and forecast. Yeah, we take your previous, uh, customer’s behavior, and we create agents that replicate those specific distribution of, of customers that you get, and then we a- we apply those to your changes, and then that, that raised raw, you know, the re-- uh, just correlation with the add to cart events or to-- with conversion or whatever it, it, it may be, uh, quite dramatically. So, uh, replicating humans in general seems like an interesting, cool challenge. [00:46:58] swyx: As a shareholder, I think this is the-- like if people are Shopify shareholders, they should really deeply understand this because this is basically the moat. The, the more you use Shopify, the more it will just automatically improve, right? Like you’re, you’re doing the job for them. [00:47:13] Mikhail Parakhin: Yeah, that’s what we started with. Like, uh- ... uh, otherwise, if you’re just a startup, I wouldn’t do it if, uh, you know, if it was my startup because Without the data, it, yeah, as, as you said, it’s, it’s exactly the case that, uh, whatever you say in prompt, that’s, that’s what the agents will be doing. [00:47:30] swyx: The statistician in me wants to like really satisfy the sort of, um, statistical intuition, I guess. Um, to me it’s kind of, uh, the, the word that comes to mind is, um, ergodicity. Uh, so let’s say a, a customer takes this path, customer takes this path, customer takes this path, right? Um, the... In my mind, the way I explain it is like, okay, here, here’s the ninety-five percentile, here’s the five percentile, and here’s the median, right? Um, but to me, what SimGym is potentially doing is that it can, uh, modify... It can sort of model the sort of in-between sort of journeys as well, that, that maybe are dependent on the previous states. This may be like a very RL-type conclusion where like basically the summary statistics, if you only did naive AB testing, you only have the, the statistics at, at, at a certain point, and you only judge based on the sort of overall summary statistics. But here you can actually model trajectories. Does that make sense? Or- [00:48:31] Mikhail Parakhin: That makes total sense because like, well, that, that makes even more sense that maybe even you realize bec- because- [00:48:38] swyx: Okay. Please, [00:48:38] Mikhail Parakhin: please. Yes ... we do-- Yeah. The, so internally, uh, we have this system, we talked about it briefly once at NeurIPS. We have a huge HSTU-based system that models the whole companies, uh, and their possible paths. And like- Yeah ... what you are, what you are showing, like actually at any point of time, you can either model the user’s behavior or you mo- can also think about, uh, the whole merchant as a company, as the entity that acts in the world. You can model that as well. And then you can do, can do counterfactuals. In your graph, like in your blue graph, uh, if you’re... Imagine in the center there, uh, somewhere in the middle, you would have an intervention. I give that person a coupon, or I don’t know, I send a personal thank you card, or give a discount in some- somewhere. And then you can, uh, then you can do forward rollouts from that counterfactual. So what would have happened with that intervention or without the intervention? And you can even ch- change where that intervention, uh, in time can happen, right? Like some- where, where in this journey. So we, we do this at the Shopify scale for our merchants, and then if we notice that something that they can be fixing, like there’s a strong counterfactual, like we have Shopify policy, they basically get a notification like, “Hey, we think your... something is wrong with your-” I don’t know, Canadian sales. Like, uh, it looks like it’s misconfigured. Here’s what you need to do. Or do you think like, uh, you have to set up this campaign with these parameters? And we do that at the buyer level to literally offer discounts or cashback or, or things to buyers. So this is-- I’m getting very excited. Like this is my sort of area of, uh, interest, I guess, and, and hobby. But being able to m-model something complex as human beings or companies and model counterfactuals on it, where you can have interventions in the future and optimize when to make intervention, what kind inter-- uh, what kind of intervention to make. It’s such an unlock that previously was completely impossible. Like the-- it was, it was always dreamed of, but never... Like how would you even simulate it without LLMs or HTUs? I think very, very exciting times. [00:50:59] swyx: I just wanted to, uh, to maybe illustrate this. I, I’m not the best illustrator, but I, I am a conceptual statistics guy. And y-you know, you cannot just do this. Like this is a dimensionality AB test doesn’t do, right? Like, uh, because it doesn’t have the, the, the change over time, uh, stochastic nature, uh, and it doesn’t have the sort of contextual like... Here’s all the context to this point. Um, okay, cool. Um, that’s SimGym. You’re, you’re gonna burn a lot of tokens on this thing. But you’re, you’re one of the, the only scale platforms in the world that can, uh, that can do this across a huge variety of workloads, right? I’m even curious on a sort of human, uh, research level of like, well, do, does retail behave d-differently from like clothing sales? D-does that behave differently from electronic sales? I, I don’t know. I don’t know what else you guys... The Kardashian shoppers, do they differ from like people who buy, uh, I don’t know, cars and, uh, whatever. [00:51:55] Mikhail Parakhin: Well, very different, and different sensitivities and different modes of, uh, shopping and, and different levels of what’s important. Now, to-totally, you can do aggregations at, uh, at a store level. You can do aggregations at a different, uh, category level. I don’t know if, uh, you know, for our statisticians among us, I couldn’t believe, but we-- recently we’re looking at it, and we had to bring back, uh, CRPs, you know, Chinese restaurant process. It’s a, like, way of aggregating and, like, naturally grow clustering. So across... Specifically to answer questions that, uh, like you were just posing on how, how if, if buyers behave different categories. And I’m like, “I haven’t seen CRP since two thousand and one.” It’s [00:52:37] swyx: so What? It’s so- What is... No, I haven’t, I haven’t seen this. No. This is not in my training. Uh, [00:52:44] Mikhail Parakhin: but, but yeah, it, uh, uh, it actually, like the, the-- there was a very popular kind of theory, popular neurips HTML circles in early two thousands, uh, kind of nice. And now, now it has practical applications, uh- Yeah ... that we were resurrecting. [00:53:03] swyx: Yeah, amazing. Uh, I, I can see, I can see how this is like a, uh, a fun job for you where you get to apply all these things. Um, yeah, yeah, so super cool. Super cool. So, okay, so, so anyone who, who knows what CRPs are and has always wanted to use them at work, uh, they should, they should definitely join Shopify. Okay, so w-we have a lot and but I, I’m, I’m being mindful of the time. I, I do wanted to, to sort of cover some other things. Um, I-I’ll give you a choice, UCP or Liquid? [00:53:30] Mikhail Parakhin: Liquid. I think, I think on UCP, you know, like UCP is very important for us and, and it just we are-- UCP, we have a structured, uh, discussions, and you can read about them, and we have, uh, blog posts, and we have a big release this week, in fact, like with our catalog. Oh, [00:53:46] swyx: okay. [00:53:46] Mikhail Parakhin: Uh, yeah, [00:53:46] swyx: but- Le-I mean, we, we can, we can discuss the, the, the release briefly because we’ll release this after the-- after it’s already announced so whatever. There’s a catalog that you guys are doing? [00:53:55] Mikhail Parakhin: Yeah. So we are, we are- Okay ... we are bringing in capabilities of a whole, uh, Shopify catalog. Basically, you now you can search for products, you can do lookups by specific ID, you can do bulk lookups when you need to bring m-multiple products. You don’t need to know in ad-in advance what you’re trying to show or to sell or check out. Like, you can now, you can now have this decided at, at runtime, and this big area for investment for us for both non-personalized and personalized searches, trying to provide basically a win-window into whole universe of products that are being sold everywhere in the world. And Shopify is really not exactly, but almost like a super set of any-anything being sold. Now we are bringing it into UCP and, uh, and, uh, identity linking is another big thing for us, uh, so that you, you can use, uh, like Google or whatever, whatever identity you have, uh, they’re minimizing friction. [00:54:56] swyx: Yeah. So [00:54:57] Mikhail Parakhin: yeah, big release for us. But Liquid AI of course we never talk about, and the problem might be more, more aligned with what we d-discussed previously on this chat. [00:55:07] swyx: Sure. The main thing that everyone understands about Liquid is that it is inspired by Worm, and I still don’t know why. I’m curious on your explanation. I think you, you, uh, you can make things very approachable. And also I think like what is the potential of like the, the level of efficiency that you get out of Liquid? [00:55:23] Mikhail Parakhin: You- we all familiar with transformer architectures. And, uh, for the longest time, there was a competing architecture, it’s called the state space models. So, so Sams, uh, you know, Chris, Chris Reyes, one of the pioneers and, and lots of startups, uh, trying to make those realities. They have, uh, significant benefits being main being, uh, being much faster and, uh, lower footprint and not quadratic in length, you know, sort of, uh, linear in, in, uh, in your context length. But with state space models- They never quite made it. Like they’re used-- They have, uh, certain niches when they thrive, their hybrid architectures are useful, but they never quite made it. And liquid neural networks are, you can think of them as a next step, like, uh, sort of, uh, state-space model square. It’s non-transformer architecture that’s more complicated than sta-state sp

22. april 20261 h 12 min

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

Beskrivelse

Kommentarer

Prøv gratis i 14 dager

Alle episoder