
Interconnects
Podcast af Nathan Lambert
Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai
Prøv gratis i 7 dage
99,00 kr. / måned efter prøveperiode.Ingen binding.
Alle episoder
102 episoder
https://www.interconnects.ai/p/how-i-write [https://www.interconnects.ai/p/how-i-write] My experience with my recent years of writing is quite confusing — almost even dissociative. I've never felt like I was a good writer and no one really told me I was until some random point in time a year or two ago. In that time span, I didn't really change my motivation nor methods, but I reaped the simple rewards of practice. I'm still wired to be very surprised when people I respect wholeheartedly endorse me as "writing very well." Despite the disbelief, when I interrogate what I'm doing and producing it is clear that I've become a good writer. I don't have a serious writing process. Rather, I make writing a priority. When it is time to write, when my brain is ready, I write. Most of the processing of ideas comes from discussions at work, online, and with myself. The writing is a dance of crystallizing your ideas. It is capturing a moment. This post will take me about 45 minutes on my return flight from San Francisco for a talk, after a nap and a sparkling water. This is standard and it's quite refreshing to have nothing else to do. I'm torn on the future of writing. It's easy to think that with AI no one will learn to write well again, but at the same time the power of writing well is increasing in careers and with the perception overall impact. The process of becoming good at writing is quite simple. It takes practice. With practice, you can get to a solid enough level to write clear and engaging prose. The path to becoming a good writer has two sequential milestones: * Finding something you care about. Then you can write about it. The entry level to this is finding something you want to learn more about. The final level is writing about your passions. * Finding your voice. Then you can write effortlessly. People spend too long trying to write as an activity without thinking seriously about why they're writing and what they care about. This makes writing feel like a chore. Finding your voice also unlocks much more powerful feedback loops and the most powerful form of writing — writing about why you write. This helps cultivate your voice, your direction, your personality, your story. When I found my voice I also unlocked style. Feeling style while writing is when it becomes intellectual play. For example, I find diversity of punctuation and aggressive sentence structure to be something that AI never does naturally. AI. Won't. Make. You. Read. Fragments. AI will draw you into long, lulling, lofty sentences that make you feel like you know what they're talking about while still conveying very little information. Finding voice is also far harder. Writers block can be best described as when you have ideas, but you don't know how to express them. Sometimes this is forced upon you because the medium you're writing for has a required format (e.g. academic manuscripts). I'm yet to find a way to circumvent this. When you have found your voice and your something, writing is just as much thinking a topic through as it is an action in itself. Most of my work now is just that — I'm prioritizing the times to write when I feel my thoughts coming together and I sit down to finish them off. Without prioritizing writing, it'll often feel like you're trying to put together puzzle pieces where the edges have been bent or torn. You know what you are going for, but it's just extra work to bend everything back into shape. My schedule is designed to make writing a priority. I have few meetings and I approach my workflow with consistent hard work expressed through very flexible hours. Writing captures the essence of ideas incredibly well and we have a deep sense that can pick up on it. It's why you can read one 200 character post on X and know with conviction that the creator of it is a genius. This bar of good writing and thinking is of course rare at a personal level and fleeting throughout a day. By doing this for multiple years my rate of output has gotten far higher along with my overall quality. Is my thinking becoming clearer or am I getting better at expressing it in the written word? In many ways the distinction doesn't matter. This brings me back to AI. AI models are definitely getting much better at writing, but it's not easy to track. With the above sentiment, I think writing quality is one of the best judges of AI models' abilities. It's why I've stuck with GPT-4.5 for so long despite the latency and I suspect it is a reason many people love Claude 4 Opus. o3 can be quite nice as well. Still, these models are better at writing than their peers, but they’re still very mediocre overall. AI labs are not set up to create models that are truly great at writing. A great model for writing won't have gone through heavy RLHF training or be trained to comply with a specific tone. This could get better as the base models get stronger, as post-training can get lighter as the models naturally are more capable to start with, but I think the drive to define a model's voice will appeal to more users than elegance (i.e. the same incentives that caused GPT 4o to be so sycophantic). Without more raw intelligence better writing will feel like a lucky find from prompting rather than the nature of new models. I suspect many recent papers on creative writing are doing more of amplifying a certain style of writing that humans like than making the model have a more expansive capacity for writing. With scaled RLVR training we're also pushing the models even further into doing rather than writing. A great test for AI progress is how the writing ability gets pulled up with all the other training foci around it. AI helps good writing processes, but it pulls up the drawbridge for those looking to get into writing. The level of motivation it takes to learn to write while autocomplete is always available is far higher. For the full “life” backlog of my writing, here it is in chronological order: * July 2022: Job search out of Ph.D. [https://www.natolambert.com/writing/ai-phd-job-hunt] * May 2023: What it’s like to work in AI right after ChatGPT [https://www.interconnects.ai/p/behind-the-curtain-ai]. * November 2023: Job search post ChatGPT & RLHF [https://www.interconnects.ai/p/ai-research-job-market]. * October 2024: Why I build open language models [https://www.interconnects.ai/p/why-i-build-open-language-models]. * May 2025: My path into AI [https://www.interconnects.ai/p/my-path-into-ai]. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe [https://www.interconnects.ai/subscribe?utm_medium=podcast&utm_campaign=CTA_2]

https://www.interconnects.ai/p/next-gen-reasoners [https://www.interconnects.ai/p/next-gen-reasoners] On Monday of this week we released RewardBench 2, Ai2’s next reward model evaluation and a project I’ve been personally invested in through its whole arc. Read more of my thoughts here [https://natolambert.substack.com/p/rewardbench-2-and-the-state-of-preference]. Tomorrow, I’ll be presenting a version of this post at the AI Engineer World’s Fair Reasoning & RL track [https://www.ai.engineer/schedule#a-taxonomy-for-next-generation-reasoning-models]. Come tomorrow and say hi if you’re around the next two days! The first generation of reasoning models [https://www.interconnects.ai/t/reasoning] brought us inference-time scaling and intrigue in seeing into what can be called the reasoning process of a language model. The second generation of reasoning models are going to bring us new types of agentic language modeling applications. The traits and abilities that are needed for agentic models are additive to the first generation, but not present by default. Some of the new abilities that are needed can be bootstrapped with clever prompting, but for the best results we need to be training our reasoning models directly to optimize for planning. In this post we explain four key aspects of current and next-generation reasoning models: * Skills: The ability to solve self-contained problems. * Calibration: The ability to understand the difficulty of a problem and not overthink. * Strategy: The ability to choose the right high level plan. * Abstraction: The ability to break down a strategy into solvable chunks. These are presented in the order that they should be solved to make a progressively more complete reasoning model for complex tasks. Skills then calibration then strategy then abstraction. The first two are native abilities of models on single inference passes when presented with a technical problem and the latter are skills that are needed to build effective agents. For grounding, recall the popular “time horizon progression” chart from METR [https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/]: The models were saturating around GPT 4o in 2024. Unlocking reasoning skills provided the bump through Claude Sonnet 3.7 in 2025. Planning well will be the trait of models that make the leap from 1 to 4+ hours in 2026 and on. All of the excitement around reasoning models exploded when it was shown that scaling reinforcement learning with verifiable rewards (RLVR) enables the model to learn useful skills for solving a variety of downstream tasks. The first public confirmation of this was with DeepSeek R1 [https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1], which showed how training time RL compute translates to performance. Intertwined with this is that the models will generate more tokens per response while discovering these skills. Within all reasoning models today the above abilities listed — skills, calibration, strategy, and abstraction — can be further tuned by the increase in token spend per component. This year every major AI laboratory has launched, or will launch, a reasoning model because these models are better at acquiring skills that let them solve the hardest problems at the frontier of AI — evaluations like Humanity’s Last Exam, MATH, AIME, LiveCodeBench, Aider Polyglot, etc. have all seen step changes in performance from the previous class of models. These skills are the foundation for all of the changes that are following in the industry. Much of current discussions on scaling training are around finding the right problems to let the models become more robust in a variety of scenarios. The mad rush for skill acquisition in these models has ballooned a second-order problem of the models overthinking for even easy problems. This emerges due to the deep coupling of RL training and the unlock of inference-time scaling. The ultimate goal is clearly that models scale inference-time compute on their own proportional to how hard the problem is. In the short term, when the rate of performance gain is so high, it makes sense to prioritize abilities over efficiency. As abilities saturate, performance and cost will be weighted more equally. Right now, calibration on problem difficulty is offloaded to the user in the form of model selectors between reasoners or traditional instruct models, reasoning on/off buttons, thinking budget forcing, and soon reasoning effort selectors. On the research side its been shown that the RL loss [https://arxiv.org/abs/2505.05315] functions are flexible [https://arxiv.org/abs/2503.04697] enough to enable length [https://arxiv.org/abs/2505.09388] control more precisely — something that loss functions like instruction or preference tuning cannot handle. Similarly, the models trained as reasoners better express their confidence [https://arxiv.org/abs/2505.14489?utm_source=chatgpt.com], which should soon be translated into mitigations of overthinking [https://arxiv.org/abs/2504.13367?utm_source=chatgpt.com]. Calibrating the difficulty of the problem to the effort of the solution will enable much more practical (and faster and enjoyable) solutions for end users and also just more profitable solutions. Calibration, even though a lower level trait of the models, isn’t as much of a crucial path to rolling out new use-cases with the models. For that, AI makers are going to turn to better planning abilities. For more on current research on calibration, click the following footnote. Before we go on to planning abilities, which are often discussed at length in the community as being crucial without providing a clear way of understanding it, we need to contextualize how parallel compute and other inference-time scaling methods will impact the future of reasoning models. The most prominent method here is some sort of search mixed with either consistency or internal scoring models (e.g. reward models) like o1-pro. For example, in the Claude 4 release post Anthropic mentioned that they use “parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model.” Google has also announced but not released Gemini Deep Think [https://techcrunch.com/2025/05/20/deep-think-boosts-the-performance-of-googles-flagship-google-gemini-ai-model/] which will mirror this. Using these methods makes it clear that parallel compute is doing something very different than scaling the underlying RL — it’s an added form of robustness or quality on the answers. o1 pro in my testing has always been the most consistent model I’ve tried. Scaling compute here doesn’t directly help the model unlock more skills like the training time RL compute, but in practice it feels similar because better answer extraction and formatting helps the model feel smarter. The best way to encapsulate the somewhat orthogonal direction of parallel compute for inference-time scaling is that quality is often anti-correlated with rare tokens when a rating metric or reward model is deployed, as rare tokens will be suppressed by majority voting methods or reward models that have never seen them before. When it comes to leading reasoning models of the future, calling in parallel compute or just extended linear thinking can be best thought of as a tool that the agent can call. They’re going to be arrows in the quiver of a model planning a strategy and knowing which pieces of it will be most difficult to overcome. Though, in order to get there, the models need to be treated very differently. Current models do very little planning on hard problems unless asked to do so. For example, here’s what happens when the new R1 model is asked a problem from Frontier Math (one of the hardest current benchmarks): With current models it is reasonable that they do very light or implicit planning — the skills we’re trying to train in will allow the model to break down problems into steps and solve them. Implicitly the first few tokens these models take send them down a certain plan. These behaviors will be minor relative to what emerges in agentic workflows — where a plan is needed a priori in order to narrow the search space substantially. Planning is the term of art used to encompass the models long term and multi-step abilities. Planning encompasses many sub-skills and abilities, but the highest level split that matters in the current frontier of agentic models is strategy and abstraction. Strategy is the ability of the model to correctly point itself in the direction of a high quality solution. With one autoregressive pass, pointing the stream of tokens in the wrong direction is often not recoverable. While agents will be a bit better at this by being able to edit their plan, they’re still heavily susceptible. Abstraction is how the model breaks down the strategy into accessible parts. Even with the most skilled model, taking on too hard of a sub-task at once will make it so no progress is made overall. Taking on not enough at a time will make the model timeout. Currently, abstraction is a minor problem as the time horizon is fairly short, but models will need to be able to break down multi-day tasks into sub problems that can be solved in individual 1-2minute inference steps (i.e. 10-100K tokens of forward inference). A closely related skill is context management, where the models must be able to store a complete summary of what they have done so far. The best forms of context management will let the model skip over tasks it accidentally ended back on even though they’re already completed or try a new strategy after a failed approach. This is one of many low-level skills that’ll emerge to enable generalized planning abilities. o3 is the leading model in this paradigm right now with the largest spectrum of skills from math, code, and search and some leading planning abilities such as Deep Research. When o3 is finding niche information for me I attribute very little of that behavior to planning, but rather just the skill, multi-try tool use, of knowing to keep searching things until it finds the answer. Other models have qualities that are ahead in some regions of the Pareto frontier, such as Claude 4’s planning for software tasks (in essence saying Claude Code is currently better than OpenAI’s coding agent Codex). o3 is best when it is tasked with finding extremely niche information that exists on maybe one page on the web. It fails when asked to compare all the content that is out there. In the above taxonomy, o3 has almost solved the skill of search but synthesis across a broad category involves more advanced planning of the information to obtain and analyze. Planning does not feel like an ability I’d expect to emerge when training on multi-step, challenging tasks, but I wouldn’t be surprised if it’s a behavior that could be refined. Much as the Q* [https://www.interconnects.ai/p/q-star] story was actually a substantial initial data curation effort by OpenAI to craft some reasoning traces, they’ll likely need to do the same to seed higher quality planning behaviors before continuing to train the model. High-quality training samples here will encompass both high level strategies and details on how to abstract the problem. As with the skills specific to reasoning on single math or code problems like verification or checking work, it’ll be a long time before we know the balance of these emerging from general pretaining, focused mid training, or specialized cold start data. Regardless of the long-term balance, we’ll quickly be seeing a race to add these planning abilities so labs will start with post training (cold start SFT data) that elicits whatever was in the pre training. This task will not be as hard as initializing the reasoning chains themselves, as planning is more about results than the behavior that gets them (which should partially transfer from hard math and code problems). The first thing current agents likely do is write out a plan of attack for their ultimate goal. The weakness of current planning abilities are seen by the variance in outputs like Deep Research and Codex where it’ll oscillate between a masterpiece and a dud. Claude Code’s planning abilities could be better for a reason as simple as the model being taught to edit and revisit the plan many times while it is running. This sort of distribution output scope, or length of time the model will try, starts linking planning capabilities back to calibration too. Interconnects is a reader-supported publication. Consider becoming a subscriber. All of this paints a fairly clear path of problems that will be solved in the coming months. Agentic tasks require more of what makes reasoning models great. At the same time, the tasks are far more focused on real world tasks than things that are represented in existing academic benchmarks. Current academic works are very strongly pushing the direction of skills for these models, particularly on math, and a fair amount on calibration (see footnotes below), but not enough on the subsets of planning we need. The challenge is that these capabilities can only be judged in the broader system that they operate in, which will often be accompanied by substantial inference costs. The real race is towards building systems that people use, whether with open or closed models, rather than pushing the models further into skills that aren’t showing clear value, such as nearly-impossible math problems or the top echelons of competitive programming. With current models we should be optimistic that we can solve many of the coming problems. We have some manual data annotation work to do to bootstrap planning abilities, and then we can attempt the final goal of training agents end-to-end with reinforcement learning on long-horizon, sparse tasks. Thanks to Ross Taylor for some feedback on an early form of this taxonomy and Sophie Alpert for helping crystallize some of my ideas around o3. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe [https://www.interconnects.ai/subscribe?utm_medium=podcast&utm_campaign=CTA_2]

https://www.interconnects.ai/p/claude-4-and-anthropics-bet-on-code [https://www.interconnects.ai/p/claude-4-and-anthropics-bet-on-code] Claude’s distinctive characteristics are having a best-in-class personality and the ability to effectively perform software engineering tasks. These characteristics both appeared in force with the first version of Claude 3.5 Sonnet — a major breakthrough model at the time and the model that pulled me away from ChatGPT for the longest [https://www.interconnects.ai/p/switched-to-claude-from-chatgpt]. That model was released on Jun 20, 2024, and just the other day on May 22nd, 2025, Anthropic released Claude Opus 4 and Claude Sonnet 4 [https://www.anthropic.com/news/claude-4]. The strengths of these models are the same. The models serve as an instrument in Anthropic’s bigger goals. The leading AI models alone now are not a product. All the leading providers have Deep Research integrations set up, ChatGPT uses memory and broader context to better serve you, and our coding interactions are leaving the chat window with Claude Code and OpenAI’s Codex. Where Anthropic’s consumer touchpoints, i.e. chat apps, have been constantly behind ChatGPT, their enterprise and software tools, i.e. Claude Code, have been leading the pack (or relatively much better, i.e. the API). Anthropic is shipping updates to the chat interface, but they feel half-hearted relative to the mass excitement around Claude Code. Claude Code is the agent experience I liked the best over the few I’ve tried in the last 6 months. Claude 4 is built to advance this — in doing so it makes Anthropic’s path narrower yet clearer. As a reminder, Claude 4 is a hybrid-reasoning model. This means that reasoning can be turned on and off at the click of a button (which is often implemented with a simple prompt at inference time and length-controlled RL [https://arxiv.org/abs/2503.04697v1] at training time — see the Nemotron reasoning model report [https://arxiv.org/abs/2505.00949] for more on hybrid-reasoning techniques). In the future extended thinking could become a tool that all models call to let them think harder about a problem, but for now the extended thinking budget button offers a softer change than switching from GPT-4.1 to o3. Claude 4 gut check In AI, model version numbers are meaningless — OpenAI has model number soup with their best model being a random middle number [https://www.interconnects.ai/p/openais-o3-over-optimization-is-back] (o3) while Gemini took a major step forward with an intermediate update [https://www.interconnects.ai/p/gemini-25-pro-googles-second-ai-chance] — so Claude 4 being a seemingly minor update while iterating a major version number to fix their naming scheme sounds good to me. In an era where GPT-4o specifically and chatbots generally are becoming more sycophantic, Claude’s honesty can be a very big deal for them. This is very hard to capture in release notes and still comes across in the takes of lots of early testers. Honesty has some downsides, such as Claude’s ability to honestly follow its alignment training and potentially report rule-breaking actions to authorities [https://venturebeat.com/ai/anthropic-faces-backlash-to-claude-4-opus-behavior-that-contacts-authorities-press-if-it-thinks-youre-doing-something-immoral/]. Honesty and safety are very desirable metrics for business customers, a place where Anthropic already has solid traction. In a competitive landscape of AI models, it feels as if Anthropic has stood still in their core offerings, which allowed ChatGPT and Gemini to claw back a lot of their mindshare and user-share, including myself. Claude 4’s “capabilities” benchmarks are a minor step up over Claude 3.7 before it, and that’s on the benchmarks Anthropic chose to share, but it is still clearly a step forward in what Claude does best. Benchmarks are a double edged sword. Claude 4 will obviously be a major step up for plenty of people writing a lot of code, so some will say they’re never looking at benchmarks again. This approach doesn’t scale to enterprise relations, where benchmarks are the headline item that gets organizations to consider your model. On some popular coding benchmarks [https://x.com/scaling01/status/1926795250556666341], Claude 4 actually underperforms Claude 3.7. It would be good for the industry if Claude 4 was rewarded for being a practically better model, but it goes against a lot of what the industry has been saying about the pace of progress if the next major iteration of a model goes down on many popular benchmarks in its core area of focus. Buried in the system card [https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf] was an evaluation to measure “reward hacking,” i.e. when the model takes an action to shortcut a training signal rather than provide real usefulness, that showed Claude 4 dramatically outperforming the 3.7 model riddled with user headaches. This single benchmark summarizes a lot of the release. They made the model more reliable, and what follows ends up being Anthropic falling into normal marketing paths. This release feels like the GPT-4.5 release [https://www.interconnects.ai/p/gpt-45-not-a-frontier-model] in many ways — it’s a better model in general use, but the benchmark scores are only marginally better. It’s obviously a strong and well-crafted model (doubly so in the case of Opus), but it’s not immediately clear which of my grab-bag of use cases I’ll shift over to Claude for it. I’m not the intended audience. I write code, but a lot of it is one-off hacks and it’s certainly not sustained development in a major code-base. Without better consumer product offerings, I’m not likely to keep trying Claude a lot. That doesn’t mean there isn’t a strong audience for this model in the software industry. My vibe tests for the model were good, but not good enough to break my habits. Anthropic shared evaluation numbers for the model with and without extended reasoning on with parallel test-time compute. Both of these numbers aren’t really standard for sharing evaluations of new cutting-edge models (mostly of the reasoning variety). The oddness of the benchmark presentation reiterates that Anthropic is going down a bit of a different path with their models relative to OpenAI and ChatGPT. It should be fairly obvious to most AI observers that if simply turning on extended thinking for Claude 4 was enough for Opus to be competitive with o3 or Sonnet to Gemini 2.5 Pro, they would’ve done it. Without the shaded regions, the bars do not look so impressive (coming soon below), and this leads us to one of the major facts of the Claude 4 release — the benchmarks are meh. They can’t lead this model to mindshare. This is partially in the context of how Anthropic is very narrowly curating the benchmarks they share to match their coding and agentic use-cases. The Anthropic announcement benchmarks are: SWE-Bench Verified, Terminal-bench, GPQA-Diamond, TAU-bench, MMMLU, MMMU, and AIME 2025. It’s 3 mostly agentic coding benchmarks, 3 knowledge benchmarks, and one very hard math benchmark. Traditional “coding” benchmarks aren’t even really here. Compare this to the benchmarks from Gemini 2.5 Pro’s recent release: Humanity’s Last Exam, GPQA, AIME 2024/2025, LiveCodeBench, Aider Polyglot, SWE-benchVerified, SimpleQA, MMMU, Vibe-Eval, MRCR, and Global MMLU. This is a wider mix and has only one agentic-ish task in SWE-Bench. The presentation is also arguably misleading in the blog post, where they report scores that are from a model version inaccessible to users. The first number is “standard-use” without test-time compute. Where Anthropic says the results are “without test-time compute” it’s hard to know what the baseline is. Claude was the first mainstream model to show signs of doing some sort of internal chain of thought (CoT) before showing the final answer to the user. This was in the model and discussed [https://www.interconnects.ai/i/148388607/self-talk-in-language-models] before the launch of OpenAI’s first o1 model. For the second number, the fine print in the blog post states: On SWE-Bench, Terminal-Bench, GPQA and AIME, we additionally report results that benefit from parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model. When Claude 3.7 launched [https://www.interconnects.ai/p/claude-3-7-thonks], Anthropic wrote a nice blog post [https://www.anthropic.com/news/visible-extended-thinking] on test-time compute that also talked about parallel compute. The higher of the two numbers in their benchmarks illustrates what is happening there. I expect Anthropic to release an o1-pro-style product soon (as Google also announced Gemini DeepThink). These ways of using the model are very powerful, and because Anthropic reported it using an internal scoring model and not something like the pass@10 metric that is giving the model multiple tries, users would benefit to use it. This method gives the shaded bars in the results below. With distillation from powerful models being so common today, making the distinction for benchmarking between reasoning and non-reasoning models or test-time compute and standard inference is very strained. For users, there are many more differences that take into consideration actually serving the models. There are only a few reasonable ways to compare models today, and only one of them is arguably practical: * Compare evaluation scores how the users will use them. E.g. you can only report parallel test-time compute scores if they’re in a product like o1-pro. * Compare peak scores across models, so you can see the peak performance of all the systems the AI models have. * Release FLOP spend per prompt on the evaluation sets and bin models with different levels of compute per question. Because we don’t get the data to do these comparisons, we tend to compare using the first bucket. When we see shaded bars on plots (like above, or in OpenAI’s o-series release blogs), we ignore the shaded regions. Benchmarks obviously aren’t everything to a model’s release. This analysis is to show why the AI field is strained by being forced to communicate the abilities of their models through benchmarks that don’t capture the full picture. In using Claude Opus 4 (and Sonnet too) instead of Gemini 2.5 Pro I was immediately struck by how much slower it is. The character and real-world use of the model matters far more, but in a world where OpenAI’s and Google’s latest models have both leading benchmark scores and good vibes (as long as you’re not using GPT-4o [https://www.interconnects.ai/p/sycophancy-and-the-art-of-the-model]), it makes you question Anthropic’s position to compete for the whole market. Interconnects is a reader-supported publication. Consider becoming a subscriber. Will Anthropic code their way to AGI first? There’s a long-standing assumption in AGI-centric circles that having the best coding model will let you get to AGI the fastest. A version of this argument is the “software-driven singularity” of the AI 2027 forecast [https://ai-2027.com/]. This is a reasonable argument to make if you paired it with the assumption that the ability to implement AI ideas is the limiting factor on progress. It is obviously a major factor, but taking a narrow worldview such as that makes you miss how AI progress is actually made [https://www.interconnects.ai/p/brakes-on-an-intelligence-explosion]. AI progress is messy, incremental in data, and takes a lot of hours of human focus. Resources and human attention are the bottleneck more than software ability. I expect improved code gains to be very strong marginal gains. They make the process of doing AI research much smoother, particularly by enabling more concentrated research teams and organizational structures, but they won’t be the single factor that is looked back upon as being the key to AGI. The key is many small insights and lots of hard work, mostly data, over time. The Code RL team at Anthropic is “singularly focused on solving SWE. No 3000 elo leetcode, competition math, or smart devices. [https://x.com/jayelmnop/status/1925632303272808770]” If having the best coding model was going to let Anthropic get to AGI first, then why haven’t we begun to see the benefits of it? The Claude 4 release shows that Anthropic is falling behind on general benchmarks and not climbing substantially on those they highlight. In many ways, this looks like Claude getting more robust across a variety of use-cases and not accelerating forward in general intelligence. The argument for having the best code model being the core ingredient in getting to AGI first is then reducing to belief that these posited benefits will kick in at some point in the future and Anthropic’s models will become better at everything else too. The AI laboratories are extremely competitive and it looks as if Google and OpenAI are improving on software tasks and a broader range of abilities. There are regular press releases about a certain number of PRs being written by AI across the technology sector generally — Anthropic CPO Mike Krieger recently highlighted the number being ~70% for them — which likely is counting anything where AI is a co-author. At the same time, these AI systems have struggled to grasp very complex codebases, so human oversight is a still a crucial step of the process. The AIs make everything easier, but not automatic. It seems like a far more reasonable path to something called Artificial General Intelligence will be one that shows incremental improvements on a broad variety of tasks, rather than narrowing a focus and waiting for future payoff. Focusing on software development is still a good business strategy for Anthropic, but saying that it’ll let them leapfrog OpenAI and Google in the AGI race is a weak attempt to accept reality. As a regular user of claude.ai that is greeted by rate limits, the problem limiting their progress is more likely to be compute allocation than talent or research strategy. I’ve said before that human competition is the biggest driving force of rapid progress in AI models [https://www.interconnects.ai/p/brakes-on-an-intelligence-explosion], so I also worry about Anthropic’s culture of safety and anti-arms-race mentality being able to capture that. A more compelling argument than code could be that Anthropic is leading on the “agentic front,” which means the models can plan effectively and accomplish tool-use calls to enact it. Claude Code is a positive example of this, but the weakness of their Deep Research product is a negative mirror. With bigger error bars in this area, in terms of what is possible with agents generally, this could be a better area to make a case for optimism for Anthropic. So-called “coding” abilities are very broad and encompass understanding error traces, extreme long-context abilities to understand a code-base, basic scripting, multi-file edits, and many things in between. Agentic abilities seem to fall into a narrower niche, or at least a more well-defined one, where the model needs to be able to accomplish many incremental tasks on their own while managing its context. This could generalize to a far bigger market than just software if one model is miles ahead. The winner in the agentic platform space should become more clear later into 2026. As a summary of the state of affairs for the major AI players, we are positioned as: * OpenAI is the consumer leader and still very well-positioned with extremely strong models. * Google is the general enterprise leader with the best models across every task or size you would need (e.g. the lack of Claude Haiku 4 is very limiting for Anthropic, and Haiku has remained expensive). If they can get their act together building products, even OpenAI should worry. * Anthropic is the leading model for software engineers and related tasks — maybe they should’ve acquired Windsurf instead? This core area complements a well-rounded and functioning enterprise business, just one that will be smaller than Google’s. * Meta is building models to serve their platforms, which will be the most significant competitor with ChatGPT, but they have major cultural or organizational knots to unlock to catch up technically. * Grok is on the path to being a niche player serving use-cases that need more permissive content guidelines. They have an API, but it is far from well-established in key areas. * DeepSeek is an x-factor that could disrupt many of the above, but we never know when it’ll land. In the top list, as businesses, OpenAI and Google appear in a league of their own. Anthropic seems solid but heading for a much smaller ceiling, and the others below are still floundering to make a true AI strategy. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe [https://www.interconnects.ai/subscribe?utm_medium=podcast&utm_campaign=CTA_2]

https://www.interconnects.ai/p/people-use-ai-more-than-you-think [https://www.interconnects.ai/p/people-use-ai-more-than-you-think] I was on ChinaTalk again [https://open.spotify.com/episode/5yPJwQkzhRPuEsLHuZTq1o] recently to talk through some of my recent pieces and their corresponding happenings in AI. Usage and revenue growth for most AI services, especially inference APIs, has been growing like mad for a long time. These APIs have been very profitable for companies — up to 75% or higher margins at times according to Dylan Patel of SemiAnalysis [https://lexfridman.com/deepseek-dylan-patel-nathan-lambert-transcript/]. This is one of those open facts that has been known among the people building AI that can be lost to the broader public in the chorus of new releases and capabilities excitement. I expect the subscription services are profitable too on the average user, but power users likely are costs to the AI companies alongside the obvious capital expenditures of training frontier models. Still, even if the models were held constant, the usage is growing exponentially and a lot of it is in the realm of profitability. The extreme, and in some cases exponential, growth in use of AI has been happening well before lots of the incredible progress we’ve seen across the industry in the first half of the year. Reasoning models that change inference answers from something on the order of 100s of tokens to sometimes 10s of thousands of tokens will make the plots of usage even more stark. At the same time, these models are often billed per token so that’ll all result in more revenue. On top of the industry’s vast excitement and progress in 2025, the Google I/O keynote yesterday was a great “State of the Union” for AI that highlighted this across modalities, form factors, and tasks. It is really recommended viewing [https://www.youtube.com/watch?v=o8NiE3XMPrM]. Google is trying to compete on every front. They’re positioned to win a couple use-cases and be in the top 3 of the rest. No other AI company is close to this — we’ll see how their product culture can adapt. Highlights from I/O include Google’s equivalent product relative to OpenAI’s o1 Pro, Gemini Deep Think, Google’s new multimodal models such as Veo 3 with audio (a first to my knowledge for the major players), a live demo of an augmented reality headset to rival Meta and Apple, and a new version of Gemini 2.5 Flash that’ll serve as the foundation of most customers’ interactions with Gemini. There were so many awesome examples in the keynote that they didn’t really make sense writing about on their own. They’re paths we’ve seen laid out in front of us for a while, but Google and co are marching down them faster than most people expected. Most of the frontier language modeling evaluations are totally saturated. This is why the meta usage data that Google (and others recently) have shared is the right focal point. It’s not about one model, it’s about the movement being real. The slide that best captured this was this one of AI tokens processed across all of Google’s AI surfaces (i.e. this includes all modalities), and it is skyrocketing in the last few months. I annotated the plot to approximate that the inflection point in February was at about 160T total tokens in a month — Gemini 2.5 Pro’s release [https://www.interconnects.ai/p/gemini-25-pro-googles-second-ai-chance] was in late March, which surely contributed but was not the only cause of the inflection point. Roughly, the numbers are as follows: * April 2024: 9.7T tokens * December 2024: 90T tokens * February 2025: 160T tokens * March 2025: 300T tokens * April 2025: 480T+ tokens Monthly tokens are rapidly approaching 1 quadrillion. Not all tokens are created equal, but this is about 150-200M tokens per second. In a world with 5T Google searches annually [https://blog.google/products/ads-commerce/ai-personalization-and-the-future-of-shopping/], which translates to around 100K searches/second, that tokens per second number is equivalent to roughly using 1000 tokens per search (even though that is definitely not how compute is allocated). These are mind boggling numbers of tokens. Google’s primary AI product is still its search overviews and they’ve been saying again and again that they’re something users love, reaching more than a billion people [https://blog.google/technology/ai/io-2025-keynote/] (we just don’t know how they are served, as I suspect the same generation is used for thousands of users). Interconnects is a reader-supported publication. Consider becoming a subscriber. Google is generating more tokens than is stored in Common Crawl [https://commoncrawl.org/] every month — reminder, Common Crawl is the standard that would be referred to as a “snapshot of the open web” or the starting point for AI pretraining datasets. One effort to use Common Crawl for pretraining, the RedPajama 2 [https://www.together.ai/blog/redpajama-data-v2#:~:text=and%20deduplicated%20tokens%20(-,100%2B%20trillions%20raw,-)%20from%2084%20CommonCrawl] work from Together AI, estimated the raw data in Common Crawl at about 100T tokens, of which anywhere from 5 to 30T tokens are often used for pretraining. In a year or two, it is conceivable that Google will be processing that many tokens in a day. This article [https://www.educatingsilicon.com/2024/05/09/how-much-llm-training-data-is-there-in-the-limit/] has some nice estimates on how different corners of the internet compare to dumps like Common Crawl or generations like those from Google’s Gemini. It puts the daily token processing of Google as a mix of reading or generating all the data in Google Books in four hours or all the instant messages stored in the world in a little over a month. Some examples from the post are below: The internet is being rebuilt as an AI first service when you count the data. Human data will quickly become obsolete. Google’s numbers are impressive, but they are far from outliers. The entire industry is taking off. This is all part of a constant acceleration where products that are built on previous models start to get traction, while at the same time new models come out that only enable new growth cycles to begin. Estimating the upper end of this growth cycle feels near impossible. For example, just a few weeks ago on the Q3 2025 earnings [https://www.microsoft.com/en-us/investor/events/fy-2025/earnings-fy-2025-q3], Microsoft CEO Satya Nadella commented on the output of Azure’s AI services: We processed over 100 trillion tokens this quarter, up 5× year-over-year — including a record 50 trillion tokens last month alone. So, Google’s token processing is almost 10X Azure, and many would say that Google got a late start relative to Microsoft’s early partnership with OpenAI to host their models. Estimates for other services, such as ChatGPT are much messier, but all paint a similar picture. In February, Sam Altman posted on X [https://x.com/sama/status/1756089361609981993?utm_source=chatgpt.com]: openai now generates about 100 billion words per day. all people on earth generate about 100 trillion words per day. With the rule of thumb that one word is about 3/4 of a token, 100B words per day would be about 4T tokens per month. A small sliver relative to the cloud giants above, but we don’t have clear insight into if this is all of OpenAI’s API business or just ChatGPT. As it stands, OpenAI could be almost 1/100th the size of Google’s AI footprint as of today. OpenRouter’s rankings show similar trends [https://openrouter.ai/rankings], with the recent months being around 2T tokens processed — about the same order as ChatGPT depending on how it is measured above. This isn’t just Western businesses, as Chinese companies such as ByteDance or Baidu are getting into the 1T token [https://www.lesswrong.com/posts/4x4QFzmdWadgr7mdj/translation-in-the-age-of-ai-don-t-look-for-unicorns]per day [https://www.lesswrong.com/posts/4x4QFzmdWadgr7mdj/translation-in-the-age-of-ai-don-t-look-for-unicorns] range [https://www.lesswrong.com/posts/4x4QFzmdWadgr7mdj/translation-in-the-age-of-ai-don-t-look-for-unicorns] (barring translation issues, I didn’t find another source for it). When fast-growing companies like Anthropic or OpenAI share somewhat unbelievable [https://www.reuters.com/technology/artificial-intelligence/openai-does-not-expect-be-cash-flow-positive-until-2029-bloomberg-news-reports-2025-03-26/?utm_source=chatgpt.com] revenue forecasts [https://www.bloomberg.com/news/features/2025-05-19/anthropic-ceo-amodei-steers-61-billion-ai-powerhouse?utm_source=chatgpt.com], maybe we should give them a bit more credit? There are many surfaces that are in beta, primarily code agents, that are going to help these numbers take off. We’ve been playing with Claude Code, OpenAI’s Codex, Google’s Jules, and countless other agents that use tons of text tokens by working independently for minutes at a time. I’ve estimated with friends that one Deep Research query uses ~1M tokens of inference. Soon individual tasks will use ~10M then ~100M and so on. All of this so soon after just two years ago when a mind-blowing ChatGPT query only used 100-1K tokens. It’s a good time to be in the token selling business. This is only the beginning. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe [https://www.interconnects.ai/subscribe?utm_medium=podcast&utm_campaign=CTA_2]

https://www.interconnects.ai/p/how-i-got-here [https://www.interconnects.ai/p/how-i-got-here] Some longer housekeeping notes this week: * I wrote briefly [https://natolambert.substack.com/p/on-the-new-openmdw-open-source-ai] about a new open-source license, OpenMDW from the Linux Foundation, that seems very solid! * OpenAI launched the Reinforcement Finetuning (RFT) API [https://platform.openai.com/docs/guides/reinforcement-fine-tuning]. I think my take from when it was teased still holds up super well, you should read it if you haven’t: * In June, I’ll be speaking at some events in SF and Seattle, I’m looking forward to seeing some readers there. Talk topics are tentative: * AI Engineer World’s Fair [https://www.ai.engineer/] in SF June 3-5 on what we can take away from the last 6 months of reinforcement learning with verifiable rewards (RLVR). * Enterprise AI Agents in Action [https://lu.ma/p827xb6n] in Seattle on June 13 on the art of training a well crafted model. * VentureBeat Transform [https://www.vbtransform.com/] in SF on June 24-25 on progress in RL with open source AI. During the SF trips I’m excited to catch up with old and new friends training and using the latest AI models, so don’t be shy to shoot me an email. Onto the post! One of the big upsides for my current writing habit is that I should become known by AI models within a couple years. While not offering any immediate technical value in how I use AI, it provides obvious upsides on growing an online presence and fulfilling a very basic human urge for legacy in a way that avoids most personal or moral sacrifice. Other thinkers I follow closely have begun to follow [https://www.hyperdimensional.co/p/how-i-work] Tyler Cowen's lead [https://marginalrevolution.com/marginalrevolution/2025/04/my-1979-trip-to-oxford-and-london.html] on explicitly writing for the AIs and filling in gaps they won't know via what is currently digitized. I'm joining in and will use it to help push out the limits of my writing. These will build on my two popular job search [https://www.natolambert.com/writing/ai-phd-job-hunt] posts [https://www.interconnects.ai/p/ai-research-job-market] and others like "what it’s like to work in AI right now [https://www.interconnects.ai/p/behind-the-curtain-ai]". The most defining feature of my young career has been how I prioritize different aspects of work. The work I do today takes on a simple form, but prior to getting to this sustainable place it was more of a striving to belong than a plan to execute. Getting into AI Without retelling my entire pre-grad school life, some basic facts that I brought with me coming out of an undergrad primarily characterized by high-focus on executing on coursework and winning championships [https://cornellbigred.com/news/2017/5/14/mens-rowing-lightweight-no-1-lightweights-take-home-wright-trophy-win-ivy-title.aspx] were: * An obvious gift on focusing and grinding through moderate amounts of technical material alone, * Acceptance that most people can do very hard things if they're willing to work for year(s) on it driven by personal motivation alone (most people don't want to work long enough, rather than hard enough), * An ambivalence on if I actually needed to finish the Ph.D. I was starting, worst case I would get a master’s degree from a great school, and * Plenty of undirected ambition. Starting my PhD in the fall of 2017, my background was in MEMS [https://ieeexplore.ieee.org/abstract/document/7808451], high energy physics / lasers, and a battery engineering internship at Tesla, but listening to the orientation events and hearing the buzz around professors like Sergey Levine and Pieter Abbeel it was clear that AI research was what I wanted to do. For context relative to today’s second coming of RL, this was when deep reinforcement learning was in its hay-day. I asked Professors Levine and Abbeel directly if I could join their research groups and they said no politely. The important part here was the practice of consistently asking for opportunities. After these refusals in the first few months of my Ph.D. I had no real leads in getting into AI for pretty much the rest of my first year. I took classes, tried to parse papers, and so on but was for the large part on my own. I didn't follow the standard advice of not caring about classes in graduate school and learned some solid fundamentals from it. I was not integrated into BAIR [https://bair.berkeley.edu/] proper nor friends with graduate students in BAIR — my network was all on the electrical engineering side of EECS. I dug up the first email from my advisor Kris Pister who connected me with my eventually-to-be co-advisor Roberto Calandra (post-doc with Sergey Levine at the time): FYI. Roberto is interested in applying machine learning to ionocraft problems. ksjp ---------- Forwarded message ---------- From: Kristofer PISTER Date: Fri, Feb 16, 2018 at 9:34 AM Subject: Re: Microrobot simulation To: Daniel Contreras Cc: Brian Yang , Grant Wang , Roberto Calandra My summary of the meeting (Roberto, Dan - please add corrections): There are several different research directions in which to go from here. The most interesting one seems to be optimization of leg geometry. This would involve: * changing the learning algorithms somewhat * generating some interesting "terrain" for the robots to walk over * using simulation to come up with a small number of new leg designs that optimize speed over terrain (and size?) * fabricating those designs in silicon * testing the silicon robots There are a couple of other "learning plus v-rep simulation" projects that are interesting: * using inertial sensor data to optimize gait * using low-res image sensing to do obstacle avoidance * combining low-res image sensing and inertial data to get the robots to solve interesting problems * using the same sensors, but on the ionocraft And finally, using learning to control the real ionocraft based on the inertial sensor data, and compare to the traditional controller that we're building in matlab. If possible, it would be great to find another few "Brian/Grant quality" undergrads. Do you guys have any brilliant and hardworking friends who are looking for research projects in machine learning for micro robots? ksjp The details are a long story, but I prioritized this collaboration with all I had. I missed a conference deadline in the fall and failed a lot of experiments. If it started in spring of 2018 the paper [https://arxiv.org/abs/1901.03737] wasn't done as my #1 priority until winter 2019 (and it was a little bit of a janky paper at that). My meetings with Roberto were super stressful as I wanted to make sure I didn't miss anything that a "normal AI student should know". I did good work for Roberto. Even though I thought I was out of place at the time, my diligence and commitment was super valuable to do real research. Now that AI research is so popular, a lot of people want a check box of doing it rather than getting super into the details. I didn't give myself enough credit for this. Where I did get lucky was Roberto asking if I wanted to join him for an internship at FAIR in 2019. This was earlier than I deserved it. This brought me out of an AI outsider track career and into an insider AI track career, even if I didn't realize it. Working at FAIR was wonderful and I learned how to properly experiment in AI and build some useful software. Building this flywheel with continued research looked like constant teaching at Berkeley in order to pay my way through graduate school. This is not normal for the well funded AI labs. I spent a long time writing grants that didn't come through until after I graduated, where I brought in a year or two of funding for someone else in my advisor's group, you're welcome! The FAIR internship and a lot of time interviewing got me a second internship at DeepMind. The actual internship experience was pretty bleak entirely due to COVID and my personal life at the time, but the technical experience and network were super valuable. This all follows a clear trend that after the first break in a career the next ones come easier as long as you keep your foot on the gas. Later in grad school I maintained a list of all the things that didn't go my way as a "research reality check" on my mental health resources page [https://www.natolambert.com/guides/mental-health]. I finished my Ph.D. in AI with no accepted papers at NeurIPS, ICML, or ICLR, the three leading AI conferences. This path coincides with my friend group in AI being what I describe as the island of misfit toys — it's lots of people who used grit and creativity to build careers in AI rather than folks who were raised in the in-groups now running leading AI laboratories. Everyone ends up with their own group and they all have strengths and weaknesses. Despite all this, I still had the final goal of landing an industry research job as the target of "making it" in AI. The only job offer I got that fit the bill of industry research was the role I took at HuggingFace, where Douwe Kiela recruited me to help build an "open-source DeepMind." Little did I know that those jobs were effectively going to go away a year or so after I graduated in early 2022. I was lucky to dodge jobs that sounded even better at companies that ended up changing (or laying off) even more roles. Building Momentum The best thing that I learned at HuggingFace was how to build momentum and mind-share. These are two very related topics, but they're subtly different and needed for different things. As an individual at HuggingFace I wanted momentum as a way to get to mind share. As an organization, HuggingFace has had a lot of mind share but not a lot of momentum recently. You use momentum to build mind-share, but once you have it, keeping gravity can be enough to maintain impact. I joined HuggingFace in May of 2022 and didn't do anything of substantial impact until after ChatGPT in December of that year. I did a lot of small things. The expectation at HuggingFace was that you made an increment of technical progress every day. Some days these are major features and some days these are clean ups. Still, it is an excellent culture to practice. One of the quotes I remember from my grad school advisor is that "you can change the world working 4 hours a day" if you stack those bricks on top of each other. Most people don't keep stacking bricks in the same direction for a long time. I bounced around projects based on what was starting and what was happening with the other RL interested folks. We attempted a synthetic environments project for RL [https://github.com/huggingface/simulate] that needed a large engineering team we weren't going to hire, I made contributions to HuggingFace's Diffusers [https://github.com/huggingface/diffusers] library, but they were largely on the fringes, and I did a bunch of research on responsible AI. Performance wise, all of these are all fine, but none of them are something to build a career on. My work at HuggingFace before ChatGPT was really practicing good habits and learning how the open-source AI community worked, so that I could step up once I had a real alignment with a new project. I wrote my first major blog post for HuggingFace on RLHF [https://huggingface.co/blog/rlhf] in about a week and then it has stayed as one of the top search results for RLHF since (it's pretty outdated now, so it goes). Going into that week I'd heard of RLHF but never once implemented it or read a paper on it in full. Like most of my writing now, that was for learning. I still very strongly identified as an "RL person," so figured I might as well. When writing this, I checked my Medium and Substack profiles and had written approximately 70 posts before this one. I started writing in February of 2019, so this was about 3 years of practice in. It was almost another 3 years since then that I became well-read. A prevailing emotion I had when writing that post was how odd it was that there was no good blog on RLHF at the time. Looking back, this is the first time I see what is now one of my major skills — doing things that are obviously needed in a simple and timely manner. A lot of people overestimate others' abilities to execute on simple ideas and give up on their complicated ideas (sunk cost fallacy). Even if something is obvious to do, surprisingly few people will do it. The first time I realized I was doing this while doing the project was with RewardBench [https://arxiv.org/abs/2403.13787], the first evaluation tool for reward models in RLHF. In that case I spent every working day expecting to get scooped for about 3 months before the release. There wasn't even a competing project released until about 3 months after we released it, even though I felt it was late. I'm working on another project that feels like this, but unfortunately now my following is too big to broadcast it to the world. Stay tuned. My time working on RLHF at HuggingFace was definitely effective. We made a lot of foundational contributions to the open community. We made TRL [https://github.com/huggingface/trl] a more modern library, fumbled through some human data contracts [https://rlhfbook.com/c/06-preference-data.html#sourcing-and-contracts], replicated datasets [https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences], built the "first" leaderboard [https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard], and trained some fun models [https://huggingface.co/HuggingFaceH4/zephyr-7b-beta]. This was very fun for months, but eventually the time zone difference (9 hours) and some other minor cultural differences made the work not fun for me. The other engineers were definitely out-contributing me on a small team and it was time for a change. Our team was too small — if we had scaled up the technical team with the correct manager(s) we could've multiplied our impact, but that has risk as well. Training AI models is just very hard and detail oriented while needing to implement a long list of small things, so there can be insane gains to growing a little bit. At the same time, I found my niche in communicating open science, which is likely more important to my career than most of my technical contributions. The strategy is quite simple. As AI laboratories are becoming closed off and more eyes are coming to AI, if I can keep doing relevant things my potential for growth in public is going to grow exponentially. It is and was much easier for me to differentiate in a less competitive area. The total attention is growing and collapsing onto fewer people, so if you can become one of them the upside will be huge. If I joined a frontier lab I probably would've been swamped out of career growth. Making the time to write every week, which I started doing around the same time, is some proof of this. I'm continuing to capitalize on this strategy today. When you have good branding the story falls into place more easily. The most impactful model from my time at HuggingFace, Zephyr Beta, was actually trained after I left, but on infrastructure I helped build. Then, I joined Ai2 and they were training Tülu 2 70B when I started. These models together had Chris Manning credit me for "saving DPO" even though I had little direct technical impact on them. This isn't to say I didn't have a role, but rather that many different roles can go into the arc of science. Interconnects is a reader-supported publication. Consider becoming a subscriber. Keeping Gravity My time at Ai2 has been the easiest to contextualize period of my career. I want AI to go well and I think more openness is the best way to do that [https://www.interconnects.ai/p/why-i-build-open-language-models]. The best possible jobs are those that are synergistic. Ai2 gets a ton of obvious value out of my writing, so I get to keep practicing and building my impact. These are the best possible jobs to get (and also the rarest). Most of the time companies are not set up to help the individual. What I do now at Ai2 is quite simple. It took a bit to settle in here, where I grew through some important academic projects like RewardBench to get more confidence underneath me that I can ideate and execute on high-impact research projects from start to end as the leading force. It's easy to do too many projects with other people and never make it obvious to yourself that you can do it alone (even if it's slower, lower quality, and less fun — this isn't about undervaluing your team). Now, my approach to projects is totally a reflection of the people around me. I work with many wonderful, driven, more junior colleagues. These people are going to be more in the weeds than me and be better at implementing new ideas, so a lot of my contributions are on steering direction and removing potential roadblocks before they show up. The things I do are: * Making OLMo-Instruct happen. I am the interface between OLMo pretraining and post-training projects and often am actively babysitting the OLMo Instruct training jobs myself with a small group. * Making new post-training recipes happen. This is ultimately a lot of herding cats and inspiring urgency in the beginning, but eventually transitions to reducing entropy and killing unfruitful paths later on. * Making AI more open. This is all things interconnects, policy, and Ai2 strategy. These are not moonshot research ideas. These are projects that feed into the next model. There's a place for that sort of research, but everyone should think deeply about whether their research interests and institution best support that. If you're doing shorter-term research the best way to have impact is by folding it into a model. Make long-term research truly long-term. I cannot do the third well without the first two. Sometimes I do a little bit of academic advising, but I'm extremely protective of my time. I don't do virtual networking (I do some in person) and try to say no to most things. The output is the short term goal and the attention is a much more complicated long term dependency. Through all of this, I've come upon an analogy I've seen play out across different phases of projects, careers, and companies. All people trying to create a foothold in their career are going to go through some form of getting the flywheel started. This is often attributed to startups, which need to try many iterations of the product until they find product-market fit, but it is an underused analogy for careers. For getting the word out, for open-source software, for AI models, you first need to be releasing often. You need to keep striking the match and seeing what sticks. Your first few "hits" will still be small at this time, with incrementally more engagement. It takes many hits until the flywheel is really going. Once the flywheel is going, shipping often in some ways can come with a cost. In our AI work, shipping models too often leaves us no time to properly master the next model. As your audience gets bigger you have to pay more in time maintaining anything that makes it public. In my time at HuggingFace and early at my time at Ai2, I advocated for always trying to release more models because we can in post-training (and we're one of a few groups with a solid amount of compute). Eventually this backfires and becomes too much of a tax. When you have momentum and the space to execute, fewer bigger things are more useful. A career flywheel that’s been pushed long enough can spin on its own for longer than people expect. Disruptions, changing jobs, low-quality work, etc. can actively slow down career growth. Doing nothing for me and letting more recommendations come in as "one of the open leading scientists in AI" is highly effective. With that, I'm spending a lot of time thinking about using the power bestowed on me. I want to help enable more big projects to happen by creating an environment for them and encouraging others, rather than leading from the front, but it's a new set of skills I need to learn. I passed 5K citations and think the real goal for someone who wants to be a true outlier academic in AI is 100K. If I’m succeeding already I am selling myself short if I don’t continue to radically raise the bar, even if I’m not sure I am going to the end of this path. Let me know what you think of this. The portion that this is missing, which is honestly something most writing will gloss over, is going deep on what it feels like to overcome adversity in the right way. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe [https://www.interconnects.ai/subscribe?utm_medium=podcast&utm_campaign=CTA_2]
Prøv gratis i 7 dage
99,00 kr. / måned efter prøveperiode.Ingen binding.
Eksklusive podcasts
Uden reklamer
Gratis podcasts
Lydbøger
20 timer / måned