Semi Doped
Gimlet Labs runs an inference cloud built on heterogeneous silicon. Their software traces a PyTorch workload, segments it into its component parts, and schedules each piece onto the best-suited hardware — connecting chips from different vendors on a single high-speed fabric. In this interview, Gimlet co-founder Natalie Serrino and former Intel executive Beltir walk through the architecture (graph trace, optimal split points, lowering each segment to TensorRT on NVIDIA and equivalents elsewhere), the three customer segments they sell into (frontier labs, sovereign clouds, AI natives), and a concrete demo: on GPT-OSS 120B at 8K input / 1K output, running the speculative decoder on a d-Matrix Corsair card while NVIDIA B200s handle the verifier shifts the throughput-vs-interactivity Pareto frontier roughly 4× over GPU-only speculative decode. The most surprising takeaway: most Neoclouds gave significant equity to a single silicon vendor in exchange for capacity. Hardware amortization is around 70% of their annual costs, and the equity terms prevent them from diversifying their silicon. So the only software innovation they can ship is disaggregation on top of one vendor's stack — never across vendors. Gimlet's two-track model (deploying orchestration software inside customer data centers, plus running their own Neocloud built on mixed silicon) is the answer to that constraint. Read the full transcript on Chipstrat. Chapters: 0:00 Intro and the chips no one's connected before 0:33 Inference cloud for agents 1:02 From Intel to Gimlet 2:14 The case for heterogeneous inference 4:03 Disaggregating inference by resource profile 6:24 Tracing PyTorch into a schedulable graph 8:08 Connecting chips never connected before 10:52 CPUs as the agentic workhorse 12:01 Tool calls in the same data center as the LLM 13:21 Latency vs throughput on a shared fabric 14:57 Three customer buckets 15:54 Sovereigns: make an API call, not a porting project 19:37 "Cracked software is the platform" 22:24 Why merchant silicon vendors need partners 25:18 Hyperscalers outsourcing CapEx, not just kernels 28:49 AI natives: latency budgets, not just price 32:06 The d-Matrix partnership 33:31 The Pareto frontier chart 35:56 Speculative decode on Corsair: 4× shift 37:27 4× faster, or 3× more customers? 41:22 Why most Neoclouds can't follow this model 42:34 Gimlet's two-track business model 44:30 CoreWeave vs Together vs Gimlet 45:15 Series A and hiring Relevant reading: The Information on Gimlet helping OpenAI optimize for Cerebras: https://www.theinformation.com/newsletters/ai-agenda/startup-helping-openai-optimize-ai-cerebras-chips Sachin Katti and Zain Asgar coauthored research at Stanford: https://arxiv.org/abs/2507.19635 Follow Chipstrat: Newsletter: https://www.chipstrat.com X: https://x.com/chipstrat
30 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y forma parte de la comunidad de Semi Doped!