Real-world problems with STT | Klemen Simonic (Soniox) & Kwindla Kramer (Daily)

Descripción

In the Future of Voice AI series of interviews, I ask three questions to my guests: - What problems do you currently see in Enterprise Voice AI? - How does your company solve these problems? - What solutions do you envision in the next 5 years? This episode’s guests are Klemen Simonic [https://www.linkedin.com/in/klemensimonic/], Co-Founder & CEO at Soniox [https://soniox.com/], and Kwindla Hultman Kramer [https://x.com/kwindla], Co-Founder & CEO at Daily [https://www.daily.co/]. Klemen Simonic is the CEO and Co-Founder of Soniox, where he leads the development of advanced voice AI models built for real-world performance. He brings over 16 years of experience across industry and academia, with a deep focus on artificial intelligence. He has worked on cutting-edge AI systems at Facebook, Google, Stanford University, and the University of Ljubljana. Klemen has been developing AI technologies since his undergraduate years, spanning speech, language, and large-scale knowledge systems. Kwin is CEO and co-founder of Daily, a developer platform for real-time audio, video, and AI. He has been interested in large-scale networked systems and real-time video since his graduate student days at the MIT Media Lab. Before Daily, Kwin helped to found Oblong Industries, which built an operating system for spatial, multi-user, multi-screen, multi-device computing. Recap Video Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates. Takeaways * Voice AI adoption is slow because real-time transcription still breaks on the most basic parts of a customer call. * Real growth is happening quietly inside call centers, but teams won’t scale until transcription stops causing cascading errors. * Even the top models fail on emails, addresses, and alphanumerics, which are the single points of failure in most B2B workflows. * Consumer-grade demos hide the reality that long, multi-turn conversations still fall apart without rigorous context control. * POC to production fails not because of LLMs, but because engineering teams underestimate context management. * A universal multilingual model can outperform single-language models by transferring entity knowledge across languages. * Mixed-language conversations are the norm worldwide, and current systems break the moment a user switches language. * Latency, accuracy, and cost must be solved at the same time; optimizing only one kills the use case. * Feeding both sides of the conversation into STT gives models more context and improves accuracy. * Domain-specific accuracy matters far more than general accuracy, and most models still fail in specialized environments. * Industry “context boosting” tricks are hacks that break at scale; native learned context inside STT is the only path forward. * Punctuation and intonation directly shape LLM reasoning, and stripping them for speed creates silent failure modes. * Voice AI is shifting from speech-to-text to full speech understanding, and models that don’t evolve won’t survive. * The future points toward fused audio plus LLM architectures that remove the brittle STT handoff entirely. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit voice-ai-newsletter.krisp.ai [https://voice-ai-newsletter.krisp.ai?utm_medium=podcast&utm_campaign=CTA_1]

Promptable Speech Language Models | Dylan Fox (Founder & CEO at AssemblyAI)

In the Future of Voice AI series of interviews, I ask three questions to my guests: - What problems do you currently see in Enterprise Voice AI? - How does your company solve these problems? - What solutions do you envision in the next 5 years? This episode’s guest is Dylan Fox [https://www.linkedin.com/in/dylanbfox/], Founder & CEO at AssemblyAI [https://www.assemblyai.com/]. Dylan started AssemblyAI [https://www.assemblyai.com/] in 2017 inspired by the potential of new voice-powered products like the Amazon Alexa, as well as his experience working as a research engineer at Cisco on new AI products and features. He saw an opportunity to use new AI technology to make fundamental improvements in the way that computers can understand and extract value from voice data. AssemblyAI started in Y Combinator and has now grown into a Series C company with over $115 million in funding from notable investors like Accel, Insight Partners, and Smith Point Capital. Dylan lives in Brooklyn, NY. AssemblyAI [https://www.assemblyai.com/] builds speech language models that serve as the foundational voice AI infrastructure for next-generation voice applications. Their models deliver industry-leading speech-to-text accuracy with superhuman speech understanding capabilities including speaker detection, summarization, PII redaction, and an LLM gateway — giving developers everything they need to build sophisticated voice AI products.Universal-3 Pro, the first speech language model optimized specifically for voice AI, goes further with advanced prompting capabilities that let developers customize model behavior for their exact use case. With both async and real-time streaming support, AssemblyAI integrates directly into voice agents, AI assistants, medical scribes, real-time call analysis systems, and more. Tens of thousands of developers rely on AssemblyAI's models to power voice AI applications used by millions of end users every day. Recap Video Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates. Takeaways * Real-time is the new growth engine - the last ~18–20 months crossed a reliability threshold where voice use cases actually work. * The real barrier in real-time STT is not model quality, it’s running low-latency systems at massive scale without breaking. * Voice AI is quietly expanding beyond agents into robotics, consumer hardware, ambient listening, and medical scribes, which widens the market fast. * Streaming models will always be disadvantaged on “look-ahead,” so the core problem is making good calls with incomplete future context. * The old quality-vs-speed tradeoff is shrinking because hardware and model optimizations are closing the gap between streaming and batch. * The ‘98% accuracy’ claims are meaningless because benchmarks reward clean audio, not real phone chaos and edge cases. * The industry needs hard voice evals where models look bad on purpose (WER ~50%) because that’s closer to real conditions. * The bottleneck is not model quality, it’s operating low-latency voice systems at insane scale without falling over. * Pricing is used as a growth lever: $0.21 per hour, prorated by the second, with automatic volume discounts. * The “no reservations, no concurrency limits” promise is really a bet on infra superiority, not just model quality. * Dylan’s open-source take is blunt: managing your own AI infra is a tax that slows shipping and kills competitiveness. * Specialization beats multimodal generalists for reliability: a model trained 100% on STT tasks is less likely to go off the rails. * Massive training data scale, not a sudden architecture breakthrough, is the main reason accuracy jumped in the last 2–3 years. * Infrastructure is becoming the hidden moat: unlimited rate limits and no concurrency negotiations remove a major bottleneck for teams shipping voice products. * Real-world performance can move business metrics, like a 15–20% lift in voice agent booking conversions from better STT. * Dylan’s adoption forecast is aggressive: we are at the start of a 100x curve, which means today’s usage is the floor, not the peak. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit voice-ai-newsletter.krisp.ai [https://voice-ai-newsletter.krisp.ai?utm_medium=podcast&utm_campaign=CTA_1]

26 de feb de 202631 min

Real-world problems with STT | Klemen Simonic (Soniox) & Kwindla Kramer (Daily)

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios