The Future of Voice AI
In the Future of Voice AI series of interviews, I ask three questions to my guests: - What problems do you currently see in Enterprise Voice AI? - How does your company solve these problems? - What solutions do you envision in the next 5 years? This episode’s guests are Klemen Simonic [https://www.linkedin.com/in/klemensimonic/], Co-Founder & CEO at Soniox [https://soniox.com/], and Kwindla Hultman Kramer [https://x.com/kwindla], Co-Founder & CEO at Daily [https://www.daily.co/]. Klemen Simonic is the CEO and Co-Founder of Soniox, where he leads the development of advanced voice AI models built for real-world performance. He brings over 16 years of experience across industry and academia, with a deep focus on artificial intelligence. He has worked on cutting-edge AI systems at Facebook, Google, Stanford University, and the University of Ljubljana. Klemen has been developing AI technologies since his undergraduate years, spanning speech, language, and large-scale knowledge systems. Kwin is CEO and co-founder of Daily, a developer platform for real-time audio, video, and AI. He has been interested in large-scale networked systems and real-time video since his graduate student days at the MIT Media Lab. Before Daily, Kwin helped to found Oblong Industries, which built an operating system for spatial, multi-user, multi-screen, multi-device computing. Recap Video Thanks for reading Voice AI Newsletter! Subscribe for free to receive weekly updates. Takeaways * Voice AI adoption is slow because real-time transcription still breaks on the most basic parts of a customer call. * Real growth is happening quietly inside call centers, but teams won’t scale until transcription stops causing cascading errors. * Even the top models fail on emails, addresses, and alphanumerics, which are the single points of failure in most B2B workflows. * Consumer-grade demos hide the reality that long, multi-turn conversations still fall apart without rigorous context control. * POC to production fails not because of LLMs, but because engineering teams underestimate context management. * A universal multilingual model can outperform single-language models by transferring entity knowledge across languages. * Mixed-language conversations are the norm worldwide, and current systems break the moment a user switches language. * Latency, accuracy, and cost must be solved at the same time; optimizing only one kills the use case. * Feeding both sides of the conversation into STT gives models more context and improves accuracy. * Domain-specific accuracy matters far more than general accuracy, and most models still fail in specialized environments. * Industry “context boosting” tricks are hacks that break at scale; native learned context inside STT is the only path forward. * Punctuation and intonation directly shape LLM reasoning, and stripping them for speed creates silent failure modes. * Voice AI is shifting from speech-to-text to full speech understanding, and models that don’t evolve won’t survive. * The future points toward fused audio plus LLM architectures that remove the brittle STT handoff entirely. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit voice-ai-newsletter.krisp.ai [https://voice-ai-newsletter.krisp.ai?utm_medium=podcast&utm_campaign=CTA_1]
64 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de The Future of Voice AI!