xAI Launches Grok Speech APIs Undercutting Competitors by 60%
Zach Anderson
Apr 18, 2026 00:53
Elon Musk’s xAI releases Grok Speech to Text and Text to Speech APIs at $0.10/hour, claiming lowest error rates across enterprise transcription benchmarks.
Elon Musk’s xAI dropped two standalone audio APIs on April 17, positioning Grok’s speech technology as a direct competitor to ElevenLabs, Deepgram, and AssemblyAI at aggressive price points.
The Grok Speech to Text API runs $0.10 per hour for batch processing and $0.20 per hour for real-time streaming. Text to Speech comes in at $4.20 per million characters. Both leverage the same infrastructure powering Tesla vehicles and Starlink customer support.
Benchmark Claims Worth Scrutinizing
xAI’s published word error rates tell an interesting story. On phone call entity recognition—think names, account numbers, dates—Grok STT claims 5.0% error rate versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That’s a significant gap if it holds up in production.
The company demonstrated this with a tricky test case: transcribing Welsh names like “Anghared Llewelyn Bowen” and “Oisin MacGiolla Phadraig” alongside mortgage details. Grok nailed it with zero errors. Competing models stumbled on pronunciations and formatted dates inconsistently.
Video and podcast transcription shows tighter competition—Grok and ElevenLabs tied at 2.4% error rate, with Deepgram and AssemblyAI trailing slightly at 3.0% and 3.2% respectively.
Technical Features for Developers
Beyond raw transcription, xAI built in features that enterprise customers actually need: word-level timestamps, speaker diarization across multiple audio channels, and support for 25+ languages with seamless switching.
The Inverse Text Normalization feature automatically converts spoken numbers, dates, and currencies into proper formats. “Four one four five five five one two three four” becomes a phone number. “Six ninety-nine” becomes $6.99. Small detail, but it eliminates post-processing headaches.
Text to Speech includes inline tags for prosody control—whispers, laughs, sighs, emphasis, pacing adjustments. Developers can inject emotional nuance without wrestling with complex audio markup.
Strategic Context
This launch follows xAI’s acquisition of X Corp in March 2025 and comes as the company expands its infrastructure partnerships. Just two days before the API announcement, reports emerged that xAI plans to supply computing power to Cursor, the AI-powered coding startup.
The Colossus supercomputer, operational since December 2024, provides the backend muscle. xAI appears to be monetizing that capacity across multiple verticals—enterprise AI, developer tools, and now voice APIs.
For developers building voice agents or transcription tools, the pricing undercuts established players substantially. Whether Grok’s accuracy claims survive real-world deployment at scale remains the open question. The documentation and rate limits are available through xAI’s API console for those ready to test it.
Image source: Shutterstock
