Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription

Last week Microsoft AI has announced MAI-Transcribe-1.5. It is the second iteration of the company’s in-house speech-to-text family. The model targets accuracy across 43 languages, accents, and noisy environments. The Microsoft team positions it for production transcription workloads.

What is MAI-Transcribe-1.5

MAI-Transcribe-1.5 is an automatic speech recognition (ASR) model. It takes audio as input and returns text. Microsoft built it in-house, not on a third-party base. The model handles 43 languages with a single system. It is optimized for diverse accents, dialects, and real-world acoustic conditions.

Microsoft is integrating it into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre. It is also available in Foundry, Microsoft’s model platform.

The Accuracy Case

Accuracy here is measured by Word-Error-Rate (WER). Lower WER means fewer mistakes per transcribed word. Microsoft reports best-in-class WER across 43 languages on FLEURS. FLEURS is a standard multilingual transcription benchmark.

On the Artificial Analysis leaderboard, the model posts a WER of 2.4%. That places it third on a competitive open benchmark. So the picture is split. Microsoft team claims first place on FLEURS and third on Artificial Analysis.

The language expansion is the other accuracy story. Coverage grew from 25 languages to 43. The 18 new languages were added without compromising accuracy. Ten of them are South Asian, including Bengali, Tamil, and Telugu. Eight are European, such as Ukrainian, Greek, and Catalan.

Speed

MAI-Transcribe-1.5 leads on accuracy-times-speed on the Artificial Analysis leaderboard. It runs up to 5x faster than models of comparable accuracy. The effect is largest on long audio files. The model can transcribe an hour of audio in under 15 seconds.

Microsoft cites up to 5x speedups over Gemini 3.1, Scribe v2, and GPT-4o-Transcribe on long audio. Against the prior MAI-Transcribe-1, the Azure card lists up to 5.7x faster long-form inference. For batch pipelines processing large archives, that latency gap compounds quickly.

Keyword (Entity) Biasing: The Feature Worth Understanding

Generic transcribers often fail on domain-specific words. These include people, product names, medical terms, and internal acronyms. Those words frequently matter most to enterprise users.

MAI-Transcribe-1.5 adds keyword biasing, also called entity biasing. You supply a list of domain-specific keywords. The Azure card supports up to 200 keywords. The model biases its predictions toward that list. Critically, it does not blindly force matches. It uses shared context to decide when biasing should apply. Microsoft reports a 30% WER reduction on FLEURS when biasing is used.

A short example shows the effect. Without biasing, names render as “Sean,” “Oif,” and “Societal.” With a supplied name list, the model recovers “Shaun,” “Aoife,” and “Xochitl.” This is relevant for meetings, healthcare, and call centers with niche vocabulary.

Use Cases

The Azure model card lists concrete production scenarios. Each maps to a common engineering workload:

Video captions for media and content platforms.

Accessibility tools that depend on accurate captions.

Meeting transcription for Teams-style collaboration tools.

Call analysis for contact centers and support analytics.

Content creation workflows that need fast draft transcripts.

Voice agents that convert speech to text before reasoning.

Automatic language identification helps when the input language is unknown. The model detects the spoken language without a manual setting.

MAI-Transcribe-1.5 vs MAI-Transcribe-1

The table below compares the two generations using stated facts only.

AttributeMAI-Transcribe-1MAI-Transcribe-1.5Languages covered2543Keyword/entity biasingNot listedUp to 200 keywordsLong-form inference speedBaselineUp to 5.7x fasterArtificial Analysis WERNot specified2.4% (ranked #3)FLEURS position (per Microsoft)State-of-the-artBest-in-class across 43 languagesAutomatic language identificationNot specifiedYesLifecyclePrior releaseGenerally available (GA)Input / OutputAudio / TextAudio / Text

Strengths and Limitations

Strengths:

43-language coverage from a single model, up from 25.

Keyword/entity biasing yields up to 30% WER reduction on FLEURS.

Sub-15-second transcription for an hour of audio.

Generally available now through Azure AI Foundry.

Robust on noisy, real-world audio, per Microsoft.

Limitations:

No diarization yet, so speaker labels are unavailable.

No native streaming API, so real-time use is limited.

Several accuracy, speed, and cost claims are first-party.

Ranked third on Artificial Analysis, behind two competitors.

Sources

Source link