Modulate Launches Velma Transcribe: High-Performance Transcription For Real World Conversations at 90% Lower Cost

TECHNOLOGY 18.03.2026

Modulate's ELM model architecture unlocks transcription for the masses, cutting costs by 10x while achieving industry-leading accuracy.

Text size:

BOSTON, MA / ACCESS Newswire / March 18, 2026 / Modulate, the frontier conversational voice intelligence company, today announced Velma Transcribe, a speech-to-text API delivering high-accuracy, low-latency transcription at 90% lower cost per hour than other leading transcription providers. This significantly lower price point represents a fundamental shift in the economics of transcription. For a fraction of the cost, Modulate unlocks affordable speech-to-text transcription for every audio conversation in the world, empowering real-time voice agents, call center platforms, social apps, and more with industry-leading transcription tools at a global scale.

Built using Modulate's industry-leading Ensemble Listening Model (ELM) research, Velma Transcribe orchestrates an ensemble of specialized transcription models to improve accuracy, latency, and cost efficiency compared to any single model. In addition to the outstanding unit economics, Velma Transcribe achieves industry-leading results on widely used datasets, including Earnings-22 and the AMI Meeting Corpus. The result is a new standard for conversational audio transcription, combining strong accuracy on complex multi-speaker audio with dramatically improved unit economics for processing voice data at scale.

"Modulate is the world leader in using voice understanding AI, and our goal is to make the tools to understand audio available to anyone, at any scale," said Carter Huffman, CTO and Cofounder of Modulate. "Our full ensemble for conversation understanding, Velma, already outperforms LLMs in recognizing key behaviors, and now Velma Transcribe makes one of our core underlying capabilities available directly to developers who simply need accurate transcripts, not behavioral insights."

In addition, Velma Transcribe offers features built for Enterprise use cases:

Emotion detection (20+ emotions)
Accent detection (20+ accents)
Multilingual (70+ languages)
PII redaction, diarization, streaming support, and more

Lower Transcription Costs By up to 10X

Velma Transcribe reduces transcription costs to approximately $0.03 per hour of audio, more than 90% lower than leading providers. These economics make it far more cost-effective for enterprise organizations to analyze and monetize their voice data.

$0.03 - Modulate Velma Transcribe
$0.40 - ElevenLabs Scribe v2
$0.31 - Deepgram Nova-3
$0.26 - Deepgram Nova-2
$0.21 - AssemblyAI Universal-3 Pro

*Based on publicly listed pricing as of March 18, 2026

Compare the leading speech-to-text transcription companies on cost and accuracy at Speechtxt.com.

Top Marks for Conversational Audio Accuracy at Scale

Velma Transcribe is engineered for real-world conversations that challenge traditional systems, including overlapping speakers, interruptions, accents, and background noise. On the AMI Meeting Corpus dataset, a widely used benchmark for complex multi-speaker conversational audio, Velma avoids over 40% of the errors made by Eleven Labs and over 70% of the errors made by OpenAI GPT-4o-transcribe.

Huffman explains the top marks, "We've tuned Velma for conversational audio, including emotion and accent detection, leading to materially lower error rates on meeting and call data while delivering dramatic cost savings versus incumbent providers. That combination makes high-quality transcription practical at scale."

Built for Secure Enterprise Voice Production

Velma Transcribe includes all the capabilities developers expect and enterprise operations need, including:

Batch and streaming transcription endpoints with structured output and segment timestamps
Zero data stored, ensuring privacy-safe workflows
Sub-second streaming latency with partial transcripts for live applications and agent pipelines
Robust formatting optimized for conversational speech and long recordings
Broad language coverage in 70 of the world's most commonly spoken languages
Personally Identifiable Information (PII) detection and redaction
Advanced transcription enrichments, including speaker diarization, emotion detection, and accent identification

Backed by Modulate's security practices and ISO 27001 certification, these capabilities allow developers to build secure, voice-enabled applications and help organizations extract insights from large volumes of conversational data.

Models that Listen and Understand

Velma Transcribe is part of Modulate's growing family of Velma 2.0 voice analytics models built to deliver a new, context-rich listening layer for AI systems. It represents the first step in Modulate's expanding developer API strategy, with additional capabilities planned across synthetic voice detection, emotion analysis, and deeper conversational intelligence. Together, these capabilities allow developers and enterprises to move beyond transcription to understand how conversations unfold, enabling applications such as fraud detection, customer sentiment analysis, compliance monitoring, and real-time decision support.

"The industry has spent years teaching AI how to generate and respond. The next frontier is teaching it how to listen," said Mike Pappas, CEO and Cofounder of Modulate. "Most systems today rely on transcription, reducing rich conversations to flat text and losing the signals humans naturally understand. Velma is the listening layer for AI, giving developers and enterprises the 'ears' needed to build voice-native applications that can capture the nuance and intent within spoken dialogue."

Availability and Pricing

Velma Transcribe is available today with batch and sub-second streaming transcription. Modulate pricing is usage-based and optimized for high-volume workloads: https://www.modulate.ai/pricing

About Modulate

Modulate is a voice intelligence company building AI models and APIs designed to understand real-world conversational audio at scale. Its technology combines speech recognition, acoustic analysis, and conversational context to deliver reliable, explainable, and cost-effective voice intelligence for developers and enterprises.

For more information or to get started, visit modulate.ai.

Media Contact

Megan Fasy
Grithaus Agency
(e) [email protected]
(m) +1 (617) 480-3674

###

SOURCE: Modulate

View the original press release on ACCESS Newswire

L.Rodriguez--TFWP