OpenAI Releases New Audio Models to Power Voice Agents

4 weeks ago 15
  • Published on March 21, 2025
  • In AI News

The company said these advancements stem from reinforcement learning techniques and extensive training with diverse audio datasets.

OpenAI is Trying Really Hard to Attract Young Talent

OpenAI has launched new speech-to-text and text-to-speech models in its API, providing developers with tools to build advanced voice agents. These models improve transcription accuracy and introduce customisation options for generated speech.

The new speech-to-text models, gpt-4o-transcribe and gpt-4o-mini-transcribe, improve word error rate and language recognition compared to Whisper models. 

In its blog post, OpenAI said these advancements stem from reinforcement learning techniques and extensive training with diverse audio datasets. The models aim to improve transcription reliability in noisy environments, varying speech speeds, and different accents.

“Our latest speech-to-text models achieve lower word error rates across established benchmarks, reflecting improvements in transcription accuracy and language coverage,” OpenAI said.

Developers can now also control how the text-to-speech model speaks. The gpt-4o-mini-tts model allows developers to instruct the model to adopt different speaking styles, such as mimicking a customer service agent. This feature expands use cases in customer interactions and creative storytelling. However, OpenAI clarified that these models are limited to synthetic preset voices.

The company credits improvements in its audio models to pretraining with authentic datasets, advanced distillation methodologies, and reinforcement learning. Distillation techniques have enabled smaller models to retain conversational quality while reducing computational costs.

The new models are available to all developers through OpenAI’s API. OpenAI has also integrated these models with its Agents SDK to simplify development. For real-time, low-latency speech-to-speech applications, OpenAI recommends using its Realtime API.

Looking ahead, OpenAI plans to enhance the intelligence and accuracy of its audio models and explore custom voice options. The company is also engaging with policymakers, researchers, and developers on the implications of synthetic voices. Moreover, OpenAI intends to expand into video, enabling multimodal agentic experiences.

Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Association of Data Scientists

GenAI Corporate Training Programs

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Rising 2025 Women in Tech & AI

March 20 - 21, 2025 | 📍 NIMHANS Convention Center, Bengaluru

AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India

Data Engineering Summit 2025

May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru

MachineCon GCC Summit 2025

June 20 to 22, 2025 | 📍 ITC Grand, Goa

Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India

India's Biggest Developers Summit Feb, 2025 | 📍Nimhans Convention Center, Bengaluru

Read Entire Article