- Published on March 21, 2025
- In AI News
The company said these advancements stem from reinforcement learning techniques and extensive training with diverse audio datasets.
OpenAI has launched new speech-to-text and text-to-speech models in its API, providing developers with tools to build advanced voice agents. These models improve transcription accuracy and introduce customisation options for generated speech.
The new speech-to-text models, gpt-4o-transcribe and gpt-4o-mini-transcribe, improve word error rate and language recognition compared to Whisper models.
In its blog post, OpenAI said these advancements stem from reinforcement learning techniques and extensive training with diverse audio datasets. The models aim to improve transcription reliability in noisy environments, varying speech speeds, and different accents.
“Our latest speech-to-text models achieve lower word error rates across established benchmarks, reflecting improvements in transcription accuracy and language coverage,” OpenAI said.
Developers can now also control how the text-to-speech model speaks. The gpt-4o-mini-tts model allows developers to instruct the model to adopt different speaking styles, such as mimicking a customer service agent. This feature expands use cases in customer interactions and creative storytelling. However, OpenAI clarified that these models are limited to synthetic preset voices.
The company credits improvements in its audio models to pretraining with authentic datasets, advanced distillation methodologies, and reinforcement learning. Distillation techniques have enabled smaller models to retain conversational quality while reducing computational costs.
The new models are available to all developers through OpenAI’s API. OpenAI has also integrated these models with its Agents SDK to simplify development. For real-time, low-latency speech-to-speech applications, OpenAI recommends using its Realtime API.
Looking ahead, OpenAI plans to enhance the intelligence and accuracy of its audio models and explore custom voice options. The company is also engaging with policymakers, researchers, and developers on the implications of synthetic voices. Moreover, OpenAI intends to expand into video, enabling multimodal agentic experiences.
Siddharth Jindal
Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Rising 2025 Women in Tech & AI
March 20 - 21, 2025 | 📍 NIMHANS Convention Center, Bengaluru
AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India
Data Engineering Summit 2025
May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru
MachineCon GCC Summit 2025
June 20 to 22, 2025 | 📍 ITC Grand, Goa
Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India
India's Biggest Developers Summit Feb, 2025 | 📍Nimhans Convention Center, Bengaluru