- Published on March 20, 2025
- In AI News
A new research paper introduces AudioX as a diffusion transformer model that enables various audio generation capabilities.

Researchers from the Hong Kong University of Science and Technology and Moonshot AI have teased a new AI model called AudioX, that generates audio and music using multimodal inputs.
AudioX is described as a unified model offering flexible natural language control and seamless processing of inputs that include text, video, image, music, and audio. This differs from the standard domain-specific models that typically focus on a single modality or a limited set of input conditions.

The research paper mentioned use cases like text-to-audio, text-and-video-to-audio, and video-to-audio with AudioX. Notably, the AI model also lets one refine existing audio through a text prompt, improve unprocessed music, and generate music from scratch.
Netizens seem excited about the demo of the model shared on the model’s GitHub repo, highlighting interesting use cases like generating audio for a tennis video:
AudioX : Anything-to-Audio Generation
Mindblowing, I could not believe that tennis example it was just too good. pic.twitter.com/EA8clWlqmF
The researchers mentioned that they aim to address the scarcity of high-quality multi-modal data, which has been a major bottleneck in the development of versatile audio generation systems. To tackle this, they curated two comprehensive datasets: vggsound-caps, with 190K audio captions based on the VGGSound dataset, and V2M-caps, with 6 million music captions derived from the V2M dataset.
“Extensive experimental results show that AudioX not only excels in intra-modal tasks but also significantly improves inter-modal performance, highlighting its potential to advance the field of multi-modal audio generation,” the research paper stated.
Currently, the code for the model is not available. The researchers mentioned it would be available on the GitHub page without specifying a timeframe or licence details.
There are various text-to-music models and some text-to-speech models available, which have seen creative use cases in the AI space. It remains to be seen how AudioX opens up more possibilities.
Ankush Das
I am a tech aficionado and a computer science graduate with a keen interest in AI, Open Source, and Cybersecurity.
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Rising 2025 Women in Tech & AI
March 20 - 21, 2025 | 📍 NIMHANS Convention Center, Bengaluru
AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India
Data Engineering Summit 2025
May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru
MachineCon GCC Summit 2025
June 20 to 22, 2025 | 📍 ITC Grand, Goa
Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India
India's Biggest Developers Summit Feb, 2025 | 📍Nimhans Convention Center, Bengaluru