AI4Bharat Introduces BhasaAnuvaad, Speech Translation Dataset of 13 Indian Languages with 44,400 Hours of Data

5 months ago 39
  • Last updated November 13, 2024
  • In AI News

AI4Bharat also developed Indic-Spontaneous-Synth, a synthetic evaluation set to highlight how current models, though effective on datasets like FLEURS, tend to underperform in realistic, spontaneous language translation scenarios, underscoring the need for more robust datasets.

AI4Bharat Introduces BhasaAnuvaad, Speech Translation Dataset of 13 Indian Languages with 44,400 Hours of Data

AI4Bharat has announced the launch of BhasaAnuvaad, a speech translation dataset tailored for Indian languages, boasting coverage across 13 languages and approximately 44,400 hours of audio. This marks the largest publicly accessible speech translation resource of its kind for Indian linguistic diversity.

Click here to check out the GitHub repository.

BhasaAnuvaad aims to address gaps in existing speech translation benchmarks that often lack sufficient resources for Indian languages and struggle with India-specific challenges like code-switching and dialect variations. 

Recognising these needs, AI4Bharat also developed Indic-Spontaneous-Synth, a synthetic evaluation set to highlight how current models, though effective on datasets like FLEURS, tend to underperform in realistic, spontaneous language translation scenarios, underscoring the need for more robust datasets.

The dataset spans a broad spectrum of India’s linguistic landscape, covering Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Gujarati, Marathi, Odia, Punjabi, Urdu, Assamese, and Nepali. The data originates from three key sources: existing public resources, large-scale web scraping, and synthetic data generation.

— AI4Bharat (@ai4bharat) November 13, 2024

AI4Bharat’s roadmap includes a future release of a human-edited version of Indic-Spontaneous-Synth, as well as plans to expand BhasaAnuvaad with more data and to develop a dedicated speech translation model for Indian languages.

Recently, AI4Bharat, in partnership with IBM Research India, introduced MILU (Multi-task Indic Language Understanding Benchmark), an extensive new evaluation benchmark for Indic languages. 

This benchmark, developed under The AI Alliance, includes 85,000 multiple-choice questions across 11 Indian languages, covering eight diverse domains and over 40 subjects with an India-centric focus on both general and cultural knowledge.

(Total 1 views)

Picture of Mohit Pandey

Mohit Pandey

Mohit writes about AI in simple, explainable, and sometimes funny words. He holds keen interest in discussing AI with people building it for India, and for Bharat, while also talking a little bit about AGI.

Association of Data Scientists

GenAI Corporate Training Programs

India's Biggest Developers Summit

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Rising 2025 | DE&I in Tech & AI Summit

Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

September 25-27, 2024 | 📍Bangalore, India

25 July 2025 | 583 Park Avenue, New York

discord icon

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

World's Biggest Media & Analyst firm specializing in AI

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2024

Read Entire Article