Cohere’s Research Lab introduces Maya to Bridge Language Gaps with Multilingual AI

4 months ago 38
  • Published on December 12, 2024
  • In AI News

Maya’s open-source model and inclusive focus advance AI by addressing the need for understanding diverse languages and cultures.

Cohere for AI, a research initiative of Cohere, recently introduced Maya, an open-source multilingual multimodal model built to address gaps in vision-language models’ (VLMs) capabilities, particularly in low-resource languages. 

The model improves accessibility and cultural comprehension through improved data quality and toxicity filtering. The model and its datasets are available on GitHub for further development.

“Current datasets often contain toxic and culturally insensitive content, perpetuating biases and stereotypes. To our knowledge, no peer-reviewed research has systematically addressed this,” the researchers stated in the paper on building a multilingual and culturally aware data set. 

In the context of the Maya model, “toxicity-free” means removing harmful or offensive content from the training data.

The team created a pretraining dataset of 558,000 image-text pairs, expanding to eight languages, including Arabic, Hindi, and Spanish. This dataset emphasises cultural diversity while mitigating toxicity using tools like Toxic-BERT and LLaVAGuard.

Maya’s performance is notable in multilingual benchmarks. It outperforms existing models in certain tasks and languages, such as Arabic, while offering comparable performance to larger models like PALO-13B. The study also highlights Maya’s effectiveness in tasks like image captioning and visual question answering.

Future plans for Maya include expanding its dataset to include more languages like Bengali and Urdu and improving its instruction-tuning capabilities. Researchers also aim to refine the model’s adaptability for complex reasoning tasks.

Maya’s open-source approach and focus on inclusivity mark a step forward in AI, addressing a critical need for models that understand diverse languages and cultural contexts. 

In August, Cohere launched Aya, a multilingual generative model that supported 101 languages, including Indian languages like Hindi and Marathi, with over 50% in lower-resourced categories. Aya outperformed mT0 and BLOOMZ across benchmarks while doubling language coverage. Developed collaboratively by 3,000 researchers in 119 countries, it was open-sourced to address AI dataset scarcity in vernacular languages.

Picture of Aditi Suresh

Aditi Suresh

Aditi is a political science graduate, and is interested in technology, AI, social media, and online culture.

Association of Data Scientists

GenAI Corporate Training Programs

India's Biggest Developers Summit

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Rising 2025 | DE&I in Tech & AI

Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru

Data Engineering Summit 2025

May, 2025 | 📍 Bangalore, India

MachineCon GCC Summit 2025

June 2025 | 583 Park Avenue, New York

September, 2025 | 📍Bangalore, India

MachineCon GCC Summit 2025

The Most Powerful GCC Summit of the year

discord icon

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Read Entire Article