- Published on December 12, 2024
- In AI News
Maya’s open-source model and inclusive focus advance AI by addressing the need for understanding diverse languages and cultures.
Cohere for AI, a research initiative of Cohere, recently introduced Maya, an open-source multilingual multimodal model built to address gaps in vision-language models’ (VLMs) capabilities, particularly in low-resource languages.
The model improves accessibility and cultural comprehension through improved data quality and toxicity filtering. The model and its datasets are available on GitHub for further development.
“Current datasets often contain toxic and culturally insensitive content, perpetuating biases and stereotypes. To our knowledge, no peer-reviewed research has systematically addressed this,” the researchers stated in the paper on building a multilingual and culturally aware data set.
In the context of the Maya model, “toxicity-free” means removing harmful or offensive content from the training data.
The team created a pretraining dataset of 558,000 image-text pairs, expanding to eight languages, including Arabic, Hindi, and Spanish. This dataset emphasises cultural diversity while mitigating toxicity using tools like Toxic-BERT and LLaVAGuard.
Maya’s performance is notable in multilingual benchmarks. It outperforms existing models in certain tasks and languages, such as Arabic, while offering comparable performance to larger models like PALO-13B. The study also highlights Maya’s effectiveness in tasks like image captioning and visual question answering.
Future plans for Maya include expanding its dataset to include more languages like Bengali and Urdu and improving its instruction-tuning capabilities. Researchers also aim to refine the model’s adaptability for complex reasoning tasks.
Maya’s open-source approach and focus on inclusivity mark a step forward in AI, addressing a critical need for models that understand diverse languages and cultural contexts.
In August, Cohere launched Aya, a multilingual generative model that supported 101 languages, including Indian languages like Hindi and Marathi, with over 50% in lower-resourced categories. Aya outperformed mT0 and BLOOMZ across benchmarks while doubling language coverage. Developed collaboratively by 3,000 researchers in 119 countries, it was open-sourced to address AI dataset scarcity in vernacular languages.
Aditi Suresh
Aditi is a political science graduate, and is interested in technology, AI, social media, and online culture.
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
February 5 – 7, 2025 | Nimhans Convention Center, Bangalore
Rising 2025 | DE&I in Tech & AI
Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru
Data Engineering Summit 2025
May, 2025 | 📍 Bangalore, India
MachineCon GCC Summit 2025
June 2025 | 583 Park Avenue, New York
September, 2025 | 📍Bangalore, India
MachineCon GCC Summit 2025
The Most Powerful GCC Summit of the year
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.