Cohere’s Research Lab introduces Maya to Bridge Language Gaps with Multilingual AI

4 months ago 38

Published on December 12, 2024
In AI News

Maya’s open-source model and inclusive focus advance AI by addressing the need for understanding diverse languages and cultures.

Cohere for AI, a research initiative of Cohere, recently introduced Maya, an open-source multilingual multimodal model built to address gaps in vision-language models’ (VLMs) capabilities, particularly in low-resource languages.

The model improves accessibility and cultural comprehension through improved data quality and toxicity filtering. The model and its datasets are available on GitHub for further development.

“Current datasets often contain toxic and culturally insensitive content, perpetuating biases and stereotypes. To our knowledge, no peer-reviewed research has systematically addressed this,” the researchers stated in the paper on building a multilingual and culturally aware data set.

In the context of the Maya model, “toxicity-free” means removing harmful or offensive content from the training data.

The team created a pretraining dataset of 558,000 image-text pairs, expanding to eight languages, including Arabic, Hindi, and Spanish. This dataset emphasises cultural diversity while mitigating toxicity using tools like Toxic-BERT and LLaVAGuard.

Maya’s performance is notable in multilingual benchmarks. It outperforms existing models in certain tasks and languages, such as Arabic, while offering comparable performance to larger models like PALO-13B. The study also highlights Maya’s effectiveness in tasks like image captioning and visual question answering.

Future plans for Maya include expanding its dataset to include more languages like Bengali and Urdu and improving its instruction-tuning capabilities. Researchers also aim to refine the model’s adaptability for complex reasoning tasks.

Maya’s open-source approach and focus on inclusivity mark a step forward in AI, addressing a critical need for models that understand diverse languages and cultural contexts.

In August, Cohere launched Aya, a multilingual generative model that supported 101 languages, including Indian languages like Hindi and Marathi, with over 50% in lower-resourced categories. Aya outperformed mT0 and BLOOMZ across benchmarks while doubling language coverage. Developed collaboratively by 3,000 researchers in 119 countries, it was open-sourced to address AI dataset scarcity in vernacular languages.