Google Unveils PaliGemma 2 Vision-Language Models for Advanced Task Transfer

4 months ago 38
  • Last updated December 5, 2024
  • In AI News

These open-weight models facilitate fine-tuning across more than 30 transfer tasks, improving state-of-the-art results in fields such as molecular structure recognition, optical music score transcription, and table structure analysis.

Gemma 2 2B

Google has announced the launch of PaliGemma 2, a family of vision-language models (VLMs) based on the Gemma 2 architecture, building on its predecessor with broader task applicability.

The upgrade includes three model sizes (3B, 10B, and 28B) and three resolutions (224px², 448px², and 896px²), designed to optimise transfer learning across diverse domains.

According to Google, the models were trained in three stages using Cloud TPU infrastructure to handle multimodal datasets spanning captioning, optical character recognition (OCR), and radiography report generation. These open-weight models facilitate fine-tuning across more than 30 transfer tasks, improving state-of-the-art results in fields such as molecular structure recognition, optical music score transcription, and table structure analysis.

In their paper, the researchers explain, “We observed that increasing the image resolution and model size significantly impacts transfer performance, especially for document and visual-text recognition tasks.” The models achieved state-of-the-art accuracy on datasets such as HierText for OCR and GrandStaff for music score transcription.

The fine-tuning capabilities of PaliGemma 2 allow it to address applications beyond traditional benchmarks. The researchers noted that while increasing compute resources yields better results for most tasks, certain specialised applications benefit more from either higher resolution or larger model size, depending on task complexity.

PaliGemma 2 also emphasises accessibility, with models designed to operate on low-precision formats for on-device inference. Researchers highlight, “Quantization of models for CPU-only environments retains nearly equivalent quality, making it suitable for broader deployments.”

Google DeepMind has introduced Genie 2, a large-scale foundation world model capable of generating diverse playable 3D environments. Genie 2 transforms a single image into interactive virtual worlds that can be explored by humans or AI using standard keyboard and mouse controls, facilitating the development of embodied AI agents.

Additionally, Google DeepMind has launched GenCast, an AI model that enhances weather predictions by providing faster and more accurate forecasts up to 15 days in advance, while also addressing uncertainties and risks.

Google has also unveiled its experimental AI model, Gemini-Exp-1121, positioned as a competitor to OpenAI’s GPT-4o. The company is gearing up to release Google Gemini 2, which is expected to compete with OpenAI’s forthcoming model, o1.

[This story has been read by 2 unique individuals.]

Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Association of Data Scientists

GenAI Corporate Training Programs

India's Biggest Developers Summit

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Rising 2025 | DE&I in Tech & AI

Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru

Data Engineering Summit 2025

May, 2025 | 📍 Bangalore, India

MachineCon GCC Summit 2025

June 2025 | 583 Park Avenue, New York

September, 2025 | 📍Bangalore, India

MachineCon GCC Summit 2025

The Most Powerful GCC Summit of the year

discord icon

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Read Entire Article