NVIDIA’s NVILA Beats GPT-4o Mini and Llama 3.2, Redefines Open Standards in Vision AI

4 months ago 35

Last updated December 9, 2024
In AI News

NVIDIA is at it again, this time with a segment-leading vision model that reads images, and videos as input

‘Someday Every Single Car will Have Autonomous Capabilities,' says Jensen Huang

Recently, NVIDIA released a new family of open visual language models called NVILA, which focuses on optimised accuracy and efficiency. The model is said to reduce training costs by 4.5 times and fine-tuning memory by 3.4 times. Moreover, it also reduced latency for pre-filling and decoding by nearly 2 times. The numbers are all in comparison with the LLaVa OneVision model, which is another large vision model coupled with the Qwen 2 language model.

From the benchmark results, NVILA is better than GPT 4o Mini in a video benchmark and offers competitive performance compared with GPT 4o, Sonnet 3.5, and Gemini 1.5 Pro. That said, it was observed that NVILA offered better performance than most open models. It also scored a marginal victory over Llama’s 3.2 model.

However, it seems that they haven’t released the model on Hugging Face yet. “We will soon make our code and models available to facilitate reproducibility,” said NVIDIA.

NVIDIA said that training a visual language model (VLM) is expensive, and it takes around 400 GPU days to train a 7B VLM. Moreover, they also stated that fine-tuning a VLM is ‘memory intensive’, and a 7B VLM consumes over 64GB of GPU memory.

Therefore, NVIDIA is using a technique called ‘scale then compress’, an approach to balancing accuracy and efficiency in VLMs. Instead of reducing the size of photos and videos, NVILA uses high-resolution images and multiple frames from a video to ensure that no details are lost.

Then, the model reduces the size of the inputs by squeezing visual information into fewer tokens, grouping the pixels together, and retaining important information.

“For example, doubling the resolution will double the number of visual tokens, which will increase both training and inference costs by more than 2×, as self-attention scales quadratically with the number of tokens. We can then cut this cost down by compressing spatial/temporal tokens,” said the authors in the paper detailing the model.

NVIDIA also outlined a few demos from the model, and it was able to provide information from multiple queries based on an image, and a video. The outputs were also compared to the VILA 1.5 model, also previously released by NVIDIA.

NVIDIA also detailed using other techniques, like Dynamic-S2 for scaling, DeltaLoss-based dataset pruning, quantisation using FP8 precision, and more. The paper can be read on Arxiv for a breakdown of how these techniques aid the model. All of these techniques were applied to an 8B parameter model.

[This story has been read by 1 unique individuals.]

Supreeth Koundinya

Supreeth is an engineering graduate who is curious about the world of artificial intelligence and loves to write stories on how it is solving problems and shaping the future of humanity.