- Last updated December 9, 2024
- In AI News
NVIDIA is at it again, this time with a segment-leading vision model that reads images, and videos as input

Recently, NVIDIA released a new family of open visual language models called NVILA, which focuses on optimised accuracy and efficiency. The model is said to reduce training costs by 4.5 times and fine-tuning memory by 3.4 times. Moreover, it also reduced latency for pre-filling and decoding by nearly 2 times. The numbers are all in comparison with the LLaVa OneVision model, which is another large vision model coupled with the Qwen 2 language model.
From the benchmark results, NVILA is better than GPT 4o Mini in a video benchmark and offers competitive performance compared with GPT 4o, Sonnet 3.5, and Gemini 1.5 Pro. That said, it was observed that NVILA offered better performance than most open models. It also scored a marginal victory over Llama’s 3.2 model.
However, it seems that they haven’t released the model on Hugging Face yet. “We will soon make our code and models available to facilitate reproducibility,” said NVIDIA.
NVIDIA said that training a visual language model (VLM) is expensive, and it takes around 400 GPU days to train a 7B VLM. Moreover, they also stated that fine-tuning a VLM is ‘memory intensive’, and a 7B VLM consumes over 64GB of GPU memory.

Therefore, NVIDIA is using a technique called ‘scale then compress’, an approach to balancing accuracy and efficiency in VLMs. Instead of reducing the size of photos and videos, NVILA uses high-resolution images and multiple frames from a video to ensure that no details are lost.
Then, the model reduces the size of the inputs by squeezing visual information into fewer tokens, grouping the pixels together, and retaining important information.
“For example, doubling the resolution will double the number of visual tokens, which will increase both training and inference costs by more than 2×, as self-attention scales quadratically with the number of tokens. We can then cut this cost down by compressing spatial/temporal tokens,” said the authors in the paper detailing the model.
NVIDIA also outlined a few demos from the model, and it was able to provide information from multiple queries based on an image, and a video. The outputs were also compared to the VILA 1.5 model, also previously released by NVIDIA.


NVIDIA also detailed using other techniques, like Dynamic-S2 for scaling, DeltaLoss-based dataset pruning, quantisation using FP8 precision, and more. The paper can be read on Arxiv for a breakdown of how these techniques aid the model. All of these techniques were applied to an 8B parameter model.
[This story has been read by 1 unique individuals.]
Supreeth Koundinya
Supreeth is an engineering graduate who is curious about the world of artificial intelligence and loves to write stories on how it is solving problems and shaping the future of humanity.

Nothing Can Beat Good Ol’ VS Code
Mohit Pandey
With the right extensions, VS Code and GitHub Copilot can make Cursor look like a tool with unnecessary functionalities.
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
February 5 – 7, 2025 | Nimhans Convention Center, Bangalore
Rising 2025 | DE&I in Tech & AI
Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru
Data Engineering Summit 2025
May, 2025 | 📍 Bangalore, India
MachineCon GCC Summit 2025
June 2025 | 583 Park Avenue, New York
September, 2025 | 📍Bangalore, India
MachineCon GCC Summit 2025
The Most Powerful GCC Summit of the year
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.