AI has moved far beyond traditional single-data-source models to multiple data sources in different modalities. As noted by experts like Anindya Sengupta and Abhijit Guha from Fractal, the development of multimodal AI—systems that can process and integrate data from multiple modalities like text, images, audio, and video—has brought AI closer to mimicking human capabilities.
For many years, AI was primarily limited to processing structured data. Guha pointed out that about five years ago, AI primarily relied on structured sources, often requiring extensive “data munging” before it could be fed into models. However, AI has evolved, and so has its ability to process and interpret more complex forms of unstructured data, such as human language, images, and video.
Multimodal AI, as Sengupta explained, “works with text, image, and video”. The concept involves taking various types of inputs from different faculties (e.g., vision, hearing, and speech) and combining them into a unified model that can mimic human-like processing and decision-making.
Guha elaborated on this integration: “Multimodal AI is moving closer to how humans experience the world by integrating inputs from multiple sources—text, video, images—and processing them together.”
The Mechanics of Multimodal AI
While the underlying philosophy of AI—rooted in machine learning and number crunching—remains intact, multimodal AI introduces new challenges and techniques for integrating different data types. Guha explained that multimodal AI converts all inputs—text, image, or video—into numbers, which are then processed using machine learning algorithms.
He describes three fusion techniques commonly used in multimodal AI:
- Early Fusion – This method involves merging different data sources at the initial stage before applying algorithms, allowing the model to process the data as a whole.
- Late Fusion – Here, data from various sources is processed independently, with the results being combined only at the end.
- Hybrid Fusion – This approach combines aspects of both early and late fusions, depending on the specific requirements of the data.
Each technique has its strengths, and its application depends on the nature of the data and the specific task at hand.
Real-World Applications of Multimodal AI
The insurance industry offers a compelling example of multimodal AI’s capabilities. Sengupta recounted a project where AI was used to assess if certain insurance claims were fraudulent. Previously, models only considered structured data like claim history and customer information. However, by integrating unstructured data, such as handwritten notes from claim adjusters, the accuracy of the model improved dramatically.
“With structured data alone, we saw the accuracy score, or KS, hovering around 50-56,” Sengupta explained. “But when we combined that with unstructured data, the KS jumped to 75-76.” This significant improvement illustrates the power of combining multiple data sources.
Beyond insurance, multimodal AI is making waves in other sectors too. In human resource management, Fractal is developing an interview bot capable of analysing not only a candidate’s words, but also their body language and speech patterns. The bot can assess pauses, eye contact, and even detect potential fraud during virtual interviews.
In the automotive sector, Fractal has partnered with companies to generate marketing content by combining images of cars with automatically generated captions. Similarly, in healthcare, Fractal’s Vaidya AI integrates multiple data types—including prescription text, medical images, and X-rays—to provide healthcare professionals with a holistic understanding of the patient’s conditions.
Challenges and Ethical Considerations
While multimodal AI’s potential is vast, its development is not without challenges. According to Sengupta, one of the biggest hurdles is the lack of large annotated datasets. “You can have the algorithms, but without the right data in a usable format, AI models will struggle to deliver accurate results.”
Data processing also presents technical challenges. Handling large volumes of video and audio data requires significant computational power and storage, making these systems costly to develop and deploy. Guha added that while multimodal applications are in high demand, the cost of creating such systems remains a barrier to widespread adoption, especially in the B2C space.
Ethical concerns also come into play, particularly when it comes to data security and bias. AI models are trained on data collected from human society, which inherently carries biases. Sengupta acknowledged this, explaining that at Fractal, they focus heavily on ensuring their tools are certified ‘Responsible AI’.
The goal is to develop models that minimise bias and follow strict ethical guidelines, even as they evolve to become more human-like.
The Future of Multimodal AI
Multimodal systems will likely become more prevalent across industries. According to Guha, one of the key advancements multimodal AI offers is “increasing the coverage of what AI can do”. By integrating multiple faculties—vision, language, and speech—AI can handle a wider array of tasks, offering more comprehensive solutions.
In the coming years, multimodal AI could reshape the landscape of artificial intelligence, pushing it closer to truly replicating human cognition and decision-making.