This Surat-Based Startup Builds Real-Time Speech AI Model That Cuts Costs by 20x

1 week ago 14

VideoSDK aims to provide developers with the infrastructure to integrate real-time AI, audio, and video into their applications.

Launched in 2021, developer-focused startup VideoSDK has been quietly transforming how application developers use real-time AI, video, and audio capabilities to solve various problems.

Now, the company from Surat, built by Arjun Kava and Sagar Kava, is stepping further into the AI world with the launch of its new small language model (SLM). The model is designed to bring on-device and cloud-enabled AI solutions to businesses.

In an exclusive interview with AIM, Arjun Kava, one of the minds behind VideoSDK’s AI leap, opened up about the motivation behind building a new SLM, the challenges they faced to integrate the solution, and what lies ahead for the startup.

What Does VideoSDK Do?

Kava explained that VideoSDK’s primary goal is to help companies automate communication-intensive tasks. Whether it is a customer service agent handling real-time conversations within the customer experience (CX) sector or a video KYC process in the banking, financial services, and insurance (BFSI) industry, VideoSDK is built to enhance the efficiency of these interactions.

The company offers developers the tools to embed real-time voice and video functionalities into their applications across various platforms, including Android, iOS, and the web. This enables the creation of applications with features similar to Google Meet, allowing users to connect with others globally. The company’s solutions cater to diverse needs, from regulated industries like BFSI and healthcare to social media, dating, and online proctoring.

In its last funding round, VideoSDK secured a $1.2 million investment from GVFL, the lead investor. The company is strategically allocating this money towards product development and its go-to-market (GTM) strategy.

Notably, it has helped Groww, an online investment platform, achieve a 90% success rate for Video KYC.

A Real-Time Speech AI Model to Reduce Cost

VideoSDK has introduced NAMO-SSLM, a hybrid real-time speech AI model that aims to leverage on-device computing power with cloud-enabled capabilities, as well as vision and OCR capabilities.

It may be similar to MoshiVis, an open-source speech model with visual understanding capabilities.

It comes in two parts—a Conversational Agent SDK and a Thinking Agent SDK.

Kava emphasised that the model’s architecture was designed by their team to effectively utilise the device’s capabilities, including both the CPU and GPU. He noted that, for instance, the model can operate directly on devices like iPhones or Android phones in real time. He referred to this as a significant benchmark achieved by the team.

He further explained that this approach of their small speech language model (SSLM) helps application developers solve various problems. On-device capabilities provide privacy and help cut costs, requiring less computing power.

“For example, a bank can directly deploy this model within its CPU infrastructure in real time. It can save costs, and at the same time, it helps them to make sure that the data of a customer does not leave their infrastructure,” Kava shared.

According to him, NAMO-SSLM reduces costs by nearly 20 times compared to other models from OpenAI and Anthropic. The model is designed to be language-agnostic and cost-effective to train and fine-tune, making it accessible for various industries and use cases.

Kava also revealed that the startup will release the weights and anything else associated with the model to pitch it as an open source initiative. It wants to enable the community to use the model for its use cases.

Moreover, if someone wants to scale using the model, it intends to have a cloud offering for developers to help them deploy it with their expertise to ensure regulatory compliance, and more.

The SSLM’s development was inspired by Kava’s research experience at companies like AWS and Vimeo, where he focused on video analytics.

Challenges Behind Helping Developers, Businesses

VideoSDK encounters several key challenges as it scales its operations and technology. The primary hurdle is ensuring its SLM is effectively deployable across various devices, particularly low-end models from various vendors prevalent in markets like India and the Middle East and North Africa (MENA) region. This necessitates ongoing innovation in its SDK to optimise performance across different hardware configurations.

Moreover, the company is trying to replicate its research on a large scale and aims to improve the acceleration of the training, validation, and benchmarking cycles for its AI models.

What’s Next?

VideoSDK aims to expand the reach of its SLM, aiming for deployment across every available device by the end of the year.

The company also wants to establish itself as the category creator and leader in the real-time AI communication space. Within the year, it aims to reach six-figure utilisation in minutes for its real-time AI offering.

It remains to be seen how the company will position itself against other major players like Agora, Twilio Video, and others to compete in the global market.

📣 Want to advertise in AIM? Book here