Why Juspay Quit Kubernetes

2 months ago 24

After switching to EC2, the cost dropped to $130 from $180 a month per instance, translating into a 28% reduction.

Ever since AI/ML workloads came into play, companies have been exploring new data container orchestration platforms and slowly moving away from Kubernetes. The latest on the list is Juspay, a leading payments platform for merchants in India that also powers platforms like Namma Yatri.

Juspay’s Hyperswitch, an open-source payment switch written in Rust, relied heavily on Kafka for pushing events. However, as the team mentioned in its latest blog post on GitHub, the decision to transition from Kafka on Kubernetes, or K8s, to Amazon’s EC2 was driven by the need to optimise performance, reduce costs, and simplify operations.

“After months of firefighting, we decided to move from Kubernetes to EC2, a transition that improved performance, simplified operations, and cut costs by 28%,” Neeraj Kumar, program manager at Juspay, said in the blog, highlighting the massive cost difference.

After switching to EC2, the cost dropped to $130 from $180 a month per instance, translating into a 28% reduction.

While Kubernetes initially provided a solid foundation for container orchestration, managing Kafka at scale proved more challenging than Juspay anticipated. Rising infrastructure costs, inefficiencies in resource allocation, and auto-scaling issues led to a critical reassessment of their infrastructure strategy.

What Were the Challenges?

Juspay’s decision to migrate Kafka from Kubernetes to EC2 ignited a lively discussion on Reddit. Engineers and architects weighed in on the trade-offs of managing stateful workloads in Kubernetes. While it sounds like Juspay had abandoned Kubernetes entirely, in reality, it only shifted away from Kafka.

This highlights a broader trend where companies often realise that running databases, queues, or brokers on Kubernetes introduces unnecessary complexity. One of the major challenges Juspay faced with Kubernetes was resource allocation inefficiencies. Kubernetes dynamically managed resources, but in practice, this resulted in unexpected waste.

For instance, when they allocated 2 CPU cores and 8GB of RAM, the actual provisioned resources were slightly lower—1.8 CPU cores and 7.5GB RAM. While this discrepancy might seem minor, at scale, it contributed to significant cost overruns. As Kumar put it, “Imagine paying for a full tank of fuel, but your car only gets 90% of it. Over time, those missing litres add up.”

This is similar to what Christian Weichel, co-founder and CTO of Gitpod, and Alejandro de Brito Fontes, staff engineer at the company, said three months ago. Kubernetes initially seemed like the obvious choice for Gitpod’s remote, standardised, and automated development environments, but scaling up was an issue for them as well.

Also Read: Why Companies are Quitting Kubernetes

In January 2024, Gitpod began developing Flex, which was launched in October. This is an ongoing trend where companies have started using in-house products for the task. Built on Kubernetes-inspired principles like declarative APIs and control theory, Flex simplifies architecture, prioritises zero-trust security, and addresses the specific needs of development environments.

Ben Houston, founder and CTO of ThreeKit, an online visual commerce platform, illustrated another of Kubernetes’s challenges in his recent blog. Houston explained why he shifted from Kubernetes to Google Cloud Run. The primary reason for his drifting away from Kubernetes was its complexity and high cost, which outweighed its benefits for managing infrastructure at scale.

For Houston, Kubernetes required extensive provisioning, maintenance, and management, leading to significant DevOps overhead. Additionally, its slow autoscaling often resulted in over-provisioning and paying for unused resources.

Auto-scaling also posed difficulties. Kubernetes’ scaling mechanisms are designed for stateless applications, but Kafka is stateful. Instead of seamlessly scaling up when resources ran low, Kubernetes would restart Kafka nodes, leading to delays in message processing and increased latency during scaling events.

Managing these stateful workloads became an ongoing operational burden, further complicating Kafka’s stability.

The Shift to EC2

Discussions about the challenges of Kubernetes have been ongoing for some time. In a Hacker News thread about its viability, developers from different companies cited several reasons why Kubernetes can be cumbersome.

Initially, Juspay relied on Strimzi for Kafka cluster management, but this solution introduced its own set of issues. New Kafka nodes often failed to integrate seamlessly, requiring manual intervention for every scaling event. “Managing our Kafka clusters felt like playing whack-a-mole—every time we solved one issue, another would pop up,” they noted.

Faced with these challenges, Juspay opted to migrate Kafka from Kubernetes to EC2, a move that allowed for better control over resource allocation, auto-scaling, and cluster management. Instead of relying on third-party tools, they built an in-house Kafka Controller tailored to their needs.

This shift enabled the seamless integration of new Kafka nodes, automated scaling based on real-time workload analysis, and significantly improved cluster management with minimal manual intervention.

Unlike Kubernetes, EC2 allowed for precise resource allocation. With Kubernetes, provisioning was often imprecise, leading to over-provisioning costs. On EC2, they could allocate exactly the CPU and memory needed, avoiding unnecessary expenditures. Previously, their Kubernetes-based setup cost $180 per instance per month.

Juspay highlighted that while Kubernetes is excellent for stateless applications, stateful workloads like Kafka can introduce unnecessary complexity. Custom solutions, such as their in-house Kafka Controller, provided better control, automation, and reliability.

Mohit Pandey

Mohit writes about AI in simple, explainable, and sometimes funny words. He holds keen interest in discussing AI with people building it for India, and for Bharat, while also talking a little bit about AGI.