Handling system instability due to HPA

Hariharan Anantharaman
5 min readMay 4, 2024

HPA is one of the essential and valuable features of Kubernetes. It is one of the main tenants which ensures the elasticity of the infrastructure and aids in significant cost optimizations. However, for one of our use cases, HPA caused the system’sinnstability ,and we turned it off for the selected services.

Use case overview

Usecase Overview

Our solution is a multi-cloud system. While the application we develop and manage is hosted in AWS, a few of the enterprise’s upstream systems are hosted in GCP.

Our application listens to a PubSub topic to which one of our upstream system posts messages. Message has only primary key of a resource. Since the resource has sensitive PII information, it was decided that the entire data will not be sent over message. However, data will be available over an API and can be accessed over a HTTP endpoint. The traffice between AWS and GCP is routed via enterprise data centre and does not go via public internet.

Assume that each of the service was running on a separate pods.

Request flow

Below are the steps

  1. Upstream systems posts message.
  2. Our application has listener “Service A” which listens to the message.
  3. Upon receiving a message Service A invokes a business “Service B”.
  4. Service B, invokes API provided by “Service C” and hosted in GCP.
  5. Once the response is received, validations and data clean up are done.
  6. After the data clean up, “Service B” invokes a produc domain service D.
  7. Product domain service saves the data in the database.

The reason for such a heterogeneous architecture and different services (product domain service, Service B etc) can be debated etc inprinciple it does not seem to challenge the relevance of HPA in this context.

How the problem unfolded

Critical factor here is the time taken by the Service B, which does a lot of operations. It invokes a third party service, gets the response, does a few business logic and saves through service D. Assume that the service hosted in GCP returns the response in 75 milli seconds.

On a normal day, the number of message received will be a factor of number of uses accessing the application. However, due to operational reasons, it has been decided to publish a lot of messages in a short time. To quantiy, around 1 million messages were emitted (bursted) by the source into PubSub.

When a messages arrives, Service A, which is the initial entry point recognized that there is a sudden need to process all the messages. Hence the HPA rules got triggered and the service A occupied a lot of CPU. Service A delegates most of its resources to Service B, which inturn delegates to service D (After the validation etc).

Now the demand for resources from Service B is same as the demand requsted by Service A. Now, Service B is also trying to spin up new instances of the pod via HPA. However, since Service A has utilized most of the available capacity in the node, there is no extra capacity for the Service B. K8S might trigger to spring up new node as well. Since the node spin up will take time, it will further delay.

The communication between Service A and B is via HTTP. Since service B is not giving the result immediately, the requests triggered by Service A waits till the timeout configured and then fail. As the requests gets piled up in Service A, the service A also requests resources from node for spinning up new pods.

Below image describes it clearly.

Infra overview with HPA

Solution

Luckily, we identified the problem in one of the mock runs in lower environment. Looking at the logs, it was evident that resource contention is happening. With higher nmber of instances of Service A, the picture was clear. We did the following things

  1. Turn off the HPA for the services involved(Service A,B and D)
  2. Pre-scale each of these services.
  3. While pre-scaing make sure that downstream services has more instances. In this case, we provisioned/prescaled 5 instances of Service D,4 instances of Service B and 3 instances of Service A.
  4. We also validated with GCP team hosting the Service C to be prepared for the sudden spike. They had also pre-scaled to avoid any delays in horizontal scaling up of their pods and services.

One impact is that the time taken to consume all the 1 million messages increased. However there were no failues. With a better CPU and RAM configuration, we were able to process the messages much faster while keeping the same ratio of instances of Service A,B and D

Other Possible Solutions

  1. See if message handler can directly invoke the Domain service D. While this will result in best of both the worlds, it might result in code-duplication or the business rules extracted to a JAR file and used by dependent services. This has its own implications.
  2. Message listener service (Service A) and other downstream services( Service B and Service D) can be run in seperate clusters. This will avoid one pod consuming all the resources in a node.
  3. Different HPA rules for Service A , B and D. I am not sure if it is possible in K8S or EKS. Any K8S experts can suggest. But the partner company who manages the infra and having a army of AWS certified experts didnt offer such insights.
  4. Limits the max instances of A, B and D. This will ensure that even within HPA, one service does not bring down th others.
  5. Package all Service A,B and D in a single pod.

While these options were there, each of them will have their own pros and cons and needs to be evaluated. During the application development and during infra design, such edge case of sudden bursts were hardly discussed. This resulted in choosing the quickest solution during a crisis. Not a solution which works on sunny and raind days and which more reliable and consistent.

--

--