How we migrated from GKE to CloudRun and saved Infra costs

GKE to CloudRun

GKE is one of the wonderful service offerings from Google cloud. However, just that your services or functionalities are dockerized, does not mean that GKE should be the only choice of runtime to achieve a scalable solution. As cloud providers, especially GCP, keep adding new services in its portfolio, it is essential to see if any of the new services could match our requirement and solve our problem or make our solution more efficient. In this post, we will highlight the reasons why we moved from GKE to Cloud Run for our lower environment, steps to do that, benefits and performance metrics. Cloud Run is a container as serverless offering from GCP.

Little Background about our Application

At Techolution , we provide end to end IoT solutions to enterprises. As part of our service, we developed an asset condition monitoring platform for a Telecom player using Mainflux IoT. The customer wanted the solution to run on their data centre and does not want a cloud based solution. So the option of using Google IoT core and Pub Sub were ruled out. Our Techstack included Mainflux IoT platform, Kafka,Influx DB, MongoDB, Postgres (required for Mainflux),Redis. UI screens were developed as Angular applications and backend services were developed using Java. Python was used for developing and exposing ML services. UI Application and services were packaged as docker images.

While the customer provided Prod and UAT environments, we needed to develop and maintain the DEV environment. For production and UAT we had set up a Kubernetes cluster on a client’s network.

Being partners with GCP, we decided to use GCP platform for our development environment. And we had set up a GKE cluster for running our backend services (Around 10 services).We used two GKE clusters. One GKE cluster had preemptible VMs for services and UI application. We had another GKE cluster with regular VMs for hosting Kafka and Kafka listeners. Mainflux IoT platform was running in a dedicated VM. We had sample devices which sent data to the IoT gateway in the DEV environment, realtime.

Why we need to Move away from GKE

While GKE provided lot of flexibility, we realized that the cost of running dev environment itself was costing around $450–500 per month (After optimizing the cluster size, host and VM configurations to minimum required and moving from Multi node Kafka cluster to single node Kafka cluster in a VM). While initially GKE master nodes were not charged, off late GCP has started billing for that as well. Since we had other internal initiatives in Techolution using the same tech stack, an ideal way to save cost was necessary. We had two options to reduce the cost

  1. Set up own Kubernetes cluster in VM’s (Similar to production and UAT environments).
  2. Leverage CloudRun as the runtime for our containers.

While option A could have been ideal, it still did not guarantee price savings. Also it does not guarantee scalability of future products developed and deployed using the same tech stack. Given that it is only a dev environment, there will not be a consistent load always. The dynamic and very optimal pricing of cloud run gave an indication that it would guarantee significant savings.

While Cloud run could be a default choice for event driven applications, its usage for web applications etc were not explored much. Since CloudRun provides HTTP endpoint for services, we decided to explore it and try it out. Snahil, our DevOps engineer was instrumental in converting the migration plan by Architects, into reality.

Migration Strategy

Following are the three key steps which we did

  1. Evaluation
  2. Deployment
  3. Routing
  4. Validation
  5. Automation

Evaluation

We evaluated the cloud run from both perspectives

  1. Technical feasibility
  2. Will it really give cost saving

CloudRun, being a serverless platform comes with its limitations. Two key things being

  1. It cannot support websockets
  2. Persistent memory is not available.

Since our UI is not driven by websockets, first limitation was not applicable. We had used Redis managed service and dedicated VM for DB. So storage is decoupled from the service. Since our PODS does not have any other dependencies, cloud run passed the feasibility test.

Coming to cost savings, cloud run bills based on the CPU utilized. In a dev environment, there is no continuous traffic. Though we had set up the monitors for down time etc, the usage is purely from developer testing, QA team testing, automated site monitoring scripts and occasional demos. So it kind of made sense to go for a cloud run. The startup times are also negligible.

Incase if we are migrating the production load of microservices serving customer facing dashboard or Apps, we could have taken the decision based on

  1. Average number of parallel requests
  2. Peak load and average load
  3. Response time of requests (90 percentile and average)
  4. CPU utilization of the services in GKE cluster

If there are continuous requests (e.g Ecommerce sites), then the cost benefit needs to be derived based on real time load projections and GKE clusters to be provisioned. So if there are occasional services which are processed (e.g customer profile section, order status etc), then they can be moved away from GKE to cloud run to free up the GKE cluster, which inturn can reduce the cluster size and the billing.

Deployment

As with any migration, we started with one service. We deployed it manually using the GCP console to see if it is working fine. We also tested the response time, start up time incase of cold start etc as part of the evaluation. Once the results are positive, we did create our pipeline in CloudBuild to deploy to the cloud run. One advantage with cloud run is that, as a different version of the service gets deployed, you can configure the traffic % to the new service. So incase of simple A/B testing which does not involve user profile and we need to test the performance on two different versions of the service, cloud run provides the features OOB.

Routing

Now that all the services are deployed separately in cloud run, they should be accessible to applications (Web app, mobile apps, external clients etc). Each of the services deployed in cloudrun have their own unique URLs. Obviously we cannot give these individual service URLs to clients. We were using a custom domain as well.So we had two options

  1. Use NGINX or Apache web server to route the requests based on patterns
  2. Use API gateway

We went with API gateway. Again GCP provided two different options for API gateway

  1. Apigee
  2. Cloud Endpoints

Apigee is a proven one, its free tier was very restrictive. So we chose Cloud endpoints. Making cloud endpoints to work with cloud run is a multi step process. At a high level, they are as follows

  1. Install OpenESP image on cloud run and migrate to ESPV2 beta as detailed in the link here
  2. Create a YAML file confirming OpenAPI version 2. You might need to provide the cloud run endpoint of the OpenESPV2 created in the above step. Configure each service corresponding to the request pattern and map them to the appropriate cloud run backend.
  3. Deploy the configuration in the cloud endpoint by giving the command and providing the YAML file. Note that, even after the deployment as well, you might get URL patterns not matching. Fix it by following all the three steps (i.e service deploy, new image building and deploying the new image)as mentioned in this link

Now your services are up and running. Since we already had an existing stable environment, we cannot just make the new environment live without testing. So we used following approach

  1. Created a new subdomain and pointed the subdomain to the cloud endpoint URL
  2. Test the new subdomain
  3. Once all the functionality is working, we can make changes to the CNAME record to point to the new endpoint. Cloud run gives a detailed tutorial on using custom domains. That can also be followed to configure the cloud run specific to the Open ESPV2 to use the new custom domain name. In case if updating CNAME is not possible, then map the old domain to a static IP. In the static IP run a webserver (Apache or Nginx). In the webserver, you can have the request forward and redirect rules.

In our project we faced few challenges in downloading static resources (e.g JS, CSS files which are packaged along with Angular App). It took a while to see OpenAPI V2 standard for downloading files and configuring it accordingly. Similarly regular expressions based matching also took a while to figure out.

Since the latest Swagger version used by service developers might differ from OpenAPI version 2 standards used by OpenESP 2, I am providing a template for use cases which required significant reading of the manuals and specification documents.

We used below pattern to download the files

/js/{filename}:

get:

summary: <<summary>>

operationId: <<operatioIdvakue>>

parameters:

- in: path

name: filename

type: string

required: true

produces:

- text/javascript

x-google-backend:

address: <<cloudrun endpoint for application hosting the static files>>

path_translation: APPEND_PATH_TO_ADDRESS

protocol: h2

responses:

‘200’:

description: A successful response

schema:

type: file

Even if you are using CDN, it is necessary to map the routes for the static files, as after the TTL, CDN will fallback to origin to download the contents.

We used the OpenAPI 2 format for regex based service declarations.

/asset/{resourceid}/{kpi} :

get:

summary: <<summary>>

operationId: <<operationId>>

parameters:

- in: path

name: resource

type: string

required: true

- in: path

name: kpi

type: string

required: true

x-google-backend:

address: <<Endpoint of service>>

path_translation: APPEND_PATH_TO_ADDRESS

protocol: h2

responses:

‘200’:

description: A successful response

schema:

type: string

With this we were able to completely move the entire web application and the backend microservice it is dependent on to cloud run. So our entire web application stack is running in a serverless infrastructure.

Results

We are just 10 days into the migration. By looking at the daily billing rates and GCP billing predictions, we expect this month’s bill to be 40% lesser. The traffic and number of requests served remain the same as earlier. We used to get charged $500 earlier and we are seeing somewhere around $300 per month after the migration.

Application performance is also satisfactory and matching the previous levels. 90 percentile response time and average response time are almost matching with GKE level. From an user experience perspective too, we did not notice much difference.

If you are deploying in cloudrun, please do see if the maximum container instances and number of parallel requests per container are good for you. Default value for maximum number of container instances it can scale upto is 1000. This is similar to the max cluster size we set in GKE. Leaving it to default might end up in an extra GCP bill incase of DDoS attacks etc.

Is Cloud Run suitable for all Needs

As long as the cloud run meets your application stack requirement,I would say that cloud run can be used. To some extent it keeps the architecture simple, especially for distributed and standalone services. Please be aware that there are few technical limitations and use cases cloud run does not support yet(e.g persistent memory and storage).

Is Cloud Run suitable only for Lower environments?

From my experience, I would suggest using cloud run for production environments too. However, if you are using GKE for production and want to use Cloud run only for lower environments keep in mind that

  1. Runtime and network architecture of your lower environment is different from prod. I would recommend having at least one environment (e.g UAT or Performance) with similar architecture like prod so that issues seen in production or which are likely to happen in production can be caught early.
  2. Note that CICD pipeline for dev cannot be used for other environments. If your higher environment CICD scripts just take the image from the repository and deploy it, then it should not be an issue. If it is doing something more, then note that you do not have any lower environment to test them in case of any changes to the CICD pipeline.

Few things which could have been better

  1. CloudEnpoints does not provide a web interface to create a new mapping or atleast to view the mappings currently deployed. This could have been done to make quick PoC’s and validation easier. Though I could understand that Google wanted to enforce script and configuration driven setup from day 1, UI at least to view the mappings can help in troubleshooting. From this perspective, I like the Azure API gateway.
  2. Though the cloud endpoints list the services, the operational metrics (number of requests, response time, response codes etc) are not displayed in CloudEndpoints page.They are available only in the cloudrun dashboard of the service corresponding to Open ESP2.

#GKE #CloudRun #GKEtoCloudrunMigration #GCP #GoogleCloud

IoT Solution Architect