Autoscaling Kuberenetes workloads with custom schedules

Autoscaling Kuberenetes workloads with custom schedules

Context

Would you like to know how we cut annual cloud costs by over 45% for one of our major financial services clients? Read on!

Following on from a previous blog post where we discovered how to make use of custom schedules in ADO pipelines, we decided to use the same mechanism as part of our FinOps practices for one of our major clients.

This particular client has a large development studio with a route to live consisting of multiple test environments prior to production. Each environment has at least 1 RedHat OpenShift cluster running a complex variety of Kubernetes workloads.

The challenge

The challenge was to reduce the compute costs for worker nodes across the non-production whilst maintaining a high degree of environment availability for development and testing teams. We saw an opportunity to take advantage of the OpenShift/Kubernetes machine autoscaler by scaling down workloads out of hours.

Reduce workloads > reduce OpenShift worker nodes > reduce cloud compute costs

Whilst it was understood that some workloads could be scaled down out of hours, not all workloads could be scaled down. Some more critical or time sensitive workloads were to remain running 24/7. Furthermore, there would be different requirements per project and per environment and the development studio would require final control over the autoscaling configuration of each environment and the projects within.

The solution

At Frontier Digital, we have always maintained an everything as code approach to all of the platforms we develop. With this mindset in place, we quickly realised that the autoscaling configuration each environment would have to be maintained in source control. We developed an easily consumable config mechanism in YAML. One YAML file per environment with the OpenShift projects defined as follows:

projects:   
  - name: core-services
    exclusions:
      - account-batch-service
      - account-registration-service
  - name: payment-services
    exclusions:
      - payment-schedule-service
      - payment-transfer-service
      - transfer-automation-service

First of all, this is an opt-in approach to autoscaling. If the project is not specifically called out in configuration, it is ignored by the autoscaler pipeline. Here we can see that we have opted the core-services and payment-services projects into autoscaling. However, each of thes projects has a number of exclusions. Exclusions are called out in config to tell the autoscaler to scale every deployment in the project except those called out in the exclusions array. This allows developers and testers to be in full control of which workloads are subject to autoscaling.

In practice

Taking the example of the core-services project, we can see that there are a number of different deployments with differing replica counts.

Payment Services

The configuration for this test environment calls out core-services as a candidate project for the autoscaler. However, the following services are to be excluded:

  • account-batch-service
  • account-registration-service

When we run the autoscaler pipeline, the config is consumed from the repository:

Pipeline Logs

Finally, when the pipeline has finished executing, we can see that the deployments have been scaled to zero, leaving the excluded deployments alone:

Payment Sevices

Keeping track

As we have previously noted, we cannot rely on a standard replica count across our different deployments. Many deployments have differing replica counts. Our goals are to reduce cost whilst maintaining stability for our development and testing teams across the studio. When the autoscaler scales deployments back up, scaling to a pod count that is too high or too low would endager one or both of these goals. In order to keep track of the previous state of the deployments, the autoscaler annotates the deployment upon scaledown:

Annotations

Information is captured such as the original replica count to aid the scale up action and some additional metadata such as the URL of the ADO pipeline run that executed the scale down.

Scheduling

All this is useful but there has to be some mechanism to control each environment and the different scheduling requirements. Going back to the custom schedule trick we are able to define many different schedules in our ADO pipeline:

schedules:
  - cron: "0 20 * * *"
    displayName: FAT shutdown
    branches:
      include:
        - master
    always: true

  - cron: "0 8 * * *"
    displayName: FAT restore
    branches:
      include:
        - master
    always: true

  - cron: "0 20 * * Fri"
    displayName: SIT shutdown
    branches:
      include:
        - master
    always: true

  - cron: "0 3 * * Mon"
    displayName: SIT restore
    branches:
      include:
        - master
    always: true

  - cron: "0 1 * * Sat"
    displayName: UAT shutdown
    branches:
      include:
        - master
    always: true

  - cron: "0 20 * * Sun"
    displayName: UAT restore
    branches:
      include:
        - master
    always: true

Since we can query the ADO API to determine which schedule kicked the pipeline off, the pipeline can make use of this information in a Pre Flight step to determine which action to take:

Pipeline

Summary

This exercise has resulted in a huge cost saving for the development studio. They are no longer burning money on expensive compute when it is simply not in use. This, combined with a number of FinOps practices has allowed us to cut their annual cloud costs by over 45%. Get in touch if you’d like to find out more.

Keep reading...
blog-post-image
Secrets as code with Mozilla SOPS

Secrets as code with Mozilla SOPS The problem Modern cloud platforms move at speed. The rate of change required to stay competitive in any market requires development teams to ship new features at a high velocity. When releasing new features rapidly, it is often difficult to maintain and control the configuration and secrets that are required to make your application work. Part of this problem has been solved by moving configuration to code repositories therefore leveraging all of the advantages of git.

blog-post-image
Autoscaling Kuberenetes workloads with custom schedules

Autoscaling Kuberenetes workloads with custom schedules Context Would you like to know how we cut annual cloud costs by over 45% for one of our major financial services clients? Read on! Following on from a previous blog post where we discovered how to make use of custom schedules in ADO pipelines, we decided to use the same mechanism as part of our FinOps practices for one of our major clients.

blog-post-image
The misconception of simplicity

Cloud platforms are increasing in popularity, is the current perception correct?

blog-post-image
Azure DevOps Pipelines with multiple schedules

Azure DevOps Pipelines with multiple schedules Context A client is aware that significant cost savings could be realised by a scheduled manipulation of some of their deployed resources and services. Their working week leads to predictable hot and cold periods in the utilisation of their build services and pre-production environments in which scaling up and down could take place. The resources in question don’t have any other suitable means to scale based on demand.

blog-post-image
Why Frontier Digital?

It’s time for your business to move to the Cloud, but how do you ensure you get there safely, securely and without it running away with the pennies?