Autoscaling in Kubernetes using Custom Metrics with HPA

EKS 3th Week Observability

8 min readMar 30, 2024

Introduction

Kubernetes provides a built-in Horizontal Pod Autoscaler (HPA) that allows for easy setup of an autoscaling system based on CPU usage. However, depending on the nature of the service, it may be more appropriate to use different metrics for autoscaling.

Kubernetes offers the following autoscalers:

Cluster Autoscaler (CA): Autoscales nodes based on the required resources (CPU, memory, etc.) to run scheduled pods. At Pingpong, we use Karpenter as an alternative to CA.
Horizontal Pod Autoscaler (HPA): Adjusts the number of pod replicas based on traffic. Each pod requires the same amount of resources.
Vertical Pod Autoscaler (VPA): Adjusts the resource requests and limits of individual pods based on traffic.

For stateless servers, HPA is generally the most suitable option. VPA requires pod restarts when changing resources, and there are limitations to the resources (CPU, memory) that a single node can hold. Before moving onto the implementation, let’s discuss the basic requirements.

What is Metrics Server and How it works?

The Horizontal Pod Autoscaler (HPA) is a key feature of Kubernetes that automatically scales the number of pod replicas in a deployment, replica set, or stateful set based on observed CPU utilization or other select metrics.

Metrics Server is a Kubernetes resource that provides a metrics pipeline for the Autoscaling feature. Its primary purpose is to collect metrics to support the implementation of Autoscaling functionality. It plays a crucial role in securely transferring metrics between the kube-api server and the kubelet process running on each node.

cAdvisor collects metrics from containers within a Pod.
The kubelet collects the metrics exposed by cAdvisor.
Metrics Server collects the metrics exposed by the kubelet.
The kube-api server collects the metrics exposed by Metrics Server.

The HPA relies on the Metric API to collect values, but the Metric API is not a default API provided by Kubernetes. To use the HPA, you need to separately install a Metric API Server to ensure the desired Metric API can operate properly.

The pipeline consists of the following components:

Metric Source: Provides metrics such as CPU, memory, GPU, and request count. Typically, these metrics are provided by pods and nodes.
Metric Collector: Collects metrics from the Metric Source. It is used for monitoring purposes, independent of the HPA.
Metric API Server: Exposes the data from the Metric Collector as a Kubernetes API, making it accessible to the HPA.

Every Kubernetes Node has kubelet installed, which includes cAdvisor. cAdvisor acts as the Metric Collector, gathering CPU and memory usage data from Pods and Nodes, which is then exposed by kubelet.

Since cAdvisor is included in Kubernetes by default, you only need to install a Metric Server to serve as the Metric API Server. You can verify whether it works well by running kubectl top command.

You can view the actual response content from the API server by running the command kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq.

In some cases, you may need to measure server load using metrics other than CPU and memory. Kubernetes provides a way to use custom metrics with the HPA. Unlike Resource Metrics, custom metrics require a separate metric collector like Prometheus since cAdvisor doesn’t provide them. Prometheus also supports to collect Custom Metric API Server which provides external metrics.

The HPA is managed by a controller within the Kubernetes Controller Manager. This controller periodically adjusts the number of replicas in a deployment to match the current demand, as indicated by the selected metrics. After deciding on the new desired number of replicas based on the target metrics, the HPA updates the specification of the controlled resource (e.g., Deployment) with this new replica count. This update is made through a modification request to the Kubernetes API server.

Once the desired state is updated in the API server, the respective controller for the resource (Deployment Controller for Deployments, ReplicaSet Controller for ReplicaSets, etc.) notices the change in desired replicas. This controller then works to reconcile the actual state with the desired state.

https://www.cloudacode.com/hello-kubernetes/hello-kubernetes/section03/horizontal-pod-autoscaling/

How Prometheus Exporters collects RPS metrics

To begin with, RPS — Request Per Seconds — as the scaling metric could be nifty choice if your workload involves a significant I/O operation. The tools can be extended from Prometheeus to Datadog, as long as an appropriate Metric API Server is installed and running.

Prometheus employs a service discovery mechanism to automatically find all resources within a Kubernetes cluster. Specifically, in an Istio setup, Prometheus is configured to recognize Istio’s configuration and automatically locate Envoy proxy endpoints to collect metrics from.

https://jerryljh.medium.com/prometheus-auto-service-discovery-73b736184999

One approach is to expose an API endpoint on each server that reports the number of requests received and have Prometheus scrape that endpoint, which requires starlette_prometheus for Python.

The istio_request_total metric can be used to calculate the number of requests between servers, without further installing requirements to the projects.

https://medium.com/google-cloud/kubernetes-autoscaling-with-istio-metrics-76442253a45a

Istio automatically injects an Envoy proxy as a sidecar container into each service pod. This sidecar proxy intermediates all communication between services and collects a variety of metrics, such as the number of HTTP requests, latency, HTTP status codes. The Envoy proxy exposes metrics through its built-in management server. These metrics are accessible via the /stats endpoint and are provided in a format compatible with Prometheus.

The Prometheus Adapter is configured to map specific Prometheus queries to custom metric names that Kubernetes can recognize. The Kubernetes HPA can then query these custom metrics through the Metrics API and adjust the number of pods automatically based on the configured thresholds.

For instance, if the request rate per second for a specific service exceeds a predefined threshold, the HPA can use this information to increase the number of pods to distribute the load more effectively.

additionalPrometheusRules:
 - name: my-rule-file
   groups:
    - name: requests-per-second
      rules:
        - record: test_requests_per_second
          expr: >-
            sum(rate(istio_requests_total{source_app="istio-ingressgateway",destination_service="probe-provision.probe.svc.cluster.local",reporter="destination"}[1m]))/sum(kube_deployment_status_replicas_available{deployment="probe-provision"})
          labels:
            namespace: probe
            service: probe-provision

istio_requests_total is a metric provided by Istio that represents the total number of requests processed by the Istio proxy.

Istio collects telemetry data from the mesh and stores it in Prometheus, a powerful monitoring and alerting system. One of the metrics captured by Istio is istio_requests_total, which allows you to determine the rate of requests per second received by a specific workload.
To query Prometheus and retrieve the requests per second (req/sec) rate for a workload named “podinfo” in the “test” namespace shown below over the last minute, excluding requests with a 404 response code, you can use the following PromQL query.

sum(
    rate(
      istio_requests_total{
        destination_workload="podinfo",
        destination_workload_namespace="test",
        reporter="destination",
        response_code!="404"
      }[1m]
    )
  )

2. The metric is filtered based on the following labels:

source_app="istio-ingressgateway": This filters the requests originating from the Istio Ingress Gateway.
destination_service="probe-provision.probe.svc.cluster.local": This filters the requests destined for the probe-provision service in the probe namespace.
reporter="destination": This specifies that the metric should be reported by the destination proxy.
In Istio, the istio_requests_total metric is recorded by both the source and destination proxies for each request. This means that the same request is counted twice - once by the source proxy and once by the destination proxy.
You only consider the requests that have successfully reached the destination service (probe-provision in this case). This provides a more accurate representation of the actual traffic hitting the service, since using reporter="source" would include requests that may have been dropped or failed before reaching the destination service, leading to an inflated count.

3. The rate() function calculates the per-second rate of requests over a 1-minute window ([1m]).

4. The sum() function aggregates the rate values across all dimensions (labels) of the metric.

Nice! Let’s install prometheus-adapter. This command retrieves the list of available custom metrics. In the output, you should see the test_requests_per_second metric listed under the custom.metrics.k8s.io/v1beta1 API group.

kubectl get - raw /apis/custom.metrics.k8s.io/v1beta1 | jq .

Injecting Istio Sidecars

When Istio sidecar injection is enabled, each pod in the mesh runs an Envoy proxy sidecar container alongside the application container. This sidecar proxy intercepts all inbound and outbound traffic for the Pod, which is crucial for gathering metrics.

If a service targeted by a VirtualService does not have an Istio sidecar injected, the traffic directed to that service is not intercepted by Istio's Envoy sidecar proxy.

You can enable Istio injection for a specific namespace by labeling the namespace with istio-injection=enabled. To label the default namespace for Istio installation, do the below.

k label ns default istio-injection=enabled

After setting the label, if you create a deployment in the default namespace, you will notice that without running any additional commands, the deployment will have two pods, and Istio will be installed.

However, when labeling an entire namespace, there may be cases where you want to exclude certain deployments from sidecar injection. In such cases, you can add the following configuration to the YAML file:

spec:
  template:
    metadata:
      labels:
        sidecar.istio.io/inject: "false"

By setting sidecar.istio.io/inject: "false", you can prevent sidecar injection for specific deployments, even if the namespace is labeled for injection.

To remove the label from a namespace, you can use the command k label ns default istio-injection-.

Defining HPA

For Resource Metrics, set spec.metrics[].type to "Resource", and for Custom Metrics, set it to "Object".

In the HPA definition, you can specify multiple metric criteria simultaneously, such as CPU and RPS. When multiple metrics are defined, the HPA will scale the number of pods to the highest value determined by each metric.

This YAML file defines an HPA that scales the deployment based on CPU utilization, memory utilization, and a custom metric for requests per second. The HPA will ensure that the number of replicas is adjusted automatically to meet the specified targets.

{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: xxx
  labels:
    {{- include "pylon.labels" . | nindent 4 }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: xxx
  minReplicas: {{ .Values.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
    - type: Object
      object:
        metric:
          name: {{ .Release.Namespace }}_xxx_requests_per_second
        describedObject:
          apiVersion: v1
          kind: Service
          name: xxx
        target:
          type: Value
          value: {{ .Values.autoscaling.targetRequestsPerSecond }}
{{- end }}

// values.yaml

autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80
  targetRequestsPerSecond: 5

// helm template deployment-name ...

Tip) Ensuring the way to scrape much frequently than 1 minute
If you want to ensure that Prometheus is configured to scrape metrics at a 10-second interval. This is done in the Prometheus configuration file (prometheus.yml), where you can set the scrape_interval globally or per job.

global:
  # How frequently to scrape targets by default.
  [ scrape_interval: <duration> | default = 1m ]

  # How long until a scrape request times out.
  [ scrape_timeout: <duration> | default = 10s ]