🎉 ScaleOps is excited to announce $58M in Series B funding led by Lightspeed! Bringing our total funding to $80M! 🎉 Read more →

DevOps Kubernetes

Kubernetes HPA: Use Cases, Limitations & Best Practices

Raz Goldenberg 25 October 2024 12 min read

The Kubernetes Horizontal Pod Autoscaler (HPA) adjusts the number of active pods in real time based on resource needs, ensuring efficient scaling for applications. This blog covers how HPA works, setup examples, best practices, and more, for a complete autoscaling strategy.

What is Kubernetes Horizontal Pod Autoscaler (HPA)?

The Kubernetes Horizontal Pod Autoscaler (HPA) is a built-in resource controller that automatically adjusts the number of running pods in a deployment, stateful set, or replication controller based on the current resource usage. It allows Kubernetes clusters to scale up when the demand increases and scale down when the demand decreases, ensuring resources are allocated efficiently without over-provisioning.

HPA mainly relies on metrics like CPU utilization or custom metrics like memory usage, request rates, or even external signals (e.g., from external APIs) to make scaling decisions. The goal of HPA is to maintain optimal performance while minimizing operational costs.

Kubernetes HPA vs VPA

In addition to HPA, Kubernetes also provides a Vertical Pod Autoscaler (VPA), which adjusts the resource requests and limits of containers rather than the number of pods. Here’s a quick comparison:

  • HPA scales the number of pods horizontally based on resource utilization (e.g., CPU, memory, custom or external metrics).
  • VPA adjusts the resource requests and limits of each pod vertically, resizing CPU or memory limits without changing the pod count.

In most scenarios, HPA and VPA complement each other, as they address different scaling needs.

How Does Kubernetes HPA Work?

HPA continuously monitors selected metrics and adjusts the pod count based on predefined thresholds. Kubernetes supports various metrics that can trigger Kubernetes autoscaling decisions, including resource metrics, object metrics, pod metrics, and external metrics.

1. Resource Metrics

Resource metrics are the most common and include basic system resources like CPU and memory. For example, HPA can be set to monitor the average CPU utilization across pods and scale the deployment if the average exceeds a given threshold.

Here’s an example of how to configure HPA based on CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cpu-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-application
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

In this example, Kubernetes will monitor the CPU utilization of the web-application deployment and scale the number of pods to keep the average CPU usage around 70%. The minReplicas and maxReplicas fields define the lower and upper limits for scaling.

2. Object Metrics

Object metrics allow the HPA to scale pods based on values extracted from other Kubernetes objects, such as the number of requests in a service or the length of a queue in a message broker. This provides a way to trigger scaling based on more application-specific data.

For example, you could autoscale based on the number of HTTP requests being processed by an Ingress controller or a service. In this case, the object metric would need to reference the appropriate Kubernetes object (like an Ingress) that tracks the number of requests.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: http-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-application
  minReplicas: 3
  maxReplicas: 15
  metrics:
  - type: Object
    object:
      metric:
        name: requests_per_second
      describedObject:
        apiVersion: networking.k8s.io/v1
        kind: Ingress
        name: web-ingress
      target:
        type: Value
        value: 1000

In this example, the HPA will scale based on the number of HTTP requests processed by the Ingress object web-ingress. If the number of requests per second exceeds 1000, the HPA will trigger an increase in pod replicas to handle the load. This is an example of using object metrics for autoscaling, where the target object (Ingress) provides the metric for the scaling trigger.

3. Pod Metrics

Pod metrics focus on specific metrics that are exposed by the pods themselves. This can be useful for scaling based on application-specific performance data, such as the rate of transactions processed by a pod or memory usage beyond basic Kubernetes resource metrics.

4. External Metrics

External metrics are used when you want to scale based on external sources like cloud services or third-party APIs. For example, you could scale pods in response to the load on an external database, the number of messages in an external queue, or signals from a monitoring service.

Kubernetes HPA Use Cases

HPA is ideal for dynamically scaling applications based on real-time demand. Below are some common use cases:

  • Microservices Architectures: In microservices-based architectures, where individual services handle specific tasks, traffic can vary significantly between services. HPA can ensure that each service has the right number of pods to handle the workload, adjusting the number of replicas based on CPU, memory, or custom metrics.
  • Web Applications with Variable Traffic: Web applications often experience fluctuating traffic patterns, such as spikes during certain times of the day or during promotional events. HPA helps in scaling the web app’s backend services based on real-time traffic, ensuring a smooth user experience even under heavy load.
  • Batch Processing Jobs: Batch processing works like data processing pipelines or Machine Learning tasks can have variable workloads. HPA can scale the processing pods up when there is a large batch of data to handle and scale them down once the batch is processed, optimizing resource use.
  • CI/CD Pipelines: In Continuous Integration/Continuous Deployment (CI/CD) systems, the workload can fluctuate heavily depending on how often code is committed and built. HPA can dynamically scale the resources needed to run builds, tests, and deployments, ensuring efficient use of compute power.
  • IoT and Data Streaming Applications: For IoT and data streaming systems, the volume of incoming data can fluctuate depending on external factors like the number of connected devices or the time of day. HPA can scale the services that process incoming data to match the varying load.

Limitations of HPA

While HPA is a powerful tool, it has some limitations that you need to be aware of when designing scalable applications.

LimitationsDescription
Combining HPA and VPAUsing HPA and VPA together can cause instability if both scale based on the same metrics. HPA adjusts pod counts, while VPA modifies resource requests and limits. Changes from VPA can affect HPA’s scaling, leading to oscillations. To avoid this, configure them to use distinct metrics.
HPA for Stateless vs. Stateful AppsHPA suits stateless apps but scaling stateful apps with HPA is complex due to state preservation.
No IOPS, Bandwidth, or Storage ScalingHPA doesn’t scale based on IOPS, bandwidth, or storage; custom metrics may help.
Limited for SpikesHPA struggles with sudden demand spikes, causing delays in scaling.
Resource Waste or TerminationFrequent scaling up and down of the number of pods may cause cluster fragmentation, which leads to wasted resources.
Cluster Capacity RisksOver-scaling may result in pending pods or resource shortages.

How to Set Up Kubernetes HPA

Setting up HPA is straightforward. Here’s a quick walkthrough for setting up a CPU-based autoscaler.

  • Ensure you have the metrics-server running in your Kubernetes cluster. The metrics-server is an efficient way to consume and analyze cluster-related metrics, such as CPU and memory usage.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
  • Deploy your application. For example, a deployment named nginx-deployment.
kubectl create deployment nginx-deployment --image=nginx
  • Create an HPA that targets CPU usage.
kubectl autoscale deployment nginx-deployment --cpu-percent=50 --min=1 --max=10

This command will set up an HPA for the nginx-deployment, automatically scaling the pod count between 1 and 10 to keep CPU utilization around 50%.

How to Manage HPAs with kubectl

You can manage and monitor HPAs easily using kubectl commands.

  • Check the status of the HPA:
kubectl get hpa

This command shows the current status of all HPAs, including metrics and pod counts.

  • Describe the HPA to get detailed information:
kubectl describe hpa nginx-deployment

Best Practices to Optimize Kubernetes HPA

To ensure that your HPA setup works efficiently, consider the following best practices.

1. Set Appropriate Metrics

Choosing the right metrics is key to efficient scaling. While CPU and memory are common metrics, they may not fully capture the performance of more complex applications. For business-critical workloads, consider custom metrics, such as request rates, response times, or queue lengths. Tools like Prometheus can be used to expose custom metrics, providing deeper insights into what should trigger scaling decisions.

2. Configure Min and Max Replicas

Setting realistic minimum and maximum replica values is critical. A too-low minimum can lead to performance degradation under load, while an excessively high maximum can waste resources. Analyze your application’s typical traffic patterns and growth forecasts to decide appropriate replica values. Ensure that the minimum number of pods keeps the app responsive during idle periods, and the maximum cap prevents resource exhaustion during peak loads.

3. Enable Cluster Auto-Scaling

Without cluster-level auto-scaling, your HPA may create more pods than the cluster can accommodate, leading to scheduling failures. Having a cluster autoscaler in your cluster ensures that your infrastructure can scale to handle additional pods when required. The cluster autoscaler can also downsize the cluster when demand decreases, saving costs while maintaining optimal resource utilization.

4. Leverage Scaling Delays

Avoid frequent scaling caused by temporary traffic spikes by introducing scaling delays. Kubernetes allows configuring stabilization windows, which delay scaling actions until the metric remains consistently above the threshold for a defined period. This reduces “flapping” by preventing scaling up or down for short-lived, high-traffic periods that don’t justify more resources.

5. Regularly Monitor and Adjust Thresholds

The scaling behavior of your HPA should evolve as your application grows. Regularly monitor the performance of your application and adjust the thresholds based on real-world data. Traffic spikes, deployment changes, or code refactoring may necessitate updates to the scaling parameters. Use historical data to set realistic thresholds that minimize response times while maintaining resource efficiency.

6. Use Event-Based Scaling

Kubernetes Event-Driven Autoscaling (KEDA) can help you prepare for anticipated traffic surges by enabling faster, event-based scaling. KEDA works in conjunction with HPA and is particularly effective for applications that experience frequent, unpredictable spikes in demand. By responding to specific events and metrics, KEDA can scale resources dynamically, ensuring optimal performance during high-traffic periods.

7. Set Correct Pod Requests and Limits

Improperly configured pod resource requests and limits can lead to inefficient scaling. If requests are too high, you might end up with underutilized resources, and if they are too low, your pods may not get the resources they need, resulting in poor performance. Analyze resource usage over time to fine-tune your requests and limits, and ensure that scaling decisions are based on accurate estimates of the actual needs of your application.

8. Work in conjunction with Cluster Autoscaler

For seamless scaling, HPA should work in conjunction with the cluster autoscaler. When HPA adds more pods, the cluster autoscaler must increase the node capacity if existing nodes lack sufficient resources. By integrating the two, you ensure that pod replicas have sufficient space to run. Without this integration, additional pods might stay in a pending state, waiting for resources, which can slow down your response to increased traffic.

9. Fine-Tune Readiness and Liveness Probes for Smooth Scaling

Incorrectly configured readiness and liveness probes can cause instability during scaling events. If your probes aren’t fine-tuned, new pods may be added but start receiving traffic before they’re fully ready, leading to failed requests. On the other hand, overly aggressive liveness probes might cause the pods to restart prematurely. Ensure that the probes accurately reflect your application’s startup time and readiness to avoid issues during autoscaling events.

Summary of Best Practices for Optimizing Kubernetes HPA

Best PracticeDescriptionActionable Advice
Set Appropriate MetricsUse relevant metrics beyond CPU and memory for effective scaling.Consider custom metrics like request rates or queue lengths.
Configure Min and Max ReplicasBalance performance and resource use by setting realistic replica values.Analyze traffic patterns to set min and max replicas effectively.
Enable Cluster Auto-ScalingAvoid pod scheduling issues by enabling CAS to work alongside HPA.Ensure your cluster can handle additional pods when scaling up.
Leverage stabilization windowPrevent frequent scaling by adding stabilization windows for temporary spikes.Use delays to minimize unnecessary scaling during short traffic surges.
Monitor and Adjust ThresholdsAdjust thresholds over time based on real-world application behavior.Regularly review and update based on historical data and growth.
Use Event Driven ScalingPrepare for traffic surges proactively by anticipating load increases.Implement tools like KEDA for event-driven scaling.
Set Correct Pod Requests & LimitsOptimize resource allocation by tuning pod requests and limits.Analyze usage data to set accurate resource limits for each pod.
HPA works in conjunction with Cluster AutoscalerSeamlessly scale nodes and pods together for consistent performance.Ensure the autoscaler is configured to match HPA’s scaling actions.
Fine-Tune Probes for Smooth ScalingAvoid instability by setting accurate readiness and liveness probes.Adjust probes to reflect true application readiness and stability.

Impact of HPA on Kubernetes Resource Costs

While HPA can optimize resource allocation, several factors can impact Kubernetes resource costs if not configured properly:

  • Overly Aggressive Thresholds: Can trigger unnecessary scaling and higher costs.
  • Insufficient Monitoring: Leads to scaling based on inaccurate data, inflating costs.
  • Autoscaler Misconfigurations: Causes inefficient scaling, increasing expenses.
  • Underutilized Replicas: Results in resource waste and added costs.
  • Short Cooldown Periods: Frequent scaling up/down increases operational costs.
  • Lack of Custom Metrics: Inefficient scaling using default metrics can drive up costs.
  • Resource Misconfiguration: Poor requests/limits cause overspending on cloud resources.

Conclusion

Kubernetes HPA is a robust solution for managing pod scaling based on real-time metrics. However, understanding its limitations, integrating it with other autoscaling tools like VPA and cluster autoscalers, and following best practices is crucial for ensuring efficient operation in production environments. By carefully configuring and monitoring HPA, you can achieve optimal scalability and performance while controlling resource costs in your Kubernetes clusters.

Ready to optimize your autoscaling? Try ScaleOps today!

Related Articles

Amazon EKS Auto Mode: What It Is and How to Optimize Kubernetes Clusters

Amazon EKS Auto Mode: What It Is and How to Optimize Kubernetes Clusters

Amazon recently introduced EKS Auto Mode, a feature designed to simplify Kubernetes cluster management. This new feature automates many operational tasks, such as managing cluster infrastructure, provisioning nodes, and optimizing costs. It offers a streamlined experience for developers, allowing them to focus on deploying and running applications without the complexities of cluster management.

Pod Disruption Budget: Benefits, Example & Best Practices

Pod Disruption Budget: Benefits, Example & Best Practices

In Kubernetes, the availability during planned and unplanned disruptions is a critical necessity for systems that require high uptime. Pod Disruption Budgets (PDBs) allow for the management of pod availability during disruptions. With PDBs, one can limit how many pods of an application could be disrupted within a window of time, hence keeping vital services running during node upgrades, scaling, or failure. In this article, we discuss the main components of PDBs, their creation, use, and benefits, along with the best practices for improving them for high availability at the very end.

ScaleOps Pod Placement – Optimizing Unevictable Workloads

ScaleOps Pod Placement – Optimizing Unevictable Workloads

When managing large-scale Kubernetes clusters, efficient resource utilization is key to maintaining application performance while controlling costs. But certain workloads, deemed “unevictable,” can hinder this balance. These pods—restricted by Pod Disruption Budgets (PDBs), safe-to-evict annotations, or their role in core Kubernetes operations—are anchored to nodes, preventing the autoscaler from adjusting resources effectively. The result? Underutilized nodes that drive up costs and compromise scalability. In this blog post, we dive into how unevictable workloads challenge Kubernetes autoscaling and how ScaleOps’ optimized pod placement capabilities bring new efficiency to clusters through intelligent automation.

Schedule your demo