All articles

HPA vs VPA: Understanding Kubernetes Autoscaling and Why It’s Not Enough in 2025

Raz Goldenberg
Raz Goldenberg

The decision between scaling out and scaling up isn’t just technical, it’s architectural. 

It defines your cost structure, performance boundaries, and how your team will spend their time: optimizing systems for cost and performance, or fighting fires. 

In Kubernetes, Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) approach scaling from opposite directions. Each targets a different layer of your infrastructure, yet many teams treat them as interchangeable switches rather than strategic choices. 

The result is predictable: reactive scaling, unstable performance, and rising costs.

Teams that understand the architectural trade-offs behind HPA and VPA build clusters that scale cleanly under pressure. Those that don’t end up tuning thresholds and chasing capacity issues after every release.

This post breaks down the real differences between HPA and VPA, when each makes sense, and why most production teams need a unified, autonomous approach to scaling that is context-aware. 

What is Horizontal Pod Autoscaling (HPA)?

HPA is a core feature in Kubernetes that automatically adjusts the number of pod replicas for a Deployment, ReplicaSet, or StatefulSet in response to observed metrics (CPU, memory, or custom metrics). 

At its core, HPA is a simple control loop running inside the K8s Controller Manager and it’s defined by a HorizontalPodAutoscaler resource, which uses the autoscaling/v2 API group.

How does HPA work?

By default, every 15 seconds HPA does the following: 

  1. Metric collection: Every ~15 seconds, HPA polls configured metrics (e.g., CPU usage, memory, or custom application metrics)
  2. Desired Replica calculation: It computes the desired replica count using a formula such as: desiredReplicas = ceil(currentReplicas * (currentMetric / desiredTarget)). If utilization is above your target, it scales up; below the target, scales down.  
  3. Scaling Action: If the computed desiredReplicas differs from the current replica count, HPA adjusts the /scale subresource of the target workload, triggering pod creation or termination.

What is Vertical Pod Autoscaling (VPA)?

Where HPA assumes your pods are correctly sized and adds more of them, Vertical Pod Autoscaler assumes you have the right number of pods and focuses on getting their individual resource allocation right. While HPA responds to demand spikes by scaling horizontally (more replicas), VPA responds to resource inefficiency by scaling vertically (bigger pods). HPA is reactive to traffic patterns; VPA is reactive to resource utilization patterns.

How does VPA work?

VPA continuously analyzes your pods’ actual resource consumption over time and compares it against their current allocations. When pods consistently exceed their requests or remain significantly under-provisioned, VPA modifies their resource specifications. When applied, these changes typically require pod recreation.

Key Differences in Scaling Focus

HPA and VPA optimize different dimensions of workload behavior.

HPA focuses on workload volume. It assumes pods are already right-sized and adjusts the number of replicas to meet demand. This makes it ideal for handling traffic bursts, queue backlogs, and other scenarios where “more hands make light work.”

VPA focuses on per-pod capacity. It assumes the replica count is correct and adjusts each pod’s CPU and memory requests to match actual usage. It’s best for eliminating waste from over-provisioned pods and preventing throttling in under-provisioned ones.

In simple terms:

  • HPA solves for throughput by adding more pods.
  • VPA solves for efficiency by resizing existing pods.

A web API experiencing unpredictable traffic benefits from HPA.

A machine learning service running models of varying complexity benefits from VPA.

This difference in scaling philosophy shapes everything else – response speed, disruption patterns, configuration complexity, and ideal use cases. Understanding which dimension your workload stresses most is the key to choosing the right autoscaling strategy.

HPA (Horizontal)VPA (Vertical)
Scaling FocusAdjusts replica countAdjusts individual pod resources
Response MethodCreates new podsModifies existing pod specs
Disruption ImpactNon-disruptive (additive scaling)Requires pod recreation (brief interruption)
Response Time2-4 minutes typicalHours/days for recommendations
ComplexityMetric thresholds & scaling policiesResource pattern analysis & eviction management
Best forTraffic spikes, queue processingSteady workloads with unclear sizing
Failure ModeScales broken pods horizontallySlow to adapt to changing patterns

When to Choose HPA vs VPA: Matching Strategy to Workload

Choosing between HPA and VPA starts with understanding what’s driving your scaling pressure: throughput or efficiency.

HPA is the right choice when your bottleneck is throughput. Your per-request needs are stable, but request volume fluctuates. HPA adds more replicas to handle higher load, making it ideal for workloads that scale linearly with traffic. 

Use HPA for workloads such as: 

  • Web APIs that experience unpredictable traffic spikes
  • Queue processors handling variable message backlogs  
  • Event-driven microservices that respond to bursts of user activity 

VPA fits when your bottleneck is efficiency. Your traffic is steady, but resource usage per request changes over time. VPA adjusts CPU and memory allocations to keep pods right-sized, reducing waste and avoiding throttling.

Use VPA for workloads such as: 

  • Batch jobs with varying computational complexity 
  • Data pipelines with changing input sizes 
  • Memory-heavy services where overprovisioning is costly 
  • Applications where resource needs are hard to estimate upfront 

Limitations of HPA and VPA

Both HPA and VPA are powerful, but they each come with their own set of operational challenges that can catch teams off-guard/ 

HPA Limitations 

HPA reacts to changing demands but has several hidden pitfalls: 

  • Cold-start delays: When scaling from zero or very low replica counts, new pods take time to initialize. This gap between detection and readiness can cause temporary latency spikes or request failures.
  • Averages hide outliers: HPA typically scales on mean CPU utilization. It can’t see P99 latency or per-request performance degradation, meaning a subset of users may still experience slow responses.
  • Memory scaling pitfalls: Scaling based on memory usage often breaks caching efficiency, evicting pods that held warm caches and introducing unnecessary cold starts.
  • Reactive by design: Because HPA operates on short-term averages, it can’t predict future load patterns and may oscillate between over- and under-provisioning.

VPA Limitations

VPA’s approach to right-sizing pods is effective but comes with trade-offs:

  • Disruptive scaling: Applying new recommendations requires pod evictions and restarts, which can interrupt service or violate PodDisruptionBudgets.
  • Slow adaptation: VPA learns from historical data, not real-time signals. Its recommendations may lag behind sudden workload changes, making it unsuitable for bursty traffic.
  • Blind to runtime conditions: VPA doesn’t account for transient factors like throttling, noisy neighbors, or short-lived spikes.
  • Learning curve for new workloads: It takes time to gather enough data for accurate recommendations. During that period, resource allocation can remain suboptimal.
  • Operational fragility: The admission controller adds scheduling latency and, if misconfigured, can block new pod creation cluster-wide.

Shared Limitations of HPA and VPA

At the end of the day, both HPA and VPA operate without context. They react to raw utilization metrics rather than understanding why those metrics change. Neither considers application behavior, cost constraints, or cluster-level conditions such as node availability, noisy neighbors, or network contention. This lack of context-awareness often leads to scaling decisions that fix symptoms instead of root causes.

They also share mechanical constraints: both rely on stabilization windows to prevent thrashing, delaying legitimate scaling actions, and can be blocked by misconfigured PodDisruptionBudgets (PDBs) that restrict evictions or rescheduling.

Can You Run HPA and VPA Together? 

Technically yes, but it introduces a feedback loop that can (and probably will) destabilize your clusters. 

Here’s what happens: 

  1. VPA increases a pod’s CPU request based on usage
  2. The higher request lowers CPU utilization percentage 
  3. HPA understands this as “underutilization” and scales down replicas
  4. Fewer replicas drive utilization back up, causing VPA to raise requests again 

This “tug of war” leads to rapid fluctuation in both pod count and size. And in production environments, that’s not an option. 

The key is recognizing that HPA and VPA were designed as separate solutions for different problems. Running both requires explicit orchestration and guardrails.

Beyond HPA and VPA: The Unified Approach 

Autoscaling in Kubernetes isn’t just about how many pods you run or how big they are. It’s about managing both dimensions continuously and intelligently. 

That’s where ScaleOps comes in. 

ScaleOps unifies horizontal scaling (replica count) and vertical rightsizing (pod resources) into a single autonomous control plane.

Instead of managing two reactive controllers, ScaleOps continuously optimizes both based on real-time workload behavior, historical data, and live cluster conditions.

Learn more about our automated, real-time pod rightsizing and replica optimization features.

How it works:

  • Monitors pod and node resource utilization in real-time.
  • Learns from workload patterns to decide when to scale out or up.
  • Applies in-place adjustments with zero downtime.
  • Coordinates with node autoscalers like Karpenter to maintain balance between performance and cost.

The result: maximized performance and reliability, minimal waste, and no manual tuning.

The Bottom Line: Reactive vs Autonomous Scaling

HPA and VPA each solve one side of the autoscaling problem. Both are reactive, both depend on manual tuning, and both operate blind to broader context. Modern production environments need autonomous, application context-aware scaling that continuously manages resources across both dimensions.

That’s the role of ScaleOps, turning reactive loops into proactive, intelligent resource management that delivers consistent performance and cost efficiency without manual effort.

Ready to see ScaleOps in action? Experience how ScaleOps optimizes and improves the functionality of both HPA and VPA.

Related Articles

Start Optimizing K8s Resources in Minutes!

Schedule your demo

Submit the form and schedule your 1:1 demo with a ScaleOps platform expert.

Schedule your demo