Skip to content
All articles

HPA’s Three Architectural Flaws (And Why Your Autoscaling Keeps Failing)

Nic Vermandé
Nic Vermandé

Main takeaways

  • HPA is reactive and delayed, scaling only after CPU spikes, causing latency during predictable traffic surges
  • HPA and VPA conflict when resource requests change, creating unstable replica counts and the Kubernetes “death spiral”
  • ScaleOps eliminates this instability with coordinated rightsizing and proactive replica optimization that pre-warms capacity and stabilizes scaling behavior

The Promise vs. Reality of HPA

HPA is the most deployed autoscaler in Kubernetes. It’s also architecturally limited in ways that matter for production workloads.

The design is straightforward: HPA monitors resource utilization, compares it against a target threshold, and adjusts replica counts accordingly. Set averageUtilization: 70, and HPA scales your deployment when CPU usage crosses that line.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

This works well for static workloads with predictable traffic and unchanging resource requests. The problems emerge when you move beyond that baseline, which every production cluster eventually does.

HPA has three architectural limitations that compound in production environments:

First, HPA is purely reactive.

It scales based on observed symptoms (elevated CPU) rather than anticipated demand. Traffic arrives, CPU rises, metrics-server collects (up to 60 seconds later), HPA detects the threshold breach, then initiates scaling. By the time new pods pass readiness probes, users have already experienced degraded latency. Predictable patterns — Monday morning spikes, lunch rush traffic, end-of-month processing — trigger the same reactive scramble every time, even though the pattern is entirely foreseeable from historical data.

Second, HPA’s percentage math breaks when requests change.

The averageUtilization target calculates against resources.requests.cpu, not actual node capacity. When you rightsize, whether manually, via VPA, or through optimization tooling, the denominator in HPA’s calculation shifts. A pod using 150m CPU with 200m requests shows 75% utilization. Drop requests to 100m (a reasonable optimization), and the same 150m usage becomes 150% utilization. HPA interprets this as an emergency and scales out aggressively, even though actual resource consumption hasn’t changed.

Third, VPA’s histograms get polluted when HPA scales.

VPA maintains per-container usage histograms to inform rightsizing recommendations. When HPA adds replicas, load spreads across more pods, and per-pod utilization drops. VPA’s histogram sees “lower usage” and recommends smaller requests. Smaller requests trigger the percentage math problem. The loop accelerates: this is the documented VPA/HPA “death spiral” that Kubernetes upstream explicitly warns against.

These three issues don’t exist in isolation. They feed each other: reactive scaling amplifies the percentage math problem during traffic spikes, histogram pollution degrades recommendations over time, and the combination creates oscillating, unpredictable autoscaler behavior.

Throughout this article, we’ll use TaxiMetrics as our demonstration workload—a representative microservices application that exhibits these scaling pathologies under realistic traffic patterns. We’ll examine each flaw in detail, then show how ScaleOps’ Replica Optimization addresses all three as a unified system.

Flaw #1: HPA is Always Late

HPA operates on a fundamental architectural constraint: it scales based on observed resource utilization, not anticipated demand. This reactive model means scaling decisions always lag behind the events that trigger them.

The Metrics Pipeline

Understanding why HPA is late requires tracing the metrics pipeline:

The cumulative delay from traffic spike to available capacity ranges from 30 seconds (best case: images cached, fast startup) to 3+ minutes (cold nodes, large images, slow readiness probes). During this window, existing pods absorb the full load increase, often resulting in elevated latency, throttling, or dropped requests.

Symptom-Based Scaling

The core issue is that HPA scales on symptoms rather than causes:

Signal TypeExampleWhen HPA Sees It
CauseTraffic increases 3xNever (HPA doesn’t monitor traffic)
SymptomCPU utilization hits 85%
30-60 seconds after traffic spike

By the time CPU rises enough to trigger scaling, the traffic spike has already impacted user experience.

Predictable Patterns, Repeated Panic

Most production traffic patterns are predictable. Business applications exhibit clear seasonality:

PatternFrequencyData Available
Monday morning spikeWeeklyMonths of historical data
Lunch rushDailyRepeats every 24 hurs
End-of-month processingMonthlyPredictable to the day
Seasonal peaks (Black Friday, Holidays, etc.) AnnuallyYears of historical data

HPA ignores all of this. It has no mechanism to learn from historical patterns or pre-warm capacity before predictable demand. Every Monday morning is treated as a novel event, triggering the same reactive scaling, even when the pattern has repeated for years.

The Over-Provisioning Tax

Teams recognize that HPA’s reactive nature creates reliability risk, which leads them to compensate with one of these mechanisms:

StrategyTrade-off
Set minReplicas artificially highPaying for idle capacity 24/7
Over-provision resource requestsWasted compute, poor bin-packing efficiency
Disable HPA, use fixed replica countsNo elasticity, always provisioned for peak
Accept latency degradation during scalingSLA impact, degraded user experience

Each approach trades cost for reliability (or accepts reliability degradation). None addresses the underlying architectural limitation.

With TaxiMetrics, we observe this pattern consistently: traffic spikes that are entirely predictable from historical data still trigger reactive scaling, with P99 latency spiking during the 30-90 second window before new pods become available.

Flaw #2: The Percentage Math Problem

HPA’s targetAverageUtilization setting appears straightforward: set 70%, and HPA maintains utilization around that level. The implementation details matter.

How HPA Calculates Utilization

The averageUtilization metric calculates against resources.requests.cpu, not node capacity or container limits:

utilization = (current CPU usage) / (CPU requests) × 100

This means the denominator in HPA’s calculation is whatever value exists in your pod’s resource requests at that moment.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: taximetrics-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: taximetrics-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Calculated against requests, not capacity

The Math Breakdown

Consider a pod with stable resource consumption:

MetricBefore RightsizingAfter rightsizing
CPU Requests200m100m
Actual Usage150m150m (unchanged)
Utilization150 ÷ 200 = 75%150 ÷ 100 = 150%
HPA ResponseStable (below 80% target)Scale out immediately

The workload hasn’t changed: the amount of traffic is identical and the actual CPU consumption is the same 150 millicores. But HPA’s percentage calculation shifted because the denominator changed.

With a 70% target, HPA calculates desired replicas using:

desiredReplicas = ceil(currentReplicas × (currentUtilization / targetUtilization))

Before rightsizing: ceil(2 × (75 / 70)) = ceil(2.14) = 3 replicas

After rightsizing: ceil(2 × (150 / 70)) = ceil(4.28) = 5 replicas

Same workload, same traffic, but HPA now wants 5 replicas instead of 3.

When This Happens

This behavior triggers whenever resource requests change:

  • VPA in Auto mode: VPA applies new requests, pods restart, HPA recalculates against new denominator
  • Manual rightsizing: Engineering adjusts requests based on observed usage
  • Optimization tooling: Any system that modifies resources.requests

The severity depends on how significantly requests change. A workload over-provisioned at 500m requests but using 150m would show 30% utilization. Rightsizing to 200m requests jumps utilization to 75%. Rightsizing further to 100m jumps to 150%.

The Documented Anti-Pattern

Kubernetes documentation explicitly warns against running VPA and HPA on the same resource metric. The controllers lack coordination because VPA modifies requests based on historical usage, while HPA scales based on current utilization percentage. When VPA changes requests, HPA’s math changes underneath it.

The standard workaround is to choose one:

  • Use HPA for horizontal scaling, accept inaccurate resource requests
  • Use VPA for rightsizing, disable HPA or use custom metrics

Neither option provides both accurate resource requests and stable horizontal scaling.

Flaw #3: The Histogram Aggregation Problem

The previous two flaws (reactive scaling and percentage math) create problems independently. This third flaw compounds them into a feedback loop.

How VPA Builds Recommendations

VPA uses historical usage data to recommend resource requests. For CPU, it maintains a decaying histogram of usage samples per container. The recommendation algorithm analyzes this histogram to suggest requests that would satisfy a target percentile (typically P90 or P95) of observed usage.

The key architectural detail: VPA collects samples per container, not per workload. Each pod’s container contributes individual data points to the histogram.

The Pollution Mechanism

When HPA scales a deployment from 2 to 4 replicas, the total workload remains constant but distributes across more pods:

StateReplicasTotal LoadPer-Pod Load
Before scale-out2300m CPU150m each
After scale-out4300m CPU75m each

VPA’s histogram now receives samples showing 75m usage per container instead of 150m. Over time, these lower samples shift the histogram distribution downward.

VPA’s recommendation logic sees “historical usage is lower” and recommends reduced requests, even though the workload’s total resource consumption hasn’t changed.

The Feedback Loop

This histogram pollution connects directly to Flaw #2 (percentage math):

Each iteration accelerates the next. VPA continuously recommends smaller requests because the histogram data is polluted by HPA’s scaling decisions. HPA continuously scales out because VPA’s request changes break the percentage math.

The Inverse Problem

The loop also operates in reverse during scale-down:

  • Traffic decreases, HPA scales from 4 → 2 replicas
  • Load concentrates: per-pod usage doubles
  • VPA histogram sees higher samples, recommends larger requests
  • Larger requests reduce utilization percentage
  • HPA sees low utilization, scales down further
  • Oscillation between over-provisioned and under-provisioned states

Why This Is Documented as an Anti-Pattern

Kubernetes VPA documentation explicitly states that running VPA and HPA on the same CPU or memory metric should not be used. The controllers operate independently with no coordination mechanism:

  • VPA modifies requests based on per-container historical data
  • HPA modifies replica count based on current aggregate utilization
  • Neither is aware of the other’s actions or their downstream effects

The standard guidance is to use HPA with custom or external metrics (queue depth, requests per second) while VPA manages resource requests. This separation prevents the feedback loop but requires additional metrics infrastructure.

ScaleOps Replica Optimization: The Unified Solution

The three flaws described above share a root cause: HPA and VPA operate independently with no coordination mechanism. ScaleOps addresses this through two integrated capabilities: Rightsizing (continuous resource request optimization) and Replica Optimization (horizontal scaling).

Flaw #1 Solved: Proactive Scaling

Standard HPA reacts to observed metrics. ScaleOps Replica Optimization takes a different approach: it replaces the static, manually-set minReplicas with a data-driven, continuously-updated value — lower when you’re over-provisioned, higher when a spike is coming.

ScenarioVanilla HPAWith Replica Optimization
Monday morning spikeScales after CPU rises, 30s-3min delayminReplicas pre-warmed before spike
Daily lunch rushSame reactive pattern every dayPattern detected, capacity ready
End-of-month processingTreats predictable spike as surpriseHistorical baseline informs scaling floor

When seasonality patterns are detected, Replica Optimization adjusts minReplicas before traffic arrives. Pods are running and ready when users show up.

Replicas remain stable after Rightsizing event:

Flaw #2 Solved: Stable Scaling Through Rightsizing

When ScaleOps Rightsizing adjusts resource requests, Replica Optimization maintains consistent scaling behavior.

ScenarioVanilla HPAWith Replica Optimization
Rightsizing reduces requests 200m → 100mUtilization spikes to 150%, panic scalingScaling behavior unchanged
Manual request tuningUnpredictable replica fluctuationsConsistent scaling response
Continuous optimizationRestart storms, oscillationRequests and replicas managed in coordination

The scaling intent — “scale when the workload needs more capacity” — remains stable regardless of what values exist in resources.requests. Request changes don’t trigger spurious scaling events.

Flaw #3 Solved: Accurate Recommendations Despite Scaling

ScaleOps Rightsizing analyzes resource consumption at the workload level, not per-container.

ScenarioVanilla VPAWith ScaleOps Rightsizing
HPA scales 2 → 8 replicasHistogram sees 75% less per-pod usage, recommends smaller requestsRecommendations stay stable
Traffic spike + scale outHistogram skewed by temporary pod distributionWorkload profile stays accurate
Scale down after peakOscillating recommendationsConsistent sizing through all phases

Replica count changes don’t pollute the data used for rightsizing recommendations. Whether the workload runs on 2 pods or 20, the recommendation reflects actual resource requirements.

The Unified System

These capabilities work together rather than as independent fixes:

  • Rightsizing optimizes resource requests based on actual workload behavior
  • Replica Optimization maintains stable horizontal scaling regardless of request values
  • Seasonality detection pre-warms capacity before predictable demand

This means resource requests reflect actual usage, horizontal scaling responds to real capacity needs, and predictable traffic patterns don’t cause repeated reactive scrambles.

TaxiMetrics: Before and After

Applying ScaleOps to the TaxiMetrics deployment demonstrates the difference:

Before (vanilla HPA + VPA):

  • Rightsizing event triggers replica spike from 3 → 7
  • VPA recommendations oscillate as HPA scales
  • Scheduled batch causes 45-second latency degradation while pods start

After (ScaleOps Rightsizing + Replica Optimization):

  • Rightsizing event: replica count unchanged
  • Recommendations stable through scaling events
  • capacity pre-warmed for scheduled batch, no latency impact

Comparison of replicas, CPU requests and latency before and after ScaleOps:

Predictive Replica Optimization with ScaleOps:

Practical Implementation

Migration Path

ScaleOps is production-grade and scale-ready from day one. The typical adoption path:

Phase 1: Read-Only Mode

Enable both Rightsizing and Replica Optimization in read-only mode. No changes are applied to workloads: ScaleOps observes, analyzes, and generates optimization opportunities.

This phase reveals the gaps in current autoscaling (vertical and horizontal) behavior:

MetricWhat You’ll See
Recommended vs. actual requestsHow far current requests are from optimal
Recommended vs. actual minReplicasThe seasonality gap — where Replica Optimization would pre-warm capacity
Predicted scaling eventsWhen ScaleOps would have adjusted before traffic arrived

The seasonality gap is particularly valuable: you can observe the time difference between when Replica Optimization would adjust minReplicas versus when your current HPA actually reacts. In automated mode, these align. But in read-only mode, the gap quantifies the latency impact you’re currently absorbing.

Phase 2: Automate

Once you’ve validated the recommendations match observed workload behavior, enable automation with one click. Rightsizing and Replica Optimization begin applying changes, and the gaps close.

No phased rollout required. No canary deployments of autoscaler configurations. The same system that generated accurate recommendations in read-only mode now applies them.

What’s Next: The Metrics Latency Problem

ScaleOps Replica Optimization addresses the death spiral: stable scaling through rightsizing, accurate recommendations despite replica changes, and proactive capacity management for predictable patterns.

But HPA has another architectural limitation that exists regardless of how you configure it: metrics latency.

The Metrics Pipeline

HPA relies on metrics-server for resource utilization data. The pipeline introduces cumulative delay:

The Limitation

For workloads where CPU and memory are accurate proxies for load, this pipeline works. For workloads where they aren’t (queue processors, latency-sensitive APIs, batch jobs with variable resource profiles) HPA scales on lagging indicators that don’t reflect actual capacity needs.

Workload TypeUseful Scaling MetricAvailable via metrics-server
API serviceRequest latency, error rateNo
Queue processorQueue depth, processing rateNo
CPU-bound batchCPU throttling ratioNo
Memory-sensitiveMemory pressure, eviction signalsNo

KEDA: A Different Approach

KEDA (Kubernetes Event-Driven Autoscaler) addresses this by querying metrics sources directly — Prometheus, cloud provider APIs, message queues — with configurable polling intervals as low as 15 seconds.

# KEDA ScaledObject example
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: taximetrics-api
spec:
  scaleTargetRef:
    name: taximetrics-api
  pollingInterval: 15
  triggers:
  - type: prometheus
    metadata:
      query: sum(rate(http_requests_total{app="taximetrics"}[1m]))
      threshold: "100"

No metrics-server in the path. No 60-second staleness. Scale on the metrics that actually indicate whether your application needs more capacity.

The Metrics Question

This points to a broader architectural question that the Kubernetes community is still working through: what should autoscaling actually respond to?

CPU and memory are lagging indicators. By the time CPU spikes, the queue is already backing up. By the time memory pressure appears, the OOM killer is already circling. You’re scaling based on symptoms, not causes — driving by looking in the rearview mirror.

Leading indicators tell a different story:

Lagging (HPA default)Leading (requires custom metrics)
CPU utilization %CPU throttling ratio
Memory usageMemory pressure, eviction signals
Queue depthQueue depth + growth rate
Average latencyP99 latency, error rate

The metrics-server design is a deliberate trade-off. Kubernetes chose simplicity and universal compatibility over metric richness. Every cluster has CPU and memory. Not every cluster has Prometheus, or application-level instrumentation, or the operational maturity to define meaningful custom metrics.

But if you’re running production workloads at scale, you’ve probably already crossed that threshold. The question becomes: are you using metrics that actually predict capacity needs, or just reacting to resource exhaustion?

KEDA opens the door to leading metrics. ScaleOps Replica Optimization adds the intelligence layer with pattern detection, seasonality, and proactive scaling. But the fundamental shift is the same: moving from “scale when it hurts” to “scale before it matters.”

Next in This Series

This article covered HPA’s architectural limitations when combined with rightsizing — reactive scaling, percentage math instability, and histogram pollution — and how ScaleOps Rightsizing and Replica Optimization address them as a unified system.

But for workloads where CPU and memory aren’t accurate proxies for capacity needs, there’s a deeper question: should you be using HPA at all?

The next article explores KEDA in depth:

  • KEDA vs HPA + Prometheus Adapter — architecture, complexity, and failure modes
  • When to use which — decision framework by workload type
  • How ScaleOps integrates with event-driven scaling — combining KEDA with intelligent rightsizing

If your scaling decisions depend on queue depth, request latency, or custom application metrics, that’s the one to read.

Ready to see ScaleOps in action? Experience how Rightsizing and Replica Optimization eliminate the HPA/VPA death spiral — with read-only mode to validate before you automate.