This comprehensive guide explores Kubernetes HPA (Horizontal Pod Autoscaler) through real-world examples, custom metrics implementation, and battle-tested best practices. You’ll learn how to configure HPA effectively, troubleshoot common scaling issues, and bridge the gap between autoscaling theory and production reality.
The foundational promise of the cloud was always elasticity: a world where compute resources expanded and contracted perfectly to meet demand, like a living organism. For years, this magic was the domain of Platform-as-a-Service (PaaS) offerings, which hid the immense complexity of scaling behind a simple git push
. They provided the dream, but at the cost of control.
Kubernetes changed the game. By offering a universal, open control plane (a true cloud operating system) it gave that control back to engineers. But in doing so, it also exposed the raw, complex machinery of elasticity that the PaaS model had so carefully abstracted away.
We now have direct access to the levers of scalability, but we’ve also inherited the responsibility of understanding their nuances. The core abstraction of our modern systems is the container, a lightweight, isolated process. But these processes are not static; they are ephemeral, volatile, and constantly changing in their resource needs based on user traffic, new code deployments, and shifting data patterns.
This creates a fundamental tension: how do you manage a dynamic, ever-changing reality with a declarative, state-based system?
Nowhere is this tension more visible than in Kubernetes’ most common and vital tool for elasticity: the Horizontal Pod Autoscaler (HPA). This article is a deep dive into that tension, complete with practical examples, custom metrics configurations, and production-ready best practices. We will explore not just what HPA is, but why its elegant, declarative model often breaks down when faced with the chaotic reality of production workloads. This is the field guide to the hard-won lessons that separate the textbook examples from a truly elastic, resilient, and cost-efficient system.
The Basics You’ll See Everywhere (But Still Matter)
Before diving into the gritty, production-learned truths, let’s quickly revisit the basics of HPA. Understanding its core mechanics is a prerequisite for seeing why it breaks down under real-world load.
The Kubernetes Horizontal Pod Autoscaler is a built-in controller whose job is to automatically adjust the number of running pods for a workload (like a Deployment or StatefulSet). It is defined by a HorizontalPodAutoscaler
resource, which uses the autoscaling/v2
API group.
How HPA Works: The Control Loop
At its heart, HPA is a simple control loop that runs inside the Kubernetes Controller Manager. By default, every 15 seconds, it performs these steps:
- Get Metrics: HPA queries a set of Kubernetes APIs to fetch the metrics you’ve configured (e.g., CPU, memory, or custom metrics).
- Calculate Desired Replicas: It compares the current metric value to your desired target and calculates the ideal number of replicas using the formula:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
. - Update Scale Subresource: If the desired replica count is different from the current count, HPA updates the
/scale
subresource of the target workload (e.g., the Deployment). This triggers the workload’s controller to create or terminate pods.
HPA’s Critical Dependency: The Metrics APIs
HPA doesn’t collect metrics itself. It relies on a set of aggregated APIs that must be running in the cluster:
- For Resource Metrics (
metrics.k8s.io
): HPA’s most basic requirement is the Metrics Server. This lightweight component scrapes CPU and memory stats from thekubelet
on each node (which gets its data from cAdvisor), and exposes them through this core API. In most managed Kubernetes offerings, this is installed by default. In custom environments, it’s typically one of the very first things a DevOps team adds to the cluster. - For Pod & Custom Metrics (
custom.metrics.k8s.io
): To scale on more advanced, application-specific metrics, you need an “adapter” that provides this API. Kubernetes HPA is designed to automatically detect and use this API as soon as it is registered in the cluster. The most common provider is the Prometheus Adapter, which translates PromQL queries into a format the HPA understands. This gives you access to powerful, business-relevant scaling. Common use cases include:- Scaling a web server based on the number of HTTP requests per second (
requests_per_second
). - Scaling a worker pool based on the number of jobs in a processing queue (
jobs_in_queue
). - Scaling a streaming application based on consumer lag (
kafka_consumer_lag_seconds
).
- Scaling a web server based on the number of HTTP requests per second (
- For External Metrics (
external.metrics.k8s.io
): To scale on metrics from outside the cluster (like an SQS queue length), you need another adapter. The open-source tool KEDA is the de facto standard for this, providing an easy way to scale on dozens of external systems.
Why KEDA is the Platform Engineer’s Choice
You might ask: “If my metric is in Prometheus, should I use the Prometheus Adapter or KEDA?” For the vast majority of use cases, the answer from the trenches is clear: start with KEDA.
The Prometheus Adapter is a powerful but low-level tool. It requires a complex, centralized ConfigMap
to define your metric discovery rules, which can become an operational bottleneck.
KEDA is a complete, purpose-built autoscaling framework. It not only provides a simpler way to scale on Prometheus metrics via its own CRD (ScaledObject
), but it also offers 60+ other built-in scalers for everything from Kafka to cloud provider queues. Most importantly, KEDA offers a killer feature the HPA ecosystem lacks natively: scale-to-zero
That’s the textbook version. Now let’s move past the basics into the uncomfortable truths.
The 30-Second Truth:
- HPA takes 2-4 minutes to respond (your users have already left).
- It scales broken pods horizontally (2× broken = 2× problems).
- It sees averages, not P99 latency (10% of your users suffer invisibly).
- Memory scaling is a trap that kills warm caches and hurts performance.
- You’re probably wasting a lot of your budget on “buffer pod” insurance.
Let’s unpack each of these uncomfortable truths, one by one.
The 2-4 Minute Brownout Window
HPA’s first and most fundamental flaw is that it is not instantaneous. It’s a reactive system at the very end of a surprisingly long and lossy data pipeline. Before HPA can scale your application, a metric must travel through a multi-stage journey, with each step adding delays and abstractions that hide the truth of what’s happening in your application right now.
For every scaling event, your application and your users have to survive the “brownout window”: the critical minutes between when a problem starts and when new capacity is actually online and helping.
Here is the timeline that every engineer learns the hard way:
- T+0s: The traffic spike hits. Your current pods are now overloaded. Latency is climbing.
- T+15-30s: The Metrics Server finally scrapes the kubelets and its internal cache reflects the high CPU usage.
- T+30-45s: The HPA controller polls the Metrics Server, sees the problem, and makes a decision to scale.
- T+45-90s: The scheduler finds nodes for the new pods, and the kubelets begin pulling container images.
- T+90-180s (and beyond): The application itself starts up. For complex services, especially those on the JVM, this means class loading, JIT compilation, and hydrating caches.
Your P99 latency during those first 2-3 minutes is your actual SLA, not what HPA eventually delivers once the fire is already raging.
The latency in the pipeline is only half the problem. The other, more insidious issue is that the very data HPA receives is often a lie, or at least, a dangerous oversimplification.
You might think you’re monitoring “current CPU usage,” but you’re not. At the lowest level, the Linux kernel exposes a pod’s CPU time as a simple, ever-increasing number called a cumulative counter (container_cpu_usage_seconds_total
). It’s like the odometer on a car: it only ever goes up.
To turn this into a useful “utilization” metric, a system like the Metrics Server has to take two snapshots in time and calculate the rate()
of change. This is where the danger creeps in. By calculating a rate over a window (e.g., 60 seconds), you are inherently creating an average.
And averages hide disasters.
How a Spike Disappears in the Average
Let’s imagine a critical 60-second window for an e-commerce checkout service:
- For the first 50 seconds, usage is a calm 50%.
- For the final 10 seconds, a catastrophic spike drives usage to 300%.
The math for the 60-second average is: (50 seconds * 50% + 10 seconds * 300%) / 60 seconds = 91.6%
.
HPA sees a high-but-not-critical 91.6%, while your application was on fire.
The nature of the metric measurement has smoothed away the disaster and rendered your autoscaler blind to the real user pain. It’s a fundamental architectural flaw that plays out in production clusters every single day.
How to Survive the Brownout Window: Hard-Won Fixes
The good news is that SREs have developed battle-tested patterns to mitigate these latency flaws. These are not solutions, but workarounds, insurance policies you take out against a reactive system.
- The Buffer Pod (N+2) Strategy The most common strategy is to simply run more pods than your baseline requires. If you need
N
pods for average load, you runN+2
. One pod acts as a buffer for unexpected spikes, giving the HPA time to react. The second provides redundancy in case of a node failure or pod crash.
You should think of this not as waste, but as availability insurance. You are paying a constant premium in idle capacity to buy yourself the 2-4 minutes you need for the HPA to catch up.
- Pre-warming with CronJobs For predictable traffic spikes, like the 9 AM morning rush or a planned marketing email, the simplest solution is to tell Kubernetes about them ahead of time. A
CronJob
can proactively scale up your deployment’s replicas just before the event.
# Crude but effective: predictive scaling without the ML.
# Note: The ServiceAccount for this CronJob needs a Role/RoleBinding
# with permission to "patch" the target deployment's "scale" subresource.
apiVersion: batch/v1
kind: CronJob
metadata:
name: myapp-scale-up-for-peak
spec:
schedule: "0 8 * * 1-5" # 8 AM on weekdays
jobTemplate:
spec:
template:
spec:
serviceAccountName: deployment-scaler-sa
containers:
- name: kubectl-scaler
image: bitnami/kubectl:latest
command: ["kubectl", "scale", "deployment/myapp", "--replicas=20"]
restartPolicy: OnFailure
These tactics are essential stopgaps for managing HPA’s reactive delays. But they are only treating the symptoms. Next, let’s look at a more fundamental problem: what happens when the pods HPA is scaling are wrong to begin with?
The “More Broken Pods” Problem
The initial scaling delay is a painful but understandable problem. The second flaw of the HPA is far more deceptive: it assumes the pod template it’s scaling is correct. HPA is a multiplier of your deployment’s quality: if your pod spec is flawed, HPA will happily and efficiently multiply that flaw across your cluster.
How does a team end up with flawed, undersized pods in the first place? It’s a story every SRE knows well. It starts with a best-guess requests
value set by a developer months ago. Over time, reality diverges. A new feature adds a computationally expensive API call. A code refactor changes the memory access pattern. User traffic slowly creeps up. The original requests
value becomes a dangerous lie, a piece of technical debt baked into your deployment manifest.
This is the most expensive lesson many you may learn. Let’s say your pods are now constantly CPU throttled because their requests
are set to 100m
, but the application actually needs 200m
to handle the current load. Here’s what happens:
- HPA sees the existing pods running at 100% of their requested CPU (because they are throttled at their limit).
- It dutifully scales the deployment from 3 to 6 pods to try and bring the average utilization down.
- You now have 6 pods all throttling at
100m
instead of 3.
Your cloud bill has just doubled, but your performance problem hasn’t improved at all. In fact, it’s often worse, as you now have twice as many pods fighting for resources and creating scheduling pressure, and a much larger blast radius to troubleshoot when you’re debugging.
How to Fix the “More Broken Pods” Problem
This is the point in the journey where many engineers turn to the Vertical Pod Autoscaler (VPA). In theory, VPA is the perfect solution to this problem. Its entire purpose is to analyze a pod’s historical usage and recommend the correct requests
values, solving the “original sin” of manual, best-guess sizing.
However, as we’ve detailed in our VPA deep dive, VPA has its own set of dangerous flaws. Its recommendations are slow to adapt, its Auto
mode is too disruptive for most production workloads, and it is blind to real-time events like throttling.
Because VPA alone can’t be trusted to automate this safely, seasoned engineers fall back on a manual pre-flight check. Before ever enabling HPA on a workload, they use VPA in “recommendation-only” mode and combine it with direct observation to validate the pod’s vertical sizing. This manual diligence is the only way to prevent HPA from amplifying a hidden flaw like this one:
# A ticking time bomb for HPA
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi
This configuration creates a pod with a “Burstable” Quality of Service class. On paper, it looks efficient: you’re only reserving a small amount of CPU and memory (requests
), which allows for dense packing of your nodes, but you’re allowing the pod to burst up to a higher limit
if needed.
The problem is that this spec is a form of technical debt. It’s a static hypothesis about workload behavior. When a new feature is deployed that increases the average CPU need to 150m
and the burst to 250m
, this manifest becomes a ticking time bomb.
By default, HPA sees utilization as a percentage of the request
. When the pod’s usage hits 100m
, the HPA sees 100% utilization and correctly decides to scale up. But it scales up by creating perfect clones of this broken pod, each with the same 200m
performance ceiling. The “explosion” happens during the first real traffic spike: the application tries to burst to 250m
, but every single pod in the deployment hits the hard 200m
limit and begins to throttle. HPA’s reaction adds more pods that still throttle, leaving the problem unsolved.
Here’s a pre-flight check to avoid these issues:
- CPU Throttling: Dive into your monitoring platform and inspect
container_cpu_cfs_throttled_seconds_total
. This metric is a cumulative counter of the total time a container was ready to run but was denied CPU time by the kernel because it had hit its limit. An increasingrate()
on this metric is the single most important signal of CPU starvation. A healthy pod ready for horizontal scaling should have this at or near zero. - Memory Headroom: Watch for pods whose
container_memory_working_set_bytes
is consistently hugging the memorylimit
. This specific metric represents the memory that the application is actively using and cannot be easily freed. If it’s too close to the limit, you risk OOMKills. Scaling memory-starved pods just multiplies crash loops, where each restart means a cold cache and a painful latency hit for your users. - Node Contention: Before you double the number of pods, check if your nodes can even handle them. Use
kubectl top nodes
to see if your cluster itself is resource-constrained. If you add more pods than your nodes have capacity for, you’ll trigger Cluster Autoscaler (or Karpenter), which can take several minutes to provision new nodes, adding even more delay to your scaling event. - Steady-State Validation: Don’t just look at idle pods. Run a controlled load test against a single replica to confirm that your pod spec can actually sustain the target throughput you expect. Scaling pods that are already unhealthy under load is like pouring water into a leaking bucket.
The “Average is a Lie” Blind Spot
So, you’ve survived the brownout window with buffer pods, and you’ve diligently rightsized your pods to prevent HPA from scaling a broken template. You should be safe now, right?
This is where the third and most subtle flaw of the HPA emerges: its fundamental reliance on averages. HPA is architecturally designed to optimize for the health of the fleet, but your users don’t experience the fleet. They experience a single pod. And when one of those pods is having a bad day, HPA is completely blind to the user pain it’s causing.
Why Averages Betray Your Users: Living in the Tail
Your service’s health isn’t defined by its average performance. It’s defined by its worst-case performance: the P99 and P99.9 latencies. These tail latencies are what your most engaged (and often most valuable) users experience. HPA, by optimizing for the fleet-wide average CPU, is fundamentally misaligned with this reality.
This problem is made worse by the uneven nature of load distribution in Kubernetes. Factors like kube-proxy
routing, sticky sessions, or even DNS caching can create “hot pods” that receive a disproportionate amount of traffic or computationally expensive requests. The headroom you see in the fleet average is an illusion. It’s not evenly shareable, and it does nothing to help the one pod that’s on fire.
Your monitoring dashboard might look perfectly healthy. HPA, querying the metrics-server
for the average CPU utilization across all pods, sees a calm, stable number well below its target.
CPU Average: 40% ✅ "Healthy"
P50 Latency: 50ms ✅ "Great"
P99 Latency: 5000ms ❗️ "ON FIRE"
This dashboard isn’t a bug, it’s a feature of averages. If you have ten pods, and nine are idle at 10% CPU while one is stuck in a hot loop at 100%, the fleet-wide average is a perfectly acceptable 19%. HPA sees no reason to act. Meanwhile, 10% of your users are being routed to that one “hot pod” and are experiencing 5-second page loads or outright timeouts. HPA reports that the system is green while your P99 latency is burning and your support channels are lighting up.
Scaling on What Matters
The only way to solve the “hot pod” problem is to stop scaling on lagging infrastructure metrics like average CPU. Instead, you must scale on leading, business-relevant metrics that are closer to the actual user experience.
If your P99 latency is caused by a backlog of work, scale on the queue length. KEDA makes this easy. Instead of a simple CPU-based HPA, a production-grade scaling configuration for a web service would look like this:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: web-service-scaler
spec:
scaleTargetRef:
name: web-service-deployment
minReplicas: 3
maxReplicas: 50
triggers:
# Trigger 1: Scale on the number of active requests (a business metric)
- type: prometheus
metadata:
serverAddress: http://prometheus.svc:9090
metricName: http_requests_per_second
query: 'sum(rate(http_requests_total{deployment="web-service"}[2m]))'
threshold: '100' # Start scaling when RPS per pod exceeds 100
# Trigger 2: A CPU backstop for unexpected issues
- type: cpu
metadata:
type: Utilization
value: "80"
This configuration is smarter. It primarily scales on the metric that actually impacts users (RPS), and only uses CPU as a secondary safety net.
💡 Quick Win: The One Change That Prevents 50% of HPA Disasters
If you do nothing else after reading this, add stabilization windows to your HPA’s behavior
spec. This single change will eliminate “flapping” (scaling up and down erratically) and prevent premature pod kills during choppy traffic.
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Don't panic-scale on transient spikes
scaleDown:
stabilizationWindowSeconds: 300 # Don't kill pods for at least 5 mins
The Memory Scaling Trap
To understand why scaling on memory is dangerous, and why it’s a fundamentally different problem than scaling on CPU, we have to look at how Kubernetes and the underlying Linux kernel treat different types of resources. This is a key distinction that trips up even experienced teams.
- CPU is a preemptive and compressible resource. “Preemptive” means the kernel’s scheduler can interrupt, or preempt, a process at any time to give another process a turn on the CPU. “Compressible” means that when there is contention, the kernel can simply give each process smaller time slices. The application slows down and latency rises, but it doesn’t necessarily crash. This ability to “compress” time makes CPU a plausible (though flawed) metric for a reactive scaler.
- Memory is a non-preemptive and incompressible resource. A process requests a block of memory, and the kernel grants it. It cannot be taken back or given to another process until it’s freed. “Incompressible” means that if a process tries to allocate more memory than its limit, there is nothing to “compress.” The kernel has no choice but to kill the process instantly with an Out Of Memory (OOM) signal (Pods can also be evicted by
kubelet
when a node is under memory pressure, but the effect for your users is the same: in-flight work disappears). An OOMKill doesn’t just reset the pod: it vaporizes in-flight requests, drops connection pools, and destroys warm caches. This amplifies tail latency and causes cache churn for every other user hitting the newly started replica. Every time a warm pod gets killed, the replacement starts cold, multiplying cache misses cluster-wide and dragging down tail latency for everyone.
This distinction is why the HPA’s reactive model fails with memory. By the time the Metrics Server even reports a memory spike, the pod may have already been OOMKilled, making memory a fundamentally “too-late” signal.
The trap is baited by another common misconception: that high memory usage is always a sign of distress. For many modern applications with managed runtimes (like the JVM or Go), high memory usage can simply be normal garbage collection behavior or a sign of a healthy, warm cache, not a signal of actual memory pressure. For a JVM, high memory may even be a good thing, indicating fewer expensive GC cycles. This isn’t just a Java problem. Node.js and Python runtimes can exhibit similar behavior due to GC patterns and memory fragmentation. When you tell a reactive, average-based tool like HPA to scale down on memory, you are effectively punishing your best-performing pods.
To give you some color, here’s a plausible scenario:
A team spent nine frantic days chasing what they believed was a memory leak in a critical Java service. Pods would start up, their memory usage would climb to ~80% of the limit as they built their caches, and then stay there. HPA, configured to scale down if memory dropped below 50%, saw these healthy pods as “over-provisioned.”
When a new pod was added during a small CPU-based scale-up, traffic would rebalance slightly, and the memory usage of one of the original, warm pods would dip. HPA would see this dip, declare the pod “underutilized,” and kill it.
The result was a constant, vicious cycle: a pod would finally become performant with a warm cache, only to be targeted for termination by the very autoscaler that was supposed to ensure stability. The team wasn’t hunting a leak, they were witnessing HPA systematically destroying their application’s performance.
How to Fix the Memory Trap
Avoid memory as a primary HPA metric. For most workloads, it is too laggy and too destructive. It only makes sense in rare, deliberately memory-bound systems such as in-memory databases (Redis), certain streaming workloads, or ML inference with fixed model sizes. Even then, it requires careful tuning.
Think of memory as a sizing problem, not a scaling signal. Get it right once (with VPA in recommendation mode or manual profiling), give it enough headroom to stay stable, and then let HPA react to the real pressure points like CPU or custom metrics.
From Reactive Firefighting to Proactive Optimization
The hard-won fixes we’ve discussed: buffer pods, pre-warming CronJobs, and custom metrics, are the marks of a seasoned engineer. They are essential tactics for surviving the limitations of a purely reactive HPA. But they are fundamentally workarounds, not solutions. They add layers of complexity and operational toil, and they often trade one problem for another, like wasting money on buffer pods to compensate for slow reaction times.
The core architectural flaws remain:
- HPA is reactive, always arriving minutes after a performance problem has already started.
- HPA is blind, unable to distinguish a satisfied pod from a throttling one.
- HPA is uncoordinated, blindly scaling undersized pods and fighting with other controllers.
A truly autonomous system doesn’t just react faster: it changes the game entirely. Instead of fighting fires, it prevents them.
This is where ScaleOps provides a new architectural layer. It acts as an intelligent control plane for your entire autoscaling ecosystem, transforming HPA from a simple, reactive tool into a proactive, context-aware system.
- It solves the delay problem with predictive scaling that learns your traffic patterns and pre-warms your application before the spike hits.
- It solves the sizing problem by using real-time throttling and OOMKill signals to continuously rightsize your pods, ensuring HPA always scales the correct building blocks.
- It solves the context problem by correlating pod, node, and application metrics to understand the why behind performance issues, not just the what.
Stop managing your autoscalers and let an intelligent platform manage your resources. If you’re tired of paying the “buffer pod tax” and reacting to incidents that your autoscaler should have prevented, it’s time to see what a proactive approach looks like.
Book a Demo with a Scaling Specialist or Explore the Full Platform.