Kubernetes Leading Metrics: Why CPU Utilization Is the Worst Scaling Signal (And What to Use Instead)

Key Takeaways

CPU utilization percentage has a 30-second to 3-minute delay through the Horizontal Pod Autoscaler pipeline — by the time it triggers horizontal scaling, users have already experienced degradation.
CPU throttling ratio provides a kernel-level signal approximately 25 seconds ahead of the Metrics Server pipeline — it is the earliest real-time indicator available for CPU-bound workloads with limits set.
PSI (Pressure Stall Information) measures CPU contention without requiring limits, available as a beta feature since Kubernetes 1.34 via the KubeletPSI feature gate.
In KEDA’s MAX-across-triggers evaluation model, deriv() is structurally unable to gate horizontal scaling decisions in most configurations — the unit mismatch between messages and messages-per-second means the backlog trigger almost always produces more replicas.
predict_linear triggered first horizontal scale-out 2.66 seconds earlier than backlog-only in controlled GKE cluster testing — with no code changes and no custom instrumentation required.
Memory pressure is primarily a vertical scaling signal — adding replicas does not reduce per-pod memory consumption, and by the time memory utilization triggers horizontal scaling, pods are already being OOMKilled.

The Spectrum of Kubernetes Leading Metrics

The Horizontal Pod Autoscaler (HPA) shipped in Kubernetes 1.1, back in 2015. At the time, CPU and memory were effectively the only scaling signals available. Metrics Server didn’t exist yet. The Custom Metrics API wouldn’t arrive until Kubernetes 1.6, two years later. “Scale when CPU exceeds 80%” was a reasonable default because most clusters ran a handful of stateless API servers behind a load balancer.

That world doesn’t exist anymore.

Today’s production clusters run fundamentally different workload types side by side: CPU-bound API servers, queue-based workers consuming from NATS or Kafka, memory-heavy caches like Redis, ML inference services with variable GPU utilization, and latency-sensitive gRPC endpoints with strict SLA budgets. Each of these workload types has different failure modes, different pressure signals, and different scaling needs. But most teams are still horizontally scaling on the same metric they configured when they first set up HPA: CPU utilization percentage.

CPU utilization is a lagging metric. It tells you what already happened, not what’s about to happen. By the time average CPU crosses your threshold, your users have already experienced degraded latency. Instead of scaling proactively, you’re reacting to symptoms.

The Kubernetes Custom Metrics API that could fix this has been available since 2017. But most teams never made the switch because the configuration is painful, the documentation is scattered across Prometheus Adapter READMEs and half-finished blog posts, and there’s no clear guidance on which metric to use for which workload.

This article is that guidance — a practical map of Kubernetes leading metrics for every major workload type.

A leading metric in Kubernetes horizontal scaling is a signal that indicates emerging stress before it impacts user experience — as opposed to lagging metrics like CPU utilization that reflect what already happened. The distinction matters because your Horizontal Pod Autoscaler can only be as fast as the signals you give it.

But “leading” is itself misleading. It implies all kubernetes leading metrics are equally predictive. They’re not. The reality is a spectrum of signal latency:

Truly real-time: kernel-level signals like CPU throttling ratio, which detect contention approximately 25 seconds before Metrics Server reports it through the HPA pipeline
Early-warning: tail latency indicators like P99 that degrade before averages move — but by the time they shift, some users already had bad requests
Predictive trends: queue growth rate and memory pressure acceleration, signaling trajectory rather than current state
Historical averages: CPU and memory utilization — where most teams are stuck

The right metric depends on your workload type. There is no universal best:

Workload Type	Lagging (what mosts teams use)	Leading (what to use instead)	Native Scaling Approach	With ScaleOps	PromQL Example
CPU-bound	CPU utilization %	Throttling ratio, P99 latency	HPA + Prometheus Adapter	Replica Optimization: pre-warms minReplicas from learned patterns; trigger math stays stable through rightsizing	`rate(container_cpu_cfs_throttled_seconds_total...)`
Queue-bound	Queue depth	Growth rate + productivity (composite)	KEDA composite trigger	KEDA integration: replica floor + trigger synchronisation when CPU trigger sits alongside queue triggers	`predict_linear(nats_consumer_num_pending...)`
Memory-bound	Memory utilization %	Memory pressure, eviction rate	HPA or VPA (Off mode)	Rightsizing: sets memory requests from observed working set, prevents OOMKill cliff	`container_memory_working_set_bytes / limit`
Latency-bound	Average latency	Error rate, timeout ratio, P99	HPA + Custom Metrics API	Rightsizing + Replica Optimization: request accuracy reduces baseline latency; replica floor absorbs traffic spikes before P99 degrades	`histogram_quantile(0.99, ...)`

And beyond leading metrics, there’s a predictive layer — scaling based on learned traffic patterns and seasonality rather than reacting to signals at all. That requires getting the metrics right first. We’ll come back to it at the end.

This is not a “what is HPA” tutorial. It assumes you’ve configured horizontal pod scaling before and are familiar with the autoscaling/v2 API. What follows is the metrics layer that sits underneath HPA, KEDA, and VPA — the signals that determine whether your scaling decisions are chasing symptoms or catching problems early.

CPU Throttling Ratio: The Kernel Knows Before Your Horizontal Pod Autoscaler

To understand why CPU utilization percentage is late, you need to trace the full metrics pipeline that feeds horizontal pod scaling decisions.

A traffic spike hits your application:

CPU usage rises immediately at the process level.
The kubelet scrapes container metrics every 10 to 15 seconds.
Metrics Server polls the kubelet roughly every 60 seconds.
The Horizontal Pod Autoscaler queries Metrics Server on its default 15-second sync period.
It detects the threshold breach, calculates desired replicas, and the scheduler places new pods.
Those pods pull their image, start, and pass readiness probes.

Total elapsed time from traffic spike to new capacity serving requests: 30 seconds to over 3 minutes.

The Linux kernel, meanwhile, knew the pod was constrained approximately 25 seconds into that chain. It was already throttling processes via CFS bandwidth control. That information was available, but nothing in the horizontal scaling pipeline asked for it.

CFS (Completely Fair Scheduler) enforces CPU limits by allocating a bandwidth quota per scheduling period (typically 100ms). When a container exhausts its quota within a period, the kernel throttles it: the process sleeps until the next period. This throttling is recorded in kernel counters that cAdvisor exposes to Prometheus. That metric is container_cpu_cfs_throttled_seconds_total.

There’s also a simple PromQL query that turns this into a scaling signal:

rate(container_cpu_cfs_throttled_seconds_total[2m])
/ (rate(container_cpu_cfs_throttled_seconds_total[2m])
+ rate(container_cpu_usage_seconds_total[2m]))

This gives you the throttling ratio: the fraction of CPU time your container spent being throttled rather than executing. It’s a direct kernel-level pressure indicator, not a derived average.

As a starting point, you can use the following thresholds:

Below 5% is healthy
Between 5% and 10% is a warning, something is tightening.
Above 10%, scale now.

These are not universal, you should tune them against your latency SLOs. But they’re a defensible starting point for most CPU-bound services.

The limits controversy — and what to do if you don’t set them

CPU throttling ratio only works if you set CPU limits. The “remove all limits” camp — popularized by several prominent voices in the Kubernetes community — argues that limits cause unnecessary throttling and should be eliminated. They have a point about throttling. But they rarely mention what you lose.

If you remove CPU limits, you lose the single best real-time signal the Linux kernel gives you about container pressure. That’s a tradeoff worth making consciously, as long as you have full awareness of what you’re giving up — not by default because the internet said limits are evil.

But here’s what changed: you no longer have to choose between limits and observability.

Since Kubernetes 1.34, PSI (Pressure Stall Information) is available as a beta feature, enabled by default via the KubeletPSI feature gate. PSI measures something fundamentally different from CFS throttling. Where throttling ratio tells you “this container hit its CPU quota and was forced to wait,” PSI tells you “this container’s tasks wanted to run but couldn’t because the CPU was busy.” Throttling requires limits, but that’s not the case for PSI.

The metric is container_pressure_cpu_waiting_seconds_total, exposed per-container via the kubelet’s cAdvisor Prometheus endpoint. The PromQL query:

rate(container_pressure_cpu_waiting_seconds_total{container="your-app"}[2m])

This gives you the fraction of time your container’s tasks were stalled waiting for CPU — regardless of whether limits are set. It works on any cluster running cgroupv2 with a Linux kernel 4.20 or newer, which at this point is effectively every modern Kubernetes distribution.

The practical model becomes three tiers:

You set CPU limits: Use throttling ratio. It’s the most precise signal — CFS-level, kernel-native, measures exactly how much your container is being capped.
You don’t set CPU limits: Use PSI. It measures contention without requiring limits. You lose the capping signal but gain a pain signal.
You want the full picture: Use both. Throttling tells you “this container is being artificially constrained by its quota.” PSI tells you “this container is experiencing real resource contention.” These are different questions with different operational implications.

ScaleOps ingests both throttling and PSI signals as part of its workload observation — translating kernel-level pressure data into rightsizing and replica decisions without requiring you to configure Prometheus Adapter or custom HPA metrics manually.

One caveat that the PSI documentation doesn’t make obvious: PSI currently cannot distinguish between pressure caused by genuine resource contention and pressure caused by CPU throttling from limits you explicitly configured. A pod with a 20m CPU limit running a compute-heavy workload will show 99% CPU pressure — technically accurate (the pod is under pressure), but the pressure is self-inflicted by the limit, not by competing neighbors. At the per-container cgroup level this is less of a concern since you’re looking at the container’s own experience. But if you aggregate PSI to the node level for scheduling decisions, this conflation becomes a real problem. KEP-4205 in the Kubernetes enhancements repo documents this limitation in detail.

Gotcha — burstable workloads

Throttling ratio has a blind spot for Burstable QoS class workloads with a wide gap between requests and limits. A pod running at 300m CPU with requests set to 100m and limits set to 500m will show 0% throttling — it’s well within its limit. But it’s consuming 3x its requested allocation, and if the node gets busy, it’ll be the first to lose that burst headroom. The throttling ratio says “everything is fine” because technically it is — until it isn’t. For Burstable workloads, combine throttling ratio with actual usage-to-request ratio, or better yet, PSI — which will register contention the moment burst headroom disappears.

Queue Growth Rate: The Gatekeeper That Doesn’t Drive KEDA Scaling

“If the queue is growing, we should scale.”

This is one of the most intuitive ideas in horizontal scaling for queue-based workloads. It’s also incomplete in a way that most teams never discover — until their KEDA configuration does something unexpected.

Queue depth tells you the stock: how many messages are waiting. Queue growth rate (the derivative) tells you the trend: is the backlog getting worse or stabilising? Together, they seem like they should give you everything you need. In practice, the way KEDA evaluates composite triggers creates a dynamic most people don’t anticipate.

Here’s a real KEDA ScaledObject configuration from a production-style setup. I used this in a Cloud Native Days France 2026 talk on advanced horizontal scaling with HPA, VPA, and KEDA, where I presented deriv() as a guardrail on top of backlog-based scaling:

triggers:
  # Stock: absolute backlog
  - type: prometheus
    metadata:
      query: |
        jetstream_consumer_num_pending{
          stream_name="MEMES", consumer_name="meme-backend"}
      threshold: "10"
  # Trend: growth rate as qualifier
  - type: prometheus
    metadata:
      query: |
        (deriv(nats_consumer_num_pending{
          stream_name="MEMES"}[5m]) > 10)
        AND
        (avg(memegenerator_pod_productivity) < 0.6)

Notice the structure. The backlog trigger is the primary replica calculator — KEDA divides pending messages by the threshold to compute desired replicas. The deriv() trigger is an AND condition: it only fires when the queue is actively growing AND per-pod productivity has dropped. It qualifies, but it doesn’t drive.

This matters because of how KEDA evaluates multiple triggers: it takes the MAX across all triggers. Whichever trigger computes the highest replica count wins.

After the talk, I went back and tested this properly. What I found challenged the approach I’d presented on stage.

The unit comparison trap

The reason deriv() rarely contributes becomes obvious when you look at the units. The backlog trigger works in messages: 500 pending messages ÷ threshold of 10 = 50 desired replicas. The deriv trigger works in messages per second: a growth rate of 200 msg/s ÷ threshold of 10 msg/s = 20 desired replicas.

These are fundamentally different units. KEDA doesn’t know that. It compares 50 and 20, takes the max, and scales to 50. Backlog wins — not because it’s a better signal, but because its unit produces a larger number at typical production scales.

But if you miscalibrate the thresholds, growth rate can accidentally become the driver. Set the backlog threshold to 100 and the deriv threshold to 5, and suddenly at moderate load the derivative produces more replicas than the backlog. Your “qualifier” just became the primary signal without you realising it.

Controlled Testing: deriv() vs Backlog in KEDA’s MAX-Across-Triggers Model

I ran controlled A/B experiments on a live GKE cluster with NATS JetStream, using the same meme-generator application from the talk.

The first finding was sobering: the production configuration had broken deriv PromQL. The query used deriv((pending + ack_pending)[2m]), which Prometheus rejects — ranges are only allowed on vector selectors. KEDA logged PartialTriggerError with a high failure count. The deriv branch was effectively dead in production. It was difficult to notice as first, because the backlog trigger handled everything on its own.

After correcting the query and running controlled tests with the consumer disabled (to isolate true queue dynamics), the results were consistent: deep backlog with flat growth — backlog dominated every time. Shallow backlog with acceleration — deriv won briefly at 1-second sampling resolution, but at HPA/KEDA decision-window cadence, backlog still produced more replicas.

The deriv trigger wasn’t gating anything. It wasn’t contributing to any scaling decision. It was dead weight in the ScaledObject — invisible to the operator, invisible in KEDA’s status, and structurally unable to influence the outcome due to MAX semantics.

The concept was right. The implementation was wrong.

I was right in that trajectory matters. You absolutely should care whether the queue is getting worse or stabilising. A queue at 1,000 messages with deriv() at zero is a fundamentally different situation from a queue at 100 messages with deriv() at 50 msg/s. The first might drain on its own. The second will be at 1,100 in 20 seconds.

But deriv() inside KEDA’s MAX-across-triggers model is the wrong vehicle for expressing that insight. KEDA’s evaluation model can only scale MORE — it takes the highest replica count. It can’t express “scale LESS because this other signal says wait.” And the AND clause embedded inside the PromQL query is opaque: the operator never sees “growth rate says yes, productivity says no” in any dashboard or KEDA status output. The guardrail is hidden.

The strongest practical alternative I found was predict_linear. Instead of measuring instantaneous rate of change, it forecasts where the backlog will be in 30 seconds:

clamp_min(predict_linear(jetstream_consumer_num_pending{stream_name="MEMES"}[2m], 30), 0)

In a clean comparison, predict_linear triggered first horizontal scale-out 2.66 seconds earlier than backlog-only. No code changes required. No custom instrumentation. Just a smarter PromQL query that captures the same trajectory concept deriv() was trying to capture — but expressed as a forecast that KEDA can actually use to compute replicas.

That 2.66 seconds might not sound like much. In a microburst scenario where your queue goes from 0 to 500 in 10 seconds, it’s the difference between scaling before users notice and scaling after your P99 has already degraded.

Beyond PromQL: Coordinated Scaling with ScaleOps

predict_linear is the best you can do with standard PromQL and no code changes. But it’s still forecasting from a short window. It catches ramps in progress but can’t anticipate a traffic pattern it hasn’t seen yet.

Queue horizontal scaling isn’t just about one threshold. In production, queue pressure, CPU demand, and baseline replica needs evolve together. The configuration you deployed three months ago may no longer reflect how your workload actually behaves. This is where the set-and-forget nature of KEDA becomes visible: threshold, minReplicaCount, and resources.requests.cpu are all static values encoding assumptions about a workload that changes over time.

ScaleOps continuously reconciles those layers. It rightsizes resource requests based on observed workload behaviour — CPU and memory consumption, replica history — which for queue workloads naturally correlates with queue activity. It adjusts the replica floor from learned patterns, so capacity is already warm when recurring load arrives rather than waiting for the queue to fill first. And when a CPU utilization trigger sits alongside the queue trigger in the same ScaledObject (a common pattern where CPU acts as a safety net), ScaleOps keeps that trigger’s scaling intent stable as requests change, so the two triggers don’t drift out of alignment after rightsizing.

If your queue traffic is genuinely unpredictable, i.e., no recurring patterns, no periodicity, then learned behaviour won’t help and you need faster reaction instead. ScaleOps Burst Reaction addresses this by shifting from historical baselines to real-time resource usage when it detects sustained spikes, giving each pod more headroom to process messages faster even before horizontal scaling kicks in. In a follow-up article, I’ll also show how custom instrumentation like queue-wait histograms, semaphore saturation gauges, and time-to-drain estimates can push queue-based horizontal scaling from reactive to genuinely proactive for those cases. Standard PromQL gets you 80% of the way. The last 20% requires knowing not just how many messages are waiting, but how long they’ve been waiting and how fast your pods are actually draining them.

P99 Latency: Early Warning, Not a Leading Metric

P99 latency occupies an awkward position in the metrics spectrum. It is closer to leading than CPU utilization — tail behaviour shifts before averages move — but by the time your 99th percentile degrades, 1% of your users have already experienced bad requests. The signal is early, but the damage has started.

That distinction matters for horizontal scaling decisions. CPU throttling ratio tells you the kernel is constraining the container before users feel anything. Queue growth rate tells you demand is accelerating before the backlog is deep enough to cause latency. P99 tells you latency has already degraded for the tail. It is the earliest lagging signal, not a leading one.

The PromQL:

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[2m])) by (le))

This requires application-level histogram instrumentation. Unlike CPU metrics (which the kubelet provides automatically) or queue depth (which your message broker exports natively), P99 latency demands that your application records request durations into Prometheus histogram buckets. That means code you have to write, or a service mesh you have to deploy. It is not free.

The instrumentation has a gotcha that affects almost every team at least once: bucket boundaries. Prometheus histograms use predefined buckets — [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] by default. histogram_quantile interpolates within those buckets. If your SLO is 200ms but your nearest bucket boundaries are 100ms and 250ms, you cannot distinguish a 110ms response from a 240ms one — they land in the same bucket. The quantile calculation assumes uniform distribution within the bucket, which for tail latency is almost never accurate.

For scaling decisions, this means your trigger threshold needs to account for bucket resolution, not just the SLO number. A P99 target of 200ms with a 250ms bucket boundary will never fire until you have already exceeded your SLO.

Where P99 earns its place: as a confirmation signal alongside throttling or queue metrics for request-serving workloads. If throttling ratio says the container is under pressure AND P99 is climbing, you have high confidence that horizontal scaling is warranted. Not just CPU contention, but user-visible impact. For batch processors and queue consumers, P99 of individual request handling time is less meaningful — queue depth and processing rate tell a more direct story.

Memory Pressure: The Forgotten Kubernetes Autoscaling Signal

Memory behaves fundamentally differently from CPU as a scaling signal. CPU degrades gradually —latency climbs, throttling increases, requests slow down, and the gradient gives you time to observe and react. Memory consumption typically grows over time as connection pools fill, caches warm, and buffers accumulate, with no equivalent of CPU throttling to indicate proximity to a limit. A container moves from 70% of its memory limit to 85% to 95%, and then the kernel’s OOM killer terminates it. There is no graceful degradation step between “approaching the limit” and “process killed.”

This characteristic makes memory one of the most difficult metrics to use as a horizontal scaling trigger. By the time memory pressure is high enough to cross an HPA threshold, pods are already being OOMKilled.

The metric to watch is container_memory_working_set_bytes, not container_memory_usage_bytes. The distinction is important: usage_bytes includes reclaimable filesystem cache — memory the kernel can reclaim under pressure without terminating anything. working_set_bytes is the memory the container is actively using and the kernel cannot reclaim. This is what the OOM killer evaluates. Scaling decisions based on usage_bytes will fire too early, because cache inflation looks like memory pressure when it is not.

The PromQL for the ratio that matters:

container_memory_working_set_bytes{container="your-app"}
/ container_spec_memory_limit_bytes{container="your-app"}

Even with the correct metric, using memory as a horizontal scaling trigger is structurally limited. Adding replicas does not reduce per-pod memory consumption. If your application has a memory leak or an unbounded cache, ten pods will each consume the same amount. Memory pressure is primarily a vertical scaling signal: the correct response is larger memory requests, not more pods.

This is where continuous rightsizing matters more than for any other metric. Developer-guessed memory requests — which is how most deployments start — are almost always inaccurate. Too low, and you get OOMKills under production load. Too high, and you waste capacity cluster-wide. ScaleOps sets memory requests from observed usage patterns, maintaining headroom above the working set without leaving the gap that allows memory consumption to reach the limit undetected.

One development that changes this equation: Kubernetes swap support (beta since 1.28, with LimitedSwap behaviour stabilising through 1.33-1.35) gives memory-pressured containers a degradation path that does not end in immediate termination. Performance degrades, but the process survives. I covered the tradeoffs, benchmarks, and decision framework in my KubeCon EU 2026 talk, “To Swap or Not to Swap – Memory Management Design Patterns for AI Workloads in Kubernetes 1.34+“.

From Reactive Metrics to Predictive Kubernetes Autoscaling

Every kubernetes leading metric covered in this article is still fundamentally reactive. Throttling ratio is faster than CPU utilization. predict_linear is faster than raw backlog. PSI works where throttling does not. But each of these responds to something that has already started happening — the signal is earlier in the chain, not ahead of it.

Predictive horizontal scaling operates differently: it raises capacity before any metric fires, based on learned workload patterns rather than real-time signals. The reactive trigger becomes the safety net rather than the primary mechanism.

This is what ScaleOps Replica Optimization does. It replaces the static, manually-set minReplicas with a continuously-updated one — lower when observed patterns indicate over-provisioning, higher when a recurring load pattern is approaching. The result is that horizontal scaling triggers fire less often, because the baseline capacity already reflects what the workload needs.

A concrete starting point: add CPU throttling ratio as a metric on a single workload this week and compare its timing against your existing CPU utilization trigger. When you see how much earlier the kernel reports pressure, the case for moving up the metrics spectrum becomes clear.

ScaleOps works with your existing HPA and KEDA configuration — no migration, no rearchitecture. Start free to see how your current scaling signals compare to what your workloads actually need, or book a demo to see Replica Optimization running on your own cluster data.

Kubernetes Leading Metrics: Frequently Asked Questions

What is the difference between leading and lagging metrics in Kubernetes?

A leading metric signals emerging stress before it impacts users — CPU throttling ratio, queue growth rate, or PSI contention data. A lagging metric reports what already happened — CPU utilization percentage, average response time, or error rate. Kubernetes horizontal scaling is only as fast as the metric it reacts to.

Does CPU throttling ratio work without CPU limits?

No. CFS throttling only occurs when a container exceeds its CPU limit quota, so removing limits eliminates the signal. Since Kubernetes 1.34, PSI (Pressure Stall Information) provides an alternative that works without limits — container_pressure_cpu_waiting_seconds_total measures CPU contention regardless of whether limits are set. The practical model is three tiers: throttling ratio if you set limits, PSI if you don’t, both for the full picture.

When should I use KEDA instead of HPA for custom metrics?

KEDA adds value when you need scale-to-zero, composite triggers combining multiple signals, or native scaler support for message queues and streams. For a single Prometheus-based custom metric — such as throttling ratio or P99 latency — HPA with Prometheus Adapter involves fewer moving parts and works well. KEDA’s strength is orchestrating multiple event sources, not replacing HPA for single-metric horizontal scaling.