Tl;dr
- Traditional autoscalers optimize in a vacuum, often creating instability and wasted resources when combined due to their reactive nature
- Real optimization needs coordination and automation. Insight is useless without action, look for platforms that rightsize in real time, prevent conflicts, and reclaim node capacity safely.
Let’s be honest: nobody adopts Kubernetes to spend their days tweaking YAML files. The goal has always been to move faster, build resilient systems and scalable applications, and let the platform handle the complexity as the de facto cloud OS. Yet here we are in 2025, and for many, “managing Kubernetes” has become a full-time job of fighting fires and trying to decipher a cloud bill that looks like a phone number.
The core of resource optimization is simple: give every single workload exactly the resources it needs, precisely when it needs them. No more, no less.
Get it right, and you’ll achieve a model of operational excellence: your costs plummet, your applications run with predictable performance, and your engineers can focus on building products. Get it wrong, and you’re not just wasting money, you’re actively hurting your business. Over-provisioned applications burn cash and hide inefficiencies. Under-provisioned applications suffer from CPU throttling and OOMKills, leading to poor user experience and late-night pages for your on-call team.
The path to optimization isn’t about finding one magic tool. It’s about mastering the fundamental mechanics of how Kubernetes manages resources, and understanding where every popular solutions fall short.
Key Aspects of Kubernetes Resource Optimization
Before we compare software, let’s expose the uncomfortable truths about Kubernetes autoscaling that vendors won’t tell you.
Rightsizing (Scaling Up & Down): VPA’s Dirty Secrets
The Vertical Pod Autoscaler (VPA) is Kubernetes’ native attempt at rightsizing. On paper, it’s brilliant. But in practice, it’s a minefield of gotchas. VPA consists of three components that seem reasonable until you understand their production implications:
The Recommender: A Slow Historian in a Real-Time World
The Recommender syncs once per minute by default. That might sound fast, but its real limitation is about reactive learning: memory recommendations rely on an eight-day aggregation window, and CPU uses an exponentially weighted average. That means both lag behind real-time demands. If you’re setting up a new deployment, you’d better brace yourself for a week of inefficiency. During that period, your service runs on suboptimal guesses, either burning cash on oversized requests or throttling performance on undersized ones.
VPA does recommend CPU, but only based on average usage. It offers no defense against sudden CPU surges, and you still suffer throttling. Its memory recommendations use a fixed 95th percentile, which gives you a historical baseline but can’t anticipate spikes. When a flash sale or traffic surge hits the 98th percentile, VPA has no answer. Its only “fast” response is to react after your pod has already been OOMKilled, worsening the experience for your users.
The Updater: A Dangerous Balancing Act
The Updater evicts pods when their requests drift more than 10% from the recommendation. It uses the eviction API, so it respects your PodDisruptionBudgets
(PDBs), but that’s the trap!
- If you set
maxUnavailable
to 0 or tighten your PDB too much, you effectively disable VPA. It sees the recommendation, can’t evict, and your pods sit unoptimized forever. - If you loosen your PDB, VPA gains permission to kill pods on every 60-second sync. It has zero context about your traffic patterns, the other pods in the same deployment, or your nodes. It will restart a critical pod in the middle of your Cyber Monday peak if the numbers line up.
You’re forced into an impossible manual trade-off: either sacrifice optimization or risk VPA triggering an incident. It’s not intelligence, it’s a loop with a PDB check. You can mitigate this by running VPA in “recommendation only” mode and handling updates yourself.
The Admission Controller: Architectural Risks
VPA’s MutatingWebhookConfiguration
intercepts all pod creation requests cluster-wide. This design presents two primary operational risks:
- Performance Overhead: Every pod creation (even those without a VPA object), triggers a synchronous check by the VPA admission controller. Kubernetes waits for this webhook before scheduling any pod. Because it evaluates resource-policy bounds in real time, it adds noticeable latency and slows the entire cluster to a crawl. A more resilient architecture would handle these expensive calculations asynchronously, allowing the webhook to simply apply a pre-approved result without adding computational overhead to the critical path.
- Single Point of Failure: By default, the webhook’s
failurePolicy
is set to Fail. If the API server can’t reach the VPA admission controller, for example if the controller pod is down, there is a network partition, or there is a TLS error, the server rejects every pod creation request. That means any VPA controller outage stops all new pods cluster-wide, creating a serious reliability risk. For that reason, you should change thefailurePolicy
to Ignore.
Scaling Out & In: HPA’s Hidden Flaws
If VPA is the tortoise, HPA is The Flash, reevaluating every 15 seconds on a single metric. Its core logic is just simple math: desiredReplicas = ceil(currentReplicas * (currentMetric / desiredMetric))
. This ratio-based feedback loop has no predictive capability and suffers from two fundamental architectural flaws that lead to both waste and poor performance.
Stateless by Design, HPA Has No Memory
HPA has no access to historical data. It cannot distinguish a predictable daily traffic ramp-up from a sudden flash sale or a DDoS attack. It is architecturally blind to trends, seasonality, or momentum. Every decision is based on a single, isolated snapshot of the immediate past, forcing it to be purely reactive. It’s a fundamental design choice that guarantees the HPA will always be one step behind your application’s real needs.
The Stabilization Window: Enforced Waste as a Feature
To prevent flapping (scaling up and down erratically), HPA enforces a five-minute stabilization window for scale-down operations. But this isn’t a smart, adaptive delay. Because the HPA is stateless and blind to historical trends, this crude delay is its only defense against self-inflicted instability. The control loop simply records its own past scaling recommendations and will refuse to scale down if it recommended a higher replica count at any point within the window.
This forces engineers into solving a conundrum:
- Set a low window: You chase cost savings but accept the operational chaos of flapping replicas as the HPA overreacts to every minor dip in traffic.
- Keep the high default: You buy stability but are forced to knowingly burn cash, holding onto expensive, idle pods for five full minutes after every single spike.
It’s a system that uses guaranteed over-provisioning as its primary tool to manage instability. So, how do you solve that design issue?
Use multiple layers of defense (like confidence scoring based on data availability), and make the stabilization window only a safety backstop, not your primary, wasteful control.
The Tolerance Dead Zone
HPA is intentionally designed to be unresponsive. The controller is governed by a tolerance flag (--horizontal-pod-autoscaler-tolerance
), which defaults to 10% and creates a “dead zone” where it will do nothing unless the calculated metrics are significantly off target.
This has dangerous real-world consequences. If your utilization target is 80%, HPA will refuse to act until average usage climbs past 88%. Since it is blind to a pod’s true performance limits and only sees the arbitrary request
value you provided, it forces the running pods to absorb 100% of this load increase without any help. By the time HPA finally crosses its tolerance threshold, your pods are already running hot, user latency is climbing, and service health is degrading.
As before, this problem doesn’t hit as hard when you properly rightsize your workloads and include a data-driven safety buffer, making your application resilient to spikes from the start. This turns the HPA’s dead zone into a harmless safeguard against small swings. You get a system that’s both stable and responsive, without forcing engineers to pick one over the other.
Combining VPA and HPA: The Death Spiral?
Running VPA and HPA together on the same metrics is one of Kubernetes’ most notorious anti-patterns, creating a feedback loop that guarantees cost-inefficiency and instability. The two controllers are fundamentally incompatible because they operate on the same data with opposing goals, leading to a predictable death spiral:
- VPA sees low average CPU usage and correctly recommends shrinking the pod’s
requests
. - After the pods restart with the new, smaller requests, HPA (which calculates utilization as a percentage of that request) suddenly sees a massive spike. The same real-world usage now represents a much higher percentage.
- HPA panics, scaling out the workload with far too many replicas.
- Now, with the same load spread across an excessive number of pods, the absolute CPU usage per pod plummets.
- VPA looks at this new, chronically low usage and concludes its previous recommendation wasn’t aggressive enough. It recommends shrinking the
requests
again, intensifying the cycle.
VPA shrinks pod requests but HPA interprets that as high utilization and scales out, creating a costly, unstable feedback loop. Most teams work around this by moving HPA off the native CPU metric, typically using a Prometheus Adapter and custom queries. But you can also avoid the clash by running VPA in recommendation-only mode or by specializing each scaler (e.g., VPA on memory, HPA on business metrics).
KEDA: Event-Driven Scaling with Event-Driven Problems
KEDA (Kubernetes Event-Driven Autoscaler) extends HPA to scale on external events like queue length. While powerful, it inherits HPA’s reactive nature and introduces its own production traps.
The Polling Paradox and the Latency Trap
KEDA is not a real-time system; it polls. With a default 30-second polling interval, a queue can grow from zero to thousands before KEDA even registers it. By the time it reacts and schedules new pods, your application is already scrambling to recover.
Its signature scale-to-zero feature is an operational nightmare for synchronous or user-facing workloads. The first request pays a steep price, up to 45–90 seconds, waiting for polling, scheduling, image pull, and startup. That almost always results in a gateway timeout for the users. On top of this, every KEDA trigger needs its own TriggerAuthentication
and Secret
, creating a sprawling credentials map that’s hard to audit and a constant security concern.
The VPA + KEDA Fallacy: Horizontally Scaling a Vertical Problem
Pairing VPA with KEDA on an external metric seems elegant, as it avoids the CPU-conflict death spiral. But the controllers remain context-blind and clash over a hidden variable: a pod’s true throughput.
A KEDA lagThreshold is an implicit promise of per-pod capacity. When VPA rightsizes your CPU request from 1000 m to 300 m, it silently breaks that promise. KEDA continues scaling based on the old assumption, unaware of the reduced power.
When a traffic burst hits, these under-resourced pods throttle on CPU and processing slows. KEDA sees the queue lag exploding and adds more replicas, but all equally starved. You end up wasting budget on useless pods while the queue grows. A truly coordinated system would restore CPU headroom first, then scale out.
The SRE’s Dilemma and the Path to True Autonomy
The fundamental flaw uniting Kubernetes’ native autoscalers is that they are a collection of disconnected, context-blind tools, each optimizing for a single metric in a vacuum. They are architecturally incapable of coordinating to solve a holistic problem. VPA fights HPA, and both can silently sabotage KEDA by changing the one variable they don’t manage: the throughput of a single pod.
This leaves savvy SREs trapped in a cycle of managing the managers, forced to implement a series of expensive workarounds to prevent these systems from destroying each other:
- The Buffer Strategy: They deliberately overprovision pods with excess CPU and memory. This expensive insurance policy is the only way to create the safety headroom that VPA would otherwise remove and that KEDA implicitly relies on to handle bursts.
- The Two-Tiered Contraption: They build fragile, hand-tuned hybrid scaling models where KEDA manages the 0-N replica range and HPA manages N-to-Max. This complex setup requires constant manual re-tuning and creates a dependency that can shatter on the next Kubernetes upgrade.
- Reactive Firefighting: They create complex alerts that attempt to correlate vertical pressure (like CPU throttling) with horizontal symptoms (like queue lag). This is a purely reactive stance; by the time the alert fires, the incident has already begun.
All of these workarounds are a cry for a missing architectural layer: a single, coordinated control plane that understands the full picture. A truly autonomous system understands the causal relationship between metrics. It knows that a pod’s throughput is a direct function of its resources. It stabilizes the vertical dimension with data-driven rightsizing before making horizontal decisions. It provides real-time, vertical healing to prevent a CPU bottleneck from ever becoming a horizontal scaling crisis.
Ultimately, the goal isn’t to get better at tuning three disconnected tools, it’s about relying on an intelligent system that makes them work together in harmony.
Notable Kubernetes Resource Optimization Software
Now that we’ve dissected the fundamental flaws in Kubernetes’ native autoscaling, let’s examine some of the tools and platforms that promise a solution. The ultimate goal of optimization is a trifecta of benefits: improved performance, greater stability, and most importantly significant cost savings.
However, achieving real cost savings isn’t as simple as shrinking pod resource requests. Rightsizing is only the first step: it converts wasted, reserved capacity back into usable node space, but it doesn’t automatically lower your cloud bill. You still need to reclaim that freed capacity. That’s why a node-level autoscaler like Karpenter or Cluster Autoscaler is a hard prerequisite. These tools perform the final step: they detect underutilized nodes after optimization and terminate them, turning theoretical efficiency into actual savings (provided your chosen solution offers an efficient bin-packing algorithm).
With this in mind, the ecosystem of optimization tools has evolved into three distinct categories: VPA-enhanced recommendation engines (Goldilocks), cost visibility platforms (OpenCost, Kubecost OSS), and event-driven scaling enhancers (KEDA extensions). Understanding where each solution excels, and more importantly, where it fails, is key for making informed architectural decisions.
Side Note
For testing, we’ve developed a Kubernetes application (TaxiMetrics) processing NYC taxi data through a complete ML pipeline. The workload includes:
- Persistent services: API orchestration (job-controller), user-facing API (taxi-api with HPA), ML model serving, Redis caching, and PostgreSQL database storing 3.3M+ records.
- Batch jobs: Parallel data processors (3 workers processing 1.1M records each in 5-6 minutes) and ML trainer (training 5 models on 1M+ records in 6-7 minutes).
The application exhibits real production patterns: 10x CPU over-provisioning in persistent services (4m actual vs 200m requested), 90% memory waste in ML jobs (800MB used vs 8GB allocated), and storage inefficiency (50GB PVC with <5GB actual usage). These documented inefficiencies across 3.3M records provide an ideal testbed for comparing optimization tool effectiveness.
1. Goldilocks: A Snapshot in Time, A To-Do List in Practice
Goldilocks presents itself as the friendly, risk-free entry point to rightsizing. It wraps Kubernetes’ native VPA in a clean UI, making its recommendations easy to consume. But this simplicity is a veneer over VPA’s inherent limitations: it makes the problems easier to see but does nothing to solve them.
How It Works:
Goldilocks is fundamentally a VPA management UI. It works by automatically deploying VPA resources in a recommendation-only (updateMode: "Off"
) mode for every workload in a labeled namespace. It then queries these VPA objects and presents their recommendations in a central dashboard, bypassing the need for any pod evictions or mutating webhooks.
Real-World Testing Results
What Works as Expected:
- Instant Recommendations: Guidance appears in under a minute, providing immediate feedback on newly deployed workloads.
- Safety by Default: Because it never enables VPA’s
Auto
mode, there is zero risk of accidental pod evictions or service disruptions. - Clear Visualization: The dashboard successfully translates VPA’s complex output into an accessible “Guaranteed” vs. “Burstable” view, which is a clear improvement over raw VPA objects.
The Uncomfortable Surprises:
- The Manual Toil Engine: Goldilocks is a read-only tool. It generates a perpetual to-do list of sizing recommendations that an engineer must manually apply via
kubectl patch
or a GitOps workflow. It doesn’t reduce operational load; it creates it. - HPA Conflict Creates Useless Recommendations: When an HPA is managing a workload, Goldilocks’s advice becomes misleading. It doesn’t analyze the pod’s true usage; instead, it tends to recommend a value that mirrors the HPA’s own utilization target, defeating the purpose of rightsizing.
- Ephemeral Workload Blindness: The tool is ineffective for ephemeral jobs. Since VPA objects are tied to the Job controller of our TaxiMetrics application, their recommendations are often lost or aggregated incorrectly once the Job completes and is garbage-collected. It cannot group multiple runs of a data pipeline into a single, logical recommendation.
2. OpenCost: The High-Maintenance Forensic Accountant
OpenCost provides clarity on exactly where your money is going. It’s a forensic accounting engine that translates resource consumption into dollars and cents.
How It Works:
OpenCost deploys a single metrics-exporter pod. This pod queries kubelet for real-time usage data and integrates with your cloud provider’s billing API (e.g., via a GCP BigQuery export). It then synthesizes this data into a detailed set of Prometheus metrics, adding high-cardinality labels for every pod, namespace, and label combination.
Real-World Testing Results
What Works as Expected:
- Unflinching Cost Clarity: It excels at its core mission. Seeing that our job-controller had a 0.21% CPU efficiency provided an undeniable, dollar-driven impetus for optimization.
- Granular Attribution: It successfully breaks down costs by any label, allowing you to answer questions like “How much does the
team: backend
project cost us?”
The Hidden Costs:
- The Prometheus Tax: OpenCost’s value comes at a steep price for your monitoring stack. In our small five-service cluster, it added over 50,000 new Prometheus time series, consuming 3GB of storage in just 24 hours. This is a significant, ongoing operational cost and load that most teams don’t plan for.
- Painful IAM Setup: Configuring the cloud billing integration is a significant hurdle. You can spend hours wrestling with IAM roles and waiting for permissions to propagate before the tool would even start.
- Provides Problems, Not Solutions: It is 100% reactive. It’s exceptionally good at telling you how much money you wasted yesterday, but it provides zero actionable advice on how to stop wasting it tomorrow.
3. Kubecost OSS: The Resource-Hungry Dashboard
Kubecost OSS bundles OpenCost’s engine with a UI, basic rightsizing hints, and additional checks for things like orphaned resources and oversized storage. It aims to be an all-in-one cost visibility solution.
How It Works:
Kubecost deploys a full cost-management stack into your cluster, including its core cost-analyzer, a bundled Prometheus, Grafana, and other components. It analyzes the same cost and usage data as OpenCost but presents it in its own UI, layered with simple percentile-based recommendations.
Real-World Testing Results
What Works as Expected:
- Unified View: It successfully combines cost data with basic health checks, flagging unused disks and abandoned deployments in one place.
- Beyond Compute: Unlike other tools, its ability to identify and recommend savings on oversized Persistent Volume Claims is a unique and valuable feature.
The Operational Burden:
- Staggering Resource Footprint: The convenience comes at a shocking resource cost. The full Kubecost stack peaked at nearly 4 CPU cores in our testing, consuming far more resources than the entire application it was meant to optimize.
- Slow Time-to-Value: The system was slow to populate. It took 13 minutes for basic metrics to appear and over 25 minutes for its first rightsizing “hints” to surface, a lifetime compared to other tools.
- Superficial Advice: The rightsizing recommendations are rudimentary and, like Goldilocks, create another to-do list of manual changes for the engineering team.
4. ScaleOps: From Insight to Autonomous Optimization
Where other tools stop at observation, ScaleOps is a production-grade platform built to close the loop and run at scale. It continuously and autonomously optimizes resources with full node and application context awareness, moving beyond reports and to-do lists to provide safe, continuous resource management.
How It Works:
ScaleOps deploys a self-hosted control plane into your cluster. Once installed, it immediately begins monitoring your environment and generates workload rightsizing recommendations without any manual steps required. It also integrates seamlessly with HPA to proactively optimize replica counts, ensuring your application performance and cluster stability stay at peak levels.
Real-World Testing Results
Within an hour, ScaleOps had automatically patched our pods, slashing cluster CPU requests by 65% and memory by 45%, all while consuming under 200mCPU itself, making it over 19x more resource-efficient than Kubecost.
It achieves this by directly solving the problems the others create:
- It Automates the To-Do List: It doesn’t just recommend; it safely applies changes via rolling updates or in-place resizes, respecting PDBs and Zero-Downtime policies.
- It Is HPA-Native: ScaleOps intelligently coordinates with HPA, using its deep workload analysis to prevent the conflict spirals that plague other tools. HPA becomes a stable, predictable scaler because it’s finally working with accurate, context-aware data.
- It Understands Complex Workloads: It provides the right abstractions for deep visibility and control over the ephemeral batch jobs that were invisible to VPA-based tools.
Outside of the scope of our test, ScaleOps has much more to offer. The platform’s capabilities extend beyond rightsizing to include proactive Fast Reaction for absorbing sudden traffic spikes, intelligent bin-packing to maximize node density from the moment a pod is scheduled, and deep integration with Karpenter to turn rightsized capacity into decommissioned nodes and realized savings.
For mission-critical services, it provides true zero-downtime updates for single-replica pods, in-place resizing to avoid disruptive evictions, and a Spot-aware optimization engine that integrates seamlessly with Karpenter, allowing you to run production workloads on the cheapest available Spot Instances without sacrificing stability. It’s this combination of proactive, infrastructure-aware, and safety-first features that elevates ScaleOps to a complete control plane for autonomous cloud efficiency.
Stop Analyzing, Start Automating
For years, the approach to Kubernetes cost management has been a hamster wheel of manual toil disguised as insight. You can start with Goldilocks to get a list of manual VPA changes. Layer in OpenCost to get a bill for the problems you haven’t fixed yet. And finally, deploy Kubecost’s resource-hungry stack to get a slightly prettier to-do list. They are all symptoms of a broken, reactive workflow that ends the same way: with an engineer manually patching YAML.
This isn’t a path to efficiency. The fundamental flaw of these tools is that they leave the most critical and highest-risk step, the actual optimization, entirely up to you.
ScaleOps was built to break this cycle. It is NOT another dashboard. It is an autonomous optimization platform that replaces the entire manual workflow of analysis, prioritization, and patching. It connects deep, contextual insight directly to safe, automated action, transforming your team’s focus from firefighting cluster costs to building the products that drive your business.
Why spend another day manually tuning what your platform should be handling for you? Stop managing your autoscalers and let our platform manage your resources.
- Get started with a full-featured free trial, or
- Book a demo to see how we can solve your most complex optimization challenges.