The two phases of Kubernetes cost optimization
At first, everything in your Kubernetes cluster looks fine. Workloads are running, autoscaling is on, the dashboards are green. Then the cloud bill arrives, and it is 40% higher than last quarter for the same traffic. Nothing broke. Nothing changed. The bill just kept growing.
This pattern is the default state of unmanaged Kubernetes infrastructure, and it has a specific cause. Kubernetes cost optimization is not a single project. It is two phases, and most teams only finish the first one.
- Kubernetes cost optimization fails in 2026 for predictable reasons. The bill keeps growing, manual cleanup gets undone within months, and the autoscalers everyone enables conflict with each other in ways most teams never audit.
- The bill comes back even after you cut it. Within 3-6 months of a manual cleanup pass, unmanaged clusters return to most of their pre-cleanup state.
- The four-autoscaler problem is where most cost erosion starts. Cluster Autoscaler, Karpenter, HPA, and VPA conflict in non-obvious ways.
- GPU rightsizing is the single largest underexploited cost lever. NVIDIA documents 0-10% GPU compute utilization for lightweight models on dedicated GPUs.
- Multi-cloud and multi-cluster make the problem qualitatively harder. Three clouds means three billing stacks, three allocation engines, three reconciliation workflows.
- Kubernetes cost optimization is two phases, and most teams only finish the first. Phase 1 cleanup delivers 30-50% savings. Phase 2 continuous management is what holds them. Performance protection comes before cost savings, always. Cost optimization that compromises application reliability is technical debt with a smaller cloud bill.
Phase 1 is cleanup. Setting sensible CPU and memory requests, enabling the right autoscalers, moving fault-tolerant workloads to Spot, cleaning up orphaned resources. This work typically delivers 30-50% savings, and it is mostly one-time effort. The wins are visible, the methodology is well documented, and most teams complete it.
Phase 2 is continuous management. Workloads drift. New services launch with default requests that nobody tunes. HPA, VPA, Cluster Autoscaler, and Karpenter interact in non-obvious ways. GPU and AI inference workloads break static assumptions entirely. Three to six months after Phase 1 ends, the bill quietly climbs back. Phase 2 is the work that prevents this, and most teams never start it because the drift is gradual and invisible.
The scale of the underlying problem is well documented. Spectro Cloud’s 2025 State of Production Kubernetes report, conducted independently by Adience, found cost overtook skills and security as the top Kubernetes challenge for 42% of organizations, with 88% reporting a year-over-year rise in total Kubernetes TCO. Sysdig’s Cloud-Native Security and Usage Report found that on average 69% of CPU cores in Kubernetes clusters were unused, with organizations running roughly 150 nodes potentially overspending by close to $1 million per year, and the largest deployments wasting upwards of $10 million annually on underutilized CPU alone. The waste is not theoretical. It is what unmanaged Kubernetes looks like at production scale.
Three forces make the Phase 2 problem worse in 2026 than it was two years ago:
- Workload heterogeneity. Stateless APIs, batch jobs, stateful services, inference endpoints, and GPU workloads now coexist in the same clusters. Generic rightsizing advice that worked for stateless services breaks on the others.
- Four-way autoscaler interactions. Cluster Autoscaler, Karpenter, HPA, and VPA are all in production use, often in the same cluster. Mixing them wrong is one of the most common reasons Phase 1 savings unwind.
- Multi-cluster sprawl. Dev, staging, production, regional clusters, and per-team resource management challenges multiply. Manual tuning that worked for one cluster does not scale to twenty.
This guide covers both phases. Sections 2 and 3 explain where Kubernetes spend actually comes from, including the storage and network costs most teams underweight. Section 4 walks through the Phase 1 cleanup checklist that delivers the first 30-50%. Sections 5 and 6 cover the four-autoscaler problem and the Phase 2 audit workflow. Section 7 covers FinOps and cost allocation. Section 8 covers cloud billing reconciliation across providers and clusters. Section 9 covers the workload patterns where generic best practices fall short. Sections 10 through 13 cover tools, pitfalls, and answers to the questions teams ask most.
Where Kubernetes spend actually comes from
Before any optimization work makes sense, the cluster bill has to be broken into its component parts. Most teams know “compute is the big one” and stop there. The other categories are smaller in percentage terms and larger in absolute dollars than expected, because they grow with usage patterns nobody is actively watching.
Kubernetes spend breaks into four categories. Each has a different waste profile, a different primary lever, and a different audit cadence.
| Cost driver | Typical share of cluster spend | Primary lever | Common waste pattern |
| Compute (worker nodes) | 60-80% | Rightsizing + bin packing + Spot | Overprovisioned requests, low node utilization, idle nodes held warm by stale requests |
| Storage | 10-20% | PV rightsizing, storage class selection, snapshot retention | Orphaned PVs, oversized volumes, wrong storage class, snapshot accumulation |
| Network egress | 5-15% | Topology-aware routing, cross-AZ awareness, ingress consolidation | Hidden cross-zone traffic, external API egress, per-service load balancers |
| Control plane | 2-10% | Cluster consolidation, namespace-based multi-tenancy | Cluster sprawl across teams and environments |
Compute
Compute is where 60-80% of the cluster bill lives, which is why Phase 1 cleanup levers 1-3 (rightsizing, bin packing, Spot) target it first. The waste pattern in compute is well understood: pod requests set above actual usage cause Cluster Autoscaler or Karpenter to provision more nodes than the workload needs, and those nodes do not get reclaimed because the requests do not change. Compute optimization is the largest single savings lever in Kubernetes cost management, and proper cost monitoring is what makes the waste visible enough to act on.
Storage
Storage is 10-20% of the bill and rarely audited. The cost mechanics differ from compute: storage is provisioned in fixed sizes that are hard to shrink, charged continuously regardless of utilization, and accumulates silently as workloads scale up and down. Most storage waste does not appear in compute dashboards because it sits in unbound PVs and oversized volumes that nobody is watching. Section 3 covers the specific patterns.
Network egress
Network is 5-15% of the bill and is the most surprising category for most teams. The cost is invisible at the workload level because pods do not show network spend in their metrics. It shows up in the cloud bill under a line item the platform team owns but did not directly cause. The largest hidden cost is usually cross-availability-zone traffic between services that should have been topology-aware. Section 3 covers the patterns.
Control plane
Control plane costs are small per cluster (managed Kubernetes services typically charge around $0.10 per hour per cluster on EKS and GKE Standard, with AKS free on its standard tier) but multiply quickly. The waste pattern is cluster sprawl: every team spins up its own cluster, every environment gets its own cluster, every region gets its own cluster. Twenty clusters at $100/month each is $24,000/year before any compute, storage, or network costs are counted. Consolidation through namespace-based multi-tenancy is the lever, and it pairs directly with the FinOps work in Section 7.
The four categories also drift at different rates. Compute waste accumulates monthly as workloads change. Storage and network waste accumulate quarterly as orphaned resources pile up. Control plane waste accumulates annually as new clusters get spun up and never decommissioned. Each needs its own audit cadence, which is the foundation for the Phase 2 workflow in Section 6.
Storage and network cost optimization
Storage and network rarely come up in the first cost-cut conversation. They reliably come up in the second, after compute optimization has hit diminishing returns and the team starts looking for the next 10-15%.
Both categories share a structural problem: the cost is real but invisible at the workload level. A pod does not show its persistent volume cost in its metrics. A service does not show its cross-zone egress cost in its dashboards. The waste only surfaces in the cloud bill, where it appears under line items the platform team owns but did not directly cause.
Storage cost optimization
Orphaned persistent volumes
When a PersistentVolumeClaim is deleted, the underlying PersistentVolume often stays around depending on the reclaim policy. The default reclaim policy for dynamically provisioned volumes is Delete, but plenty of clusters have Retain set for safety, which means deleted PVCs leave their PVs behind. StatefulSets that are scaled down also leave their PVs in place by design (the volumes are meant to persist across scaling events) but they often outlive the StatefulSet itself.
Audit quarterly: list all PVs not currently bound to a PVC. The query is one kubectl command per cluster. Typical waste in unaudited clusters: 5-15% of total storage spend, sometimes higher in environments that frequently spin up and tear down workloads.
Oversized persistent volumes
PVs provisioned at 100Gi while the workload uses 4Gi are common. The cause is conservative defaults in Helm charts, copy-paste from other deployments, or precautionary sizing for workloads whose data footprint was unknown at deployment time. Once provisioned, most cloud providers do not allow PVs to shrink without recreating the volume and restoring data, so the oversizing persists indefinitely.
The lever is not to shrink existing PVs. The lever is to rightsize new PVs at deployment time and audit storage utilization quarterly so the pattern does not repeat. For workloads where data footprint genuinely grows over time, use volume expansion (allowVolumeExpansion: true in the StorageClass) to grow on demand instead of provisioning the maximum upfront.
Storage class selection
The same workload running on gp3 versus gp2 on AWS, Premium SSD versus Standard SSD on Azure, or balanced PD versus SSD PD on GCP can differ in cost by 40-60% with no performance impact for non-database workloads. Most clusters default to the highest-performance storage class for everything, which is the right choice for production databases and the wrong choice for log volumes, scratch space, build artifacts, and any non-latency-sensitive workload.
Audit current StorageClass usage by workload type. Move non-critical workloads to the lower-cost class. For workloads that genuinely need high-performance storage, keep them where they are.
Snapshot retention
Default snapshot policies accumulate silently. A daily snapshot retained for 90 days is 90 snapshots per volume, charged at full storage rate. Multiplied across a cluster with hundreds of PVs, snapshots can exceed live volume cost.
Audit retention windows quarterly. Tier snapshots aggressively: keep daily snapshots for 7-14 days, weekly snapshots for 30-90 days, monthly snapshots for compliance-driven retention windows only.
Network cost optimization
Cross-zone traffic
Multi-availability-zone clusters that do not use topology-aware routing pay for cross-AZ traffic on every service-to-service call. The default Kubernetes Service abstraction does not prefer pods in the same zone as the caller, so a service in zone A calling a service spread across A, B, and C will hit zone B and zone C pods two-thirds of the time. Each cross-zone call costs $0.01 per GB in each direction on AWS (effectively $0.02 per GB round-trip), with similar pricing across the major clouds, which sounds trivial until it is multiplied by every microservice call in a high-traffic cluster.
The lever is topology-aware routing. As of Kubernetes 1.34+, set trafficDistribution: PreferSameZone on the Service spec to prefer same-zone endpoints when capacity allows. Earlier versions (1.27-1.33) use the service.kubernetes.io/topology-mode: Auto annotation or trafficDistribution: PreferClose. Service meshes (Istio, Linkerd, Cilium) provide more granular topology control. For most clusters, enabling topology-aware routing on internal services is the single largest network savings lever available.
External egress
Egress to the internet is charged at full bandwidth rates, typically $0.08-0.12 per GB on the major clouds. The largest sources in most Kubernetes clusters are external API calls (SaaS integrations, third-party services), container image pulls from public registries, and outbound webhook traffic.
The levers: use VPC endpoints or private connectivity for SaaS integrations where the provider supports it, mirror commonly pulled images to a private registry inside the cluster’s VPC, and batch outbound webhook traffic where the application design allows. For a deeper read on the network category, see Reduce Network Traffic Costs in Your Kubernetes Cluster.
Inter-cluster traffic
Multi-cluster meshes or service-to-service calls between clusters in different VPCs or projects incur both egress and ingress charges, often across cloud regions. This is a deliberate architectural cost and there is no single lever to eliminate it, but the typical pattern is that inter-cluster traffic grew organically as the cluster topology evolved and nobody audited whether the calls still needed to cross cluster boundaries.
Audit inter-cluster traffic patterns annually. Workloads that started as separate clusters for organizational reasons often consolidate cleanly once the original reason no longer applies.
Load balancer proliferation
Every Service of type LoadBalancer provisions a cloud load balancer with its own hourly cost ($16-25/month on the major clouds). Clusters with 50 LoadBalancer services pay $800-1,250/month in load balancer fees alone, often for services that could have shared a single ingress controller.
The lever is ingress consolidation. Use an Ingress controller (NGINX, Traefik, AWS ALB Controller, GCP HTTP Load Balancer Controller) to route multiple services through a single cloud load balancer. Reserve LoadBalancer services for workloads that genuinely need their own entry point.
Storage and network are the second-pass categories. They will not deliver the same savings as compute optimization, but they are also less likely to come back. Compute waste regenerates monthly as workloads change. Storage and network waste, once cleaned up, stays cleaner for longer because the patterns that cause it (oversized PVs, missing topology hints, LB-per-service) are architectural decisions rather than per-workload tuning.
Phase 1 — The cleanup checklist (one-time savings, 30-50%)
Most Kubernetes clusters have never been through a deliberate cleanup. Requests and limits were set once at deployment and never revisited. Autoscalers were enabled with default thresholds. Nobody ever audited which workloads were idle. This is the work that delivers the largest one-time savings in Kubernetes cost optimization.
The eight levers below are ordered by typical savings impact, not by difficulty. Run them in this order, measure after each, and move on once the cluster has stabilized at the new baseline. The levers can be applied manually as a discrete project, or applied automatically by a workload management platform that handles both Phase 1 cleanup and Phase 2 continuous management from day one. The methodology is the same in both cases; the operational model is what differs.
1. Rightsize CPU and memory requests
This is where 30-50% of compute savings live. Most workloads run with requests set well above actual usage, either because the original developer padded the values to avoid throttling and OOM kills, or because requests were copied from a template and never revisited. Use Prometheus, Kubecost, OpenCost, or Vertical Pod Autoscaler recommendations to measure actual usage over a 7-30 day window, then rightsize requests to match real consumption with sensible headroom (typically 20-30% above the 95th percentile).
Limits are a separate decision. Setting CPU limits often causes throttling that hurts performance more than it saves money. Setting memory limits prevents OOM kills from cascading across nodes. The current practice in most production clusters is to set memory limits but not CPU limits.
The mitigation depends on what tooling is in place. With stock VPA, run it in recommendation-only mode on any deployment that also has HPA, apply its recommendations manually during regular review cycles, and auto-apply VPA only on workloads without HPA. Workload-aware platforms that coordinate vertical and horizontal scaling resolve the conflict directly: they reconcile replica count and per-pod sizing as a single decision instead of two independent loops, which makes running HPA and continuous vertical management together safe.
2. Bin-pack workloads onto fewer nodes
Once requests are rightsized, the cluster usually has more nodes than it needs. Node-level utilization typically improves by 20-40% after the first rightsizing pass, which means workloads can consolidate onto fewer, fuller nodes. This is the largest savings lever after rightsizing because compute is 60-80% of a typical Kubernetes bill, and idle node capacity is pure waste.
Two ways to do this. The manual path: drain underutilized nodes and let the scheduler repack workloads. The automated path: enable a node provisioner like Karpenter that consolidates aggressively or use a workload-aware platform that handles bin-packing continuously.
3. Move fault-tolerant workloads to Spot
Spot Instances are 60-90% cheaper than on-demand for the same compute. The catch is that they can be interrupted with two minutes notice. Fault-tolerant workloads handle this fine: stateless web services, batch jobs, CI runners, ML training, anything with checkpoints or replicas. Stateful workloads (databases, message queues, single-replica services) should stay on-demand unless your team has built explicit failover patterns.
The lever is not just “use Spot.” It is matching the right workload to the right capacity type, designing for interruption with PodDisruptionBudgets and topology spread, and accepting that some workloads will always cost full price. Per-workload Spot strategy (which workloads can run on Spot at any given moment, when to fall back, how to anticipate interruptions) gets harder as cluster scale and workload diversity grow, and is one of the levers continuous management platforms handle automatically.
4. Add commitment discounts on stable baseline
Cloud providers offer discounts of 30-60% for committing to baseline capacity for one or three years: AWS Savings Plans, Google Cloud Committed Use Discounts, Azure Reservations. This is pure free money on capacity you would buy anyway, with one caveat: you have to be honest about what is actually stable. Overcommit and you pay for capacity you do not use. Undercommit and you leave savings on the table.
The right framing: measure your minimum 30-day baseline across the cluster, commit to 70-80% of that, and let on-demand or Spot cover the rest.
5. Sleep dev and staging clusters off-hours
Non-production environments typically sit idle 12-16 hours per workday and all weekend. Shutting them down outside business hours delivers 50-70% savings on non-prod compute. The simplest implementation: a CronJob that scales deployments to zero at 7pm and back up at 7am, plus a node group that scales down behind it. More sophisticated patterns use namespace-level sleep policies with developer overrides for after-hours debugging.
This lever is almost always underused because no single team owns non-prod infrastructure, so nobody is incentivized to optimize it.
6. Enforce ResourceQuotas and LimitRanges per namespace
Quotas do not save money directly. They prevent the next regression. A ResourceQuota caps total CPU and memory consumption per namespace. A LimitRange sets default and maximum request values for any pod that does not specify them. Together they make it impossible for a single team or workload to silently inflate the cluster bill.
Apply quotas to every namespace, not just shared ones. Set defaults aggressively (small) and let teams raise them with justification. The friction is the feature.
7. Clean up orphaned PVs, unused images, and idle workloads
Kubernetes does not garbage collect aggressively by default. Persistent Volume Claims get deleted but their PVs linger. Container images pile up on node disks. Deployments get scaled to zero and forgotten. This silent accumulation typically represents 5-15% of monthly waste.
Audit quarterly: list PVs not bound to any PVC, deployments with zero replicas for more than 30 days, images on nodes that have not been pulled in 90 days. Delete or archive accordingly.
8. Pick the right autoscaler for each workload pattern
This is the lever that determines whether Phase 1 holds or unravels. Cluster Autoscaler and Karpenter are not interchangeable. HPA and VPA solve different problems. Mixing them wrong is one of the most common reasons cost optimization work erodes within months.
Section 5 covers the four-autoscaler problem in detail. The short version: use Cluster Autoscaler for stable workloads with predictable instance shapes, use Karpenter for bursty or Spot-heavy workloads, use HPA for traffic-driven stateless services, and configure VPA according to the combinations covered in Section 5.
Phase 1 is finite work. A well-resourced team can complete all eight levers in 4-8 weeks on a single cluster. The savings are real and they show up on the next monthly bill. What this section does not promise is that the savings will hold. That is Phase 2.
How Cluster Autoscaler, Karpenter, HPA, and VPA actually fit together
The four autoscalers in production Kubernetes are not interchangeable. They solve different problems, scale different things, and respond to different signals. Most teams enable two or three of them with default settings, never audit the interactions, and then wonder why the cluster bill keeps climbing despite “autoscaling being on.”
Tweaking dials on individual autoscalers is not the answer. Understanding what each one does, where they conflict, and which combinations make sense for which workload patterns is the actual lever.
What each autoscaler does
| Autoscaler | What it scales | Trigger | Scope | Best for | Cost impact |
| Cluster Autoscaler | Nodes | Pending pods that cannot be scheduled | Node groups with predefined instance types | Stable workloads, predictable instance shapes, single-cloud environments | Removes idle nodes when utilization drops, slower scale-up than Karpenter |
| Karpenter | Nodes | Pending pods that cannot be scheduled | Just-in-time provisioning across any instance type | Bursty workloads, Spot-heavy clusters, instance-flexible workloads | Faster scale-up, better instance fit, deeper Spot savings, aggressive consolidation |
| Horizontal Pod Autoscaler (HPA) | Pod replica count | CPU, memory, or custom metrics per deployment | Per-deployment, within a namespace | Stateless, traffic-driven workloads (APIs, web services) | Prevents over-replicated baselines, scales out during traffic spikes |
| Vertical Pod Autoscaler (VPA) | CPU and memory requests per pod | Historical usage data | Per-pod, applied at pod restart | Workloads with stable but unknown resource needs | Reduces per-pod overprovisioning, reduces node count indirectly |
Cluster Autoscaler and Karpenter both manage nodes, but they are not redundant. Cluster Autoscaler works within static node groups defined upfront. Karpenter provisions nodes on demand across any instance type the workload can run on. Cluster Autoscaler has the longer deployment history. Karpenter has rapidly gained adoption since 2023 in environments where instance flexibility, Spot strategy, and faster scale-up matter more than the operational simplicity of static node groups.
Both are valid. The right choice depends on workload patterns, not on which is newer.
HPA and VPA solve genuinely different problems. HPA changes how many replicas exist. VPA changes how big each replica is. They are not alternatives to each other, they are complements that operate on different axes.
The interaction problem
The reason Phase 1 cost savings unwind in production is almost always autoscaler conflict, not bad rightsizing. Three patterns to watch for:
Pattern 1: HPA scaling out while VPA sizes up the same pods. HPA decides the deployment needs 10 replicas instead of 4 based on CPU usage. VPA simultaneously decides each pod needs 2x the CPU request based on historical data. The result is a 10x increase in cluster CPU consumption when the real demand only justified 2-3x. This is the most common cause of “autoscaling spiraled” incidents.
The mitigation depends on what tooling is in place. VPA can apply rightsizing recommendations automatically, but it interacts with HPA in ways that need attention before turning auto-apply on. Section 5 covers the interactions and the right combination for each workload pattern.
Pattern 2: Cluster Autoscaler holding nodes warm because of stale pod requests. A workload has historically used 4 CPU but its pod request is set to 8 CPU. Cluster Autoscaler sees the request and refuses to consolidate the node, because the request makes the workload look bigger than it is. The cluster carries 30-50% more nodes than it needs, indefinitely.
The mitigation: rightsize requests aggressively (Phase 1 lever #1) before relying on Cluster Autoscaler’s consolidation logic. Cluster Autoscaler is only as smart as the request data it reads. The same constraint applies to Karpenter consolidation.
Pattern 3: Karpenter consolidating nodes during HPA scale-up. Karpenter’s consolidation logic moves pods to fewer, larger nodes when utilization drops. HPA simultaneously decides to scale out because of a traffic spike. Pods get evicted from consolidating nodes at the exact moment new pods are being scheduled. Application latency spikes, sometimes triggering further HPA scale-out.
The mitigation: use PodDisruptionBudgets to set minimum availability during voluntary evictions, tune Karpenter’s consolidation timing to avoid peak traffic windows, and ensure HPA target utilization values leave enough headroom that one evicted pod does not push the deployment past its scaling threshold.
Which combinations make sense
The four-autoscaler problem is solvable. The general principles:
- Cluster Autoscaler + HPA is the safest combination for most production workloads. Predictable, well-understood, low interaction risk.
- Karpenter + HPA is the right choice for bursty, Spot-heavy, or AI inference workloads. Requires more tuning but unlocks significant Spot savings.
- VPA in recommendation-only mode belongs on every deployment as observability, not auto-apply. Auto-apply VPA only on workloads without HPA, and only after testing in non-prod.
- Cluster Autoscaler + Karpenter in the same cluster is possible but adds complexity. Most teams choose one for a given cluster. If both are running, partition workloads by node group clearly.
The autoscaler choice is the foundation. What runs on top of that foundation is the lever that determines whether Phase 1 savings hold into Phase 2. Static autoscaler configuration responds to the inputs it is given (pod requests, replica counts, scheduling pressure). When those inputs are stale or wrong, the autoscalers compound the waste instead of reducing it. This is true whether you run Cluster Autoscaler or Karpenter for node management, and it is the gap that continuous workload management (Section 6) is designed to fill.
For a deeper opinion piece on why static autoscaler configuration breaks at production scale, see Tweaking Dials Isn’t Enough for Optimizing Kubernetes Costs. For a head-to-head comparison of Cluster Autoscaler and Karpenter, see Karpenter vs Cluster Autoscaler: 2026 Comparison Guide.
Phase 2 — The Kubernetes cost optimization audit workflow
Phase 1 is a project with a finish line. Phase 2 is a loop without one. The work is structurally different, and most teams never start the loop because nobody owns it.
Phase 1 has a clear ending: every workload has rightsized requests, the right autoscaler is enabled, Spot is in use where appropriate, orphaned resources are cleaned up, quotas are enforced. The savings show up on the next monthly bill. The team moves on to other priorities.
Phase 2 has no ending. New services launch every week with default requests that nobody tunes. Existing workloads drift as traffic patterns change. Autoscalers respond to inputs that may or may not still be accurate. The cluster slowly returns to its pre-Phase-1 state, not because anything broke, but because nothing is actively holding it in place.
The audit workflow below is the work that holds Phase 1 savings in place. Run it monthly at minimum. Quarterly is the absolute longest interval before drift becomes material.
The monthly loop
Step 1. Measure actual utilization across the last 30 days. Pull CPU, memory, and GPU utilization for every workload in the cluster. Tools: Prometheus with kube-state-metrics, Kubecost or OpenCost-style cost monitoring, or the cloud provider’s native monitoring (CloudWatch Container Insights on EKS, Cloud Monitoring on GKE, Azure Monitor on AKS). The output is a per-workload table of actual usage versus requested resources.
Step 2. Rank workloads by absolute dollar waste. This is the step most teams get wrong. A workload running at 5% CPU utilization sounds like a high-priority fix until you discover it costs $40/month. A workload running at 60% utilization with a $20,000/month footprint is the bigger target even though its waste percentage is smaller. Rank by absolute dollars wasted, not percentage waste.
Step 3. Flag idle workloads. Any workload under 5% utilization for more than 7 consecutive days is a candidate for removal, scaling to zero, or consolidation. Some of these are intentional (failover replicas, disaster recovery instances). Most are not. Cross-reference with the team that owns the workload before taking action.
Step 4. Compare actual usage to requests and limits for the top 10 wasteful workloads. For each high-dollar workload, calculate the ratio of actual P95 utilization to requested resources. Anything above 3x is overprovisioned. Anything above 5x is severely overprovisioned and should be the first to rightsize.
Step 5. Rightsize, consolidate, or remove. Apply one of three actions per workload: rightsize requests to match actual usage with appropriate headroom, consolidate into a shared deployment if the workload is duplicative, or remove if the workload is idle and nobody claims it. For workloads with HPA enabled, use the stock VPA + HPA guidance from Section 5 or rely on a workload-aware platform that coordinates both.
Step 6. Recheck node utilization. After rightsizing, node utilization should improve. Bin-packing logic in Cluster Autoscaler or Karpenter should consolidate workloads onto fewer nodes. If node count does not drop after a rightsizing pass, audit the autoscaler configuration for the stale-request problem covered in Section 5 Pattern 2.
Step 7. Audit autoscaler interactions. Look for HPA/VPA conflicts on workloads with both enabled. Look for Cluster Autoscaler or Karpenter holding nodes warm because of stale pod requests. Look for Karpenter consolidating during HPA scale-up events. The three patterns from Section 5 cover the most common failure modes.
Step 8. Repeat next month. The first audit after Phase 1 cleanup typically finds the largest savings. Subsequent audits find smaller, more incremental savings. After three or four monthly cycles, the cluster reaches a steady state where each audit catches the drift from new services, traffic changes, and autoscaler interactions rather than the original waste.
Where this loop falls apart
Most teams know the workflow above. The reason it does not run monthly in most clusters is operational, not technical.
No single owner. Platform teams own the cluster. Application teams own the workloads. Cost optimization sits in the gap between them. Nobody is measured on cluster utilization, so nobody runs the audit.
Visibility tools generate alerts, but require humans to act. Kubecost, OpenCost, and cloud-native monitoring can identify overprovisioned workloads automatically. They cannot rightsize the workloads. The gap between visibility and action is where most savings get stuck.
Workload count grows faster than the audit cycle. A cluster with 50 workloads can be audited manually in a few hours. A cluster with 500 workloads cannot. A cluster with 5,000 workloads is impossible to audit manually at any frequency that matters.
New services launch faster than they get audited. In an environment where teams ship multiple times per week, every audit is already chasing a moving target. By the time the audit completes, new workloads have launched with default requests.
Autoscaler interactions are not obvious. Even teams that run the audit regularly often miss the HPA/VPA conflicts and stale-request problems from Section 5. The patterns are subtle, the symptoms (slightly higher bill than expected, slightly more nodes than the workload count suggests) are easy to ignore.
When manual audits stop being enough
For clusters with fewer than 100 workloads on a single cloud with stable traffic patterns, monthly manual audits can be tractable. The team can run them in a half-day each month if the cadence holds. In practice, most teams find the cadence harder to maintain than the math suggests, because the audit competes with feature work and the cost of skipping a month is invisible until quarters later.
For clusters with hundreds of workloads, multi-cluster topologies, AI inference workloads, or rapid deployment velocity, monthly manual audits stop working. The work either does not get done (most common), gets done late (second most common), or gets done but cannot keep up with the rate of change.
This is the boundary where continuous workload management becomes the natural answer. Platforms that monitor every workload continuously, coordinate vertical and horizontal scaling decisions, and apply changes automatically remove the operational bottleneck that breaks Phase 2 in most production environments. They do not replace Cluster Autoscaler or Karpenter; they layer on top of whichever node provisioner is already in place and feed it accurate inputs. Section 9 covers the tool category in more detail.
The audit workflow above is not optional even with automation in place. It is the framework for understanding what good looks like, whether the cluster is operating within it, and where the remaining gaps are. Automation handles the rate-of-change problem. The framework handles the accountability problem.
FinOps for Kubernetes: chargeback, showback, and cost allocation
Phase 1 and Phase 2 are technical work. FinOps is the layer that determines whether anyone actually does either of them.
Most Kubernetes cost optimization fails not because teams cannot rightsize workloads, but because no team is accountable for the bill. Platform teams own the cluster but not the workloads consuming it. Application teams own the workloads but not the cluster budget. Cost optimization sits in the gap, which is why audits do not run and savings do not stick.
FinOps closes the accountability gap. The technical foundation is cost allocation: attributing every dollar of cluster spend to a team, service, environment, or business unit. The organizational foundation is chargeback or showback: making that attribution visible enough that the people consuming the cluster see what they are spending.
Cost allocation by namespace, label, and workload
Kubernetes cost allocation works through three primary signals: namespace, labels, and workload owner. Each has different operational properties.
Namespace-level allocation is the simplest and most common pattern. Every workload runs in a namespace, every namespace maps to a team or service, and total namespace cost is the sum of pod resource consumption within it. Tools like Kubecost and OpenCost handle this natively. The limitation is that namespace boundaries do not always match accountability boundaries: shared namespaces (kube-system, monitoring, ingress-nginx) generate costs that no single team owns.
Label-based allocation lets teams attribute cost more granularly. Standardize on labels like team, service, environment, and cost-center, apply them to every workload, and allocate cost by label values. This handles the shared-namespace problem (the monitoring stack can be labeled by which team owns it) and supports multi-dimensional reporting (cost by team AND by environment in a single query). The limitation is enforcement: labels only work if every workload has them, which requires admission controllers or policy enforcement to maintain.
Workload-owner allocation combines both. Every workload has an explicit owner (team, on-call rotation, or service registry entry), allocation queries roll up by owner, and the cluster bill maps cleanly to organizational responsibility. This is the model most production teams converge on.
The lever for allocation is consistency. Pick a labeling standard, enforce it via admission controllers (OPA Gatekeeper, Kyverno), and audit compliance quarterly. Without consistent labels, allocation reports are guesses.
Showback versus chargeback
Both share the same technical foundation. The difference is what happens after the report is generated.
Showback reports cost back to teams without billing them. The team running recommendations sees that their service costs $40,000/month and 30% of the cluster, and is expected to act on it. Works when engineering teams have intrinsic motivation to manage spend. Fails when teams see their costs and shrug.
Chargeback actually bills teams for their consumption. The recommendations team gets a $40,000/month line item on their internal budget and has to justify it. Works when teams have budget authority to make tradeoffs and allocation reporting is accurate enough to defend. Fails when teams dispute the numbers or have no authority to act on them.
Most production organizations land on a hybrid: showback for engineering visibility, chargeback for finance reporting.
Budget guardrails through ResourceQuotas and LimitRanges
Cost allocation tells teams what they spent. Budget guardrails prevent the spending in the first place. Kubernetes provides two native primitives.
ResourceQuota caps total resource consumption per namespace. A namespace with cpu: 100 cannot collectively request more than 100 CPU across all its workloads, regardless of how many pods get deployed. Once the quota is hit, new pods fail to schedule with a clear error message.
LimitRange sets default and maximum request values for any pod that does not specify them. A namespace with a LimitRange setting default CPU request to 100m and maximum to 2 will apply 100m to any pod missing a request, and reject any pod requesting more than 2.
Together they make it operationally hard to silently inflate the cluster bill. A misconfigured deployment that tries to request 1000 CPU gets rejected at admission. A new team that does not specify requests gets reasonable defaults instead of unbounded consumption.
Here is a concrete example for a development namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-team-quota
namespace: dev-team-alpha
spec:
hard:
requests.cpu: "50"
requests.memory: "100Gi"
limits.cpu: "100"
limits.memory: "200Gi"
persistentvolumeclaims: "20"
---
apiVersion: v1
kind: LimitRange
metadata:
name: dev-team-limits
namespace: dev-team-alpha
spec:
limits:
- type: Container
default:
cpu: 200m
memory: 256Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: 4
memory: 8Gi
This pair caps the dev-team-alpha namespace at 50 CPU and 100Gi memory in aggregate requests, applies sensible defaults to any pod that does not specify requests, and rejects any single container requesting more than 4 CPU or 8Gi memory. The friction is the feature: teams that need more capacity have to make the request explicitly, which surfaces the cost decision instead of hiding it.
Apply quotas and limit ranges to every namespace, not just shared ones. Set defaults aggressively (small) and let teams raise them with justification. The audit cadence is quarterly.
Where FinOps fits in the two-phase model
Phase 1 cleanup gets the cluster into a baseline of efficiency. Phase 2 management keeps it there. FinOps determines whether the work happens at all.
Without cost allocation, teams cannot see their drift. Without showback or chargeback, they have no incentive to fix it. Without budget guardrails, the cluster has no mechanism to prevent regression. The technical work in Phases 1 and 2 only delivers durable savings when the FinOps foundation is in place.
For teams starting fresh, the order matters: establish allocation first (so the data is trustworthy), pick showback or chargeback (so the data has consequences), apply ResourceQuotas and LimitRanges (so the data drives default behavior), and only then run the technical optimization loops. Running the loops without the FinOps layer produces savings that erode within a quarter.
Cloud billing reconciliation across providers and clusters
Kubernetes cost allocation and cloud billing are two different problems that have to reconcile to the same number. Kubecost and OpenCost tell teams what they spent inside the cluster. The cloud provider’s billing tools tell finance what the company was charged. When these numbers do not match (and they often do not), chargeback credibility collapses.
Each cloud has its own billing stack, with its own data model, its own export mechanism, and its own gaps when it comes to Kubernetes context.
AWS Cost and Usage Report (CUR) is the most granular billing data AWS provides. It exports line-level usage to S3 on an hourly cadence, with full resource IDs, tags, and pricing dimensions. AWS Cost Explorer is the visual layer on top, useful for spend trends and forecasting but limited for deep allocation work. The gap on EKS: CUR sees EC2 instances and EBS volumes, not pods. Kubernetes-level allocation has to come from Kubecost, OpenCost, or a workload management platform that joins pod-level consumption to instance-level billing.
GCP Billing exports detailed usage to BigQuery, where teams can query spend by project, label, or SKU. GKE Cost Allocation extends this with namespace and workload-level visibility inside GKE clusters, and it is the most mature native Kubernetes cost reporting of any cloud. The gap: it only covers GKE. Multi-cloud teams still need an external allocation layer to compare GKE spend to EKS and AKS spend in a single view.
Azure Cost Management provides cost analysis and budget alerts in the portal, with exports to storage accounts for downstream analysis. AKS-specific allocation requires either the AKS cost analysis preview features or third-party tooling. The gap is similar to AWS: Azure billing sees VMs and managed disks, not Kubernetes pods.
The reconciliation problem compounds with multi-cluster sprawl. A platform team running clusters across AWS, GCP, and Azure has three separate billing stacks, three separate cost allocation engines, and three different ways to map cluster cost back to teams. Kubecost and OpenCost handle this inside each cluster but do not consolidate across clouds. Native cloud billing tools handle each cloud but do not understand Kubernetes.
This is where workload management platforms with integrated billing become operationally valuable. ScaleOps connects to each cloud provider’s billing data, layers it against per-workload utilization across every cluster the platform manages, and produces a unified multi-cluster view of Kubernetes spend that reconciles to the cloud bill. The same view shows where waste is concentrated by workload, by namespace, and by cluster, so the optimization decisions and the chargeback reports use the same numbers. For teams running Kubernetes on more than one cloud or more than a handful of clusters, this consolidation removes a category of reconciliation work that otherwise consumes a measurable share of the FinOps team’s monthly cycle.
The audit cadence is monthly at minimum. Reconcile Kubernetes-internal allocation totals against the cloud bill line items, flag any gaps above 5%, and trace the gaps to either labeling problems (workloads without owners) or shared infrastructure costs (control plane fees, load balancers, egress) that need explicit allocation rules.
When generic rightsizing isn’t enough: GPU, stateful, batch, and bursty workloads
The rightsizing advice in Section 4 assumes a particular workload shape: stateless, CPU and memory bound, with steady traffic patterns and the ability to absorb a pod restart without consequence. This is what runs in most production clusters most of the time. It is also where the easiest cost savings live.
The rest of the cluster is harder. Stateful databases, batch jobs, GPU workloads, and bursty inference endpoints each break one or more of those assumptions. Generic best practices either underperform or actively cause problems when applied to them. The next 15-20% of cluster savings, after the generic rightsizing pass, lives in treating these workload categories distinctly.
GPU workloads
GPU memory is typically the binding constraint, not GPU compute. NVIDIA’s own technical guidance notes that GPU compute utilization in production Kubernetes environments often hovers in the 0-10% range when lightweight models run on dedicated GPUs, because the scheduler maps a model to one or more whole GPUs and cannot easily share GPUs across models. A workload requesting one full A100 might use a small fraction of its compute capacity and a portion of its memory, paying for capacity it cannot use. Generic VPA does not help because VPA operates on CPU and memory requests, not GPU allocation.
The levers for GPU cost optimization are structurally different:
- Fractional GPU allocation. NVIDIA Multi-Instance GPU (MIG), NVIDIA Multi-Process Service (MPS), and time-slicing let multiple workloads share a single GPU. The right choice depends on workload pattern: MIG for hard isolation between workloads, MPS for compatible workloads that can share GPU context, time-slicing for development and inference workloads that tolerate scheduling latency. The mechanics, tradeoffs, and configuration for each are covered in Kubernetes GPU Sharing: MIG vs. MPS vs. Time-Slicing Explained.
- GPU memory rightsizing. Most GPU workloads request full GPUs out of habit. Measuring actual GPU memory usage and rightsizing to fractional allocations typically delivers 50-70% savings on GPU-bound workloads.
- Inference cost-per-token tracking. For LLM inference workloads, cost-per-token is a more useful metric than cost-per-GPU-hour. Tools that track this enable optimization decisions that GPU-hour metrics cannot.
GPU rightsizing is the single largest underexploited cost lever in 2026 Kubernetes clusters. The 2025 baseline of running every model on a dedicated full GPU is no longer defensible, and the platforms to manage fractional allocation are mature enough for production. For a deeper read on AI inference sizing, see Kubernetes Swap for AI Inference.
Stateful workloads
Stateful workloads (databases, message queues, anything with persistent volumes) cannot be rightsized aggressively without downtime risk. A pod restart that flushes a 64GB buffer cache might cost more in performance degradation than the rightsizing saves in compute. VPA’s default behavior of restarting pods to apply new resource requests is particularly dangerous here.
The levers for stateful workloads:
- Provision conservatively, audit slowly. Set initial requests with generous headroom (50-100% above expected peak). Audit utilization over multi-week windows, not daily. Apply changes during planned maintenance windows, not continuously.
- Run VPA in recommendation-only mode. Use VPA as observability for stateful workloads, never as auto-apply. The recommendations tell you what to change. The when and how is a human decision.
- Use dedicated node pools. Separate stateful workloads onto their own node pools to avoid bin-packing them with stateless workloads. This prevents disruption from Karpenter consolidation events and Cluster Autoscaler scale-down decisions that would otherwise evict pods with persistent storage attached.
- Use PodDisruptionBudgets aggressively. Set
minAvailableto the highest value the workload can tolerate during voluntary disruptions. For single-replica databases, this meansmaxUnavailable: 0, which effectively pins the pod.
Stateful workload cost optimization is slower and more cautious than stateless. The savings are real but the methodology is different.
Batch and CI workloads
Batch jobs, CI runners, ML training pipelines, and any workload that uses the cluster heavily for short windows and then disappears are ideal Spot candidates. They are also the workloads most underused on Spot because teams default to running them on the same node pools as production services.
The levers for batch workloads:
- Dedicated node pools with aggressive Spot. Set up a node pool exclusively for batch workloads with 100% Spot allocation. Use node taints to keep production workloads off it, and tolerations on batch workloads to allow them in.
- Checkpoint where possible. Batch jobs that can checkpoint their progress survive Spot interruptions cleanly. Jobs that cannot checkpoint have to restart from the beginning, which negates the cost savings if interruptions are frequent.
- Use Karpenter for batch. Karpenter’s instance flexibility means batch workloads can run on whatever Spot capacity is cheapest at any given moment, not just the instance types pre-defined in static node groups. This is where Karpenter’s advantages over Cluster Autoscaler matter most.
- Set explicit completion deadlines. Use Kubernetes Jobs with
activeDeadlineSecondsto prevent runaway batch jobs from accumulating cost indefinitely.
Batch workloads are often the highest-percentage savings opportunity in a cluster, even if the absolute dollars are smaller than the compute savings on production services.
Bursty workloads
Bursty workloads (inference endpoints, traffic-spike services, scheduled jobs that consume cluster capacity in irregular bursts) are where HPA tuning matters most and where it most often fails. Set HPA too aggressively and the cluster thrashes. Set it too conservatively and the burst fails. The default HPA settings are usually wrong for bursty patterns.
The levers for bursty workloads:
- Karpenter for fast node provisioning. Bursty workloads need nodes to come online in 60-90 seconds, not 3-5 minutes. Karpenter typically provisions 2-3x faster than Cluster Autoscaler. For inference and traffic-spike workloads, this is the difference between absorbing a burst and timing out.
- KEDA for event-driven scaling. Kubernetes Event-Driven Autoscaling (KEDA) extends HPA to scale on metrics that standard HPA cannot use: queue depth, Kafka lag, custom application metrics, scheduled cron triggers. For workloads where CPU and memory are lagging indicators, KEDA is the right HPA replacement.
- Predictive scaling for seasonal patterns. Workloads with predictable traffic patterns (daily peaks, weekly cycles, monthly batch runs) benefit from predictive scaling that warms capacity ahead of the burst rather than reacting to it. This is platform-specific and not standard Kubernetes, but several workload management platforms support it.
- HPA tuning beyond defaults. Stabilization windows, scale-up policies, and scale-down policies in HPA configuration are usually left at defaults. Tuning them for the specific burst pattern of each workload prevents thrash and improves response time.
Bursty workload optimization is where the four-autoscaler problem from Section 5 gets sharpest. The interactions are most visible during bursts because everything is changing at once. Phase 2 audits should specifically examine bursty workloads for autoscaler conflicts.
When workload categories overlap
Most production workloads fit one category cleanly. Some fit two or more, and the optimization approach has to combine the relevant levers.
An ML inference service is both GPU-bound and bursty: it needs fractional GPU allocation and Karpenter-fast scale-up. A data pipeline that processes batch jobs against a stateful database is both batch and stateful: it needs Spot capacity for the workers and conservative rightsizing for the database. A web application with a peak traffic season is bursty for nine months and stateful-adjacent during a heavy promotional period: it needs KEDA scaling year-round and dedicated capacity reservations during peaks.
The categories are not rigid. They are starting points for thinking about which levers apply to which workloads.
Kubernetes cost optimization tools
The tool landscape for Kubernetes cost optimization has three distinct categories. Most teams need at least one tool from each. Conflating the categories or assuming a single tool covers all of them is one of the most common mistakes in tool selection.
The three categories:
- Cost visibility tools measure and allocate spend. They tell you where the money is going. They do not change the cluster.
- Node provisioners add and remove nodes in response to scheduling pressure. They determine what infrastructure runs underneath the workloads.
- Workload management platforms rightsize requests, manage replicas, and coordinate the autoscalers. They determine whether the workloads consume infrastructure efficiently. Within this category, autonomous workload management platforms apply changes continuously and automatically rather than surfacing recommendations. The most advanced of these are application context-aware: they understand workload behavior and anticipate resource needs instead of reacting to threshold breaches.
Commitment discount programs (AWS Savings Plans, Google Cloud Committed Use Discounts, Azure Reservations) sit alongside these three categories as a fourth lever: discounted pricing on capacity that the tools above help size correctly.
Tool comparison
| Tool | Category | What it does | Best for | Limitation |
| ScaleOps | Autonomous workload management | Continuously rightsizes pod requests and replicas based on application context, anticipates traffic spikes, coordinates HPA and vertical scaling, optimizes pod placement, manages Spot allocation per workload, uses native Kubernetes In-Place Pod Resize to apply changes without restarts or evictions | Production Kubernetes at scale, multi-cluster environments, AI inference clusters | Commercial |
| Kubecost | Cost visibility | Per-namespace, per-workload, per-label cost allocation. Native integration with Prometheus and cloud billing APIs | Showback and chargeback reporting, FinOps allocation | Visibility only, does not change cluster state |
| OpenCost | Cost visibility (CNCF) | Open-source cost allocation engine, foundation Kubecost is built on | Teams wanting an open-source allocation engine without commercial support | Visibility only, requires self-hosting and integration work |
| Karpenter | Node provisioner | Just-in-time node provisioning across any instance type, aggressive consolidation, deep Spot integration | Bursty workloads, Spot-heavy clusters, instance-flexible environments | Node layer only, does not address workload-level waste |
| Cluster Autoscaler | Node provisioner | Adds and removes nodes from predefined node groups based on scheduling pressure | Stable workloads with predictable instance shapes, single-cloud environments | Slower scale-up than Karpenter, less instance flexibility |
| AWS Savings Plans / GCP CUDs / Azure Reservations | Commitment discounts | 30-60% discount on committed baseline capacity for 1 or 3 year terms | Predictable baseline workloads | Does not address waste, only discounts existing spend |
| Grafana + Prometheus | Visualization | Custom dashboards for cluster metrics, cost-related panels when integrated with Kubecost or OpenCost | Engineering teams wanting full control over dashboards and alerting | DIY, no allocation logic out of the box |
Notes on the categories
Cost visibility tools answer where the money is going. They do not change the cluster. The most common mistake in tool selection is buying a visibility tool and expecting it to optimize the cluster. Visibility is necessary but not sufficient. Reports generate awareness, but they do not rightsize workloads.
Node provisioners answer how nodes get added and removed. Cluster Autoscaler and Karpenter are alternatives, not complements, in most clusters. They respond to the same trigger (unschedulable pods) and differ in how they fulfill it. The choice between them depends on workload patterns and operational preferences, covered in Section 5. Both are valid foundations.
Workload management platforms answer whether workloads consume infrastructure efficiently. This is where the continuous Phase 2 work happens, and it is the category most teams underweight. The cluster can have perfect visibility and an excellent node provisioner and still bleed money if workload requests are stale, replicas are over-provisioned, and HPA and VPA are fighting each other. Workload management platforms exist to handle this layer.
Commitment discounts answer how to pay less for the capacity you keep. They do not optimize workloads or change provisioning. They reduce the rate you pay for the baseline. Most teams underuse commitment discounts because measuring “true baseline” is harder than it looks, and the lever only works after Phase 1 cleanup has stabilized the cluster.
A note on ScaleOps positioning
ScaleOps is an autonomous cloud and AI resource management platform that runs on top of existing Kubernetes infrastructure. Two characteristics define the platform.
The first is autonomy: ScaleOps does not surface recommendations for engineering teams to act on, it applies workload changes continuously and automatically as resource needs change. Rightsizing happens without pod restarts. Replica decisions stay coordinated with HPA. Spot placement is managed per workload in real time.
The second is application context-awareness: ScaleOps understands the behavior of each workload it manages. It learns traffic patterns, recognizes the difference between a stateless API and a batch job and a stateful service, anticipates spikes before they hit the autoscaler’s threshold, and adapts resource allocation in time with how the workload actually behaves. The result is that workloads get the resources they need when they need them, not before and not after, and not the static padding that every “set it once and forget it” rightsizing pass leaves behind.
ScaleOps does not replace Cluster Autoscaler or Karpenter; it works alongside whichever node provisioner is already in place. It does not replace HPA; it coordinates with HPA so vertical and horizontal scaling decisions stay consistent. Teams adopting ScaleOps do not need to migrate off existing autoscalers, redesign their node provisioning strategy, or change which cloud they run on. The platform adds an autonomous, context-aware intelligence layer over the existing foundation.
ScaleOps supports EKS, GKE, AKS, OpenShift, and self-managed Kubernetes across cloud, on-premise, and hybrid deployments.
Tool selection by team size and workload pattern
Smaller teams (under 50 workloads, single cloud, stable traffic):
- Cost visibility: Kubecost or OpenCost
- Node provisioner: Cluster Autoscaler with default settings is usually sufficient
- Workload management: Manual monthly audits using VPA recommendations are tractable at this scale
- Commitments: AWS Savings Plans, GCP CUDs, or Azure Reservations on stable baseline
Mid-sized teams (50-500 workloads, multi-cluster, mixed workload patterns):
- Cost visibility: Kubecost for chargeback reporting, Prometheus for engineering dashboards
- Node provisioner: Cluster Autoscaler or Karpenter depending on Spot strategy and workload patterns
- Workload management: Manual audits start to break down; workload management platforms become worth evaluating
- Commitments: Baseline coverage with commitment discounts, on-demand for variable capacity
Large teams (500+ workloads, multi-cluster, AI workloads, rapid deployment velocity):
- Cost visibility: Kubecost for allocation and chargeback, custom Grafana dashboards for operational visibility
- Node provisioner: Often a mix, with Karpenter for bursty and Spot-heavy workloads and Cluster Autoscaler for stable production
- Workload management: Continuous platforms become operational necessity, not optimization
- Commitments: Sophisticated commitment portfolio with dedicated FinOps owner
The pattern across all sizes: visibility, node provisioning, and workload management are three separate decisions. Conflating them is the root cause of most tool selection mistakes. For a deeper benchmark on the workload management category specifically, see The 6 Best Kubernetes Cost Optimization Tools: 2025 Benchmark.
Common pitfalls
Most Kubernetes cost optimization advice focuses on what to do. The pitfalls below are what teams reliably get wrong even after they have read the best practices. They are framed around the two-phase model so that the failure modes track to where the work actually breaks.
Buying tools without understanding which category solves which problem
The three-category framework in Section 9 (visibility, node provisioning, workload management) exists because teams reliably buy tools that cannot solve the problem they actually have. A common pattern: a team is told their Kubernetes bill is too high, they buy Kubecost or OpenCost expecting it to fix the bill, six months later the bill is the same and the team is frustrated. Kubecost and OpenCost report cost. They do not act on it.
The same mistake plays out in the other direction. A team buys an autonomous workload management platform and then expects it to also handle chargeback reporting, FinOps allocation, or commitment discount optimization. Those are visibility-layer and contract-layer problems, not workload-layer problems.
The fix is to match the tool to the problem. If the cluster bleeds compute through stale requests and over-replicated workloads, the answer is workload management (autonomous or manual). If the cluster’s spend is opaque to the teams generating it, the answer is cost visibility plus a FinOps process. If node provisioning is slow or expensive, the answer is the right node provisioner for the workload mix. If the bill is too high on stable capacity, the answer is commitment discounts. These are four distinct problems with four distinct solutions, and conflating them is the most expensive mistake in tool selection.
Doing Phase 1 and assuming you are done
The cluster bill drops 35% in the first month after Phase 1 cleanup, the team celebrates, leadership reallocates the savings, and nobody schedules the next audit. Three to six months later the bill is back to its pre-Phase-1 baseline plus inflation, and no one can explain why.
The pattern is so consistent it is the operational case for Phase 2. Phase 1 savings do not hold without something actively preventing drift, whether that is a monthly audit cadence or a continuous management platform.
Optimizing for percentage waste instead of absolute dollars
A workload running at 5% utilization with a $40/month footprint sounds like an obvious cleanup target. A workload running at 60% utilization with a $20,000/month footprint sounds like it is doing fine. The first looks dramatic in percentage terms. The second is the actual cost problem.
Always rank optimization targets by absolute dollar waste, not percentage. The dashboards that highlight percentage utilization are useful for spotting outliers but they bias toward small workloads. The dashboards that highlight dollar waste are what drive the savings.
Treating Spot as plug-and-play
Spot Instances are 60-90% cheaper than on-demand and Spot interruptions are real. Teams that move workloads to Spot without designing for interruption (PodDisruptionBudgets, replica counts above the interruption threshold, checkpointing for batch workloads, topology spread across multiple instance types) eventually get caught by a Spot capacity event, lose more in performance and reliability than they saved in compute, and migrate back to on-demand.
The lever is not just “use Spot.” It is matching the right workload to the right capacity type, designing for interruption with PodDisruptionBudgets and topology spread, and accepting that some workloads will always cost full price. Per-workload Spot strategy (which workloads can run on Spot at any given moment, when to fall back, how to anticipate interruptions) gets harder as cluster scale and workload diversity grow, and is one of the levers continuous management platforms handle automatically.
Mixing autoscalers without understanding interactions
Cluster Autoscaler, Karpenter, HPA, and VPA all do useful work. They also conflict in non-obvious ways, covered in Section 5. The most common failure mode is enabling HPA and auto-applying VPA on the same deployment, which causes the autoscaling spiral pattern. The second most common is running both Cluster Autoscaler and Karpenter in the same cluster without partitioning workloads by node group, which produces unpredictable provisioning decisions.
The fix is to read Section 5 before turning autoscalers on, not after. For a head-to-head on the node-provisioner choice, see Karpenter vs Cluster Autoscaler: 2026 Comparison Guide.
Ignoring storage and network because they are not compute
Compute is 60-80% of the bill. Storage and network are 15-35% combined. Teams that optimize compute aggressively and ignore the other categories leave significant savings unrealized, and the savings that remain available in storage and network tend to be more durable because they are architectural decisions rather than per-workload tuning. Section 3 covers the specific patterns.
The pattern across teams that handle this well: a quarterly audit of storage and network costs on the same cadence as the compute audit, with a different owner so the work does not get deprioritized.
Chasing 100% utilization
The goal of Kubernetes cost optimization is not to drive every node to 100% utilization. The goal is to eliminate waste while preserving the headroom that workloads need to absorb traffic spikes, accommodate autoscaler decisions, and handle the brief periods of overcommitment that any healthy cluster experiences.
Targeting 100% utilization causes autoscaler thrash, latency spikes during traffic bursts, and reliability incidents when the cluster has no slack to absorb unexpected load. The right target depends on workload pattern: 70-80% average node utilization for stateless workloads with HPA, 50-60% for bursty workloads, 40-50% for workloads with strict latency requirements. Performance protection comes first; cost savings come from the work that remains after the headroom is preserved.
Trusting allocation reports without verifying the labels
FinOps reports are only as accurate as the labels and namespaces they rely on. Teams that adopt Kubecost or OpenCost without enforcing consistent labeling end up with allocation reports that miss 20-40% of cluster cost in “unallocated” categories. The “unallocated” bucket is where chargeback credibility goes to die.
The fix is to enforce labeling through admission controllers (OPA Gatekeeper or Kyverno) before launching allocation reports, audit labeling compliance quarterly, and treat the “unallocated” bucket as a bug to fix rather than a category to accept.
Letting one team own everything
Kubernetes cost optimization is structurally cross-functional. Platform teams understand the cluster. Application teams understand the workloads. Finance understands the budget. FinOps understands the allocation model. Cost optimization that gets assigned to one team without authority or visibility across the others fails consistently, regardless of which team it is.
The pattern across teams that handle this well: a named cross-functional owner with monthly review cadence, allocation reports visible to all relevant teams, and budget conversations that reference the data rather than working around it.
Frequently asked questions about Kubernetes cost optimization
What is Kubernetes cost optimization and why does it matter?
Kubernetes cost optimization is the practice of reducing cloud spend on Kubernetes clusters by rightsizing workloads, configuring autoscalers correctly, choosing the right capacity types, and continuously managing resource consumption as workloads change. It matters because the default state of Kubernetes is overprovisioned: industry data consistently shows average CPU utilization in unmanaged clusters sits in the single digits to low teens, with the rest paid for and unused. At production scale, that waste represents hundreds of thousands to millions of dollars per year in cluster spend that no workload actually requires.
How do I reduce Kubernetes costs without hurting application performance?
The order matters. Start with cost monitoring (Kubecost, OpenCost, or cloud-native tools) to identify where the bill comes from. Then run the Phase 1 cleanup levers in priority order: rightsize CPU and memory requests based on actual 95th percentile usage with 20-30% headroom, bin-pack workloads onto fewer nodes, move fault-tolerant workloads to Spot Instances, and apply commitment discounts to stable baseline capacity. Set memory limits to prevent OOM cascades but avoid CPU limits unless throttling is acceptable, since CPU limits often hurt application performance more than they save in compute cost. After cleanup, continuous workload management prevents the savings from eroding as workloads drift.
Why are resource requests and limits so important for Kubernetes cost optimization?
Kubernetes schedules and bills based on requests, not actual usage. A pod requesting 4 CPU and using 0.5 CPU consumes one full CPU of cluster capacity from the scheduler’s perspective, which translates directly into node count and cloud cost. Limits are a separate mechanism that caps how much a container can consume, which matters for stability (preventing one workload from starving others) but does not directly drive cost. Most Kubernetes cost optimization work focuses on rightsizing requests to match actual usage with appropriate headroom, because that is the lever that determines how many nodes the cluster needs. For deeper context on sizing, see Kubernetes Capacity Planning.
When should I use Spot Instances for Kubernetes workloads?
Spot Instances are 60-90% cheaper than on-demand and can be interrupted with two minutes notice. They are appropriate for fault-tolerant workloads: stateless web services with multiple replicas, batch jobs that checkpoint progress, CI runners, ML training pipelines, and any workload designed to absorb interruption. They are not appropriate for single-replica stateful services, databases without failover patterns, or workloads where a two-minute interruption causes data loss or significant customer impact. The lever is not “use Spot” universally; it is matching the right workload pattern to the right capacity type, designing the workload to handle interruption through PodDisruptionBudgets and topology spread, and accepting that some workloads will always cost full price.
How do Cluster Autoscaler, Karpenter, HPA, and VPA work together?
Each scales a different thing. Cluster Autoscaler and Karpenter manage nodes (they are alternatives, not complements). HPA manages pod replica counts per deployment. VPA manages per-pod CPU and memory requests. The combinations that work in production: Cluster Autoscaler + HPA for stable workloads with predictable instance shapes, Karpenter + HPA for bursty or Spot-heavy workloads, and VPA in recommendation-only mode as observability on every deployment. The combination that breaks: HPA + auto-applied VPA on the same deployment, which causes an autoscaling spiral where vertical and horizontal scaling compound each other. Either run VPA in recommendation-only mode, restrict auto-apply VPA to workloads without HPA, or use an autonomous workload management platform that coordinates vertical and horizontal scaling as a single decision.
Kubecost vs native cloud billing tools, which is better for Kubernetes cost optimization?
They solve different problems. Native cloud billing (AWS Cost Explorer, GCP Cost Management, Azure Cost Management) reports cost at the cloud-account level: total spend by service, by region, by instance type. It does not understand Kubernetes-specific abstractions like namespaces, workloads, or labels, so it cannot allocate cluster cost back to the teams or services that generated it. Kubecost and OpenCost report cost at the Kubernetes-workload level: spend by namespace, by deployment, by label, with allocation logic that maps cluster cost to organizational responsibility. Most production teams need both: native cloud billing for cross-service spend visibility, Kubecost or OpenCost for Kubernetes-internal allocation and chargeback. Neither of them changes the cluster; both report on it.
What are the best practices for Amazon EKS cost optimization?
The core practices map directly to general Kubernetes cost optimization with EKS-specific levers: rightsize pod requests and use either Cluster Autoscaler or Karpenter for node management, run fault-tolerant workloads on Spot Instances through EC2 Spot integration, apply AWS Savings Plans or Reserved Instances to stable baseline capacity, enforce ResourceQuotas and LimitRanges per namespace, and use CloudWatch Container Insights or Kubecost for visibility. EKS-specific considerations: managed node groups versus self-managed node groups have different cost and operational tradeoffs, Fargate is more expensive per pod-hour than EC2 but eliminates node management overhead for low-volume workloads, and EKS control plane fees ($0.10 per hour per cluster) add up across multi-cluster environments. For a deeper EKS-focused guide, see Amazon EKS Cost Optimization. For the AKS equivalent, see AKS Workload Optimization.
How can I identify idle or over-provisioned workloads in Kubernetes?
Run the audit workflow in Section 6: pull 30-day actual utilization for every workload, rank by absolute dollar waste rather than percentage waste, flag any workload under 5% utilization for more than 7 days as idle, and calculate the ratio of actual P95 usage to requested resources for the rest. Anything above 3x is overprovisioned; anything above 5x is severely overprovisioned. The tools that collect this data are Prometheus with kube-state-metrics, Kubecost, OpenCost, or cloud-native monitoring (CloudWatch Container Insights, Cloud Monitoring, Azure Monitor). The friction is not data collection, it is acting on the data: most teams have the dashboards but lack the operational cadence to rightsize the top offenders every month. At production scale this is where continuous workload management becomes necessary.
Key takeaways for Kubernetes cost optimization
Kubernetes cost optimization is two distinct kinds of work. Most teams complete the first kind and never start the second, which is why the bill comes back.
Phase 1 cleanup delivers 30-50% savings. Rightsizing CPU and memory requests, bin-packing workloads onto fewer nodes, moving fault-tolerant workloads to Spot, enforcing ResourceQuotas and LimitRanges, applying commitment discounts to stable baseline capacity, and choosing the right autoscaler for each workload pattern. This is finite, well-documented work. A focused team can complete it on a single cluster in 4-8 weeks, and the savings show up on the next monthly bill.
Phase 2 continuous management prevents the drift. Workloads change. New services launch with default requests. Traffic patterns shift. Cluster Autoscaler, Karpenter, HPA, and VPA respond to inputs that may or may not still be accurate. Within 3-6 months of Phase 1 cleanup, an unmanaged cluster returns to most of its pre-cleanup state, not because anything broke, but because nothing is actively holding it in place. Phase 2 is the work that prevents this, whether through monthly manual audits at smaller scale or autonomous workload management platforms at production scale.
The four-autoscaler problem is where most Phase 2 erosion starts. HPA and VPA conflict on the same deployment when both are auto-applied. Cluster Autoscaler holds nodes warm because of stale pod requests. Karpenter consolidates during HPA scale-up. These interactions are subtle, the symptoms are easy to ignore, and they compound waste over time. Audit the autoscaler interactions monthly, not annually.
FinOps is the accountability layer that makes the technical work happen. Without consistent cost allocation by namespace, label, or workload owner, teams cannot see their drift. Without showback or chargeback, they have no incentive to fix it. Without ResourceQuotas and LimitRanges as policy enforcement, the cluster has no mechanism to prevent regression. Phase 1 and Phase 2 technical work only delivers durable savings when the FinOps foundation is in place.
Storage and network are the second-pass categories. Compute optimization delivers the largest savings. Storage and network deliver smaller savings that are more durable, because the patterns that cause them (orphaned PVs, missing topology hints, load balancer proliferation) are architectural decisions rather than per-workload tuning. Audit them quarterly on the same cadence as the compute audit, with a different owner.
GPU rightsizing is the single largest underexploited cost lever in 2026 Kubernetes clusters. NVIDIA’s own technical guidance documents that GPU compute utilization in production Kubernetes environments often sits in the 0-10% range when lightweight models run on dedicated GPUs, and the platforms for fractional GPU allocation through NVIDIA MIG, MPS, and time-slicing are mature enough for production.
Performance protection comes before cost savings, always. Set memory limits to prevent OOM cascades, avoid CPU limits unless throttling is acceptable, use PodDisruptionBudgets to protect availability during voluntary disruptions, target 70-80% average node utilization rather than 100%, and keep stateful workloads on conservative rightsizing schedules. Kubernetes cost optimization that compromises application performance is not cost optimization; it is technical debt with a smaller cloud bill.