Key takeaways
- Kubernetes GPU cost optimization is the practice of measuring real GPU compute and memory utilization, then matching workload allocation to actual demand rather than reserving entire devices.
- Without optimization, production inference workloads typically run at 5% to 20% GPU utilization while you pay for 100% of allocated capacity.
- Kubernetes allocates GPUs as devices, but inference workloads consume GPU compute and memory unevenly. That structural gap is where waste accumulates.
- GPU telemetry, SM utilization, framebuffer memory, and cost attribution identify which workloads are wasting GPU capacity.
- ScaleOps GPU Platform delivers continuous optimization through GPU Observability, Automated Fractional GPUs, GPU Memory Optimization, GPU Replica Optimization, and Batch Inference Optimization.
What Is GPU Cost Optimization?
GPU cost optimization is the practice of measuring real GPU compute and memory utilization across Kubernetes workloads and matching allocation to actual demand. Kubernetes GPU cost optimization addresses a structural gap: the scheduler treats GPUs as whole devices, but inference workloads consume GPU compute and memory unevenly, leaving expensive capacity allocated but unused.
Why Kubernetes GPU Costs Get Out of Control
GPU costs spiral when AI and ML platform teams assume a reserved GPU is being actively used. When a pod requests a GPU, Kubernetes assigns it and the bill starts. The workload may use only a small fraction of that device for most of the day.
That waste compounds because GPU capacity is billed by the hour, even when a workload uses only a fraction of the device. The exact rate varies by provider, region, commitment, availability, support model, and instance configuration, but the principle is consistent: low utilization turns every always-on GPU into a recurring cost center.
The table below offers pricing context for the waste examples that follow. Verify current rates before building a business case.
| Provider category | Example GPU | Typical 2026 on-demand range | What to verify before using the GPU-hour estimate |
| Hyperscaler | H100 | Often mid single digits to low double digits per GPU-hour after normalizing multi-GPU instances, e.g., about $4–$6/GPU-hour in some AWS/GCP-style configurations and higher in some Azure-style bundled instances | Region, GPU count per VM, bundled CPU and memory, reservations, support, data transfer |
| Neo-cloud | H100 | Often low-to-mid single digits per GPU-hour, e.g., around $2.50–$6.16/GPU-hour depending on provider, term, and cluster size | Availability, uptime commitments, storage, networking, support, capacity guarantees |
| Hyperscaler or marketplace | A100 | Usually lower than H100 but highly variable, with public ranges often around sub-$1 to about $5/GPU-hour depending on marketplace, memory size, and availability | GPU memory size, spot or preemptible terms, commitment model, reserved capacity |
In production, inference workloads commonly run at as little as 5–20% GPU utilization. In typical unoptimized deployments, the majority of paid compute capacity sits idle while the bill reflects 100% allocation.
ScaleOps pro tip: Use ScaleOps AI Inference Observability to connect raw GPU spend directly to workload ownership, mapping which teams and services generate unnecessary costs and which ones actually use the hardware.
The Allocation Problem That Kubernetes Cannot See
The standard Kubernetes device plugin model treats GPUs as either allocated or available. It has no visibility into whether an allocated GPU is 5%, 50%, or 95% utilized after scheduling. This breaks cost efficiency in four ways:
- GPU capacity is reserved at the device level.
- Streaming multiprocessor (SM) utilization can stay low.
- Video RAM (VRAM) can be overreserved.
- Idle GPU capacity cannot be reused unless sharing or fractional allocation is configured.
Standard Kubernetes tooling misses this waste because CPU and memory are native scheduling resources, while GPU compute and framebuffer memory require specialized telemetry. A node can look fully allocated even when expensive GPU capacity lies mostly idle.
For example, this pod reserves one full GPU:
apiVersion: apps/v1
kind: Deployment
metadata:
name: embedding-service
spec:
replicas: 1
template:
spec:
containers:
- name: model
image: example.com/embedding-service:latest
resources:
limits:
nvidia.com/gpu: 1
Nothing in that request tells Kubernetes whether the model uses 2 GB or 70 GB of VRAM, whether SM utilization is 10% or 90%, or whether the service receives one request per minute or thousands per second. GPU telemetry and cost attribution reveal that difference.
Why HPA and VPA Do Not Solve GPU Optimization
HPA and VPA are useful tools, but neither was built around GPU memory behavior.
HPA can use GPU custom metrics when they are exposed and mapped correctly. If HPA sees CPU but not GPU pressure, it under-scales a GPU-bound service. If it sees only node-level GPU averages, it misses which pod or model endpoint drives the pressure.
VPA tunes CPU and memory requests but does not handle VRAM sizing or fractional GPU allocation. A model server can consume excessive GPU memory even when its CPU and container memory requests look reasonable.
These gaps become visible when inference workloads hit GPU memory limits. A service can appear healthy from a CPU and container memory perspective, while excessive VRAM reservation or sudden KV cache growth causes model-server errors, CUDA out-of-memory errors, failed requests, or container crashes.
GPU optimization requires its own telemetry and allocation logic. Standard HPA and VPA behavior alone is not enough.
A GPU-aware HPA configuration looks like this:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 12
metrics:
- type: Pods
pods:
metric:
name: gpu_sm_utilization
target:
type: AverageValue
averageValue: "65"
The hard part is not writing the HPA spec. It is implementing trustworthy GPU telemetry and a scaling policy that accounts for latency, memory, and request shape.
ScaleOps pro tip: Use ScaleOps Automated Fractional GPUs to manage pod-level GPU allocation dynamically based on real workload demand. Where HPA and VPA fall short on GPU memory behavior, Automated Fractional GPUs applies MIG-aware optimization and GPU sharing continuously in production, replacing the manual, node-by-node time-slicing that teams typically rely on.
How to Monitor GPU Utilization in Kubernetes
NVIDIA Data Center GPU Manager (DCGM) Exporter is the standard NVIDIA tool for collecting GPU telemetry in Kubernetes. It exposes NVIDIA DCGM metrics in a Prometheus-compatible format, making them available to dashboards in Grafana or any Prometheus-compatible observability stack.
Key metric categories to monitor:
- GPU utilization or SM activity for compute activity
- Framebuffer memory used for VRAM pressure
- Memory bandwidth or copy utilization for memory-bound workloads
- PCIe or NVLink throughput for data movement bottlenecks
- Power draw and cost per job, request, token, or model endpoint, where available
A basic ServiceMonitor for a Prometheus Operator setup looks like this:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
namespaceSelector:
matchNames:
- gpu-operator
endpoints:
- port: metrics
interval: 15s
Dashboards should connect metrics to pods, namespaces, models, services, teams, and owners, not just nodes. A node-level chart in Grafana shows that a GPU is busy. It cannot show whether the cost comes from search, recommendations, internal notebooks, or a forgotten staging endpoint.
Useful dashboards surface answers to operational questions:
- Which workloads reserve GPU memory but do little compute?
- Which namespaces carry the highest idle GPU cost?
- Which models are latency-bound versus memory-bound?
- Which owners should review baseline replicas or allocation policy?
- Which endpoints have the highest cost per token, request, or job?
This is where monitoring becomes cost management. GPU metrics point to a specific next action: reduce replicas, change placement, tune memory reservation, or assign an owner to review the workload.
GPU Compute Utilization vs. Memory Utilization
Before deciding whether a GPU workload is wasteful, separate compute usage from memory usage. SM is short for streaming multiprocessor, the part of the GPU that runs parallel compute work. A single utilization number hides the real bottleneck, especially in inference environments where models can reserve large amounts of VRAM without keeping GPU compute units busy.
The main signals:
- SM utilization (or SM activity where available) shows how busy the GPU compute units are.
- Framebuffer memory shows how much VRAM is occupied.
- Memory bandwidth and copy-engine metrics show whether the workload spends more time moving data than computing.
Read these signals together before deciding whether to consolidate, resize, tune batching, or change allocation strategy.
| Metric | What it tells you | Cost interpretation | Action to consider |
| SM utilization | How active the GPU compute units are | Low SM suggests paid compute capacity is idle. | Consolidate, share, tune batching, or reduce replicas. |
| Framebuffer memory used | How much VRAM is occupied | High VRAM can block co-location even when compute is low. | Right-size memory reservation, manage model placement, tune KV cache. |
| Memory bandwidth or copy utilization | How much data movement is happening | Low SM with high movement may indicate a memory-bound workload. | Optimize batching, data layout, transfer path, or serving settings. |
| PCIe/NVLink throughput | Whether data transfer is a bottleneck | High transfer pressure can make extra GPUs appear necessary. | Reduce host-device movement or improve placement. |
| Cost per job, request, token, or endpoint | How spend maps to business output | Unit cost reveals waste better than node cost alone. | Attribute spend by model, namespace, service, or owner. |
This four-state diagnostic view helps during reviews:
| Pattern | What it means | Optimization response |
| High SM + high framebuffer | GPU is likely well used. | Check latency and saturation before consolidating. |
| Low SM + high framebuffer | Workload is memory-reserved but compute-light. | Investigate smaller models, KV cache settings, fractional allocation, or memory rightsizing. |
| Low SM + high memory bandwidth | Workload may be memory-bound. | Optimize batching, data movement, or serving configuration before adding GPUs. |
| Low SM + low framebuffer | Strong candidate for optimization. | Use sharing, consolidation, fractional allocation, or replica reduction. |
The goal is not 100% utilization. A constantly saturated GPU breaks p99 latency. A GPU with no memory headroom fails under a burst. The practical goal is to reduce unused paid capacity while preserving service-level objectives.
ScaleOps pro tip: Use ScaleOps GPU Platform’s GPU Memory Optimization to manage overprovisioned GPU memory reservations that block sharing even when actual usage is lower.
Calculate the Real Cost of Low GPU Utilization in Kubernetes
GPU waste becomes actionable when you convert utilization into dollars. Start with a simple calculation:
hourly GPU cost × 24 × 365 = annual GPU cost
annual GPU cost × idle percentage = estimated unused-capacity cost
waste by namespace, service, model, or owner = action model for optimization
Two scenarios show how quickly low GPU utilization becomes annual waste:
At 15% average utilization, 85% of allocated capacity is not performing useful work. Ten always-on GPUs can incur six-figure unused-capacity costs before discounts, commitments, support, storage, or data transfer.
Attribution is the next step. A cluster-level waste number is interesting. A namespace-level or service-level number changes behavior. If the recommendation service has $40K of unused capacity and the experimentation namespace has $25K, each team has a concrete optimization target.
FinOps Attribution: Chargeback and Showback for GPU Spend
Visibility without accountability leaves waste in place. FinOps practices, chargeback and showback, connect GPU spend to the teams and services that generate it.
Showback surfaces GPU cost data to team owners without billing them directly. It creates visibility and prompts review without a hard financial consequence. Chargeback allocates costs to the responsible team’s budget, creating a direct incentive to optimize.
Both approaches require cost data at the namespace, workload, label, and annotation level. Node-level cost data is not granular enough. Teams cannot identify their own waste or take targeted action without workload-level attribution.
ScaleOps GPU Observability surfaces workload-level GPU utilization down to the pod, giving teams the granular signal chargeback and showback require. Cost Monitoring extends that to invoice-accurate spend visibility across all clusters, down to namespace, label, GPU, and network, so platform and finance teams can act on the full picture.
Inference vs. Training Cost Profiles
Inference and training consume GPU capacity differently. Applying the same strategy to both makes inference too expensive or training unreliable.
| Workload type | Traffic pattern | Typical session or job length | Memory footprint | Best optimization tactic | Expected savings impact |
| Inference | Bursty, service-driven, often multi-service | Long-running endpoints with variable request volume | Model weights plus KV cache; often memory-reserved | Sharing, fractional allocation, replica tuning, batching, serving-layer tuning | High when many endpoints sit partly idle |
| Training | Scheduled, queued, or campaign-based | Longer-running jobs with clearer start and end points | Larger and more sustained; often full-GPU or multi-GPU | Queueing, checkpointing, job placement, capacity planning, spot where safe | High when jobs can tolerate scheduling flexibility |
Start with workload behavior, then choose GPU type, allocation model, and scheduling policy.
Practical Kubernetes GPU Cost Optimization Checklist
| Optimization tactic | Expected savings impact | Best-fit workload type | Risk to watch for |
| Right-size GPU type to model size, latency target, and memory footprint | Potentially high when models run on oversized GPUs | Inference and smaller training jobs | Smaller GPUs can hurt latency, batch size, or model fit. |
| Measure GPU or SM activity and framebuffer utilization before changing placement | Potentially high since it prevents blind consolidation | All GPU workloads | Node-level averages can hide pod-level contention. |
| Choose a sharing strategy based on isolation and performance risk | Potentially high for multi-service inference clusters | Inference, notebooks, internal services | Weak isolation can affect tenants or SLOs. |
| Use MIG for stronger isolation | Medium to high where tenants need hardware-level boundaries | Production multi-tenant inference | Static profiles can leave part of the GPU unused if slices are larger than workload demand. |
| Use MPS for trusted compatible CUDA inference workloads | Medium where CUDA workloads benefit from concurrency | Same-team or same-trust-boundary inference services | Fatal faults and interference can affect shared clients. |
| Use time-slicing for low-risk oversubscription | Medium in dev/test and bursty environments | Notebooks, experiments, low-criticality services | There is no provision for memory or fault isolation. |
| Plan baseline vs. burst GPU demand | Potentially high in variable traffic environments | Inference and batch | Conservative baselines can become permanent waste. |
| Use pre-warmed replicas or proactive capacity only where scale-up delay creates measured impact | Medium when cold starts affect reliability | Latency-sensitive inference | Warm capacity can become idle spend. |
| Avoid fixed GPU profiles when demand changes frequently | Potentially high in fast-moving AI platforms | Multi-model inference environments | Manual profile management can lag workload behavior. |
ScaleOps pro tip: ScaleOps GPU Platform automates the tactics in this checklist continuously, managing fractional allocation, memory optimization, and replica scaling based on live workload behavior rather than manual policies set once and left to drift. For a deeper decision matrix on sharing strategies before changing production allocation policies, see our dedicated MIG vs. MPS vs. time-slicing guide.
Model Serving Patterns and Cold-Start Cost
Model serving settings shape GPU cost after placement. Serving stacks like vLLM improve throughput through batching, scheduling, and KV cache management. vLLM’s PagedAttention approach manages the KV cache as paged memory, enabling more efficient GPU memory use across concurrent requests.
Monitor real framebuffer usage regardless. vLLM can reserve device memory through settings like gpu_memory_utilization, and an aggressive reservation blocks sharing even when compute stays low.
Serving behavior connects directly to cost:
- Batch size affects SM utilization.
- KV cache affects GPU memory pressure.
- Concurrency affects latency and memory footprint.
- Traffic shape determines whether replicas sit idle.
Cold starts represent a distinct cost factor. GPU scale-up and model startup can produce several minutes of paid idle time before requests are served.
ScaleOps pro tip: Cold-start engineering reduces startup delay. Kubernetes GPU cost optimization reduces ongoing waste after workloads are running. For the full 3 to 8 minute startup path, covering node provisioning, image pulls, model download, CUDA initialization, and weight transfer, read our GPU Cold Starts article.
How ScaleOps GPU Platform Turns GPU Visibility into Continuous Optimization
The practical GPU optimization workflow has four steps: observe, interpret, allocate, and attribute.
- Observe GPU compute and memory.
- Interpret SM versus framebuffer patterns.
- Manage fractional allocation and memory reservations.
- Tie utilization back to ownership and cost.
ScaleOps GPU Platform is built for this continuous workflow in Kubernetes inference environments. Teams running self-hosted open models on GPU Platform typically see 50 to 70% GPU cost savings. For teams moving off frontier-model APIs, that layers on top of the 10x to 30x cost reduction from hosting open models directly.
GPU Platform brings together five capabilities:
| ScaleOps GPU Platform Capability | What it Does |
| GPU Observability | Workload-level visibility into GPU compute and memory utilization, time to first token, and inference performance |
| Automated Fractional GPUs | Dynamic, demand-based fractional GPU allocation with MIG-aware optimization, replacing manual time-slicing |
| GPU Memory Optimization | Right-sizes GPU memory for serving frameworks like vLLM that claim most GPU memory by default, recovering capacity without hurting performance |
| GPU Replica Optimization | Scales inference replicas to real demand, removing idle GPU capacity while protecting latency |
| Batch Inference Optimization | Aggressive scheduling and consolidation for latency-tolerant batch inference, the highest-savings class of GPU workload |
GPU Platform also works alongside ScaleOps Core Platform capabilities, including Kubernetes Cost Monitoring, Karpenter Optimization, Spot Optimization, and Replica Optimization, where they support the GPU optimization workflow.
Rather than asking teams to manually tune every model endpoint, GPU Platform builds an operating loop that keeps GPU allocation aligned with real demand. The result is inference infrastructure that runs reliably at scale, with continuous GPU cost optimization as the outcome.
Key Steps for GPU Cost Optimization in Kubernetes
Kubernetes GPU cost optimization is an allocation and utilization problem, not a discounting problem. Discounts lower the rate. They do not fix a cluster where expensive GPUs sit allocated and mostly idle.
The structural issue: Kubernetes allocation is device-based, while workload demand is compute- and memory-shaped. Uncovering real waste requires GPU telemetry that separates compute activity from memory utilization. A workload can reserve or consume large amounts of VRAM while leaving GPU compute capacity underused.
The optimization path follows the signal: read SM and framebuffer patterns together, match strategies to workload type, attribute spend to the teams and services that own it, and build a continuous loop rather than a one-time audit.
ScaleOps GPU Platform delivers that loop for Kubernetes inference workloads. Book a ScaleOps demo to see how adaptive GPU optimization keeps inference fast and reliable, and reduces GPU costs as a result.
Frequently Asked Questions
How do you monitor GPU utilization in Kubernetes?
Use NVIDIA DCGM Exporter to expose GPU metrics to Prometheus, then connect those metrics to pods, namespaces, services, models, and owner-related labels or annotations. View them in Grafana alongside SM utilization, framebuffer memory, memory bandwidth, transfer throughput, power, and unit cost metrics such as cost per request, token, job, or endpoint.
What is DCGM Exporter in Kubernetes?
DCGM Exporter is NVIDIA’s Prometheus-compatible exporter for GPU telemetry. It exposes metrics from NVIDIA Data Center GPU Manager, allowing Kubernetes teams to monitor GPU utilization, memory usage, power, health, and workload behavior.
Why is my GPU utilization low in Kubernetes?
GPU utilization is often low because Kubernetes allocates GPUs as whole devices while workloads use compute and memory unevenly. Common causes include oversized models, excessive VRAM reservation, idle replicas, small batches, low traffic, poor placement, and lack of GPU sharing.
How do I reduce GPU costs in Kubernetes?
Measure SM and framebuffer utilization, attribute spend to services and owners, consolidate low-utilization workloads, choose the right sharing method, tune serving settings, reduce unnecessary memory reservation, and use fractional allocation where isolation and performance requirements allow.
Can HPA or VPA optimize GPU workloads in Kubernetes?
Not by themselves. HPA can use GPU custom metrics when correctly exposed, but it does not optimize GPU performance by default. VPA focuses on CPU and memory requests, not VRAM sizing or fractional GPU allocation.
GPU time-slicing vs. MIG for Kubernetes cost optimization: which is better?
Time-slicing suits low-risk oversubscription, development, notebooks, and bursty workloads that can tolerate weak isolation. MIG suits production workloads that require stronger memory and fault isolation. The right choice depends on tenant risk, latency sensitivity, and hardware support.
What is the NVIDIA device plugin, and why does it limit GPU cost optimization?
The NVIDIA device plugin exposes GPUs to Kubernetes as schedulable resources such as nvidia.com/gpu. This enables GPU scheduling, but the standard device-plugin model does not make Kubernetes aware of real GPU compute or memory utilization after allocation