Skip to content
All articles

GPU Cost Optimization in Kubernetes: From Waste to Efficient AI Infrastructure

Konstantin Zelmanovich
Konstantin Zelmanovich

Key takeaways

  • Kubernetes GPU cost optimization is the practice of measuring real GPU compute and memory utilization, then matching workload allocation to actual demand rather than reserving entire devices.
  • Without optimization, production inference workloads typically run at 5% to 20% GPU utilization while you pay for 100% of allocated capacity.
  • Kubernetes allocates GPUs as devices, but inference workloads consume GPU compute and memory unevenly. That structural gap is where waste accumulates.
  • GPU telemetry, SM utilization, framebuffer memory, and cost attribution identify which workloads are wasting GPU capacity.
  • ScaleOps GPU Platform delivers continuous optimization through GPU Observability, Automated Fractional GPUs, GPU Memory Optimization, GPU Replica Optimization, and Batch Inference Optimization.

What Is GPU Cost Optimization?

GPU cost optimization is the practice of measuring real GPU compute and memory utilization across Kubernetes workloads and matching allocation to actual demand. Kubernetes GPU cost optimization addresses a structural gap: the scheduler treats GPUs as whole devices, but inference workloads consume GPU compute and memory unevenly, leaving expensive capacity allocated but unused.

Why Kubernetes GPU Costs Get Out of Control

GPU costs spiral when AI and ML platform teams assume a reserved GPU is being actively used. When a pod requests a GPU, Kubernetes assigns it and the bill starts. The workload may use only a small fraction of that device for most of the day.

That waste compounds because GPU capacity is billed by the hour, even when a workload uses only a fraction of the device. The exact rate varies by provider, region, commitment, availability, support model, and instance configuration, but the principle is consistent: low utilization turns every always-on GPU into a recurring cost center.

The table below offers pricing context for the waste examples that follow. Verify current rates before building a business case.

Provider categoryExample GPUTypical 2026 on-demand rangeWhat to verify before using the GPU-hour estimate
HyperscalerH100Often mid single digits to low double digits per GPU-hour after normalizing multi-GPU instances, e.g., about $4–$6/GPU-hour in some AWS/GCP-style configurations and higher in some Azure-style bundled instancesRegion, GPU count per VM, bundled CPU and memory, reservations, support, data transfer
Neo-cloudH100Often low-to-mid single digits per GPU-hour, e.g., around $2.50–$6.16/GPU-hour depending on provider, term, and cluster sizeAvailability, uptime commitments, storage, networking, support, capacity guarantees
Hyperscaler or marketplaceA100Usually lower than H100 but highly variable, with public ranges often around sub-$1 to about $5/GPU-hour depending on marketplace, memory size, and availabilityGPU memory size, spot or preemptible terms, commitment model, reserved capacity

In production, inference workloads commonly run at as little as 5–20% GPU utilization. In typical unoptimized deployments, the majority of paid compute capacity sits idle while the bill reflects 100% allocation.

ScaleOps pro tip: Use ScaleOps AI Inference Observability to connect raw GPU spend directly to workload ownership, mapping which teams and services generate unnecessary costs and which ones actually use the hardware.

The Allocation Problem That Kubernetes Cannot See

The standard Kubernetes device plugin model treats GPUs as either allocated or available. It has no visibility into whether an allocated GPU is 5%, 50%, or 95% utilized after scheduling. This breaks cost efficiency in four ways:

  • GPU capacity is reserved at the device level.
  • Streaming multiprocessor (SM) utilization can stay low.
  • Video RAM (VRAM) can be overreserved.
  • Idle GPU capacity cannot be reused unless sharing or fractional allocation is configured.

Standard Kubernetes tooling misses this waste because CPU and memory are native scheduling resources, while GPU compute and framebuffer memory require specialized telemetry. A node can look fully allocated even when expensive GPU capacity lies mostly idle.

For example, this pod reserves one full GPU:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-service
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: model
        image: example.com/embedding-service:latest
        resources:
          limits:
            nvidia.com/gpu: 1

Nothing in that request tells Kubernetes whether the model uses 2 GB or 70 GB of VRAM, whether SM utilization is 10% or 90%, or whether the service receives one request per minute or thousands per second. GPU telemetry and cost attribution reveal that difference.

Why HPA and VPA Do Not Solve GPU Optimization

HPA and VPA are useful tools, but neither was built around GPU memory behavior.

HPA can use GPU custom metrics when they are exposed and mapped correctly. If HPA sees CPU but not GPU pressure, it under-scales a GPU-bound service. If it sees only node-level GPU averages, it misses which pod or model endpoint drives the pressure.

VPA tunes CPU and memory requests but does not handle VRAM sizing or fractional GPU allocation. A model server can consume excessive GPU memory even when its CPU and container memory requests look reasonable.

These gaps become visible when inference workloads hit GPU memory limits. A service can appear healthy from a CPU and container memory perspective, while excessive VRAM reservation or sudden KV cache growth causes model-server errors, CUDA out-of-memory errors, failed requests, or container crashes.

GPU optimization requires its own telemetry and allocation logic. Standard HPA and VPA behavior alone is not enough.

A GPU-aware HPA configuration looks like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 12
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_sm_utilization
      target:
        type: AverageValue
        averageValue: "65"


The hard part is not writing the HPA spec. It is implementing trustworthy GPU telemetry and a scaling policy that accounts for latency, memory, and request shape.

ScaleOps pro tip: Use ScaleOps Automated Fractional GPUs to manage pod-level GPU allocation dynamically based on real workload demand. Where HPA and VPA fall short on GPU memory behavior, Automated Fractional GPUs applies MIG-aware optimization and GPU sharing continuously in production, replacing the manual, node-by-node time-slicing that teams typically rely on.

How to Monitor GPU Utilization in Kubernetes

NVIDIA Data Center GPU Manager (DCGM) Exporter is the standard NVIDIA tool for collecting GPU telemetry in Kubernetes. It exposes NVIDIA DCGM metrics in a Prometheus-compatible format, making them available to dashboards in Grafana or any Prometheus-compatible observability stack.

Key metric categories to monitor:

  • GPU utilization or SM activity for compute activity
  • Framebuffer memory used for VRAM pressure
  • Memory bandwidth or copy utilization for memory-bound workloads
  • PCIe or NVLink throughput for data movement bottlenecks
  • Power draw and cost per job, request, token, or model endpoint, where available

A basic ServiceMonitor for a Prometheus Operator setup looks like this:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  namespaceSelector:
    matchNames:
    - gpu-operator
  endpoints:
  - port: metrics
    interval: 15s

Dashboards should connect metrics to pods, namespaces, models, services, teams, and owners, not just nodes. A node-level chart in Grafana shows that a GPU is busy. It cannot show whether the cost comes from search, recommendations, internal notebooks, or a forgotten staging endpoint.

Useful dashboards surface answers to operational questions:

  • Which workloads reserve GPU memory but do little compute?
  • Which namespaces carry the highest idle GPU cost?
  • Which models are latency-bound versus memory-bound?
  • Which owners should review baseline replicas or allocation policy?
  • Which endpoints have the highest cost per token, request, or job?

This is where monitoring becomes cost management. GPU metrics point to a specific next action: reduce replicas, change placement, tune memory reservation, or assign an owner to review the workload.

GPU Compute Utilization vs. Memory Utilization

Before deciding whether a GPU workload is wasteful, separate compute usage from memory usage. SM is short for streaming multiprocessor, the part of the GPU that runs parallel compute work. A single utilization number hides the real bottleneck, especially in inference environments where models can reserve large amounts of VRAM without keeping GPU compute units busy.

The main signals:

  • SM utilization (or SM activity where available) shows how busy the GPU compute units are.
  • Framebuffer memory shows how much VRAM is occupied.
  • Memory bandwidth and copy-engine metrics show whether the workload spends more time moving data than computing.

Read these signals together before deciding whether to consolidate, resize, tune batching, or change allocation strategy.

MetricWhat it tells youCost interpretationAction to consider
SM utilizationHow active the GPU compute units areLow SM suggests paid compute capacity is idle.Consolidate, share, tune batching, or reduce replicas.
Framebuffer memory usedHow much VRAM is occupiedHigh VRAM can block co-location even when compute is low.Right-size memory reservation, manage model placement, tune KV cache.
Memory bandwidth or copy utilizationHow much data movement is happeningLow SM with high movement may indicate a memory-bound workload.Optimize batching, data layout, transfer path, or serving settings.
PCIe/NVLink throughputWhether data transfer is a bottleneckHigh transfer pressure can make extra GPUs appear necessary.Reduce host-device movement or improve placement.
Cost per job, request, token, or endpointHow spend maps to business outputUnit cost reveals waste better than node cost alone.Attribute spend by model, namespace, service, or owner.

This four-state diagnostic view helps during reviews:

PatternWhat it meansOptimization response
High SM + high framebufferGPU is likely well used.Check latency and saturation before consolidating.
Low SM + high framebufferWorkload is memory-reserved but compute-light.Investigate smaller models, KV cache settings, fractional allocation, or memory rightsizing.
Low SM + high memory bandwidthWorkload may be memory-bound.Optimize batching, data movement, or serving configuration before adding GPUs.
Low SM + low framebufferStrong candidate for optimization.Use sharing, consolidation, fractional allocation, or replica reduction.

The goal is not 100% utilization. A constantly saturated GPU breaks p99 latency. A GPU with no memory headroom fails under a burst. The practical goal is to reduce unused paid capacity while preserving service-level objectives.

ScaleOps pro tip: Use ScaleOps GPU Platform’s GPU Memory Optimization to manage overprovisioned GPU memory reservations that block sharing even when actual usage is lower.

Calculate the Real Cost of Low GPU Utilization in Kubernetes

GPU waste becomes actionable when you convert utilization into dollars. Start with a simple calculation:

hourly GPU cost × 24 × 365 = annual GPU cost
annual GPU cost × idle percentage = estimated unused-capacity cost
waste by namespace, service, model, or owner = action model for optimization

Two scenarios show how quickly low GPU utilization becomes annual waste:

At 15% average utilization, 85% of allocated capacity is not performing useful work. Ten always-on GPUs can incur six-figure unused-capacity costs before discounts, commitments, support, storage, or data transfer.

Attribution is the next step. A cluster-level waste number is interesting. A namespace-level or service-level number changes behavior. If the recommendation service has $40K of unused capacity and the experimentation namespace has $25K, each team has a concrete optimization target.

FinOps Attribution: Chargeback and Showback for GPU Spend

Visibility without accountability leaves waste in place. FinOps practices, chargeback and showback, connect GPU spend to the teams and services that generate it.

Showback surfaces GPU cost data to team owners without billing them directly. It creates visibility and prompts review without a hard financial consequence. Chargeback allocates costs to the responsible team’s budget, creating a direct incentive to optimize.

Both approaches require cost data at the namespace, workload, label, and annotation level. Node-level cost data is not granular enough. Teams cannot identify their own waste or take targeted action without workload-level attribution.

ScaleOps GPU Observability surfaces workload-level GPU utilization down to the pod, giving teams the granular signal chargeback and showback require. Cost Monitoring extends that to invoice-accurate spend visibility across all clusters, down to namespace, label, GPU, and network, so platform and finance teams can act on the full picture.

Inference vs. Training Cost Profiles

Inference and training consume GPU capacity differently. Applying the same strategy to both makes inference too expensive or training unreliable.

Workload typeTraffic patternTypical session or job lengthMemory footprintBest optimization tacticExpected savings impact
InferenceBursty, service-driven, often multi-serviceLong-running endpoints with variable request volumeModel weights plus KV cache; often memory-reservedSharing, fractional allocation, replica tuning, batching, serving-layer tuningHigh when many endpoints sit partly idle
TrainingScheduled, queued, or campaign-basedLonger-running jobs with clearer start and end pointsLarger and more sustained; often full-GPU or multi-GPUQueueing, checkpointing, job placement, capacity planning, spot where safeHigh when jobs can tolerate scheduling flexibility

Start with workload behavior, then choose GPU type, allocation model, and scheduling policy.

Practical Kubernetes GPU Cost Optimization Checklist

Optimization tacticExpected savings impactBest-fit workload typeRisk to watch for
Right-size GPU type to model size, latency target, and memory footprintPotentially high when models run on oversized GPUsInference and smaller training jobsSmaller GPUs can hurt latency, batch size, or model fit.
Measure GPU or SM activity and framebuffer utilization before changing placementPotentially high since it prevents blind consolidationAll GPU workloadsNode-level averages can hide pod-level contention.
Choose a sharing strategy based on isolation and performance riskPotentially high for multi-service inference clustersInference, notebooks, internal servicesWeak isolation can affect tenants or SLOs.
Use MIG for stronger isolationMedium to high where tenants need hardware-level boundariesProduction multi-tenant inferenceStatic profiles can leave part of the GPU unused if slices are larger than workload demand.
Use MPS for trusted compatible CUDA inference workloadsMedium where CUDA workloads benefit from concurrencySame-team or same-trust-boundary inference servicesFatal faults and interference can affect shared clients.
Use time-slicing for low-risk oversubscriptionMedium in dev/test and bursty environmentsNotebooks, experiments, low-criticality servicesThere is no provision for memory or fault isolation.
Plan baseline vs. burst GPU demandPotentially high in variable traffic environmentsInference and batchConservative baselines can become permanent waste.
Use pre-warmed replicas or proactive capacity only where scale-up delay creates measured impactMedium when cold starts affect reliabilityLatency-sensitive inferenceWarm capacity can become idle spend.
Avoid fixed GPU profiles when demand changes frequentlyPotentially high in fast-moving AI platformsMulti-model inference environmentsManual profile management can lag workload behavior.

ScaleOps pro tip: ScaleOps GPU Platform automates the tactics in this checklist continuously, managing fractional allocation, memory optimization, and replica scaling based on live workload behavior rather than manual policies set once and left to drift. For a deeper decision matrix on sharing strategies before changing production allocation policies, see our dedicated MIG vs. MPS vs. time-slicing guide.

Model Serving Patterns and Cold-Start Cost

Model serving settings shape GPU cost after placement. Serving stacks like vLLM improve throughput through batching, scheduling, and KV cache management. vLLM’s PagedAttention approach manages the KV cache as paged memory, enabling more efficient GPU memory use across concurrent requests.

Monitor real framebuffer usage regardless. vLLM can reserve device memory through settings like gpu_memory_utilization, and an aggressive reservation blocks sharing even when compute stays low.

Serving behavior connects directly to cost:

  • Batch size affects SM utilization.
  • KV cache affects GPU memory pressure.
  • Concurrency affects latency and memory footprint.
  • Traffic shape determines whether replicas sit idle.

Cold starts represent a distinct cost factor. GPU scale-up and model startup can produce several minutes of paid idle time before requests are served.

ScaleOps pro tip: Cold-start engineering reduces startup delay. Kubernetes GPU cost optimization reduces ongoing waste after workloads are running. For the full 3 to 8 minute startup path, covering node provisioning, image pulls, model download, CUDA initialization, and weight transfer, read our GPU Cold Starts article.

How ScaleOps GPU Platform Turns GPU Visibility into Continuous Optimization

The practical GPU optimization workflow has four steps: observe, interpret, allocate, and attribute.

  1. Observe GPU compute and memory.
  2. Interpret SM versus framebuffer patterns.
  3. Manage fractional allocation and memory reservations.
  4. Tie utilization back to ownership and cost.

ScaleOps GPU Platform is built for this continuous workflow in Kubernetes inference environments. Teams running self-hosted open models on GPU Platform typically see 50 to 70% GPU cost savings. For teams moving off frontier-model APIs, that layers on top of the 10x to 30x cost reduction from hosting open models directly.

GPU Platform brings together five capabilities:

ScaleOps GPU Platform CapabilityWhat it Does
GPU ObservabilityWorkload-level visibility into GPU compute and memory utilization, time to first token, and inference performance
Automated Fractional GPUsDynamic, demand-based fractional GPU allocation with MIG-aware optimization, replacing manual time-slicing
GPU Memory OptimizationRight-sizes GPU memory for serving frameworks like vLLM that claim most GPU memory by default, recovering capacity without hurting performance
GPU Replica OptimizationScales inference replicas to real demand, removing idle GPU capacity while protecting latency
Batch Inference OptimizationAggressive scheduling and consolidation for latency-tolerant batch inference, the highest-savings class of GPU workload

GPU Platform also works alongside ScaleOps Core Platform capabilities, including Kubernetes Cost Monitoring, Karpenter Optimization, Spot Optimization, and Replica Optimization, where they support the GPU optimization workflow.

Rather than asking teams to manually tune every model endpoint, GPU Platform builds an operating loop that keeps GPU allocation aligned with real demand. The result is inference infrastructure that runs reliably at scale, with continuous GPU cost optimization as the outcome.

Key Steps for GPU Cost Optimization in Kubernetes

Kubernetes GPU cost optimization is an allocation and utilization problem, not a discounting problem. Discounts lower the rate. They do not fix a cluster where expensive GPUs sit allocated and mostly idle.

The structural issue: Kubernetes allocation is device-based, while workload demand is compute- and memory-shaped. Uncovering real waste requires GPU telemetry that separates compute activity from memory utilization. A workload can reserve or consume large amounts of VRAM while leaving GPU compute capacity underused.

The optimization path follows the signal: read SM and framebuffer patterns together, match strategies to workload type, attribute spend to the teams and services that own it, and build a continuous loop rather than a one-time audit.

ScaleOps GPU Platform delivers that loop for Kubernetes inference workloads. Book a ScaleOps demo to see how adaptive GPU optimization keeps inference fast and reliable, and reduces GPU costs as a result.

Frequently Asked Questions

How do you monitor GPU utilization in Kubernetes?

Use NVIDIA DCGM Exporter to expose GPU metrics to Prometheus, then connect those metrics to pods, namespaces, services, models, and owner-related labels or annotations. View them in Grafana alongside SM utilization, framebuffer memory, memory bandwidth, transfer throughput, power, and unit cost metrics such as cost per request, token, job, or endpoint.

What is DCGM Exporter in Kubernetes?

DCGM Exporter is NVIDIA’s Prometheus-compatible exporter for GPU telemetry. It exposes metrics from NVIDIA Data Center GPU Manager, allowing Kubernetes teams to monitor GPU utilization, memory usage, power, health, and workload behavior.

Why is my GPU utilization low in Kubernetes?

GPU utilization is often low because Kubernetes allocates GPUs as whole devices while workloads use compute and memory unevenly. Common causes include oversized models, excessive VRAM reservation, idle replicas, small batches, low traffic, poor placement, and lack of GPU sharing.

How do I reduce GPU costs in Kubernetes?

Measure SM and framebuffer utilization, attribute spend to services and owners, consolidate low-utilization workloads, choose the right sharing method, tune serving settings, reduce unnecessary memory reservation, and use fractional allocation where isolation and performance requirements allow.

Can HPA or VPA optimize GPU workloads in Kubernetes?

Not by themselves. HPA can use GPU custom metrics when correctly exposed, but it does not optimize GPU performance by default. VPA focuses on CPU and memory requests, not VRAM sizing or fractional GPU allocation.

GPU time-slicing vs. MIG for Kubernetes cost optimization: which is better?

Time-slicing suits low-risk oversubscription, development, notebooks, and bursty workloads that can tolerate weak isolation. MIG suits production workloads that require stronger memory and fault isolation. The right choice depends on tenant risk, latency sensitivity, and hardware support.

What is the NVIDIA device plugin, and why does it limit GPU cost optimization?

The NVIDIA device plugin exposes GPUs to Kubernetes as schedulable resources such as nvidia.com/gpu. This enables GPU scheduling, but the standard device-plugin model does not make Kubernetes aware of real GPU compute or memory utilization after allocation