GPU sharing in Kubernetes lets multiple pods use the same physical GPU, rather than forcing each pod to reserve a full device. The three main NVIDIA-supported options are time-slicing, Multi-Process Service (MPS), and Multi-Instance GPU (MIG). Each solves that problem differently. Choosing the wrong one for your workload means paying for capacity you don’t need, or taking on failure risk you can’t afford.
Key Takeaways: GPU Sharing in Kubernetes
- Full-GPU allocation wastes capacity when inference workloads use only a small part of the device.
- Time-slicing is the easiest way to start sharing GPUs, but it does not isolate memory or faults.
- MIG gives you hardware-level isolation, but only on supported GPUs and with planned static profiles.
- MPS can improve concurrency for trusted CUDA workloads, but a single client fault can affect other clients sharing the GPU.
- Dynamic resource allocation (DRA) represents the Kubernetes-native approach for device allocation and sharing intent, while the device plugin and GPU Operator path remain common in existing clusters.
- ScaleOps automatically detects each AI workload’s behavior, assigns the right fractional GPU policy, and continuously re-optimizes as usage evolves, with no manual configuration required.
Why Kubernetes GPU Sharing Matters
Kubernetes GPU sharing matters because the traditional GPU device model doesn’t expose partial GPUs by default. The NVIDIA device plugin advertises GPUs as integer resources, such as nvidia.com/gpu. When a pod requests nvidia.com/gpu: 1, Kubernetes schedules it onto a node with an available GPU and treats that device as allocated.
This creates a GPU waste tax for inference workloads: ten services that each use 15% of a GPU still need ten physical GPUs with full-GPU allocation. The real demand is closer to 1.5 GPUs, but Kubernetes still reserves ten devices unless sharing is configured.
Sharing improves utilization, but it also changes the failure model. A notebook can usually tolerate slower GPU access or a failed experiment. A regulated customer-facing endpoint cannot, because one workload’s memory pressure or CUDA fault could affect another tenant. That is why each workload type needs a sharing method that matches its isolation and risk requirements.
ScaleOps Tip
Before choosing a sharing method, look at pod-level GPU memory, compute telemetry, time to first token (TTFT), key-value (KV) cache behavior, request volume, and latency. ScaleOps gives you that workload view before you lock teams into static slices or shared slots.
Time-Slicing in Kubernetes
Time-slicing is usually the simplest way for multiple Kubernetes workloads to share a GPU. Instead of partitioning the hardware, it lets GPU processes take turns on a single physical device via CUDA context switching.
In Kubernetes, you enable this behavior through the NVIDIA GPU Operator and device plugin by setting a replicas count for the GPU resource. You can scope that ConfigMap cluster-wide or label-target only the nodes that carry a specific GPU model, so teams can enable oversubscription gradually. You can also apply time-slicing inside an existing MIG slice when you need more than the seven hardware partitions MIG exposes.
The example below shows a device plugin ConfigMap that advertises four shared GPU slots for each physical GPU:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: true
failRequestsGreaterThanOne: true
resources:
- name: nvidia.com/gpu
replicas: 4
With renameByDefault: true, a one-GPU node advertises 4 nvidia.com/gpu.shared resources. A pod then requests one shared slot:
resources:
limits:
nvidia.com/gpu.shared: 1
It works across a broad range of NVIDIA GPUs and is useful for notebooks, experiments, internal demos, development clusters, and low-criticality inference.
Time-slicing does not isolate memory. It also does not isolate faults. GPU time is shared across GPU processes, not cleanly by Kubernetes pod or replica, so you have limited control over proportional compute.
⚠️ Warning: Requesting more than one time-sliced GPU does not guarantee more compute. It only gives the pod access to a shared GPU.
MIG in Kubernetes
Multi-Instance GPU (MIG) partitions a supported Ampere-generation (compute capability 8.0+) NVIDIA GPU into up to seven hardware-isolated GPU instances, each of which can be scheduled as an independent mini-GPU. Each instance behaves like a smaller GPU with dedicated memory, dedicated compute resources, stronger fault isolation, and more predictable performance boundaries.
MIG is supported on NVIDIA GPUs starting with the Ampere generation. You still need to verify the exact GPU model before planning a cluster, because the supported profiles and maximum number of MIG instances vary by product. Common datacenter examples include A100, A30, H100, H200, B200, and GB200. A100, H100, H200, B200, and GB200 support up to seven MIG instances. A30 supports up to four.
After MIG is enabled on a supported GPU, the NVIDIA device plugin can expose MIG profiles as separate Kubernetes resource names. In the A100 40 GB example below, the pod requests one 1g.5gb MIG resource, which represents one MIG instance with one GPU slice and 5 GB of GPU memory:
resources:
limits:
nvidia.com/mig-1g.5gb: 1
MIG is usually the best starting point for production multi-tenant workloads, regulated environments, and services that need clear blast-radius control.
As a static, hardware-level partitioning method, MIG uses predefined profiles, so you must plan slice sizes in advance. Oversized profiles can leave part of the GPU unused because the slice is larger than the workload actually needs. Undersized profiles block workloads that need more memory or compute. Reconfiguring MIG profiles typically requires the GPU to be free of running workloads, so profile changes become an operational event rather than a runtime decision.
ScaleOps Tip
Compare actual GPU memory and compute usage against the MIG profile size before standardizing on a profile. ScaleOps continuously analyzes each workload’s real GPU consumption, automatically assigns the right fractional policy, and re-optimizes as usage changes, so workloads stop holding capacity in oversized slices without anyone having to retune manually.
MPS in Kubernetes
NVIDIA Multi-Process Service (MPS) is a runtime service for cooperative CUDA multi-process and multi-application workloads on NVIDIA GPUs.
In Kubernetes, MPS fits trusted CUDA workloads that benefit from concurrent execution on the same GPU. That can include inference services with compatible CUDA behavior, small kernels, and workloads owned by the same team or within the same trust boundary.
In the NVIDIA device plugin, MPS sharing uses a replicas setting. Unlike time-slicing, MPS uses the control daemon to limit each client to an equal fraction of GPU memory and compute capacity. That gives you more explicit memory and compute partitioning than time-slicing, but weaker isolation than MIG.
NVIDIA marks MPS support in the device plugin as experimental. NVIDIA’s MPS documentation also notes that a fatal fault from one client can cause the MPS server to enter a FAULT state and affect other clients sharing the affected GPU.
MPS can improve concurrency for compatible CUDA workloads, but it does not adapt to shifting workload demand on its own. In current NVIDIA device plugin modes, time-slicing and MPS are mutually exclusive. In the NVIDIA GPU Operator, MPS and MIG cannot be combined: they are configured as separate node-level strategies. Treat MPS and MIG as separate strategies rather than combining them by default, then use workload-aware optimization above the chosen sharing method to keep GPU utilization aligned with runtime demand.
How to Choose Between MIG, MPS, and Time-Slicing in Kubernetes
The centerpiece of MIG versus MPS Kubernetes planning is not raw utilization. It is the isolation and failure model that each workload can tolerate:
| Sharing Method | Isolation Level | Memory Isolation | Fault Isolation | GPU Hardware Required | Max Instances or Clients | Best-Fit Workload | Use When |
| Time-slicing | No isolation | No | No | NVIDIA GPU supported by the device plugin sharing configuration | Configurable replicas | Dev/test, notebooks, low-risk bursty workloads | You need fast oversubscription and can tolerate contention |
| MIG | Hardware isolation | Yes | Yes | Supported Ampere+ GPUs | Up to seven; exact limit depends on GPU model | Production multi-tenant or regulated workloads | You need predictable boundaries and stronger isolation |
| MPS | Software-level concurrency | Memory and compute limits, but not hardware isolation | Fatal faults can affect shared clients | Supported NVIDIA CUDA workloads on full GPUs | Configurable clients and equal fractions | Trusted compatible inference workloads | You want concurrent CUDA execution inside one trust boundary |
Use the table as a starting point, then test with your own latency, memory, and failure patterns:
| Scenario | Recommended Starting Point | Why |
| Production inference | MIG or carefully tested MPS | Depends on isolation needs and CUDA compatibility |
| Trusted internal inference | MPS | One team owns the clients and can test interference |
| Dev/text notebooks | Time-slicing | Easy access matters more than strict isolation |
| Regulated workloads | MIG | Hardware boundaries are easier to reason about |
| Dynamic AI workloads | ScaleOps with the chosen sharing method | Static profiles often change too slowly |
Utilization and performance tradeoffs
Time-slicing improves utilization for idle or bursty workloads, but it can introduce latency and memory contention. MIG improves predictability and isolation, but oversized profiles can still waste capacity. MPS can improve concurrency for compatible inference workloads, but it requires careful testing for interference, memory behavior, and fault containment.
Track KV cache size, prefix cache hit rate, prefill time, TTFT, decode time, generated tokens, waiting requests, request volume, and p95 and p99 latency. Each sharing method affects utilization and performance differently, so start by considering the operational trade-offs before choosing a default.
For a deeper benchmark comparison, analyze the data points presented in NVIDIA’s KubeCon NA 2024 GPU sharing benchmark study.
ScaleOps Tip
ScaleOps continuously evaluates sharing decisions as traffic, KV cache size, GPU memory reservation, and model behavior change.
How DRA is changing Kubernetes GPU allocation
Dynamic resource allocation (DRA) is the Kubernetes-native direction for GPU allocation and sharing intent. It moves the ecosystem beyond opaque integer resources and introduces objects such as DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice.
DRA graduated to GA in Kubernetes 1.34 and is enabled by default in Kubernetes 1.34 and later. Similarly, OpenShift 4.21 shipped DRA GA based on the upstream Kubernetes 1.34 implementation.
NVIDIA also announced at KubeCon Europe 2026 that it is donating the NVIDIA DRA Driver for GPUs to the CNCF, moving the driver from vendor-governed software toward community ownership under the Kubernetes project.
AKS engineering has documented DRA concepts and NVIDIA DRA driver examples, but AKS support should still be treated as version-dependent.
The practical caveat is simple: device plugin plus GPU Operator is still the more familiar path in many existing GPU clusters. DRA is where the ecosystem is moving, so it is worth understanding before you stand up new GPU clusters.
Review our deep-dive architectural guide to Kubernetes DRA to understand specific DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice implementation details.
Limitations of Static GPU Sharing Methods in Kubernetes
MIG, MPS, and time-slicing are allocation mechanisms. They divide or share a GPU, but they do not decide whether a workload still deserves the same share tomorrow.
Monitoring dashboards can show GPU waste, but visibility is not optimization. They can tell you that a device is underused, but they do not resize a MIG profile, reduce an oversized memory reservation, or change scaling behavior when demand shifts.
Static sharing leaves three common waste patterns:
- Idle slices: Peak-sized workloads sit mostly idle inside a MIG slice or time-sliced slot.
- Memory blocking: Workloads hold more GPU memory than they use, which blocks co-location even when compute headroom exists.
- Static inflexibility: MIG profiles, MPS thread percentages, and memory limits do not adapt when batch sizes, model versions, context lengths, or traffic patterns change.
When teams manage these boundaries through manual YAML changes, every adjustment becomes a trial-and-error cycle across profile size, memory reservation, replica count, and latency impact. That slows tuning and increases the risk of disruption when workloads are already running.
This is especially visible in inference stacks such as vLLM. GPU memory settings can reserve a large fraction of the GPU’s memory for model execution and the KV cache. That reservation can block sharing even when actual compute use is low, leaving expensive AI hardware underused when static allocations no longer match real inference demand.
How ScaleOps Optimizes GPU Sharing in Kubernetes
ScaleOps sits above static Kubernetes GPU sharing as a workload-aware, autonomous optimization layer. Where MIG, MPS, and time-slicing allocate capacity, ScaleOps continuously manages it, with no manual reconfiguration required as workload behavior changes.
Automated Fractional GPUs. ScaleOps automatically detects whether a workload is real-time, near-real-time, or batch, attaches the right fractional GPU policy, and auto-tunes policy parameters to preserve workload behavior while improving GPU efficiency. It continuously reacts to application changes and re-optimizes as usage patterns evolve.
GPU Memory Optimization. ScaleOps reduces unnecessary GPU memory capture so memory-bound workloads, including vLLM-based inference stacks, stop holding capacity they never use. Freeing that reserved memory allows more workloads to participate in GPU sharing without blocking co-location.
AI Replica Optimization. ScaleOps optimizes minimum replicas, scaling thresholds, and autoscaling triggers using GPU-native signals (KV cache size, waiting requests, and request volume) instead of generic CPU metrics. It scales eligible workloads to zero during low-demand periods and reduces cold-start impact when demand returns, so teams avoid paying for always-on GPU capacity that sits idle overnight.
AI Inference Visibility. ScaleOps surfaces GPU cost, utilization, prefill time, TTFT, decode time, prefix cache hit rate, and request volume in one place, giving teams the visibility to troubleshoot inference issues without building ad hoc dashboards.
Batch Inference Optimization. For batch workloads, ScaleOps optimizes execution timing and uses lower-cost capacity to meet SLA requirements without dedicated always-on GPU allocation.
From Static Partitioning to Autonomous GPU Sharing
Kubernetes GPU sharing is a practical tradeoff between utilization, isolation, and complexity. For dynamic AI environments, static hardware partitioning is only the first step. ScaleOps helps you move from fixed GPU assumptions to continuous, workload-aware optimization.
Stop partitioning hardware based on fixed assumptions and static configurations. ScaleOps provides the workload-aware resource automation your shared GPU environments require to eliminate idle capacity dynamically, without asking your team to manually retune policies as workloads change. Automate Kubernetes GPU optimization with ScaleOps.
GPU Sharing in Kubernetes: Frequently Asked Questions
When should I choose MIG, MPS, or time-slicing for Kubernetes GPU workloads?
- MIG: Use MIG when you need hardware-level isolation between tenants, including memory isolation and stronger fault isolation. MIG requires Ampere-generation (compute capability 8.0+) GPUs, supports up to seven instances depending on the GPU model, and uses static profiles that you plan in advance.
- MPS: Use MPS for trusted CUDA workloads that benefit from concurrent execution on a full GPU inside one trust boundary. MPS can improve concurrency by partitioning memory and compute fractions in software, but a fatal fault from one client can affect other clients sharing the GPU.
- Time-slicing: Use time-slicing when you want the simplest way to oversubscribe GPUs for dev/test, notebooks, experiments, and low-risk inference. It does not isolate memory or faults, and contention can impact latency.
How do I enable GPU time-slicing in Kubernetes?
You enable time-slicing through the NVIDIA GPU Operator and device plugin by setting a replicas count for the GPU resource in a ConfigMap, then requesting nvidia.com/gpu.shared in your pod spec. You can scope that ConfigMap cluster-wide or target specific nodes by label, allowing gradual rollout.
Can I combine MIG and time-slicing on the same GPU?
Yes. You can enable MIG to carve the GPU into hardware-isolated slices, then apply time-slicing inside a MIG slice to oversubscribe that slice when you need more than the seven hardware partitions MIG exposes. Changing MIG profiles typically requires the GPU to be free of running workloads, so treat profile changes as an operational event. Time-slicing still does not add memory or fault isolation beyond the MIG slice boundary.
What is Dynamic Resource Allocation (DRA) for GPUs in Kubernetes?
DRA is the Kubernetes-native direction for device allocation and sharing intent, using objects such as DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice. DRA complements MIG, MPS, and time-slicing by providing a Kubernetes-native way to model and request devices, rather than replacing the underlying GPU sharing mechanisms themselves. It graduated to GA in Kubernetes 1.34.
What is the difference between MIG and MPS in Kubernetes?
MIG partitions the GPU in hardware with dedicated memory and stronger fault isolation. MPS lets compatible CUDA clients share a single GPU via daemon-managed memory and compute partitioning. MIG is better for isolation. MPS is better for trusted concurrency.
Is time-slicing safe for production GPU workloads?
Time-slicing can work for low-criticality production workloads, but it is risky for strict multi-tenant environments because it has no memory or fault isolation. Test latency, memory contention, and failure behavior before adoption.