GPU Sharing in Kubernetes: MIG vs MPS vs Time-Slicing

GPU sharing in Kubernetes lets multiple pods use the same physical GPU, rather than forcing each pod to reserve a full device. The three main NVIDIA-supported options are time-slicing, Multi-Process Service (MPS), and Multi-Instance GPU (MIG). Each solves that problem differently. Choosing the wrong one for your workload means paying for capacity you don’t need, or taking on failure risk you can’t afford.

Full-GPU allocation wastes capacity when inference workloads use only a small part of the device.
Time-slicing is the easiest way to start sharing GPUs, but it does not isolate memory or faults.
MIG gives you hardware-level isolation, but only on supported GPUs and with planned static profiles.
MPS can improve concurrency for trusted CUDA workloads, but a single client fault can affect other clients sharing the GPU.
Dynamic resource allocation (DRA) represents the Kubernetes-native approach for device allocation and sharing intent, while the device plugin and GPU Operator path remain common in existing clusters.
ScaleOps automatically detects each AI workload’s behavior, assigns the right fractional GPU policy, and continuously re-optimizes as usage evolves, with no manual configuration required.

Kubernetes GPU sharing matters because the traditional GPU device model doesn’t expose partial GPUs by default. The NVIDIA device plugin advertises GPUs as integer resources, such as nvidia.com/gpu. When a pod requests nvidia.com/gpu: 1, Kubernetes schedules it onto a node with an available GPU and treats that device as allocated.
This creates a GPU waste tax for inference workloads: ten services that each use 15% of a GPU still need ten physical GPUs with full-GPU allocation. The real demand is closer to 1.5 GPUs, but Kubernetes still reserves ten devices unless sharing is configured.
Sharing improves utilization, but it also changes the failure model. A notebook can usually tolerate slower GPU access or a failed experiment. A regulated customer-facing endpoint cannot, because one workload’s memory pressure or CUDA fault could affect another tenant. That is why each workload type needs a sharing method that matches its isolation and risk requirements.

ScaleOps Tip

Before choosing a sharing method, look at pod-level GPU memory, compute telemetry, time to first token (TTFT), key-value (KV) cache behavior, request volume, and latency. ScaleOps gives you that workload view before you lock teams into static slices or shared slots.

Time-Slicing in Kubernetes

Time-slicing is usually the simplest way for multiple Kubernetes workloads to share a GPU. Instead of partitioning the hardware, it lets GPU processes take turns on a single physical device via CUDA context switching.

In Kubernetes, you enable this behavior through the NVIDIA GPU Operator and device plugin by setting a replicas count for the GPU resource. You can scope that ConfigMap cluster-wide or label-target only the nodes that carry a specific GPU model, so teams can enable oversubscription gradually. You can also apply time-slicing inside an existing MIG slice when you need more than the seven hardware partitions MIG exposes.

The example below shows a device plugin ConfigMap that advertises four shared GPU slots for each physical GPU:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: true
        failRequestsGreaterThanOne: true
        resources:
        - name: nvidia.com/gpu
          replicas: 4

With renameByDefault: true, a one-GPU node advertises 4 nvidia.com/gpu.shared resources. A pod then requests one shared slot:

resources:
  limits:
    nvidia.com/gpu.shared: 1

It works across a broad range of NVIDIA GPUs and is useful for notebooks, experiments, internal demos, development clusters, and low-criticality inference.

Time-slicing does not isolate memory. It also does not isolate faults. GPU time is shared across GPU processes, not cleanly by Kubernetes pod or replica, so you have limited control over proportional compute.

⚠️ Warning: Requesting more than one time-sliced GPU does not guarantee more compute. It only gives the pod access to a shared GPU.

MIG in Kubernetes

Multi-Instance GPU (MIG) partitions a supported Ampere-generation (compute capability 8.0+) NVIDIA GPU into up to seven hardware-isolated GPU instances, each of which can be scheduled as an independent mini-GPU. Each instance behaves like a smaller GPU with dedicated memory, dedicated compute resources, stronger fault isolation, and more predictable performance boundaries.

MIG is supported on NVIDIA GPUs starting with the Ampere generation. You still need to verify the exact GPU model before planning a cluster, because the supported profiles and maximum number of MIG instances vary by product. Common datacenter examples include A100, A30, H100, H200, B200, and GB200. A100, H100, H200, B200, and GB200 support up to seven MIG instances. A30 supports up to four.

After MIG is enabled on a supported GPU, the NVIDIA device plugin can expose MIG profiles as separate Kubernetes resource names. In the A100 40 GB example below, the pod requests one 1g.5gb MIG resource, which represents one MIG instance with one GPU slice and 5 GB of GPU memory:

resources:
  limits:
    nvidia.com/mig-1g.5gb: 1

MIG is usually the best starting point for production multi-tenant workloads, regulated environments, and services that need clear blast-radius control.

As a static, hardware-level partitioning method, MIG uses predefined profiles, so you must plan slice sizes in advance. Oversized profiles can leave part of the GPU unused because the slice is larger than the workload actually needs. Undersized profiles block workloads that need more memory or compute. Reconfiguring MIG profiles typically requires the GPU to be free of running workloads, so profile changes become an operational event rather than a runtime decision.

ScaleOps Tip

Compare actual GPU memory and compute usage against the MIG profile size before standardizing on a profile. ScaleOps continuously analyzes each workload’s real GPU consumption, automatically assigns the right fractional policy, and re-optimizes as usage changes, so workloads stop holding capacity in oversized slices without anyone having to retune manually.

MPS in Kubernetes

NVIDIA Multi-Process Service (MPS) is a runtime service for cooperative CUDA multi-process and multi-application workloads on NVIDIA GPUs.

In Kubernetes, MPS fits trusted CUDA workloads that benefit from concurrent execution on the same GPU. That can include inference services with compatible CUDA behavior, small kernels, and workloads owned by the same team or within the same trust boundary.

In the NVIDIA device plugin, MPS sharing uses a replicas setting. Unlike time-slicing, MPS uses the control daemon to limit each client to an equal fraction of GPU memory and compute capacity. That gives you more explicit memory and compute partitioning than time-slicing, but weaker isolation than MIG.

NVIDIA marks MPS support in the device plugin as experimental. NVIDIA’s MPS documentation also notes that a fatal fault from one client can cause the MPS server to enter a FAULT state and affect other clients sharing the affected GPU.

MPS can improve concurrency for compatible CUDA workloads, but it does not adapt to shifting workload demand on its own. In current NVIDIA device plugin modes, time-slicing and MPS are mutually exclusive. In the NVIDIA GPU Operator, MPS and MIG cannot be combined: they are configured as separate node-level strategies. Treat MPS and MIG as separate strategies rather than combining them by default, then use workload-aware optimization above the chosen sharing method to keep GPU utilization aligned with runtime demand.

How to Choose Between MIG, MPS, and Time-Slicing in Kubernetes

The centerpiece of MIG versus MPS Kubernetes planning is not raw utilization. It is the isolation and failure model that each workload can tolerate:

Sharing Method	Isolation Level	Memory Isolation	Fault Isolation	GPU Hardware Required	Max Instances or Clients	Best-Fit Workload	Use When
Time-slicing	No isolation	No	No	NVIDIA GPU supported by the device plugin sharing configuration	Configurable replicas	Dev/test, notebooks, low-risk bursty workloads	You need fast oversubscription and can tolerate contention
MIG	Hardware isolation	Yes	Yes	Supported Ampere+ GPUs	Up to seven; exact limit depends on GPU model	Production multi-tenant or regulated workloads	You need predictable boundaries and stronger isolation
MPS	Software-level concurrency	Memory and compute limits, but not hardware isolation	Fatal faults can affect shared clients	Supported NVIDIA CUDA workloads on full GPUs	Configurable clients and equal fractions	Trusted compatible inference workloads	You want concurrent CUDA execution inside one trust boundary

Use the table as a starting point, then test with your own latency, memory, and failure patterns:

Scenario	Recommended Starting Point	Why
Production inference	MIG or carefully tested MPS	Depends on isolation needs and CUDA compatibility
Trusted internal inference	MPS	One team owns the clients and can test interference
Dev/text notebooks	Time-slicing	Easy access matters more than strict isolation
Regulated workloads	MIG	Hardware boundaries are easier to reason about
Dynamic AI workloads	ScaleOps with the chosen sharing method	Static profiles often change too slowly

Utilization and performance tradeoffs

Time-slicing improves utilization for idle or bursty workloads, but it can introduce latency and memory contention. MIG improves predictability and isolation, but oversized profiles can still waste capacity. MPS can improve concurrency for compatible inference workloads, but it requires careful testing for interference, memory behavior, and fault containment.

Track KV cache size, prefix cache hit rate, prefill time, TTFT, decode time, generated tokens, waiting requests, request volume, and p95 and p99 latency. Each sharing method affects utilization and performance differently, so start by considering the operational trade-offs before choosing a default.

For a deeper benchmark comparison, analyze the data points presented in NVIDIA’s KubeCon NA 2024 GPU sharing benchmark study.

ScaleOps Tip

ScaleOps continuously evaluates sharing decisions as traffic, KV cache size, GPU memory reservation, and model behavior change.

How DRA is changing Kubernetes GPU allocation

Dynamic resource allocation (DRA) is the Kubernetes-native direction for GPU allocation and sharing intent. It moves the ecosystem beyond opaque integer resources and introduces objects such as DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice.

DRA graduated to GA in Kubernetes 1.34 and is enabled by default in Kubernetes 1.34 and later. Similarly, OpenShift 4.21 shipped DRA GA based on the upstream Kubernetes 1.34 implementation.

NVIDIA also announced at KubeCon Europe 2026 that it is donating the NVIDIA DRA Driver for GPUs to the CNCF, moving the driver from vendor-governed software toward community ownership under the Kubernetes project.

AKS engineering has documented DRA concepts and NVIDIA DRA driver examples, but AKS support should still be treated as version-dependent.

The practical caveat is simple: device plugin plus GPU Operator is still the more familiar path in many existing GPU clusters. DRA is where the ecosystem is moving, so it is worth understanding before you stand up new GPU clusters.

Review our deep-dive architectural guide to Kubernetes DRA to understand specific DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice implementation details.

MIG, MPS, and time-slicing are allocation mechanisms. They divide or share a GPU, but they do not decide whether a workload still deserves the same share tomorrow.

Monitoring dashboards can show GPU waste, but visibility is not optimization. They can tell you that a device is underused, but they do not resize a MIG profile, reduce an oversized memory reservation, or change scaling behavior when demand shifts.

Static sharing leaves three common waste patterns:

Idle slices: Peak-sized workloads sit mostly idle inside a MIG slice or time-sliced slot.
Memory blocking: Workloads hold more GPU memory than they use, which blocks co-location even when compute headroom exists.
Static inflexibility: MIG profiles, MPS thread percentages, and memory limits do not adapt when batch sizes, model versions, context lengths, or traffic patterns change.

When teams manage these boundaries through manual YAML changes, every adjustment becomes a trial-and-error cycle across profile size, memory reservation, replica count, and latency impact. That slows tuning and increases the risk of disruption when workloads are already running.

This is especially visible in inference stacks such as vLLM. GPU memory settings can reserve a large fraction of the GPU’s memory for model execution and the KV cache. That reservation can block sharing even when actual compute use is low, leaving expensive AI hardware underused when static allocations no longer match real inference demand.

ScaleOps sits above static Kubernetes GPU sharing as an autonomous optimization layer for GPU and inference workloads. It keeps inference fast and shared GPU environments reliable as workloads change, then delivers GPU cost savings of 50 to 70 percent as a result of that automation. Where MIG, MPS, and time-slicing allocate capacity, ScaleOps acts in production, managing allocation in real time based on actual consumption rather than producing recommendations a team applies manually.

Automated Fractional GPUs. ScaleOps detects whether a workload is real-time, near-real-time, or batch, attaches the right fractional GPU policy, and auto-tunes policy parameters to preserve workload behavior while improving GPU efficiency. Allocation is demand-based and MIG-aware, replacing manual, node-by-node time-slicing, and it re-optimizes as usage patterns evolve.

GPU Memory Optimization. ScaleOps right-sizes GPU memory for serving frameworks like vLLM that claim most GPU memory by default, recovering capacity without hurting performance. Freeing that reserved memory lets more workloads participate in GPU sharing without blocking co-location.

GPU Replica Optimization. ScaleOps scales inference replicas to real demand using GPU-native signals (KV cache size, waiting requests, and request volume) instead of generic CPU metrics. It scales eligible workloads to zero during low-demand periods and reduces cold-start impact when demand returns, so teams protect latency while removing always-on GPU capacity that sits idle overnight.

Batch Inference Optimization. For latency-tolerant batch inference, the highest-savings class of GPU workload, ScaleOps applies aggressive scheduling and consolidation and uses lower-cost capacity to meet SLA requirements without dedicated always-on GPU allocation.

GPU Observability. ScaleOps reads workload-level GPU compute and memory utilization, prefill time, TTFT, decode time, prefix cache hit rate, and request volume, and acts on those signals to drive the optimizations above. The same data gives teams the operational context to resolve inference issues without building ad hoc dashboards.

Kubernetes GPU sharing is a practical tradeoff between utilization, isolation, and complexity. For dynamic AI environments, static hardware partitioning is only the first step. ScaleOps helps you move from fixed GPU assumptions to continuous, workload-aware optimization.

Stop partitioning hardware based on fixed assumptions and static configurations. ScaleOps provides the workload-aware resource automation your shared GPU environments require to eliminate idle capacity dynamically, without asking your team to manually retune policies as workloads change. Automate Kubernetes GPU optimization with ScaleOps.

Book a demo now.

When should I choose MIG, MPS, or time-slicing for Kubernetes GPU workloads?

MIG: Use MIG when you need hardware-level isolation between tenants, including memory isolation and stronger fault isolation. MIG requires Ampere-generation (compute capability 8.0+) GPUs, supports up to seven instances depending on the GPU model, and uses static profiles that you plan in advance.
MPS: Use MPS for trusted CUDA workloads that benefit from concurrent execution on a full GPU inside one trust boundary. MPS can improve concurrency by partitioning memory and compute fractions in software, but a fatal fault from one client can affect other clients sharing the GPU.
Time-slicing: Use time-slicing when you want the simplest way to oversubscribe GPUs for dev/test, notebooks, experiments, and low-risk inference. It does not isolate memory or faults, and contention can impact latency.

How do I enable GPU time-slicing in Kubernetes?

You enable time-slicing through the NVIDIA GPU Operator and device plugin by setting a replicas count for the GPU resource in a ConfigMap, then requesting nvidia.com/gpu.shared in your pod spec. You can scope that ConfigMap cluster-wide or target specific nodes by label, allowing gradual rollout.

Can I combine MIG and time-slicing on the same GPU?

Yes. You can enable MIG to carve the GPU into hardware-isolated slices, then apply time-slicing inside a MIG slice to oversubscribe that slice when you need more than the seven hardware partitions MIG exposes. Changing MIG profiles typically requires the GPU to be free of running workloads, so treat profile changes as an operational event. Time-slicing still does not add memory or fault isolation beyond the MIG slice boundary.

What is Dynamic Resource Allocation (DRA) for GPUs in Kubernetes?

DRA is the Kubernetes-native direction for device allocation and sharing intent, using objects such as DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice. DRA complements MIG, MPS, and time-slicing by providing a Kubernetes-native way to model and request devices, rather than replacing the underlying GPU sharing mechanisms themselves. It graduated to GA in Kubernetes 1.34.

What is the difference between MIG and MPS in Kubernetes?

MIG partitions the GPU in hardware with dedicated memory and stronger fault isolation. MPS lets compatible CUDA clients share a single GPU via daemon-managed memory and compute partitioning. MIG is better for isolation. MPS is better for trusted concurrency.

Is time-slicing safe for production GPU workloads?

Time-slicing can work for low-criticality production workloads, but it is risky for strict multi-tenant environments because it has no memory or fault isolation. Test latency, memory contention, and failure behavior before adoption.

ScaleOps Tip

Time-Slicing in Kubernetes

MIG in Kubernetes

ScaleOps Tip

MPS in Kubernetes

How to Choose Between MIG, MPS, and Time-Slicing in Kubernetes

Utilization and performance tradeoffs

ScaleOps Tip

How DRA is changing Kubernetes GPU allocation

When should I choose MIG, MPS, or time-slicing for Kubernetes GPU workloads?

How do I enable GPU time-slicing in Kubernetes?

Can I combine MIG and time-slicing on the same GPU?

What is Dynamic Resource Allocation (DRA) for GPUs in Kubernetes?

What is the difference between MIG and MPS in Kubernetes?

Is time-slicing safe for production GPU workloads?

Achieve Full GPU Utilization

How to Deploy vLLM on Kubernetes: The Complete Guide to LLM Inference in Production

GPU Cost Optimization in Kubernetes: From Waste to Efficient AI Infrastructure

The Kubernetes Scheduler: How Pod Placement, Bin Packing, and Autoscalers Actually Fit Together

Kubernetes GPU Sharing: MIG vs. MPS vs. Time-Slicing Explained

Key Takeaways: GPU Sharing in Kubernetes

Why Kubernetes GPU Sharing Matters

ScaleOps Tip

Time-Slicing in Kubernetes

MIG in Kubernetes

ScaleOps Tip

MPS in Kubernetes

How to Choose Between MIG, MPS, and Time-Slicing in Kubernetes

Utilization and performance tradeoffs

ScaleOps Tip

How DRA is changing Kubernetes GPU allocation

Limitations of Static GPU Sharing Methods in Kubernetes

How ScaleOps Optimizes GPU Sharing in Kubernetes

From Static Partitioning to Autonomous GPU Sharing

GPU Sharing in Kubernetes: Frequently Asked Questions

When should I choose MIG, MPS, or time-slicing for Kubernetes GPU workloads?

How do I enable GPU time-slicing in Kubernetes?

Can I combine MIG and time-slicing on the same GPU?

What is Dynamic Resource Allocation (DRA) for GPUs in Kubernetes?

What is the difference between MIG and MPS in Kubernetes?

Is time-slicing safe for production GPU workloads?

Achieve Full GPU Utilization

Related Articles

How to Deploy vLLM on Kubernetes: The Complete Guide to LLM Inference in Production

GPU Cost Optimization in Kubernetes: From Waste to Efficient AI Infrastructure

The Kubernetes Scheduler: How Pod Placement, Bin Packing, and Autoscalers Actually Fit Together