Running vLLM Kubernetes workloads in production is a different problem from running vllm serve on a workstation. The model is the easy part. The work is everything between a single process holding a GPU and a resilient inference service other teams can call, that scales when traffic arrives, recovers when a pod dies, and does not quietly waste the most expensive hardware in the cluster.
This guide is a single worked example carried end to end. We deploy Llama-3-8B on a GKE cluster with NVIDIA L4 GPUs, expose it as a service, send it a real request, watch its metrics, scale it under load, shard a larger variant across GPUs, and then deal with the failure modes and the cost. Every section advances the same deployment rather than showing a disconnected snippet, so by the end you have a running, observable, scalable vLLM service and the reasoning behind each decision. It is written for platform engineers who are comfortable with Kubernetes but newer to LLM serving.
One decision shapes everything downstream, so it is worth stating as the thesis this guide argues: serve vLLM as a Deployment, not a DaemonSet. A surprising number of tutorials reach for a DaemonSet because one pod per GPU node feels tidy. It also pins your replica count to your node count and makes it impossible to scale on GPU utilization, which is the entire reason to run inference on Kubernetes rather than a static VM. A Deployment keeps replica scaling and node provisioning as two independent, composable levers. Everything below assumes that model, and the scaling section is where the choice pays off. Once the service is running, the harder question becomes how much of each GPU you are actually using — a question a layer like ScaleOps AI Infra answers, and one we return to at the end.
This is not a whiteboard exercise. Everything below was validated on a live GKE cluster — Kubernetes 1.35, NVIDIA L4 24 GB nodes for single-GPU serving and a pair of A100-40 GB cards (NVLink) for the multi-GPU sections — running vLLM v0.8.5 on the V1 engine, with KEDA, the kube-prometheus-stack, and the NVIDIA DCGM exporter. The numbers and the failure modes below are from that run, and where a result depends on the environment it is flagged as ours rather than presented as universal.
Key Takeaways
- vLLM’s core innovation, PagedAttention, cuts KV-cache memory waste from the 60–80% typical of naive serving to roughly 4%, which is what lets one GPU hold larger batches and serve more throughput.
- GPUs become schedulable through the NVIDIA device plugin, a DaemonSet that advertises
nvidia.com/gpu; vLLM pods then request it like any other resource. - Deploy vLLM as a Deployment behind a ClusterIP Service, not a DaemonSet, so replica scaling is decoupled from node count.
- Scale in two independent layers: KEDA drives replica scaling from GPU utilization, while Cluster Autoscaler or Karpenter handles node autoscaling for GPU nodes.
- vLLM exposes a Prometheus
/metricsendpoint with the signals that actually matter for inference — time-to-first-token, prefix cache hit rate, and queue depth — not just GPU utilization. - Models too large for one GPU are served with tensor parallelism (
-tensor-parallel-size) across NVLink-connected GPUs on a single node. - The incidents come from cold-start model downloads, CUDA graph compilation, and an over-aggressive
-gpu-memory-utilization, not the model itself. - A single inference replica holds a whole physical GPU it rarely saturates; right-sizing fractional GPU allocation against real per-pod utilization is where most GPU spend is recovered.
What vLLM is and why it changes GPU economics
vLLM is an open-source inference and serving engine: it runs a large language model efficiently on datacenter GPUs and exposes an OpenAI-compatible HTTP API through the vllm serve command. It supports 200+ model architectures natively and is the serving layer most model and hardware vendors now target first.
The reason it matters for a platform team comes down to memory. LLM inference produces a KV cache — the keys and values already computed for the tokens in a sequence — and that cache grows as generation proceeds. Earlier serving systems allocated KV-cache memory in large contiguous blocks per request, leaving most of it stranded: only 20–40% was ever used, the rest lost to internal fragmentation and over-reservation. PagedAttention borrows the operating system’s virtual-memory trick. It partitions the KV cache into fixed-size blocks, like pages, and maps logical blocks to physical blocks through a block table, allocating on demand rather than up front. Per the vLLM PagedAttention paper, the result is roughly 4% waste instead of 60–80%, which translates directly into larger batch sizes and higher throughput on the same GPU.
That 60–80%-to-4% jump is worth holding onto, because the same waste reappears one level up: PagedAttention reclaims memory stranded inside a single GPU, while at the fleet level whole GPUs sit idle because Kubernetes treats a card as indivisible and hands each pod the entire thing. Reclaiming that cluster-level waste with fractional GPU allocation is what ScaleOps AI Infra does, and §10 gets concrete about how.
Two related mechanics explain vLLM’s behavior under load. Continuous batching admits and retires requests at the token level rather than waiting for a fixed batch to finish, so one long generation does not block short ones. The scheduling consequence is that a single replica behaves as a high-concurrency server rather than a one-request worker, so you scale it on GPU saturation, not request count (§6), and one pod absorbs a lot of traffic before you need a second. Prefix caching, on by default in the V1 engine, shares the KV blocks of a common prompt prefix across requests, so a shared system prompt is computed once and reused. That cache lives in a single pod’s GPU memory, so the benefit is per-replica: a plain round-robin Service scatters requests that share a prefix across pods and each one recomputes it, which is why prefix-aware routing becomes the real lever once you scale out (§8). Hold onto prefix caching — it reappears in the monitoring and cost sections, because how often it hits is one of the most useful numbers you can watch.
This maps cleanly onto Kubernetes because the unit you schedule — a pod that requests one or more GPUs and exposes an HTTP port — lines up with how vLLM wants to run: one server process per GPU or group of GPUs, load-balanced behind a Service. That is exactly what we build next.
Prerequisites: The Cluster We Are Building On
Our example runs on GKE with a GPU node pool of NVIDIA L4 cards (24 GB each), which comfortably holds Llama-3-8B in FP16 with room for the KV cache. Three things have to be in place before any vLLM manifest will schedule.
First, a GPU node pool. On a managed platform this is a node pool with a GPU machine type, and the nodes are tainted so only workloads that explicitly tolerate the taint land on expensive hardware. Our pool carries the taint nvidia.com/gpu=present:NoSchedule.
Second, the NVIDIA device plugin, which runs as a DaemonSet on every GPU node and advertises the GPUs to the kubelet as an allocatable resource named nvidia.com/gpu. This is the one DaemonSet that belongs in this picture — its job is per-node hardware registration, which is what a DaemonSet is for. On GKE the plugin and driver are installed for you when you add a GPU node pool; on self-managed clusters you install the NVIDIA GPU Operator, which also manages the driver and container toolkit, per NVIDIA’s documentation.
Third, the driver and container toolkit on the node, so containers reach the GPU through the NVIDIA runtime. Confirm GPUs are schedulable before deploying anything:
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
# A GPU node should report 1 (or more) under GPU, not <none>.
If that column is empty, no vLLM pod will ever leave Pending, and no manifest tuning fixes it — the problem is below vLLM, in the device plugin or driver. With the column reporting 1 on our L4 nodes, we are ready to deploy. In our test, a fresh L4 node took about six minutes to provision and report nvidia.com/gpu allocatable — a number worth remembering for §6, where node autoscaling has to add that capacity under load.
Deploy vLLM as a Deployment, Then Call It
Here is the manifest that runs one vLLM replica serving Llama-3-8B on one L4, fronted by a ClusterIP Service. This is the spine of the whole guide; later sections modify this object rather than introduce new ones.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b
namespace: inference
spec:
replicas: 1 # start at 1; KEDA scales this in §6
selector:
matchLabels:
app: vllm-llama3-8b
template:
metadata:
labels:
app: vllm-llama3-8b
spec:
tolerations:
- key: nvidia.com/gpu # land on the tainted GPU node pool
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.5 # pin a tested tag — :latest drifts and surprises you on restart
args:
- "--model"
- "NousResearch/Meta-Llama-3-8B-Instruct" # ungated Llama-3-8B mirror; the official meta-llama repo 403s until Meta approves you (see below)
- "--gpu-memory-utilization"
- "0.90" # fraction of the L4's 24 GB for weights + KV cache; headroom matters (§4)
- "--max-model-len"
- "8192" # cap context; larger = more KV cache reserved per request
ports:
- containerPort: 8000 # vLLM's OpenAI-compatible server listens here
env:
- name: HUGGING_FACE_HUB_TOKEN # Llama is gated; this is required
valueFrom:
secretKeyRef:
name: hf-token
key: token
resources:
limits:
nvidia.com/gpu: "1" # one physical L4 per replica
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60 # weights are still downloading and loading; /health first passed at ~6 min in our test
periodSeconds: 10
failureThreshold: 60 # ~10 min window so a slow cold start is not mistaken for a crash-loop (§9)
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
emptyDir: {} # swap for a PVC to avoid re-downloading weights on restart (§9)
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-8b
namespace: inference
spec:
type: ClusterIP # internal only; front with an Ingress/Gateway for external traffic
selector:
app: vllm-llama3-8b
ports:
- port: 80
targetPort: 8000 # clients hit :80, vLLM listens on :8000
Apply both, wait for the pod to pass readiness (the first start is slow — that is the cold-start tax we dissect in §9), and send it a real request. vLLM speaks the OpenAI API, so the call is ordinary:
kubectl -n inference port-forward svc/vllm-llama3-8b 8000:80 &
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Explain Kubernetes in one sentence."}],
"max_tokens": 64
}'
{
"id": "chatcmpl-3f0c…",
"object": "chat.completion",
"choices": [
{"index": 0, "message": {"role": "assistant",
"content": "Kubernetes is an open-source platform that automates deploying, scaling, and operating containerized applications across a cluster of machines."},
"finish_reason": "stop"}
],
"usage": {"prompt_tokens": 17, "completion_tokens": 24, "total_tokens": 41}
}
That usage block is the first place tokens show up, and it is worth noticing now because token throughput becomes the thing we monitor and, ultimately, pay for. Other workloads in the cluster reach the model at http://vllm-llama3-8b.inference.svc.cluster.local/v1/chat/completions, and the Service load-balances across however many replicas exist.
Two practical notes from actually running this. First, the model is gated: Meta’s meta-llama/Meta-Llama-3-8B-Instruct returns 403 ... not in the authorized list even with a valid HUGGING_FACE_HUB_TOKEN until Meta approves your account, so the manifest above uses the ungated NousResearch/Meta-Llama-3-8B-Instruct mirror, an identical drop-in you can run immediately. Second, the first start is slow: in our test the pod sat 0/1 Running for about 6.3 minutes — pulling the ~16 GB of weights, loading them onto the GPU, and capturing CUDA graphs — before /health passed. That is why the readiness probe uses a long failureThreshold rather than a tight delay; §9 covers how to shrink the wait.
Now the thesis in practice. This is a Deployment, not a DaemonSet, because a DaemonSet runs exactly one pod per matching node — its replica count is whatever your node count happens to be. You cannot ask for three replicas, and no controller can add a fourth in response to load. KEDA, which is how we scale on GPU utilization in §6, scales the replicas of a Deployment; it has nothing to act on with a DaemonSet. The DaemonSet-per-node pattern is fine for a fixed fleet where you scale only by adding nodes, but it forecloses replica scaling, so it is the wrong default for an inference service. Keep replica scaling on the Deployment and node provisioning on the cluster autoscaler, and the two never fight.
Production Configuration That Matters
Our pod serves requests, but a few flags separate a server that holds up from one that falls over on the first real burst. Each of these tunes the same vllm-llama3-8b Deployment.
-gpu-memory-utilizationis the fraction of GPU memory vLLM claims for weights plus KV cache. It is tempting to push it to0.95to maximize batch size. In production, leave headroom —0.85–0.90— because the figure is a target, not a hard ceiling on transient activity, and a traffic spike against0.95is exactly how you get an out-of-memory kill mid-request. On our 24 GB L4 at0.90, Llama-3-8B’s ~16 GB of FP16 weights leave roughly 5–6 GB for KV cache, which sets how many concurrent requests fit. One observed subtlety that matters later: vLLM claims that memory up front. At0.90on the L4, DCGM showed roughly 23 GiB held constantly, idle or saturated, because the KV-cache pool is reserved at startup rather than grown on demand. That is good for predictability, but it means external GPU-memory right-sizing has almost nothing to reclaim from a running vLLM server, a point §10 returns to. The headroom itself need not be guesswork, though: the per-pod KV-cache usage that ScaleOps AI Inference Observability surfaces (§5) shows how much of that reserved pool a workload actually touches under load, turning the0.85–0.90you picked by feel into a number backed by data.-max-model-lenbounds context length, which bounds how much KV cache one request can reserve. Setting it higher than you need silently reduces how many requests share the GPU.-quantizationis the lever for smaller or cheaper GPUs and larger models. An AWQ-quantized Llama-3-8B drops to roughly 6 GB and fits a 16 GB T4 with room to spare — the same model, a third of the GPU. If your example needs to run on T4s rather than L4s, add-quantization awqand point-modelat an AWQ build; the rest of the manifest is unchanged.
A quick way to reason about GPU choice: FP16 weights take roughly 2 GB per billion parameters, and you need headroom on top for the KV cache, which grows with context length and concurrency. That gives a rough floor for sizing the node pool:
| Model | FP16 weights (approx.) | Fits on | AWQ 4-bit (approx.) | Fits on |
| Llama-3-8B | ~16 GB | one L4 / A10G (24 GB) | ~6 GB | one T4 (16 GB) |
| Llama-3-70B | ~140 GB | tensor-parallel across H100s | ~40 GB | tensor-parallel across 2× L4 / A10G |
The table is a starting point, not a guarantee: leave room for the KV cache, and confirm the real footprint on your hardware before committing a node pool. This is also why --gpu-memory-utilization and --max-model-len are levers rather than set-and-forget values, since together they decide how much of the remaining memory becomes usable KV cache.
Prefix caching is on by default in V1. You do not enable it, but you should understand it: a shared system prompt across requests is computed once. For chat and agent workloads that prepend the same instructions every call, this is a real throughput win, and §5 shows how to confirm it is actually hitting.
Structured / JSON output (guided decoding) is supported natively and is worth enabling at the request layer when downstream consumers expect strict JSON, rather than parsing-and-retrying in application code.
Finally, the readiness probe. vLLM exposes /health, but a fresh pod is not ready the instant the container starts — it downloads the model, loads weights onto the GPU, and captures CUDA graphs. Probe too early and Kubernetes restarts a pod that was merely starting, producing a crash loop that looks like a vLLM bug and is actually a probe-timing bug. Our initialDelaySeconds: 60 plus a generous failureThreshold reflects that — in our run the pod needed about six minutes before /health passed; §9 explains why startup is slow and how to shrink it.
Monitoring: See What the GPU and the Model Are Doing
You cannot scale on a signal you are not collecting, and Kubernetes exposes neither GPU utilization nor inference health on its own. Two layers fill the gap, and our example needs both.
The first is GPU hardware metrics from NVIDIA’s DCGM exporter, deployed as a DaemonSet on GPU nodes and scraped by Prometheus. The metric that matters for scaling is DCGM_FI_DEV_GPU_UTIL — the percentage of time the GPU’s compute engines were busy — and the exporter labels it with pod and namespace, so you can attribute utilization to our vllm-llama3-8b replica. A minimal ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
honorLabels: true # critical: without this Prometheus overwrites the metric's pod/namespace and the §6 KEDA query silently matches nothing
Two findings from standing this up on GKE, because the one-line version hides real friction. First, the upstream DCGM exporter is not a drop-in on managed Kubernetes. On GKE it failed four separate and mostly silent ways: its DaemonSet does not tolerate the nvidia.com/gpu taint, so it scheduled on zero nodes; its priorityClassName: system-node-critical was rejected by GKE’s critical-pod quota; its distroless image swallowed the error; and even once running it logged NVML doesn't exist because GKE injects the GPU driver only into containers that request a GPU, and the exporter requests none. The fix is to use GKE Managed DCGM, the NVIDIA GPU Operator, or a GPU-aware DCGM build (ScaleOps ships one) instead of the bare upstream DaemonSet.
Second, and easy to miss: the ServiceMonitor needs honorLabels: true (above). Without it, Prometheus relabels the scraped metric with the exporter pod’s own identity, overwriting the pod and namespace of the vLLM pod the metric actually describes. The KEDA query in §6 then matches nothing and KEDA silently never scales — no error, just a service that never grows under load. We watched the query come back empty before the flag and return the right pod after.
The second layer is vLLM’s own telemetry. vLLM serves a Prometheus endpoint at /metrics, and the signals there are the ones that actually tell you whether inference is healthy — far more than raw GPU busy-ness:
vllm:time_to_first_token_seconds: TTFT, the latency a user feels before the first token streams back.vllm:gpu_cache_usage_perc: how full the KV cache is, your real saturation signal.vllm:num_requests_runningandvllm:num_requests_waiting: running vs queued, the early warning that you are out of capacity.vllm:prefix_cache_hits_totalandvllm:prefix_cache_queries_total: the prefix cache hit rate from §4, which tells you whether your shared prompts are paying off.
A query like avg(DCGM_FI_DEV_GPU_UTIL{namespace="inference", pod=~"vllm-llama3-8b.*"}) gives the average GPU utilization across replicas — both a dashboard number and the input KEDA scales on next. But GPU utilization alone is a blunt instrument: a pod can read 90% busy while TTFT quietly climbs because the KV cache is thrashing. Watching TTFT, queue depth, and cache hit rate alongside utilization is what separates “the GPU is busy” from “the service is healthy.”
For alerting, a p95 time-to-first-token threshold is usually more meaningful than an average GPU-utilization threshold, because it tracks what users actually feel. vLLM exposes TTFT as a histogram, so the p95 is a standard histogram_quantile:
histogram_quantile(0.95,
sum(rate(vllm:time_to_first_token_seconds_bucket{namespace="inference"}[5m])) by (le))
Alert when that crosses your latency budget. It tends to fire before a raw utilization threshold would, because TTFT degrades as the queue builds, while a GPU can read busy long before requests start backing up.
This is the first natural place ScaleOps fits, and the hook is a genuinely hard problem you just set up. DCGM sees the device; vLLM’s /metrics sees the model; and the moment several models share a GPU, neither can tell you which pod is burning what. ScaleOps AI Inference Observability is the layer that stitches the two together per-pod across a fleet — surfacing per-pod GPU utilization and memory even on a shared card, folding in vLLM’s own signals (TTFT, prefix cache hit rate, running and waiting requests, queue size), and mapping GPU spend back to the workload driving it. It reads the same /metrics you just exposed instead of replacing your Prometheus stack, so it is additive, not a migration.
Scaling: KEDA for Replicas, Cluster Autoscaler for Nodes
Scaling our service well means keeping two layers distinct: how many vLLM replicas run, and how many GPU nodes exist to run them on.
Replica scaling is KEDA’s job. A ScaledObject with a Prometheus trigger drives the replica count of the vllm-llama3-8b Deployment from the DCGM utilization metric:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-llama3-8b
namespace: inference
spec:
scaleTargetRef:
name: vllm-llama3-8b # the Deployment — this is why §3 is not a DaemonSet
minReplicaCount: 1 # keep one warm; scale-to-zero pays the full cold-start tax (§9)
maxReplicaCount: 8
cooldownPeriod: 300 # GPU pods are costly to churn; scale down deliberately
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
query: |
avg(DCGM_FI_DEV_GPU_UTIL{namespace="inference", pod=~"vllm-llama3-8b.*"})
threshold: "70" # target ~70% average GPU utilization per replica
Send sustained load at the service and the average utilization crosses 70%; KEDA adds replicas, and the Service spreads requests across them. In our run, GPU utilization pinned at 95–98% under concurrency 16, KEDA scaled the Deployment from one replica to two, and it scaled back to one once the load stopped. The cooldownPeriod matters more here than for a typical web service: a GPU pod is expensive and slow to start, so you want it to scale up promptly but down slowly rather than thrash a node in and out of the pool.
To see this work, put sustained load on the service and watch the controllers respond. A simple load generator pointed at the OpenAI endpoint is enough:
# 2,000 requests, 16 concurrent, against the running service
hey -n 2000 -c 16 -m POST \
-H "Content-Type: application/json" \
-d '{"model":"NousResearch/Meta-Llama-3-8B-Instruct","messages":[{"role":"user","content":"Summarize the CAP theorem."}],"max_tokens":512}' \
http://localhost:8000/v1/chat/completions
In a second terminal, watch the scaling objects move:
watch -n 2 kubectl get scaledobject,hpa,pods -n inference
KEDA manages a standard HorizontalPodAutoscaler under the hood, so what you see is the HPA’s current metric climbing past the target and the replica count following it up:
NAME MIN MAX REPLICAS
scaledobject.keda.sh/vllm-llama3-8b 1 8 2
NAME TARGETS REPLICAS
horizontalpodautoscaler.../keda-hpa-... 96%/70% 2
NAME READY STATUS
vllm-llama3-8b-7c9d...-2xr4q 1/1 Running
vllm-llama3-8b-7c9d...-mq7wd 0/1 ContainerCreating # second replica, paying the cold-start tax
Two things are worth watching on the metrics side while this happens. Average DCGM_FI_DEV_GPU_UTIL climbs above the 70% target, which is what triggered the scale-up, and vllm:num_requests_waiting spikes on the saturated replica before the second replica absorbs the queue. When the load generator stops, utilization falls, and after the 300-second cooldownPeriod KEDA scales back down, slowly and on purpose, because spinning a GPU pod back up is not free. That replica stuck in ContainerCreating is the cold-start cost made visible, and it is the concrete reason minReplicaCount stays at 1 rather than 0.
Node autoscaling is a separate concern you do not implement yourself — you let Cluster Autoscaler or Karpenter handle it. When KEDA adds a replica and no L4 node has a free nvidia.com/gpu, the new pod sits Pending, and the cluster autoscaler provisions another GPU node. The layers compose cleanly: KEDA decides how many replicas, the autoscaler decides how many nodes, and neither needs to understand the other’s internals.
There is a caveat this series covers in depth. When node autoscaling provisions a fresh GPU node, the new vLLM pod pays a cold-start cost — image pull, model download, weight load, CUDA graph capture — before it serves, which is why minReplicaCount stays above zero for latency-sensitive services. That latency is the subject of GPU Cold Starts in Kubernetes. In our test it took about 6.3 minutes for a fresh L4 pod to go from scheduled to serving, on top of the roughly six minutes to provision the node itself.
A second node-layer reality the happy path hides: GPU capacity is not guaranteed. When KEDA scaled up and the autoscaler asked for a node, GCE returned out of resources for both L4 and A100 in-zone, and the new pod simply stayed Pending — the configuration was correct, the capacity was not there. Treat FailedScaleUp and out-of-resources as an expected failure mode: alert on it, and use multi-zone GPU node pools or capacity reservations in production. Capacity did recover for us on retry, with an A100 node provisioning in about 66 seconds once the stockout cleared.
But there is one scaling subtlety that bites teams running GPU sharing. The KEDA query above reads device-level GPU utilization, which is fine when each replica owns a whole L4. The moment you pack several models onto one GPU, DCGM_FI_DEV_GPU_UTIL reports the device, not the pod, and your per-workload scaling signal disappears. Recovering per-pod GPU utilization from a shared device — a number the hardware does not report directly — is precisely what ScaleOps GPU Usage-Based HPA (AI Replica Optimization) provides: it surfaces per-pod GPU utilization as HPA-ready custom metrics even when workloads share a GPU, so each model scales on its own real consumption rather than a device-level average. It plugs into the same HPA and KEDA setup — better inputs to the controllers you already run, not a replacement for them.
Multi-GPU: Tensor Parallelism for Bigger Models
Suppose the example grows: the team wants Llama-3-70B, whose weights plus KV cache do not fit on one L4. You shard it across several GPUs on one node with tensor parallelism. vLLM splits each layer’s tensors across the GPUs and runs them in parallel, exchanging activations between them. The flag is --tensor-parallel-size, and its value must equal the number of GPUs the pod requests:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.5
args:
- "--model"
- "meta-llama/Meta-Llama-3-70B-Instruct"
- "--tensor-parallel-size"
- "4" # shard across 4 GPUs
resources:
limits:
nvidia.com/gpu: "4" # must equal --tensor-parallel-size
volumeMounts:
- name: dshm
mountPath: /dev/shm # NCCL talks over shared memory; the default 64 MiB hangs TP
volumes:
- name: dshm
emptyDir:
medium: Memory
Three constraints decide whether this performs. First, tensor parallelism is chatty — the GPUs exchange data on every forward pass — so they must be connected by a high-bandwidth interconnect (NVLink) on the same node; spreading a tensor-parallel group across nodes over ordinary networking bottlenecks on communication. Second, the GPU count and --tensor-parallel-size must agree, or the pod fails to start in a confusing way. Third, a non-obvious one we hit: tensor parallelism communicates through shared memory, and a container’s default /dev/shm of 64 MiB is too small, so NCCL hangs at startup — mount an in-memory emptyDir at /dev/shm, as the manifest above does. We validated this path end to end with Llama-3-8B at --tensor-parallel-size 2 on two NVLink-connected A100-40 GB cards: NCCL 2.21.5, two workers at world size 2, serving completions. Note what tensor parallelism is for: fitting a model too large for one GPU. For more throughput on a model that already fits, add replicas (§6), not GPUs per replica.
Scaling Out: Prefill–Decode Disaggregation
There is a ceiling to single-node tensor parallelism and replica scaling, and beyond it the frontier is splitting the phases of inference across pods. Inference has two phases with very different hardware profiles. Prefill processes the entire prompt in one forward pass and is compute-bound: it wants raw FLOPs. Decode then generates tokens one at a time, each step reading the whole KV cache, so it is memory-bandwidth-bound and latency-sensitive. Run both on the same GPU and they interfere: a long prompt’s prefill monopolizes compute and stalls the token-by-token decode of every other in-flight request.
Prefill–decode disaggregation runs the two phases on separate pools sized for their respective bottlenecks, then transfers the KV cache from the prefill workers to the decode workers. A prefill pool can run on compute-dense GPUs and scale with prompt volume, while a decode pool optimizes for memory bandwidth and many concurrent token streams. The reference architecture in the vLLM ecosystem is llm-d, which adds LLM-aware, prefix-cache-aware request scheduling (routing each request to the worker most likely to already hold its prefix, which ties straight back to the prefix caching from §4) and uses a KV-connector interface with NVIDIA’s NIXL for the cache transfer. That transfer is not free, so disaggregation pays off only when the interference it removes outweighs the cost of moving the cache, which is the case for workloads with long prompts and high concurrency.
This is the one place in the guide where we could not get a clean end-to-end result, and that is itself the lesson: disaggregation is not a drop-in. Both paths carry hard prerequisites. llm-d’s production disaggregation expects RDMA networking (InfiniBand or RoCE) between pods and roughly eight GPUs, so standard GKE L4 and A100 pools will not do — it needs RDMA-capable nodes such as a3-megagpu H100s with GPUDirect. The lighter vLLM-native NixlConnector (NIXL over UCX, no RDMA) is deployable on a single two-GPU node and validates the mechanic in principle, but it is explicitly experimental: in our attempt on vllm/vllm-openai:latest, the prefill and decode pods both started and loaded NIXL, but the NIXL handshake listener thread crashed with ValueError: not enough values to unpack, the KV leases expired, and the decode side pulled the cache from zero remote workers. If you go here, pin a known-good version and budget for rough edges rather than expecting it to work first try.
What both paths have in common is orchestration: disaggregation means standing up separate prefill and decode pools, a KV connector between them, and prefix-aware routing in front, all assembled by hand on top of raw Deployments. If you would rather operate at a higher level than that, KServe runs vLLM as a declarative InferenceService — the same vLLM container underneath — and is built to manage this kind of multi-component serving:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama3-8b
namespace: inference
spec:
predictor:
minReplicas: 1 # same warm-floor logic as the KEDA setup in §6
model:
modelFormat:
name: vllm # KServe's built-in vLLM serving runtime
args:
- "--model"
- "NousResearch/Meta-Llama-3-8B-Instruct"
resources:
limits:
nvidia.com/gpu: "1"
KServe gives you request-driven scaling, canary rollouts, and a path to disaggregated topologies without hand-rolling the orchestration, at the cost of another layer to operate. The honest guidance holds: reach for any of this only when you have a measured reason, namely that prefill and decode contention is your bottleneck and the replica-scaled Deployment from §6 can no longer hold your latency targets. For most teams deploying vLLM Kubernetes workloads, §3 through §7 is the whole job, and §8 is where you go when the numbers say so.
Operational Pitfalls
The failures that take down a vLLM service are rarely the model. They are operational and predictable, and our example has already brushed against most of them.
Cold-start model downloads. A pod’s first action is pulling the model weights — roughly 16 GB for Llama-3-8B — from the model hub onto the node. With the emptyDir cache from §3, that repeats on every restart and every new node, and in our test it dominated the ~6-minute cold start. Back the cache with a PersistentVolume, or bake the model into the image, so a restart does not re-download. This is the single biggest contributor to the cold-start tax that made minReplicaCount: 1 non-negotiable in §6.
CUDA graph compilation. vLLM captures CUDA graphs at startup to speed up the steady state, and that capture takes time before the pod is ready — which is why §3’s probe pairs a short initialDelaySeconds with a large failureThreshold (we used 60 checks at 10-second intervals, a ten-minute window). If pods restart during startup, suspect the probe before you suspect vLLM.
The --enforce-eager tradeoff. Setting --enforce-eager skips CUDA graph capture, lowering memory use and shortening startup at the cost of somewhat slower per-token inference in the steady state. It is a reasonable lever when you are memory-constrained or churning pods frequently, but make it a deliberate choice, not a default you leave on.
V1 engine defaults. The V1 engine (the 2025 rearchitecture) changed defaults: prefix caching is on, torch.compile integration is enabled, and the engine loop was rebuilt for roughly a 2x throughput improvement. These are good defaults, but if you are following an older tutorial that toggles them by hand you can end up fighting the engine. Read the release notes for the tag you pinned; we ran v0.8.5 on the V1 engine, with prefix caching and torch.compile on by default exactly as described above.
Scale-up latency. Every time a new replica spins up, on a traffic burst or a fresh node, the first requests wait out the ~6.3-minute cold start from §3. The traffic that triggered the scale-up gets a multi-minute response instead of a sub-second one, and scaling to zero makes the first request after idle pay it in full. The choice is to keep warm replicas around (the minReplicaCount floor, which costs money) or eat the latency. Collapsing that cold start is exactly what ScaleOps Model Performance Optimization targets, accelerating model load and keeping self-hosted models warm for real-time inference. This complements the minReplicaCount floor rather than replacing the scaling logic.
The Cost Problem: Rightsizing GPU Inference
The last problem does not show up as an incident, only as a bill. Our vllm-llama3-8b replica requests nvidia.com/gpu: 1 and gets a whole L4, but inference traffic is bursty, and outside peak hours that GPU sits well under its capacity. This underuse is typical: even well-run clusters rarely push GPU utilization past 20–30%, because Kubernetes sees a GPU as atomic — 1 allocated, 0 available — with no concept of fractional use. A cluster running ten inference services, each using 15% of a GPU, needs ten physical GPUs at full price despite using the equivalent of 1.5.
Closing that gap means matching allocation to real utilization, but a vLLM server complicates the obvious move. As §4 showed, vLLM reserves its VRAM up front — roughly 23 GiB of the L4 held constantly, idle or busy — so external GPU-memory right-sizing has almost nothing to reclaim from a running vLLM server; the memory is committed by design. For a vLLM fleet, the real levers are running the right number of GPUs rather than reclaiming slices of each, and tuning --gpu-memory-utilization down to the smallest pool that still holds peak concurrency. This is where ScaleOps GPU Platform fits, and it is the natural home for the per-pod signals from §5:
- AI Replica Optimization (GPU Usage-Based HPA) is the primary lever for a vLLM fleet: it surfaces per-pod GPU utilization as HPA-ready metrics, including on shared GPUs, so each model scales on its real consumption. It is the §6 capability doing double duty as cost control.
- Automated Fractional GPUs and GPU Memory Optimization pay off most for GPU work that does not pre-claim all its memory — smaller models, notebooks, batch jobs, or vLLM servers deliberately tuned to a low
-gpu-memory-utilizationso several share a card. ScaleOps reports up to a 70% reduction in GPU waste from that kind of fractional bin-packing; just temper the expectation for a single vLLM server already holding 90% of its card. - Batch Inference Optimization maximizes throughput for batch, non-latency-critical inference, getting more tokens out of the hardware you do run.
All of it works with the device plugin, the scheduler, and Cluster Autoscaler or Karpenter rather than replacing them — the existing primitives stay where they are and get better inputs. The honest boundaries: the fractional allocation is advisory at the scheduling layer (it prevents over-packing by declared fractions, not hardware isolation) and applies to single-GPU workloads, not the multi-GPU tensor-parallel pods from §7. For a vLLM fleet specifically, lead with replica and node scaling, and treat fractional memory sharing as the lever for the workloads around vLLM more than for the vLLM servers themselves.
There is also a token-level thread worth pulling. Prefix caching (§4) is vLLM reusing the KV blocks of repeated prompt prefixes; the prefix-cache hit rate you monitor in §5 is, in effect, a measure of how much token computation you are not paying for. ScaleOps surfaces those prompt-caching opportunities and ties token-level signals to per-pod cost — and for teams running agentic workloads, where each loop carries the full conversation forward and spend compounds silently, the same lens extends through ScaleOps agent solutions to per-agent token spend and budget control. For a self-hosted vLLM service, though, the win is concrete and local: right-size the GPU, keep the model warm, and watch the prefix cache do its job.
The honest summary of all of this: deploying vLLM on Kubernetes is, mechanically, the easy 80% — a Deployment behind a Service, a GPU request, KEDA on DCGM_FI_DEV_GPU_UTIL, and the cluster autoscaler for nodes, all of which we ran end to end on GKE with L4 and A100 GPUs. The other 20% is the operational detail that decides whether it survives production: the DCGM exporter is not a drop-in on managed Kubernetes, your scaling query silently breaks without honorLabels, vLLM holds its VRAM by design so memory savings come from replica scaling rather than fractional memory, GPU stockouts will leave autoscaled pods Pending, and the scale-out paths carry hard prerequisites — a bigger /dev/shm for NCCL, and RDMA plus multiple GPUs for llm-d. You can get the 80% running in an afternoon; budget for the 20%.
ScaleOps AI Infra installs with a single Helm flag and plugs into the KEDA and node-autoscaling setup from §6 without rearchitecture — same Deployment, same Service, same /metrics.
Try ScaleOps free → to see how much of your inference GPU footprint is allocated but idle before you change a single pod spec.
Book a demo → to walk through fractional GPU allocation and per-pod inference observability against your own vLLM workloads with our team.
vLLM Kubernetes Troubleshooting
Most vLLM Kubernetes failures are operational rather than model bugs, and they cluster into a handful of recognizable symptoms. This table consolidates the ones covered above into a quick symptom-to-fix reference.
| Symptom | Likely cause | Fix |
Pod stuck in Pending | No allocatable nvidia.com/gpu: device plugin or driver missing, or no free GPU | Confirm the NVIDIA device plugin DaemonSet and the node’s nvidia.com/gpu; let Cluster Autoscaler or Karpenter add a GPU node (§2, §6) |
| Pod crash-loops during startup | Readiness probe firing before model load and CUDA graph capture finish | Raise initialDelaySeconds on the /health probe (§3) |
OOMKilled under load | --gpu-memory-utilization set too high to absorb a traffic burst | Lower to 0.85–0.90 and cap --max-model-len (§4) |
| Slow first response after a scale-up or new node | Cold-start model download and weight load | Back the Hugging Face cache with a PVC or bake the model into the image; keep minReplicaCount at 1 or higher (§6, §9) |
ImagePullBackOff | Untagged or unreachable vLLM image, or missing registry auth | Pin a tested vllm/vllm-openai tag and check pull secrets (§3) |
| 403 or gated-model error at load | Missing or unauthorized HUGGING_FACE_HUB_TOKEN | Mount a valid Hugging Face token Secret with access to the model (§3) |
| Pod will not start with tensor parallelism | --tensor-parallel-size does not equal the nvidia.com/gpu count, or the GPUs are not NVLink-connected | Match the count and schedule onto a single NVLink node (§7) |
| Clients cannot reach the model | Service port/targetPort mismatch or wrong namespace DNS | Map Service :80 to targetPort: 8000 and use the …inference.svc.cluster.local name (§3) |
vLLM on Kubernetes: Quick Reference
| vLLM server arg | What it controls | Production note |
--model | Model to serve (HF id or local path) | Gated models need HUGGING_FACE_HUB_TOKEN |
--gpu-memory-utilization | Fraction of VRAM for weights + KV cache | 0.85–0.90; 0.95 risks OOM under burst |
--max-model-len | Max context length | Higher = more KV cache reserved per request |
--quantization | Quantization method (e.g. awq) | Fits larger models or smaller/cheaper GPUs (T4) |
--tensor-parallel-size | GPUs to shard one replica across | Must equal nvidia.com/gpu; NVLink, same node |
--enforce-eager | Skip CUDA graph capture | Lower memory + faster start, slower steady-state |
| Concern | Primitive | Scales / acts on |
| Replica count | KEDA ScaledObject → Deployment | DCGM_FI_DEV_GPU_UTIL |
| GPU node count | Cluster Autoscaler / Karpenter | Pending GPU pods |
| Fitting a big model | --tensor-parallel-size | one replica, many NVLink GPUs |
| Inference health | vLLM /metrics | TTFT, prefix cache hit rate, queue depth |
| Per-pod metrics on shared GPUs | ScaleOps GPU Usage-Based HPA | per-pod GPU utilization |
| Idle GPU spend | ScaleOps Automated Fractional GPUs | per-pod DCGM utilization |
Frequently Asked Questions
Can vLLM run on Kubernetes without a GPU?
In practice, no. vLLM is built to serve models on GPUs, and a production vLLM deployment requests nvidia.com/gpu. CPU execution exists for experimentation and tiny models, but it is not a path for real inference traffic.
Should I deploy vLLM as a Deployment or a DaemonSet?
Deploy vLLM as a Deployment. A DaemonSet pins one pod per node and cannot have its replicas scaled by KEDA, which removes the ability to scale on GPU utilization. A Deployment behind a ClusterIP Service, with Cluster Autoscaler or Karpenter handling node count, is the production pattern.
How do I autoscale vLLM on Kubernetes?
Autoscaling vLLM uses two layers. KEDA performs replica scaling on the DCGM_FI_DEV_GPU_UTIL metric from the DCGM exporter, and Cluster Autoscaler or Karpenter performs node autoscaling when new replicas have nowhere to schedule. Keep minReplicaCount above zero for latency-sensitive services because of GPU cold-start cost.
What GPU do I need to serve Llama-3-8B with vLLM?
Serving Llama-3-8B in FP16 needs about 16 GB of GPU memory plus KV-cache headroom, so a 24 GB card such as an NVIDIA L4 or A10G is comfortable. An AWQ-quantized build of the same model drops to roughly 6 GB and runs on a 16 GB T4.
Why is my vLLM pod taking minutes to become ready?
A vLLM pod is slow to start because, on first run, it downloads the model, loads the weights onto the GPU, and captures CUDA graphs, none of which is instant. Give the readiness probe a generous initialDelaySeconds, cache weights on a PersistentVolume, and see the GPU cold-starts article in this series to cut the delay.
How do I monitor vLLM inference performance, not just GPU usage?
Monitoring vLLM inference means scraping vLLM’s /metrics endpoint, not only GPU utilization. The signals that show service health are vllm:time_to_first_token_seconds, vllm:gpu_cache_usage_perc, vllm:num_requests_waiting, and the prefix-cache hit counters.
How do I serve a model too large for one GPU?
A model too large for one GPU is served with tensor parallelism: set –tensor-parallel-size to the GPU count and request the matching nvidia.com/gpu, with the GPUs NVLink-connected on one node. Beyond a single node, prefill–decode disaggregation with llm-d is the scale-out path.
Why is one GPU per replica wasteful, and what can I do about it?
Inference is bursty, so a replica that owns a whole GPU rarely saturates it, and Kubernetes cannot natively share a GPU across pods. Per-pod GPU observability plus fractional allocation — for example ScaleOps AI Infra — lets several models share a card based on real utilization, recovering the idle capacity.