Key takeaways
- VPA uses decaying weighted histograms, not simple averages, to generate CPU and memory recommendations per container. It needs at least 24–48 hours of data to produce meaningful output.
- There are now four update modes:
Off,Initial,Recreate, andInPlaceOrRecreate. The Auto mode is deprecated since VPA 1.4.0. - VPA’s biggest production risk is the death spiral with HPA: VPA lowers requests, HPA’s percentage math changes underneath it, replicas explode, per-pod histograms skew downward, and the loop accelerates.
InPlaceOrRecreatewith in-place pod resize (GA in K8s 1.35) removes the restart penalty, but VPA’s recommender logic is still reactive and context-blind.- Always set
minAllowedandmaxAllowedboundaries, separate VPA and HPA onto different metrics, and start inOffmode before enabling enforcement.
The Kubernetes Vertical Pod Autoscaler (VPA) adjusts CPU and memory requests for running pods based on observed historical usage. Instead of scaling the number of replicas (which is what the Horizontal Pod Autoscaler does), VPA resizes individual pods — making it the go-to tool for workloads where adding more replicas either doesn’t help or isn’t possible.
In practice, most teams deploy VPA in recommendation-only mode, use its output to manually set resource requests at deploy time, and then ignore it. The tool that was supposed to continuously optimize resources becomes a calculator you consult once. This article explains why that happens, what VPA actually does under the hood, and how to configure it so it works in production rather than against it.
What changed in 2025–2026: In-place pod vertical scaling graduated to GA in Kubernetes 1.35 (December 2025), and VPA 1.2+ introduced the InPlaceOrRecreate update mode. Together, these eliminate VPA’s most significant operational cost: forced pod restarts. The Auto update mode was deprecated in VPA 1.4.0 and is now an alias for Recreate.
What Is the Kubernetes Vertical Pod Autoscaler (VPA)?
The Kubernetes VPA is a set of Custom Resource Definitions (CRDs) and three cooperating controllers that run as an add-on in your cluster. Unlike HPA, the Vertical Pod Autoscaler is not part of the core Kubernetes API — you must install it separately before you can create VerticalPodAutoscaler resources.
VPA manages two resource parameters per container:
- Requests — the guaranteed amount of CPU and memory a container is scheduled with. The scheduler uses these values to find a node with sufficient capacity.
- Limits — the ceiling on CPU and memory a container can consume. Exceeding a memory limit triggers an OOMKill; exceeding a CPU limit triggers CFS throttling.
The controlledValues field in the VPA spec determines whether VPA manages only requests (RequestsOnly, the default) or both requests and limits (RequestsAndLimits). This distinction matters: if you set RequestsAndLimits and VPA raises the limit beyond what the node can satisfy, you create a scheduling problem rather than solving one.
VPA is most useful for workloads where horizontal replica scaling is either impossible or doesn’t address the bottleneck — single-replica stateful workloads, memory-heavy caches, JVM-based applications where heap sizing drives resource needs, or batch jobs with variable resource profiles across runs. In the broader Kubernetes autoscaling ecosystem, VPA fills the gap that HPA and KEDA cannot: right-sizing individual pods rather than adjusting replica count.
VPA Architecture: How the Three Components Work Together
Kubernetes VPA’s three components form a continuous feedback loop: observe, recommend, enforce.
Recommender
The Recommender is the analytical core. It polls resource usage metrics from the Kubernetes Metrics Server and maintains a decaying weighted histogram per container. This is not a simple moving average — the histogram tracks the distribution of resource consumption over time, with more recent samples weighted more heavily than older ones.
For memory, the Recommender targets the 95th percentile of observed usage, providing headroom for occasional spikes without grossly over-provisioning. For CPU, it uses a different approach that accounts for burst patterns and sustained load characteristics. The Recommender also reacts to OOMKill events by raising the memory target.
By default, the Recommender retains 8 days of historical data. It needs at least 24 to 48 hours of observations before generating meaningful recommendations. Applications with weekly usage patterns — like business-hours-only services — benefit from extending this observation window to capture full operational cycles.
The recommendation output includes four values per resource: target (what VPA recommends), lowerBound and upperBound (the confidence interval), and uncappedTarget (the recommendation before minAllowed/maxAllowed boundaries are applied).
Updater
The Updater watches running pods and compares their current resource requests against the Recommender’s output. When the delta exceeds an internal threshold, the Updater takes action — either evicting the pod (forcing a restart with updated values) or patching the pod’s resource spec in place, depending on the update mode.
The Updater respects PodDisruptionBudgets (PDBs) when evicting pods. If eviction would violate a PDB, the Updater backs off and retries later. This is an important safety mechanism for multi-replica deployments, but it also means single-replica workloads without a PDB can be evicted with no protection.
Admission Controller
The Admission Controller is a mutating webhook that intercepts pod creation requests. When a new pod matches a VPA target, the webhook patches the pod’s resource requests (and optionally limits) to match the current recommendation before the pod is admitted to the cluster. This ensures that newly scheduled pods start with optimal sizing rather than whatever was hardcoded in the Deployment spec.
The Metrics Pipeline
Understanding VPA’s latency requires understanding the full metrics pipeline:
- kubelet scrapes container resource usage from cgroups (every 10–15 seconds)
- Metrics Server polls kubelets and aggregates data (~60-second intervals)
- VPA Recommender reads from Metrics Server and updates histograms
- Recommendation calculation produces new target values
- Updater (if enabled) compares and acts
This pipeline introduces a minimum of 60–90 seconds of observability lag before VPA even knows a spike occurred. For a spike-to-action timeline, add eviction scheduling time (or in-place patch propagation) on top.
VPA Update Modes: Off, Initial, Recreate, and InPlaceOrRecreate
VPA supports four update modes. Choosing the right one depends on your workload’s tolerance for disruption and your Kubernetes version.
Off
VPA calculates recommendations but does not apply them. Pods are never evicted or patched. This is the safest starting point for any new VPA deployment — it lets you observe what VPA would do before you let it act.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off"
Inspect recommendations with:
kubectl get vpa my-app-vpa -o jsonpath='{.status.recommendation}' | jq
Initial
VPA sets resource requests only at pod creation time. Running pods are never evicted. This is a good choice for workloads where mid-life restarts are risky (databases, caches, leader-election services) but you still want new pods to start with better defaults than whatever was in the original manifest.
updatePolicy:
updateMode: "Initial"
Recreate
VPA actively evicts pods when their current resource requests diverge significantly from the recommendation. Evicted pods are recreated by their owning controller (Deployment, StatefulSet, etc.), and the Admission Controller applies the updated recommendation to the new pod spec.
updatePolicy:
updateMode: "Recreate"
This is the mode that makes VPA a continuous optimization loop — but it comes with operational cost. Restarting a JVM clears JIT compilation. Restarting PostgreSQL triggers WAL replay. Restarting Redis flushes the in-memory cache. For many production workloads, the restart cost exceeds the benefit of more accurate resource requests.
InPlaceOrRecreate
Introduced in VPA 1.2+, this mode attempts an in-place resize first — patching the pod’s resource spec via the resize subresource without restarting it. If the in-place update fails (for example, the node lacks capacity or the container’s resizePolicy requires a restart for that resource), VPA falls back to eviction.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "InPlaceOrRecreate"
This mode requires Kubernetes 1.33+ (in-place resize beta) or 1.35+ (GA). You also need to configure resizePolicy on the container spec to control restart behavior per resource:
containers:
- name: my-app
image: my-app:latest
resizePolicy:
- resourceName: cpu
restartPolicy: NotRequired
- resourceName: memory
restartPolicy: NotRequired
resources:
requests:
cpu: "250m"
memory: "256Mi"
Setting restartPolicy: NotRequired for both CPU and memory means the container can be resized live. If you need a restart for memory changes (common for JVM applications where heap is set at startup), set memory’s policy to RestartContainer.
Auto (Deprecated)
Auto was deprecated in VPA 1.4.0 and is now an alias for Recreate. It was originally introduced to allow for future expansion of automatic update strategies. If you see Auto in existing configurations, migrate to either Recreate or InPlaceOrRecreate depending on your Kubernetes version.
Vertical Pod Autoscaler Update Modes: Comparison Table
| Mode | Evicts pods? | In-place resize? | When to use |
Off | No | No | Observation, dry-run, manual application |
Initial | No | No | Stateful workloads, leader-election services |
Recreate | Yes | No | Stateless workloads tolerant of restarts |
InPlaceOrRecreate | Fallback only | Yes (K8s 1.33+) | Production workloads on K8s 1.35+ |
Auto (deprecated) | Yes | No | Migrate to Recreate or InPlaceOrRecreate |
Kubernetes VPA Limitations: What Breaks in Production
VPA has well-documented benefits — automated rightsizing, reduced over-provisioning, lower infrastructure cost. The limitations are less discussed but more consequential in production.
Reactive Latency
VPA’s recommendation pipeline has inherent lag. By the time VPA detects a memory spike, the OOMKill may have already happened. The metrics pipeline alone introduces 60–90 seconds of delay (kubelet scrape → Metrics Server poll → Recommender read). On top of that, the Recommender uses historical averages, not instantaneous values, so a sudden spike takes time to shift the histogram meaningfully.
In-place resize (K8s 1.35) makes the application of recommendations fast enough for proactive scaling. But VPA’s Recommender is still purely reactive — it only responds to what has already occurred. There is no prediction, no pattern recognition, no seasonality awareness.
Histogram Aggregation Skew
This is VPA’s most subtle failure mode and the one that causes the most unexpected behavior in production.
VPA maintains histograms per container, not per workload. When HPA scales a deployment from 2 replicas to 8, the load spreads across more pods. Each individual container now handles roughly 25% of the previous load. VPA’s histograms see this as reduced per-pod usage and recommends smaller requests.
Those smaller requests then cause HPA’s percentage math to change (same absolute usage / smaller requests = higher utilization percentage), which triggers more scale-out, which spreads load further, which makes VPA recommend even smaller requests. This is a feedback loop.
The VPA/HPA Death Spiral
The histogram aggregation problem compounds with a separate percentage math issue into what is commonly called the death spiral. Here is the sequence:
- VPA lowers CPU requests from 200m to 100m based on observed usage.
- Actual CPU usage remains 150m per pod.
- Before: 150m / 200m = 75% utilization. After: 150m / 100m = 150% utilization.
- HPA sees 150% and panics — scales out aggressively.
- Load spreads across more replicas. Each pod now uses less CPU.
- VPA’s per-container histograms see lower usage. Recommends even smaller requests.
- The cycle accelerates.
The upstream Kubernetes documentation explicitly warns against running VPA and HPA together on the same resource metric. The controllers lack any coordination mechanism — VPA modifies requests based on historical usage, HPA scales based on current utilization as a percentage of those requests. When VPA changes requests, HPA’s math changes underneath it.
No Workload Context
VPA sees resource consumption numbers. It does not understand what those numbers mean. When memory usage climbs steadily, VPA cannot distinguish between a memory leak (where scaling up just delays the inevitable OOMKill), valid cache expansion (where the application will stabilize), or JVM heap behavior (where the GC cycle creates a sawtooth pattern).
This means VPA will scale up blindly for leaks (wasting money) or hesitate to scale down after a temporary spike (staying stuck at a high-water mark). It has no mechanism to ask “is this usage growth healthy or pathological?”
No Predictive Capability
VPA’s Recommender looks backward. It has no concept of time-of-day patterns, weekly traffic cycles, or seasonal spikes. A workload that reliably peaks at 9 AM every Monday gets no advance preparation — VPA will react to the spike after it happens, every single week.
This is arguably the most significant gap in VPA’s design. In-place pod resize made the enforcement mechanism fast enough for proactive scaling. The recommender logic has not caught up.
Kubernetes VPA Best Practices for Production
Start in Off Mode
Deploy every new VPA in Off mode first. Observe recommendations for at least one full traffic cycle (typically a week) before enabling enforcement. This prevents VPA from acting on incomplete data and gives you time to validate that its recommendations make sense for your workload.
Set minAllowed and maxAllowed Boundaries
Without boundaries, VPA can recommend arbitrarily large or small resource values. A runaway recommendation can either schedule your pod onto a node that is too expensive or starve it of resources during a transient usage dip.
spec:
resourcePolicy:
containerPolicies:
- containerName: my-app
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
controlledResources: ["cpu", "memory"]
controlledValues: "RequestsOnly"
Use resourcePolicy for Multi-Container Pods
If your pod has sidecar containers (Envoy proxy, log collectors, init containers), configure separate policies per container. Without this, VPA applies a blanket recommendation that may over-size sidecars or under-size the main application.
resourcePolicy:
containerPolicies:
- containerName: my-app
minAllowed:
cpu: "250m"
memory: "512Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
- containerName: envoy-sidecar
mode: "Off"
Setting a sidecar to mode: "Off" excludes it from VPA management entirely.
Choose controlledValues Deliberately
RequestsOnly (the default) adjusts requests while leaving limits untouched. This is safer if you have limits set as a safety net and don’t want VPA raising them. RequestsAndLimits adjusts both, maintaining the ratio between them. Use this only if you have a well-understood limit strategy and are confident VPA’s recommendations won’t push limits beyond what your nodes can satisfy.
Separate VPA and HPA Metrics
If you run both VPA and HPA on the same deployment, do not let them share metrics. Configure VPA to manage CPU and memory requests while HPA scales on a different signal — queue depth, request rate, or a custom metric exposed through the Custom Metrics API.
# VPA handles sizing
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "InPlaceOrRecreate"
---
# HPA handles replica count on a custom metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
Monitor VPA-Specific Prometheus Metrics
Track these metrics to understand Kubernetes VPA’s operational health:
vpa_recommender_recommendation_latency— how long the Recommender takes to calculate recommendations. Rising latency suggests the Recommender is overloaded.vpa_updater_evictions_total— number of pod evictions triggered by VPA. Unexpected spikes indicate oscillating recommendations.kube_verticalpodautoscaler_status_recommendation_target— the current recommendation values, useful for dashboarding and alerting.
VPA with Cluster Autoscaler and Karpenter
VPA changes pod resource requests, which directly affects node scheduling demand. When VPA raises requests, pods may no longer fit on their current nodes. Cluster Autoscaler or Karpenter will detect unschedulable pods and provision new nodes accordingly. When VPA lowers requests, pods become smaller — enabling better bin-packing and, over time, node consolidation as Karpenter or Cluster Autoscaler removes underutilized nodes. This interaction is generally safe and beneficial, but be aware that VPA-driven request changes can trigger node churn if recommendations oscillate. Setting minAllowed/maxAllowed boundaries on VPA prevents the worst oscillation scenarios.
Match Update Mode to Workload Type
- Stateless, restart-tolerant services →
RecreateorInPlaceOrRecreate - Databases, caches, leader-election services →
InitialorInPlaceOrRecreate(with appropriateresizePolicy) - Batch jobs →
Initial(sized once per run) - Observing before committing →
Off
Conclusion
Kubernetes VPA solves a real problem — resource requests in production clusters are almost always wrong, either over-provisioned (wasting capacity) or under-provisioned (risking OOMKills and throttling). The Vertical Pod Autoscaler’s histogram-based recommender provides data-driven sizing that adapts over time.
The challenge is that VPA’s recommendation logic is reactive, context-blind, and designed per-container rather than per-workload. In clusters where HPA also operates, the lack of coordination between the two controllers creates real production risk. In-place pod resize removes the disruption cost of applying recommendations, but it does not fix the recommendation quality.
For most teams, the practical path is: deploy Kubernetes VPA in Off mode, set minAllowed/maxAllowed boundaries, observe for a week, then enable InPlaceOrRecreate on Kubernetes 1.35+ for workloads where you trust the recommendations. Keep VPA and HPA on separate metrics. Monitor vpa_updater_evictions_total for oscillation. And accept that for workloads with complex resource profiles — JVM applications, GPU inference, or anything with strong time-of-day patterns — the Vertical Pod Autoscaler’s upstream recommender may never be sufficient on its own.
Frequently Asked Questions
What are the four VPA update modes?
Off (recommendation only, no pod changes), Initial (sets resources at pod creation, no mid-life updates), Recreate (evicts and recreates pods to apply changes), and InPlaceOrRecreate (attempts live resize, falls back to eviction). The Auto mode was deprecated in VPA 1.4.0 and is an alias for Recreate.
Can I run VPA and HPA together?
Yes, but you must use different metrics for each. Let VPA manage CPU and memory sizing while HPA scales replica count on a separate signal such as request rate or queue depth. Running both on CPU or memory creates a feedback loop where VPA changes requests and HPA’s percentage math destabilizes.
Does VPA support in-place pod resize?
Yes, since VPA 1.2+. The InPlaceOrRecreate update mode uses the Kubernetes resize subresource to patch pods without restarting them. This requires Kubernetes 1.33+ (beta) or 1.35+ (GA) and appropriate resizePolicy configuration on the container spec.
Is VPA safe for stateful applications?
It can be, with the right configuration. Use Initial mode (resources set only at pod creation) or InPlaceOrRecreate (live resize with eviction fallback). Avoid Recreate mode for stateful workloads where restarts disrupt caches, JIT compilation, or leader elections. Always set minAllowed and maxAllowed boundaries.
How long does VPA need to generate accurate recommendations?
The Recommender needs at least 24 to 48 hours of data before generating meaningful recommendations. For workloads with weekly traffic patterns, allow a full week. The Recommender retains 8 days of historical data by default.
What is the VPA/HPA death spiral?
A feedback loop where VPA lowers requests based on historical usage, which changes HPA’s utilization percentage calculation (same absolute usage / smaller requests = higher percentage). HPA scales out, spreading load across more pods. VPA’s per-container histograms see lower per-pod usage and recommend even smaller requests. The cycle accelerates until intervention.
Will VPA change my resource limits?
Only if you configure controlledValues: “RequestsAndLimits” in the resource policy. The default (RequestsOnly) adjusts requests and leaves limits unchanged.