All articles

Kubernetes v1.35 Deep Dive: In-Place Resize GA, Gang Scheduling & the Cgroup v2 Cliff

Nic Vermandé
Nic Vermandé

Kubernetes 1.34 gave us the building blocks: DRA went GA, PSI metrics landed in beta, swap support graduated. The primitives arrived. But v1.35 is different. This is the first release where it really feels like the project is leaning into its identity as an operating system kernel. Not a platform. Not an abstraction layer. A kernel.

Think about what Linux gives you: cgroups, mmap, epoll are powerful primitives, with zero opinions on how to use them. You want a database? Write it yourself. Kubernetes v1.35 follows that pattern. It hands you in-place cgroup mutation, structured device claims, coordinated pod placement. What it doesn’t hand you is the intelligence to drive them. That’s user space. That’s your problem.

The headline depends on who you are. Stateful workloads finally get In-Place Resize GA. AI/ML teams get Gang Scheduling alpha and continued DRA ecosystem refinement. Batch platforms get Opportunistic Batching, scheduling 10,000-pod Jobs in seconds. Platform engineers get… a migration mandate.

But there’s a recurring pattern: the native controllers still can’t use these primitives effectively. Gang Scheduling lands as alpha, but the native scheduler lacks queue intelligence. DRA is GA, but defining a ResourceClaim still requires fluency in CEL expressions that would make a compiler engineer wince.

Kubernetes v1.35 gives you the syscalls. The question is: who’s writing the user space?

This release also draws hard lines. cgroup v1 support is gone – not deprecated, removed. containerd 1.x reaches end of life. IPVS mode is formally deprecated. For platform teams running legacy node fleets, v1.35 is less of an upgrade and more of a modernization mandate. (If you’re still running CentOS 7 nodes… I’m sorry. This is going to hurt.)

Fair warning: this one’s long. I’m covering a lot—In-Place Resize GA, Gang Scheduling alpha, HPA improvements, security features like Structured Auth Config and Pod Certificates, breaking changes, the works. And yes, I’ll mention ScaleOps and where we fill the gaps.

That’s my reward for writing 5,000 words about KEPs. Deal with it 🙂.

Let’s dig in.

The End of the Restart Tax: In-Place Pod Resizing

The graduation of In-Place Pod Resizing to GA is arguably the most significant feature for stateful workload efficiency in Kubernetes history. It addresses the central inefficiency of resource management: the “Restart Tax.”

Tested on: Kubernetes v1.35.0-rc.1, Docker runtime 28.4.0, cgroup v2.

Container-Level Resize (KEP-1287) – GA

Historically, the resources field in a Pod spec was immutable. Changing a CPU limit from 1 core to 2 cores meant terminating the Pod and recreating it. This rendered VPA practically unusable for production-critical workloads because of multiple factors:

  • Disruption: Restarting a JVM clears JIT compilation. Similarly, restarting PostgreSQL triggers WAL replay, and restarting Redis flushes the cache.
  • Risk: The new Pod might fail to schedule (capacity), fail readiness probes (cold start), or land on a worse node.

That means engineers relegated VPA to “recommendation mode”, using it as a one-off sizing tool at deploy time. But apps have seasonal patterns, traffic spikes, and growth curves. So teams oversize requests to leave headroom, and clusters end up at 30-40% utilization. The tool that was supposed to optimize resources became a calculator you use once and ignore thereafter.

With KEP-1287 GA, the API server allows mutation of spec.containers[*].resources via the Pod resize subresource. Kubelet evaluates feasibility, and applies the change asynchronously. On our tested stack, the container kept running (no restartCount/containerID change) and we observed the memory cgroup limit change directly from inside the container.

Here’s how it works in practice:

#Resize via the new /resize subresource (required in GA)
kubectl patch pod my-app --subresource resize --type='merge' -p '{
  "spec": {
    "containers": [{
      "name": "app",
      "resources": {
        "requests": {"memory": "512Mi"},
        "limits": {"memory": "1Gi"}
      }
    }]
  }
}'

New status conditions can track the asynchronous process (they may be brief on a fast “happy path” – easiest to observe under resource pressure):

#Resize state (may be empty for fast resizes)
kubectl get pod my-app -o jsonpath='{.status.conditions}' \
| jq '.[] | select(.type | startswith("PodResize"))'

#Desired vs actual (what you asked vs what's applied)
kubectl get pod my-app -o jsonpath='{.spec.containers[0].resources}' | jq .
kubectl get pod my-app -o jsonpath='{.status.containerStatuses[0].resources}' | jq .

#Advanced: what the node has admitted/allocated
kubectl get pod my-app -o jsonpath='{.status.containerStatuses[0].allocatedResources}' | jq .

During a resize, Kubernetes maintains three distinct resource views that can temporarily diverge:

  • Desired: spec.containers[*].resources – what you asked for
  • Configured: status.containerStatuses[*].resources – what’s actually applied to the container
  • Admitted: status.containerStatuses[*].allocatedResources – what the node has committed

The PodResize* conditions exist but flash by quickly on successful resizes. To observe them reliably, force an infeasible scenario, for example by requesting more memory than the node can provide.

QoS Class Protection

A Pod’s QoS class (Guaranteed, Burstable, BestEffort) is determined at creation and is immutable by design. Consider the following scenario: you try to resize a Guaranteed pod (requests == limits) by increasing limits only, which would make it Burstable. What happens?

The API server rejects it:

Pod QOS Class may not change as a result of resizing

This is a guardrail. Kubernetes prevents you from accidentally changing a pod’s scheduling priority and eviction behavior mid-flight. Any resize that would shift QoS class: Guaranteed → Burstable, Burstable → Guaranteed, BestEffort → anything, is blocked.

So, when resizing, pick values that preserve the original QoS class rules. For Guaranteed pods, always resize both requests and limits together to maintain equality.

The Memory Shrink Hazard

Increasing memory in-place is generally safe. Decreasing memory is where things get interesting, but not in the way you might expect.

If a container has a 4GB limit and is using 3GB, what happens when you resize to 2GB? You might expect an immediate OOM kill. In practice (observed on v1.35.0-rc.1), the kubelet is smarter:

  • The resize enters PodResizeInProgress with reason: Error
  • Messages like: attempting to set pod memory limit below current usage
  • The cgroup limit is not decreased – the container keeps running at the original limit

This is a protective behavior, but it leaves you in operational limbo: your desired spec says 2GB, the container is still running at 4GB, and the resize is stuck. Someone on your team will absolutely try this in prod and spend an hour wondering why the pod won’t converge. Don’t ask me how I know…

For predictable shrink behavior, use resizePolicy: RestartContainer for memory:

apiVersion: v1
kind: Pod
metadata:
  name: safe-resize-app
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        cpu: "500m"
        memory: "256Mi"
      limits:
        cpu: "1"
        memory: "512Mi"
    resizePolicy:
    - resourceName: cpu
      restartPolicy: NotRequired  # Hot resize — no restart
    - resourceName: memory
      restartPolicy: RestartContainer  # Restart on memory change — clean slate

This gives you hot CPU scaling while ensuring memory changes trigger a clean restart rather than a stuck resize.

Practical tip: When patching resources on a Pod with resizePolicy, prefer strategic merge patch, kubectl edit --subresource resize, or server-side apply. That’s because JSON merge patch (--type='merge') replaces arrays wholesale and can accidentally wipe your resizePolicy configuration

Pod-Level Resize (KEP-25419) – Alpha

Kubernetes 1.32 introduced pod-level resources (pod.spec.resources), useful for workloads with complex internal resource management. This includes sidecars that dynamically allocate memory, or init containers sharing resources with main containers.

KEP-5419 extends in-place resize to these pod-level resources. However, this is alpha and requires two feature gates – the API server rejects resize attempts without both enabled.

# Three feature gates required (validated on v1.35.0-rc.1):
--feature-gates=PodLevelResources=true                          # Enables pod.spec.resources field
--feature-gates=InPlacePodLevelResourcesVerticalScaling=true    # Enables /resize for pod-level
--feature-gates=NodeDeclaredFeatures=true                       # Dependency — kubelet won't start without it

Note: Some distros may already have NodeDeclaredFeatures enabled by default, but on our Minikube RC1 validation it was required explicitly. The API server error message references InPlacePodLevelResourcesVerticalScalingEnabled, but that gate name doesn’t work. Use InPlacePodLevelResourcesVerticalScaling (without “Enabled”).

# Resize pod-level resources (when enabled)
kubectl patch pod my-app --subresource resize --type='json' -p='[
  {"op": "replace", "path": "/spec/resources/requests/memory", "value": "4Gi"},
  {"op": "replace", "path": "/spec/resources/limits/memory", "value": "8Gi"}
]'

Native VPA: The Mechanism Works, The Intelligence Doesn’t

KEP-1287 provides the primitive. Native VPA has evolved to use it. updateMode: InPlaceOrRecreate patches running pods via the /resize subresource without eviction.

The updater logged “In-place patched pod /resize subresource,” the pod UID and containerID stayed the same, restartCount remained 0, and requests/limits changed live.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: InPlaceOrRecreate  # Uses /resize when possible
    minReplicas: 1  # Default is 2 — single-replica workloads need this

There’s also an operation gotcha here. The VPA updater defaults to --min-replicas=2 as an availability safeguard, and won’t update workloads with fewer than 2 replicas. That means VPA calculates recommendations (visible in .status.recommendation), but never applies them. You’ll stare at the status wondering why nothing is happening. For single-replica workloads, explicitly set minReplicas: 1 in the VPA spec.

So the mechanism works. Architecturally, the control logic is where it falls short:

  • Reactive Latency: VPA relies on Metrics Server (polled every 15-60 seconds) and a recommender analyzing historical averages. By the time it detects a spike and issues a patch, the OOM may have already occurred.
  • Lack of Context: VPA sees “high memory usage.” It doesn’t know if that’s a memory leak, valid cache expansion, or JVM heap behavior. It scales up blindly (cost waste) or hesitates to scale down (stuck at high-water mark).
  • No Predictive Capability: In-place resize is fast enough for proactive scaling. VPA only reacts to what already happened.

My bet: teams will enable InPlaceOrRecreate, celebrate the zero-downtime updates, then watch VPA recommend something nonsensical within a week. The API is production-ready. The recommender intelligence isn’t. (I’ll come back to who fills this gap later—you knew the plug was coming.)

Native Gang Scheduling (Alpha)

For distributed training, workloads have an “All-or-Nothing” requirement. A job needing 100 GPUs that gets 95 cannot start. If it holds those 95 while waiting for 5 more, it creates deadlocks and starves other jobs.

Previously, this required external schedulers like Volcano or Kueue. Kubernetes v1.35 introduces Gang Scheduling directly via the new Workload API.

Validated on: v1.35.0-rc.1 (Minikube). We confirmed gang semantics work: on a single-node cluster with 7 allocatable CPUs, a gang requiring 8 CPUs kept all pods Pending (no partial allocation), while a non-gang Job with the same requirements partially scheduled.

Enabling Gang Scheduling requires two feature gates on both the API server and scheduler. GenericWorkload enables the new Workload custom resource, andGangScheduling activates the scheduler plugin that enforces all-or-nothing placement. You also need to enable the alpha API group:

# kube-apiserver (alpha API + gates)
--feature-gates=GenericWorkload=true,GangScheduling=true
--runtime-config=scheduling.k8s.io/v1alpha1=true

# kube-scheduler (gates; plugin enabled by default when gate is on)
--feature-gates=GenericWorkload=true,GangScheduling=true

A Workload defines pod groups and their gang policies. Pods link to the Workload via spec.workloadRef, a first-class Pod field, not an annotation. The scheduler correlates all pods referencing the same Workload and holds them until the entire gang can be placed:

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: distributed-training
  namespace: ml-workloads
spec:
  podGroups:
  - name: workers
    policy:
      gang:
        minCount: 10  # All-or-nothing: need all 10 to start
---
apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  parallelism: 10
  completions: 10
  template:
    spec:
      workloadRef:           # Real Pod field, not an annotation
        name: distributed-training
        podGroup: workers
      containers:
      - name: trainer
        image: pytorch/pytorch:latest
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Never

The scheduler sees spec.workloadRef and holds all pods until it can schedule the entire gang. No partial allocation, no deadlock, no wasted cycles spinning on 7 of 10 workers.

The Native Scheduler Gap

While v1.35 adds the API, the native scheduler’s implementation is basic compared to Volcano or Kueue:

  • No queue management or fair-share policies
  • No sophisticated backfill capabilities
  • No preemption intelligence for gang workloads

For serious AI supercomputing, external orchestrators remain necessary to manage the economics of the queue, even as Kubernetes manages the mechanics of the gang. (You can probably guess where this is going.)

Opportunistic Batching: Faster Scheduling for Homogeneous Workloads (Beta)

When you submit 1,000 identical pods, the scheduler traditionally evaluates each one individually. KEP-5598 introduces opportunistic batching to speed this up – but it’s not what you might expect.

What batching actually is: A small, opportunistic cache inside the scheduler. When it schedules a pod, it may keep the ranked node list briefly. For the very next pod with the same signature, it returns a hint: “try node X first.” It’s not “schedule 1,000 pods at once”. It’s “maybe skip some work for the next identical pod.”

This is Beta and enabled by default in v1.35. We confirmed on v1.35.0-rc.1:

# Feature enablement confirmed
kubernetes_feature_enabled{name="OpportunisticBatching",stage="BETA"} 1

Two Hurdles to Actually Getting Batching

Hurdle 1: Pods must be “signable.” The scheduler computes a signature from fields that affect placement. If any plugin can’t produce a signature fragment, the whole signature is nil and the pod becomes “not batchable.”

We hit this immediately: PodTopologySpread refused to sign pods when system-default topology constraints were enabled. Every pod became unbatchable—not because our pods differed, but because a default plugin blocked signatures.

The fix: Disable system-default constraints if you want batching for batch workloads:

# /etc/kubernetes/kube-scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - name: PodTopologySpread
    args:
      defaultingType: List
      defaultConstraints: []

Hurdle 2: Pods must “fill” nodes. The scheduler only reuses the cached ordering if the previous chosen node becomes infeasible for the next pod. If the node still has room, it flushes the cache (node_not_full) because reusing the hint might cause suboptimal packing.

Translation: Batching works cleanly for “fat” pods (one per node). For “tiny” pods that pack densely, batching can’t safely reuse hints.

Diagnostic Metrics

Monitor these patterns (names validated on RC1):

MetricMeaning
scheduler_batch_attempts_total{result="hint_used"}Batching is working
scheduler_batch_cache_flushed_total{reason="node_not_full"}Pods too small – batching can’t help
scheduler_batch_cache_flushed_total{reason="pod_not_batchable"}Signatures blocked (check plugin defaults)

Long story short, batching shines for GPU-hungry distributed training jobs where each worker fills a node. For microservices that pack 50-to-a-node, don’t expect benefits – and that’s by design.

This pairs naturally with Gang Scheduling: a distributed training job with node-filling workers gets gang semantics for all-or-nothing placement and batching for faster scheduling decisions.

HPA Gets Granular: Configurable Tolerance (Beta)

The global 10% HPA tolerance has been a pain point forever. For a 1000-replica deployment, that’s 100 pods of dead zone where HPA won’t react.

KEP-4951, graduating to Beta in v1.35, lets you configure tolerance per-HPA, and differently for scale-up vs scale-down. This is how it’s configured:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-frontend
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend
  minReplicas: 10
  maxReplicas: 500
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      tolerance: 0.02  # 2% — respond faster to traffic spikes
      stabilizationWindowSeconds: 60
    scaleDown:
      tolerance: 0.15  # 15% — conservative scale-down, avoid thrashing
      stabilizationWindowSeconds: 300

The asymmetric pattern (tight scale-up, loose scale-down) maps to how humans handle incidents: scale up on smoke, scale down on proof.

On feature gates: The HPAConfigurableTolerance gate exists, but on v1.35.0-rc.1 the API accepts tolerance fields even with the gate disabled. The gate likely controls controller behavior, not schema admission. If you’re on v1.35, the fields should just work.

Practical Gotchas

Tolerance is stored as Quantity, not float. You write tolerance: 0.02, but the API canonicalizes it to 20m. Similarly, 0.15 becomes 150m. If your diffs look “weird” after apply, this is why. Normalize to the canonical form in your manifests if it bothers you.

Two HPAs targeting the same workload lead to a silent failure. If two HPAs select the same pods, the controller refuses to act with AmbiguousSelector. It looks like “HPA is broken” but it’s actually a guardrail preventing conflicting scaling decisions.

Expect a warm-up period. Right after creating an HPA, you may see transient warnings like “did not receive metrics for targeted pods” while metrics-server caches catch up. Don’t over-interpret the first minute.

Breaking Changes: The Modernization Cliff

Kubernetes v1.35 draws hard lines. Unlike API version removals that can be mitigated by rewriting manifests, these changes target the Kubelet-to-kernel interface. They require infrastructure-level action.

Cgroup v1 Removal (KEP-5573)

This is not a deprecation warning. It’s a hard failure mode.

Default Behavior: If the Kubelet detects cgroup v1 on startup, expect it to fail. Treat this as an infrastructure upgrade, not a manifest change.

Check your nodes this way:

stat -fc %T /sys/fs/cgroup

# cgroup2fs = You're good (cgroup v2)
# tmpfs = You have a problem (cgroup v1)

We validated all our v1.35.0-rc.1 nodes are cgroup v2. If your fleet includes cgroup v1 nodes, validate the exact failure mode on your infrastructure before upgrading.

A failCgroupV1: false option exists in KubeletConfiguration, but using it is operationally dangerous. It defers the inevitable and locks you out of features that require v2 (Memory QoS, certain swap configurations, PSI metrics).

Impact on Legacy Fleets: CentOS 7 (EOL June 2024), RHEL 7, Ubuntu 18.04 – all default to cgroup v1. Even on modern distros, if your Kubelet explicitly sets cgroupDriver: cgroupfs instead of systemd, you’ll hit this wall.

Why This Matters Beyond Compliance: Cgroup v2’s unified hierarchy unlocks Pressure Stall Information (PSI), which acts as a “killer metric” for autoscaling by telling you not just that CPU is high, but that processes are stalling waiting for CPU. A real game changer for your operational dashboards.

Containerd 1.x: Final Warning (KEP-4033)

Kubernetes v1.35 is the last version supporting containerd 1.x. In 1.36, it’s gone.

Check your runtime versions this way:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.containerRuntimeVersion}{"\n"}{end}'

Containerd 2.0 removes support for Docker Schema 1 images. Ancient images (pushed 5+ years ago) still lurking in production manifests will fail to pull.

So scan before upgrading:

#Find Schema 1 images in your registries
skopeo inspect docker://your-registry/old-image:tag | jq '.schemaVersion'

Configuration Breaking Changes: Containerd 2.0 removes deprecated registry.configs and registry.auths structures in config.toml. Automated node upgrade scripts injecting old configs will crash the runtime. Treat containerd 2.0 as a compatibility change, not just a version bump.

IPVS Mode Deprecation (KEP-5495)

For years, IPVS was the go‑to recommendation for very large Kubernetes clusters because iptables didn’t scale.

But keeping IPVS behavior perfectly aligned with iptables semantics and newer Service features turned out to be complex and awkward for kube‑proxy. The long‑term path is nftables: a more modern, programmable backend that fixes iptables’ scaling issues and is intended to eventually replace both iptables and IPVS as the primary kube‑proxy mode on Linux.

Check your kube-proxy mode:

# Check your kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# If mode: ipvs, plan migration:
# 1. Test nftables in staging
# 2. For smaller clusters, iptables may suffice
# 3. IPVS removal targeted for 1.38

If you explicitly set IPVS, you’re on a deprecation clock. Start testing now, but it’s not a v1.35 emergency.

Image Pull Credential Verification (KEP-2535) – Beta, Default Enabled

A long-standing multi-tenant security gap is finally closed. Previously, if Tenant A pulled a private image with valid credentials, Tenant B could use that cached image without any credentials – Kubelet only verified on first pull.

In v1.35, the KubeletEnsureSecretPulledImages feature is enabled by default. We confirmed this via kubelet metrics on our v1.35.0-rc.1 nodes. Kubelet now re-validates credentials for every pod, even if the image is cached locally.

Impact: Image cache is no longer a “free pass” in multi-tenant clusters. If a pull secret expires or rotates, pods that previously started fine (due to caching) will now fail with ImagePullBackOff. So, monitor pull secret expiry and treat cache-dependent startups as a bug.

If you need to adjust verification policy, it’s configured via KubeletConfiguration. On most clusters, this lives in a ConfigMap (often kubelet-config-* in kube-system) or as a file on each node:

# Check current kubelet config (if using ConfigMap)
kubectl get cm -n kube-system -l k8s-app=kubelet -o yaml | grep imagePull

# Or on the node itself
cat /var/lib/kubelet/config.yaml | grep imagePull

The available policies:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
imagePullCredentialsVerificationPolicy: AlwaysVerify
# Options:
#   AlwaysVerify - Check credentials for every pod (default in v1.35)
#   NeverVerify - Old behavior, skip verification for cached images
#   NeverVerifyAllowlistedImages - Skip only for specific image patterns

WebSocket Streaming: New RBAC Requirement

When you run kubectl exec, attach, or port-forward, you’re opening a bidirectional stream between your terminal and a container. Kubernetes originally used SPDY (a Google protocol that predates HTTP/2) for this streaming. But SPDY is deprecated, poorly supported by modern proxies and load balancers, and increasingly problematic in production environments.

Kubernetes has been migrating these streaming connections to WebSocket (stable, widely supported, proxy-friendly). In v1.35, this transition includes a security tightening: the WebSocket upgrade (from HTTP connection) is now treated as a create action on the subresource, not just get.

Impact: RBAC policies that worked with SPDY might not grant WebSocket upgrade permissions. If “read-only” roles suddenly can’t exec into pods after upgrading, it’s RBAC, not networking.

To check whether a service account has the required permissions, use kubectl auth can-i with the subresource flag:

# Test if a role can exec after the change
kubectl auth can-i create pods --subresource=exec --as=system:serviceaccount:default:my-sa -n default

# Previously only needed get:
kubectl auth can-i get pods --subresource=exec --as=system:serviceaccount:default:my-sa -n default

The fix: Grant create on pods/exec (and similarly for attach, portforward) to roles that need interactive access. You can temporarily disable this check with AuthorizePodWebsocketUpgradeCreatePermission=false, but plan to update your RBAC policies.

Deployment Rollouts: Tracking Terminating Pods (Beta)

Ever seen a rolling update trigger QuotaExceeded errors despite having capacity? The Deployment controller was ignoring terminating pods when counting replicas.

The Scenario:

  1. 100 replicas, maxSurge: 25%
  2. Controller creates 25 new pods (125 total)
  3. Controller terminates 25 old pods (100 running + 25 terminating)
  4. Controller sees “100 running” and creates 25 MORE new pods
  5. Reality: 100 running + 25 terminating + 25 new = quota explosion

v1.35 improves rollout observability with .status.terminatingReplicas (Beta, enabled by default). We validated this field exists and works on v1.35.0-rc.1:

# Check terminating count during rollouts
kubectl get deployment my-app -o jsonpath='{.status.terminatingReplicas}'

Under slow termination and tight quota, the deployment can hit quota failures (ReplicaFailure). Now you can at least see the overlap:

# Watch all replica states during a rollout
kubectl get deployment my-app -o jsonpath='{.status.replicas} ready={.status.readyReplicas} terminating={.status.terminatingReplicas}'

Note: Some rollout behavior knobs proposed upstream (like podReplacementPolicy) may land in a later release – don’t assume they’re available in v1.35. We tested and the API server rejects podReplacementPolicy as an unknown field on v1.35.0-rc.1.

Security: Structured Authentication Config (GA)

Authentication configuration via --oidc-* flags has been rigid: single provider, requires restart to change, limited validation. Kubernetes v1.35 graduates Structured Authentication Configuration to GA.

The new approach uses a dedicated config file that defines JWT issuers, claim mappings, and validation rules. Here’s an example supporting two identity providers: a corporate IdP for humans and a GitLab instance for CI/CD pipelines:

# /etc/kubernetes/auth-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: AuthenticationConfiguration
jwt:
# Production IdP
- issuer:
    url: https://okta.example.com
    audiences:
    - production-cluster
  claimMappings:
    username:
      expression: 'claims.email.split("@")[0]'
    groups:
      expression: 'claims.groups.map(g, "okta:" + g)'
  claimValidationRules:
  - expression: 'claims.exp - claims.iat <= 3600'
    message: "Token lifetime cannot exceed 1 hour"

# CI/CD IdP  
- issuer:
    url: https://gitlab.example.com
    audiences:
    - ci-cluster
  claimMappings:
    username:
      claim: preferred_username
    groups:
      claim: roles

To use this config, pass it to the API server via the --authentication-config flag. This replaces the legacy --oidc-* flags entirely:

What you get:

  • Multiple JWT providers simultaneously (no more “one IdP per cluster”)
  • Config reload without API server restart (designed for dynamic updates)
  • CEL expressions for claim validation and transformation
  • Advanced claim mappings with custom logic

We confirmed the feature is enabled by default via metrics (StructuredAuthenticationConfiguration=1). If you’re still juggling multiple OIDC flag sets and restarting API servers every time you add a provider, this is the upgrade to prioritize.

Quick Hits: Other Notable Features

I’m not going to give every feature the full treatment, because you’d stop reading (and it’s quite long already), and I’d stop caring…

Here’s what else matters:

Coordinated Container Restarts (Alpha)

Kubernetes is adding first-class restart semantics for multi-container Pods via restartPolicyRules. When tightly-coupled containers need to restart together (worker-1 crashes, worker-2 should restart too), this provides the primitive. Useful for distributed training pods with shared state. Feature gate: RestartAllContainersOnContainerExits (off by default on our RC1 cluster – treat as alpha plumbing you must explicitly enable).

Constrained Impersonation (Alpha)

Granular impersonation control is coming. Instead of “can impersonate anyone” or “can impersonate no one,” you’ll be able to limit to specific service accounts and actions. Directionally important for least-privilege security. Feature gate: ConstrainedImpersonation (alpha, off by default).

Pod Certificates (Beta)

Native pod identity is inching toward “no sidecar required” for basic mTLS flows. Kubelet generates certs, requests signing, mounts them via projected volumes (podCertificate), and auto-rotates. We validated the API surface exists, but the gate was off by default on our RC1 cluster. Feature gate: PodCertificateRequest (may be distro-dependent).

Extended Toleration Operators (Alpha)

Tolerations can now express numeric intent with Gt and Lt operators. You can say “only schedule on nodes with SLA > 95%” and auto-evict if it drops. The comparison is numeric, so your taint value must be parseable as a number. Feature gate: TaintTolerationComparisonOperators (alpha, off by default).

Node Declared Features (Alpha)

Nodes can advertise their supported feature gates via status.declaredFeatures. The scheduler uses this to avoid placing pods on nodes that can’t run them – a real answer for mixed-version clusters during upgrades. On our RC1 cluster, the field was empty (gate off by default). Don’t assume you’ll see .status.declaredFeatures populated until you explicitly enable it. Feature gate: NodeDeclaredFeatures.

Primitives Need Intelligence

A few years ago we expected upstream Kubernetes to deliver production-grade autoscaling and intelligent scheduling. Native VPA would get smarter, HPA would understand seasonality, and the scheduler would learn topology economics. That hasn’t happened at production scale.

Kubernetes v1.35 clarifies the trajectory: Kubernetes is the kernel, not the user space.

The project focuses on robust, low-level primitives:

  • In-Place Resize: The syscall to change cgroups
  • DRA: The ioctl to map hardware
  • Gang Scheduling: The semaphore for coordination allocation
  • PSI Metrics: Kernel counters that tell you contention and pressure.

But the logic to drive these primitives effectively, like when to resize, where to place AI workloads, and how to bin-pack for cost, is increasingly left to the user. Native controllers (HPA, VPA, default scheduler) are becoming reference implementations, not production-grade optimization engines. And that’s probably the right design decision for upstream, but the optimization logic is now on you.

Here’s where we come in:

The Gap in In-Place Resize

Native VPA now supports InPlaceOrRecreate. The mechanism works, but the recommender doesn’t. It applies a single global percentile and treats every workload the same

ScaleOps adds the missing control plane:

  • Granular Policy: Detect workload profiles and apply different recommendation and update strategies for databases, latency-sensitive services, and batch jobs.
  • Burst Reaction: VPA is a historian, ScaleOps is a first responder. Our platform detects real-time spikes and adapts instantly to prevent throttling or OOMs.
  • Node Context: Every resize is validated against node capacity and disruption risk to prevent Pending pods, avoid unnecessary node additions, and protect cluster performance and stability.

The Gap in AI Infrastructure

DRA improves hardware binding, but a pod that requests nvidia.com/gpu: 1 often locks the entire device even if it uses 20-30% of it. That’s the GPU waste tax most clusters pay.

ScaleOps AI Infra shifts from static allocation to dynamic, workload-aware optimization:

  • Continuous Rightsizing: We track actual GPU memory and compute consumption to identify the real workload’s footprint, not the guess you made at deployment time.
  • Dynamic GPU Sharing: We safely co-locate compatible workloads on the same GPU based on their actual behavior, without the static rigidity of MIG or the performance unpredictability of naive time-slicing.
  • Workload-Level Visibility: Turn opaque GPU allocation into workload-level cost attribution so teams can act on real numbers.

The Gap in Horizontal Scaling

HPA configurable tolerance helps, but HPA is fundamentally reactive and tied to raw request signals, which causes thrashing and late reactions.

ScaleOps decouples the logic:

  • Predictive Scaling: The platform learns about workload seasonality and automatically allocates capacity before predictable demand spikes.
  • Stability without hand-tuning: defaults are asymmetric (fast up / slow down) and remain consistent even as pod sizing evolves. This leads to less churn, better bin packing, and fewer runaway nodes.

The Gap in Node Optimization

Finally, primitives like PodDisruptionBudgets (PDB) often become blockers that freeze your cluster topology. ScaleOps Smart Pod Placement analyzes the context of these “unevictable” pods and safely consolidates them to improve node savings that the scheduler typically leaves on the table.

Kubernetes v1.35 Upgrade Checklist

If you’re the unlucky person owning the upgrade runbook, here’s the checklist I’d actually paste into my internal doc:

Before Upgrade

PriorityCheckCommand/Action
🔴 BLOCKERCgroup v2 on all nodesstat -fc %T /sys/fs/cgroup → must show cgroup2fs
🔴 BLOCKERContainerd 2.0+kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.containerRuntimeVersion}'
🟠 HIGHNo Docker Schema 1 imagesScan registries: skopeo inspect for schemaVersion: 1
🟠 HIGHkube-proxy not using IPVSkubectl get cm kube-proxy -n kube-system -o yaml | grep mode
🟠 HIGHcontainerd config.toml updatedRemove deprecated registry.configs entries. Don’t discover this during a Friday night node cycle.
🟠 HIGHImage pull secrets validVerify secrets aren’t expired—credential check now mandatory
🟠 HIGHRBAC for exec/attachEnsure roles have create verb for pods/exec, pods/attach
🟡 MEDIUMAudit for removed beta APIsRun kubepug or pluto against manifests

After Upgrade

ActionCommand
Verify PSI availablecat /proc/pressure/memory on nodes
Test in-place resizeCreate test pod, patch via --subresource resize
Check scheduling batchingMonitor scheduler_batch_attempts_total{result="hint_used"} metric

Summary: Feature Maturity Reference

FeatureKEPStageWhy it matters
In-Place Pod Resize (container)1287GAZero-disruption vertical scaling
In-Place Pod Resize (pod-level)5419AlphaFuture extension for complex pods
Opportunistic Batching5598Beta (default)Faster scheduling for node-filling pods
Gang Scheduling4671AlphaDistributed training coordination
HPA Configurable Tolerance4951BetaGranular autoscaling control
Structured Auth Config3331GAMulti-provider OIDC, no restarts
Pod Certificates4317BetaNative mTLS without sidecars
Image Pull Verification2535Beta (default)Multi-tenant security
Constrained Impersonation5284AlphaLeast-privilege impersonation
Extended Toleration Operators5471AlphaSLA-based scheduling
Node Declared Features5328AlphaSafe mixed-version upgrades
Deployment Terminating Pods3973BetaRollout observability
Coordinated Container Restart5532AlphaDistributed workload reliability
Cgroup v1 Removal5573EnforcedUpgrade blocker
IPVS Deprecation5495DeprecatedStart planning nftables migration


Features Graduating from 1.34

Several features covered in our Kubernetes 1.34 blog continue their graduation path:

  • DRA Core (KEP-4381) – Went GA in 1.34, feature gate now locked in v1.35 (cannot be disabled)
  • PSI Metrics (KEP-4870) – Beta in 1.34, continues refinement in v1.35
  • Swap Support (KEP-2400) – GA in 1.34, stable in v1.35
  • Image Volumes (KEP-4639) – Beta in 1.33, stable path continues
  • User Namespaces (KEP-127) – On-by-default beta in 1.33, continues hardening

Conclusion

Kubernetes v1.35 is the release where the primitives finally feel “good enough.” In-Place Resize is GA. Gang Scheduling has real semantics. The kernel is mature. But this release also makes something clear: Kubernetes won’t ship the user-space optimization logic for you. Native VPA still recommends like it’s 2019. The scheduler still lacks queue intelligence. The syscalls are ready. Who’s writing the control loops is up to you.

If you’re upgrading (and you should be), here’s the shortlist: validate cgroup v2 on every node, confirm containerd 2.0+, test the /resize subresource on a real workload, audit RBAC for the new create verb on exec/attach, and verify your image pull secrets actually work under mandatory credential checks. The full checklist is above. Print it out. Tape it to someone’s monitor.

And if you’d rather not build that user space yourself? That’s what we do at ScaleOps. We handle the rightsizing decisions, horizontal scaling stability, and node/GPU bin-packing so you don’t have to. What stays yours: PDBs, deployment strategies, rollout policies, and the final call on what “good enough” actually means for your workloads.

Thanks for making it through 5,000 words on KEPs. Seriously. If this saved you a weekend of upgrade surprises or helped you explain cgroup v2 to your manager, it was worth writing.

Now go upgrade something. And if it breaks, you know where to find me.

Related Articles

Schedule your demo

Schedule your demo

Meet ScaleOps at Booth #900

Start Optimizing K8s Resources in Minutes!

Schedule your demo

Submit the form and schedule your 1:1 demo with a ScaleOps platform expert.

Schedule your demo