Skip to content

chore(deps): update kube client dependencies #2019

Description

@elezar

Problem Statement

OpenShell's Rust Kubernetes client stack is significantly behind current upstream releases. The workspace currently pins kube and kube-runtime to 0.90, and k8s-openapi to 0.21.1 with the Kubernetes v1_26 generated API feature. Current upstream releases are kube/kube-runtime/kube-client/kube-core 4.0.0 and k8s-openapi 0.28.0, so updating is a multi-major compatibility task rather than a mechanical lockfile refresh.

The goal of this spike is to define the scope and risks for updating the Kubernetes client dependencies while preserving OpenShell's Kubernetes gateway behavior, sandbox lifecycle management, service-account bootstrap authentication, certificate generation, and Kubernetes e2e coverage.

Technical Context

The Kubernetes dependency surface is intentionally small but security-sensitive. openshell-driver-kubernetes uses kube-rs to construct in-cluster or inferred clients, create/list/get/delete/watch Agent Sandbox CRs, read Kubernetes Events, and list Nodes for GPU capacity checks. openshell-server uses kube-rs for the generate-certs Kubernetes Secret workflow and for the in-cluster ServiceAccount TokenReview bootstrap authenticator.

Upstream version checks performed during this spike:

  • cargo info kube@4.0.0: latest kube is 4.0.0, released with Rust MSRV 1.88.0; this matches OpenShell's workspace rust-version = "1.88".
  • cargo info kube-runtime@4.0.0, kube-client@4.0.0, kube-core@4.0.0: related kube-rs crates are aligned at 4.0.0.
  • cargo info k8s-openapi@0.28.0: latest k8s-openapi is 0.28.0; available Kubernetes feature flags are v1_32 through v1_36, with latest = v1_36.
  • kube-rs 4.0.0 release notes: adds Kubernetes v1_36 support via k8s-openapi 0.28, enables regular client retries by default, changes timeout behavior, makes client tracing opt-in, and preserves the prior ErrorResponse to Status migration from the 3.x line.
  • k8s-openapi v0.28.0 release notes: adds v1_36, drops support for Kubernetes 1.31, and lists corresponding API server versions v1.32.13 through v1.36.2.

Primary upstream references:

Version Target Summary

Option kube stack k8s-openapi Kubernetes API features Kubernetes API EOL dates YAML dependency status Assessment
Current kube/kube-runtime/kube-client 0.90.0 0.21.1; workspace selects v1_26 Crate supports v1_24-v1_29; OpenShell currently selects v1_26 v1_24: 2023-07-28; v1_25: 2023-10-28; v1_26: 2024-02-28; v1_27: 2024-07-16; v1_28: 2024-10-22; v1_29: 2025-02-28 kube-client depends on serde_yaml; current lockfile resolves serde_yaml 0.9.34+deprecated Historical baseline only. The selected v1_26 API feature is long past EOL and the stack keeps the deprecated YAML parser.
Lower-churn fallback kube/kube-runtime/kube-client 2.0.1 0.26.0 v1_30-v1_34 v1_30: 2025-07-15; v1_31: 2025-11-11; v1_32: 2026-02-28; v1_33: 2026-06-28; v1_34: 2026-10-27 Still depends on serde_yaml; does not move to serde_saphyr Useful if maintainers want a smaller dependency jump, but it leaves both the YAML cleanup and later kube-rs API churn for a follow-up.
Best pre-4.0 stepping stone kube/kube-runtime/kube-client 3.1.0 0.27.0 v1_31-v1_35 v1_31: 2025-11-11; v1_32: 2026-02-28; v1_33: 2026-06-28; v1_34: 2026-10-27; v1_35: 2027-02-28 Still depends on serde_yaml; does not move to serde_saphyr Strongest intermediate target before 4.0. It absorbs major API migrations while avoiding the kube 4.0 default retry/read-timeout/tracing behavior changes.
Latest / YAML cleanup target kube/kube-runtime/kube-client 4.0.0 0.28.0 v1_32-v1_36 v1_32: 2026-02-28; v1_33: 2026-06-28; v1_34: 2026-10-27; v1_35: 2027-02-28; v1_36: 2027-06-28 Removes serde_yaml and uses serde-saphyr/serde_saphyr through kube-client Cleanest dependency posture and the first target that removes the deprecated YAML parser, but includes kube 4.0 behavior changes around retries, timeouts, and tracing.

The version choice depends on whether this work is primarily a compatibility modernization or a dependency cleanup. If the goal is to de-risk the kube-rs API migration, 3.1.0 is the best initial target before 4.0. If removing serde_yaml is in scope for this spike, 4.0.0 is the first version that actually replaces it with serde_saphyr.

Required validation: do not treat the k8s-openapi feature range as the OpenShell runtime support matrix by itself. k8s-openapi selects the generated Rust Kubernetes API schema, while OpenShell's documented runtime minimum is Kubernetes 1.29+ with RBAC enabled. The implementation must validate the selected kube/k8s-openapi target against that documented minimum, or explicitly update the docs and release notes if maintainers decide to raise the minimum supported Kubernetes version.

Affected Components

Component Key Files Role
Workspace dependency pins Cargo.toml, Cargo.lock Pins kube, kube-runtime, and k8s-openapi; selects the generated Kubernetes API feature.
Kubernetes compute driver crates/openshell-driver-kubernetes/src/driver.rs, crates/openshell-driver-kubernetes/Cargo.toml Creates and watches Agent Sandbox CRs, lists Nodes for GPU validation, maps Kubernetes Events to platform events, and converts sandbox specs into Kubernetes JSON.
Gateway Kubernetes auth crates/openshell-server/src/auth/k8s_sa.rs, crates/openshell-server/src/lib.rs Builds in-cluster clients, validates projected ServiceAccount tokens via TokenReview, reads Pods and owning Sandbox CRs, and enables the sandbox JWT bootstrap path.
Gateway cert generation crates/openshell-server/src/certgen.rs Creates and reads Kubernetes Secret resources for gateway TLS and sandbox JWT signing materials.
Kubernetes docs and e2e docs/reference/sandbox-compute-drivers.mdx, crates/openshell-driver-kubernetes/README.md, .github/workflows/e2e-kubernetes-test.yml, tasks/test.toml Documents Kubernetes behavior and runs Kind-based Kubernetes e2e coverage.

Technical Investigation

Architecture Overview

The gateway selects the Kubernetes compute runtime through ComputeRuntime::new_kubernetes, after parsing [openshell.drivers.kubernetes] and applying gateway defaults. The driver constructs a normal kube client plus a separate watch client with no read timeout, then creates dynamic Api<DynamicObject> handles for the Agent Sandbox CRD. Lifecycle RPCs call get, list, create, delete, and watcher::watcher; Kubernetes Events are watched in parallel and translated into progress/platform events.

The ServiceAccount bootstrap path is constructed only when the gateway is running in-cluster and has a sandbox JWT issuer. It uses kube::Client::try_default(), Api<TokenReview>, Api<Pod>, and Api<DynamicObject> to verify a sandbox pod's projected token, live pod UID, ownerReference, owning Sandbox CR UID, and sandbox-id label before minting a gateway JWT.

The certgen path uses Client::try_default() and typed Api<Secret> operations to implement idempotent TLS/JWT Secret creation in Helm hook contexts.

Code References

Location Description
Cargo.toml:111 Workspace pins kube = { version = "0.90", features = ["runtime", "derive"] }.
Cargo.toml:112 Workspace pins kube-runtime = "0.90".
Cargo.toml:113 Workspace pins k8s-openapi = { version = "0.21.1", features = ["v1_26"] }.
crates/openshell-driver-kubernetes/Cargo.toml:26 Kubernetes driver depends on workspace kube.
crates/openshell-driver-kubernetes/Cargo.toml:27 Kubernetes driver depends on workspace kube-runtime.
crates/openshell-driver-kubernetes/Cargo.toml:28 Kubernetes driver depends on workspace k8s-openapi.
crates/openshell-server/Cargo.toml:31 Server depends on workspace kube.
crates/openshell-server/Cargo.toml:32 Server depends on workspace k8s-openapi.
crates/openshell-driver-kubernetes/src/driver.rs:57 Maps KubeError::Api(api).code == 409 to AlreadyExists; kube-rs 3.x/4.x prefers Status helpers such as conflict/not-found predicates.
crates/openshell-driver-kubernetes/src/driver.rs:211 Builds in-cluster or inferred kube config for driver clients.
crates/openshell-driver-kubernetes/src/driver.rs:219 Sets explicit connect/read/write timeouts for normal client operations.
crates/openshell-driver-kubernetes/src/driver.rs:225 Sets separate watch-client timeouts, including read_timeout = None. kube 4.0 changed default timeout behavior, so this should be revalidated.
crates/openshell-driver-kubernetes/src/driver.rs:259 Builds dynamic Agent Sandbox CR APIs from GroupVersionKind and ApiResource.
crates/openshell-driver-kubernetes/src/driver.rs:271 Lists typed Node resources to validate GPU capacity.
crates/openshell-driver-kubernetes/src/driver.rs:302 Fetches individual Sandbox CRs and handles Kubernetes 404s.
crates/openshell-driver-kubernetes/src/driver.rs:338 Lists Sandbox CRs and converts dynamic objects to driver sandboxes.
crates/openshell-driver-kubernetes/src/driver.rs:381 Creates dynamic Sandbox CRs with generated JSON spec.
crates/openshell-driver-kubernetes/src/driver.rs:469 Deletes Sandbox CRs and treats Kubernetes 404 as already deleted.
crates/openshell-driver-kubernetes/src/driver.rs:510 Checks Sandbox existence via api.get.
crates/openshell-driver-kubernetes/src/driver.rs:525 Watches Sandbox CRs and Kubernetes Event resources with watcher::watcher.
crates/openshell-driver-kubernetes/src/driver.rs:729 Maps Kubernetes Event objects to driver platform events.
crates/openshell-driver-kubernetes/src/driver.rs:747 Converts Event timestamps with t.0.timestamp_millis(); kube/k8s-openapi changed timestamp internals from chrono to jiff in later releases.
crates/openshell-server/src/lib.rs:315 Enables K8s SA bootstrap only in-cluster and only when sandbox JWT issuing is enabled.
crates/openshell-server/src/lib.rs:321 Builds default kube client for the K8s SA bootstrap authenticator.
crates/openshell-server/src/lib.rs:729 Constructs the Kubernetes compute runtime.
crates/openshell-server/src/auth/k8s_sa.rs:158 Builds typed TokenReview/Pod APIs and dynamic Sandbox CR API.
crates/openshell-server/src/auth/k8s_sa.rs:195 Creates TokenReview objects through the Kubernetes API.
crates/openshell-server/src/auth/k8s_sa.rs:223 Reads the sandbox pod with get_opt.
crates/openshell-server/src/auth/k8s_sa.rs:261 Reads the owning Sandbox CR with get_opt.
crates/openshell-server/src/auth/k8s_sa.rs:371 Validates pod ownerReferences against supported Agent Sandbox apiVersion/kind.
crates/openshell-server/src/auth/k8s_sa.rs:403 Validates owning Sandbox CR UID and sandbox-id label.
crates/openshell-server/src/certgen.rs:123 Constructs kube client and typed Secret API for Kubernetes cert generation.
crates/openshell-server/src/certgen.rs:140 Reads existing JWT Secret with get_opt.
crates/openshell-server/src/certgen.rs:161 Creates JWT Secret with typed Api<Secret>::create.
crates/openshell-server/src/certgen.rs:181 Reads existing TLS Secrets with get_opt.
crates/openshell-server/src/certgen.rs:244 Creates TLS and JWT Secrets in Kubernetes mode.
.github/workflows/e2e-kubernetes-test.yml:115 Runs the Kind-backed Kubernetes e2e workflow.
docs/reference/sandbox-compute-drivers.mdx:286 Published docs currently describe the Agent Sandbox CR integration. Update if supported Kubernetes/API assumptions change.

Current Behavior

OpenShell currently compiles against kube-rs 0.90 and k8s-openapi 0.21.1. The generated Kubernetes API surface is selected with k8s-openapi feature v1_26. Kube API errors are matched directly through KubeError::Api(api).code, and Kubernetes Event timestamps are treated as chrono-like values with timestamp_millis().

The driver has explicit 30-second API operation timeouts and a watch client with no read timeout. kube 4.0 changes default timeout/retry behavior upstream, so the implementation should verify that OpenShell's explicit timeout strategy still bounds non-watch calls and does not unintentionally multiply retries around the gateway's existing tokio::time::timeout wrappers.

What Would Need to Change

  • Update workspace dependency pins in Cargo.toml and refresh Cargo.lock for kube, kube-runtime, and k8s-openapi. Do not update only one of these; kube-rs release notes explicitly say to upgrade k8s-openapi with kube to avoid conflicts.
  • Choose the k8s-openapi Kubernetes feature target deliberately. 0.28.0 supports v1_32 through v1_36; the current workspace uses v1_26, so this may change the documented minimum tested Kubernetes API surface even if OpenShell only uses stable core resources.
  • Replace direct KubeError::Api(api).code matching with the current kube-rs Status-based helpers or equivalent non-deprecated checks. Affected cases include 409 conflict and 404 not found handling in the Kubernetes driver, and any API-version probing/fallback code if feat(kubernetes): support agent-sandbox v1beta1 #2009 lands first.
  • Update Kubernetes Event timestamp handling for k8s-openapi's chrono to jiff transition. map_kube_event_to_platform should continue producing millisecond Unix timestamps.
  • Recheck kube::Config timeout fields and Client::try_from/Client::try_default construction against kube 4.0. Explicit OpenShell timeouts should remain intentional after kube's default read-timeout changes.
  • Revalidate watcher behavior. The current code assumes watcher::watcher(...).try_next() yields Event::Applied, Event::Deleted, and Event::Restarted variants for both Sandbox CRs and Events.
  • Confirm typed Kubernetes resources still compile and serialize as expected: Node, Event, Pod, TokenReview, TokenReviewSpec, TokenReviewStatus, UserInfo, Secret, ObjectMeta, and ByteString.
  • Update crate README and published docs only if the dependency update changes the supported Kubernetes minor range, the recommended Agent Sandbox install, or runtime behavior visible to operators.

Alternative Approaches Considered

  1. Jump directly to latest: kube/kube-runtime 4.0.0 and k8s-openapi 0.28.0. This is the cleanest dependency posture and matches OpenShell's current Rust 1.88 MSRV, but it requires resolving all API breaks and deciding whether the Kubernetes feature target should be v1_32, v1_36, or another supported minor.
  2. Stage through an intermediate major, such as kube 1.x or 2.x. This may make compile errors easier to isolate, but it creates extra dependency churn and still leaves OpenShell behind current kube-rs.
  3. Update only k8s-openapi or only kube-rs. This is not recommended; kube-rs release notes repeatedly warn to upgrade k8s-openapi with kube to avoid conflicts.

Patterns to Follow

  • Keep Kubernetes dependencies centralized in workspace dependencies, as they are today.
  • Preserve explicit operation timeouts around Kubernetes API calls in the driver. The code currently wraps get/list/create/delete calls with tokio::time::timeout and uses a separate no-read-timeout watch client.
  • Keep Kubernetes API use contained to openshell-driver-kubernetes and the narrow server paths that already require it: certgen and ServiceAccount bootstrap auth.
  • Keep unit coverage close to the affected code, following existing tests in crates/openshell-driver-kubernetes/src/driver.rs, crates/openshell-server/src/auth/k8s_sa.rs, and crates/openshell-server/src/certgen.rs.
  • Run Kubernetes e2e for behavioral validation, not only compile/unit tests. The risk is integration behavior against a real API server and Agent Sandbox controller.

Proposed Approach

Update the kube-rs stack together in one branch, targeting the current upstream major unless maintainers choose an intermediate version for compatibility reasons. Start by changing workspace dependency pins and selecting a k8s-openapi feature target, then fix compile breaks in the Kubernetes driver, K8s SA authenticator, and certgen paths. Treat kube-rs retry/timeout changes as behavior changes to verify, not just compile fallout. Once unit tests pass, run the Kubernetes e2e path against Kind and confirm sandbox create/watch/delete, ServiceAccount bootstrap, certgen hook behavior, and Kubernetes Event progress mapping.

Scope Assessment

  • Complexity: Medium
  • Confidence: Medium - dependency graph is small, but the update crosses multiple kube-rs major releases and changes generated Kubernetes API versions.
  • Estimated files to change: 5-10
  • Issue type: chore

Risks & Open Questions

  • Which k8s-openapi feature should OpenShell use after the update: v1_32, v1_36, or another supported minor? This determines the generated API surface and may affect documented Kubernetes compatibility.
  • Does OpenShell need to continue claiming support for clusters older than the k8s-openapi 0.28 feature range? If yes, maintainers may need to choose an older kube-rs target or document the new tested minimum.
  • The current published Kubernetes setup docs require Kubernetes 1.29+ with RBAC enabled. Any target whose generated API feature no longer includes v1_29 must be validated against a Kubernetes 1.29 API server, or the documented minimum must be raised intentionally.
  • kube 4.0 enables regular client retries by default. Confirm this does not interact badly with OpenShell's explicit 30-second tokio::time::timeout wrappers or produce surprising latency under API server failures.
  • kube 4.0 changes default read timeout behavior. Confirm non-watch operations remain bounded and watch streams remain healthy.
  • k8s-openapi timestamp internals changed through the chrono to jiff migration. Ensure Kubernetes Event progress timestamps remain correct.
  • Coordinate with PR feat(kubernetes): support agent-sandbox v1beta1 #2009 if it lands first, because that PR touches the same dynamic Sandbox API construction and API-error fallback paths for Agent Sandbox v1beta1/v1alpha1 support.
  • LSM impact: none expected. This dependency update does not touch process identity, /proc, file labels, binary execution, or inter-process visibility. SELinux/AppArmor-sensitive behavior should be limited to existing sandbox runtime/e2e environments, not kube client calls themselves.

Test Considerations

  • Run mise exec -- cargo check -p openshell-driver-kubernetes and mise exec -- cargo check -p openshell-server after dependency changes to catch API breaks quickly.
  • Run mise exec -- cargo test -p openshell-driver-kubernetes --lib for driver conversion, event mapping, GPU validation, and Kubernetes spec rendering tests.
  • Run mise exec -- cargo test -p openshell-server auth::k8s_sa --lib for ServiceAccount bootstrap logic.
  • Run mise exec -- cargo test -p openshell-server certgen --lib or the relevant server unit subset for Kubernetes Secret generation logic.
  • Run the Kubernetes e2e path with mise run e2e:kubernetes, or rely on the test:e2e-kubernetes CI workflow if local Kind/k3d is unavailable.
  • Validate the chosen dependency target against the documented Kubernetes minimum, currently Kubernetes 1.29+ with RBAC enabled. At minimum, render the Helm chart for Kubernetes 1.29 and run a Kubernetes e2e smoke test against a 1.29-compatible API server, covering sandbox create/watch/delete, TokenReview bootstrap, Secret certgen, Node listing, Event watching, and Agent Sandbox CR access.
  • Validate supervisor sideload behavior across the version boundary: Kubernetes < 1.35 should render/use init-container; Kubernetes >= 1.35 should render/use image-volume unless explicitly overridden. If testing image-volume on 1.33 or 1.34, the ImageVolume feature gate must be enabled.
  • If feat(kubernetes): support agent-sandbox v1beta1 #2009 lands first, include e2e coverage for both Agent Sandbox API generations or ensure the existing version matrix continues to pass.
  • No docs/reference/gateway-config.mdx update is expected unless the update changes gateway TOML fields, driver-specific config options, defaults, or Helm rendering. Update docs/reference/sandbox-compute-drivers.mdx, crates/openshell-driver-kubernetes/README.md, or Kubernetes setup docs if minimum supported Kubernetes/API assumptions change.

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions