feat(recipes): update AKS H100 Dynamo recipe to match working cluster state#700
Merged
mchmarny merged 9 commits intoMay 4, 2026
Merged
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
ayuskauskas
reviewed
Apr 28, 2026
9617e29 to
539d299
Compare
539d299 to
9bf84c2
Compare
Contributor
Author
|
This branch has been rebased onto Conflicts resolved during rebase:
|
Contributor
|
@Jont828 this PR now has merge conflicts with |
Signed-off-by: Jont828 <jt572@cornell.edu>
Plan to update the h100-aks-ubuntu-inference-dynamo recipe to match the working AKS ND-H100 cluster state. Covers version bumps (GPU operator v26.3.0, network operator v26.1.0, dynamo v1.0.1), NicClusterPolicy manifest for MOFED + RDMA device plugin, NFD NodeFeatureRule for Mellanox IB detection, and dynamo values rewrite for the 1.0 schema. Signed-off-by: Jont828 <jt572@cornell.edu>
… state Brings h100-aks-ubuntu-inference-dynamo in sync with a working AKS ND-H100 cluster running Dynamo. Changes span the full RDMA stack and Dynamo v0.9 → v1.0.1 migration. RDMA / InfiniBand (aks.yaml + network-operator manifests): - Add NFD NodeFeatureRule manifest (hook-weight 1): labels ConnectX-7 nodes (device IDs 101c/101e) with pci-15b3.present=true; replaces chart's generic Ethernet class rule (deployNodeFeatureRules: false) - Add NicClusterPolicy manifest (hook-weight 5): MOFED driver doca3.2.0-25.10-1.2.8.0-2, RDMA shared device plugin (hca_shared_devices_a, rdmaHcaMax 1000, vendor 15b3/mlx5_core), DOCA telemetry; folds standalone rdma-shared-dp-ds DaemonSet into CR - Add ib-node-config DaemonSet manifest: loads IB kernel modules and sets LimitMEMLOCK=infinity on containerd/kubelet systemd units; node selector changed from pool-specific label to pci-15b3.present for portability; TODO comment for Skyhook replacement - Add network-operator/values-aks.yaml with deployNodeFeatureRules: false - Version pins: gpu-operator v26.3.0, network-operator v26.1.0, kube-prometheus-stack 83.7.0 - gpu-operator/values-aks.yaml: add driver.rdma.useHostMofed: true (use MOFED from network-operator, skip peermem source build), nfd.enabled: false (avoid duplicate NFD DaemonSets) Dynamo v0.9 → v1.0.1: - Remove dynamo-crds componentRef (v1.0 platform manages CRDs itself) - Rewrite dynamo-platform/values.yaml for v1.0 global.* subchart keys: grove.install: true, kai-scheduler.install: false / enabled: true (use AICR-managed external instance), etcd.install: false - Remove nats.enabled: false (NATS is active in v1.0; operator injects natsAddress into DynamoGraphDeployments) - Fix Prometheus endpoint: kube-prometheus-kube-prome-prometheus (derived from fullnameOverride: kube-prometheus in AICR values) - Restore NATS JetStream PVC override: storageClassName: managed-csi - registry.yaml: dynamo-platform defaultVersion 0.9.1 → 1.0.1 Docs: replace implementation plan with post-implementation summary covering RDMA stack layers, Dynamo migration rationale, and version table. Signed-off-by: Jont828 <jt572@cornell.edu>
Skyhook is being renamed to Nodewright. Nodewright is adding AKS support (NVIDIA/nodewright-packages#39) but RDMA/MEMLOCK setup is not yet included. Update the TODO comment to track the right upstream project. Signed-off-by: Jont828 <jt572@cornell.edu>
Signed-off-by: Jont828 <jt572@cornell.edu>
Signed-off-by: Jont828 <jt572@cornell.edu>
The AKS recipe unconditionally deploys the full RDMA/InfiniBand stack via network-operator. Users running H100s without InfiniBand can now opt out at bundle time with: --set networkoperator:enabled=false \ --set gpuoperator:driver.rdma.useHostMofed=false Add CheckHostMofedWithoutNetworkOperator validation that warns when network-operator is disabled but useHostMofed is not overridden. Add a ConfigMap kill switch to the ib-node-config DaemonSet so operators can disable host-level RDMA setup at runtime without redeploying.
fc54e87 to
e5a85bb
Compare
The assert-recipe.yaml had grove listed twice in componentRefs, causing a slice length mismatch (18 expected vs 17 actual) in the E2E test.
ayuskauskas
approved these changes
May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Updates the
h100-aks-ubuntu-inference-dynamorecipe to reproduce the state of a working AKS ND-H100 cluster running Dynamo. Adds the full RDMA/InfiniBand stack and migrates Dynamo from v0.9 to v1.0.1.Motivation / Context
The recipe was missing the RDMA stack required for GPUDirect on AKS ND-series nodes, had no version pins, and targeted Dynamo v0.9 (now v1.0.1 with a restructured values schema). This change makes
aicr bundleproduce a deployable configuration that matches a known-good cluster.Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
docs/,examples/)recipes/)Implementation Notes
RDMA stack (5 layers):
NFD NodeFeatureRule (
nfd-network-rule.yaml, hook-weight 1) — labels ConnectX-7 nodes (PCI device IDs101c/101e, vendor15b3) withpci-15b3.present=true. Replaces the chart's generic Ethernet class rule (deployNodeFeatureRules: falseinvalues-aks.yaml).NicClusterPolicy (
nic-cluster-policy-aks.yaml, hook-weight 5) — reconciled by network-operator to deploy: MOFED driver (doca3.2.0-25.10-1.2.8.0-2), RDMA shared device plugin (rdma/hca_shared_devices_a,rdmaHcaMax: 1000), DOCA telemetry. Folds the standalonerdma-shared-dp-dsDaemonSet that was manually deployed on the working cluster into the CR.ib-node-config DaemonSet (
ib-node-config-aks.yaml) — host-level: loadsib_umad/rdma_ucmkernel modules, setsLimitMEMLOCK=infinityon containerd/kubelet systemd units so container processes inherit unlimited memlock (noIPC_LOCKneeded in pod specs). Node selector usespci-15b3.present=true(portable) rather than the working cluster's pool-specific label. Includes a TODO to replace with Nodewright (formerly Skyhook) once AKS RDMA support lands upstream (NVIDIA/nodewright-packages#39).driver.rdma.useHostMofed: true— tells GPU operator to use MOFED installed by network-operator instead of buildingnvidia-peermemfrom source.nfd.enabled: falseprevents duplicate NFD DaemonSets.Version pins —
gpu-operator: v26.3.0,network-operator: v26.1.0,kube-prometheus-stack: 83.7.0— match the working cluster.Dynamo v0.9 → v1.0.1:
dynamo-crdscomponentRef removed — v1.0 platform chart manages CRDs itself.dynamo-platform/values.yamlrewritten for v1.0global.*subchart keys:grove.install: true(Dynamo manages grove),kai-scheduler.install: false / enabled: true(use AICR-managed external instance),etcd.install: false(k8s-native discovery).nats.enabled: falseremoved — NATS is active in v1.0; operator injectsnatsAddressintoDynamoGraphDeploymentworkloads.kube-prometheus-kube-prome-prometheus(derived fromfullnameOverride: kube-prometheusin AICR's kube-prometheus-stack values).storageClassName: managed-csi.registry.yamldynamo-platformdefaultVersion:0.9.1→1.0.1.Testing
make test make lintRecipe compilation and bundle inspection against the working cluster confirmed correct values structure.
Risk Assessment
Rollout notes: AKS-only changes. Other cloud overlays (EKS, GKE) are untouched. Dynamo version bump only affects the AKS Dynamo overlay;
dynamo-crdsregistry entry is unchanged for other overlays that may reference it.Checklist
make testwith-race)make lint)git commit -S)