Skip to content

feat(recipes): update AKS H100 Dynamo recipe to match working cluster state#700

Merged
mchmarny merged 9 commits into
NVIDIA:mainfrom
Jont828:jont828/aks-h100-dynamo-recipe-update
May 4, 2026
Merged

feat(recipes): update AKS H100 Dynamo recipe to match working cluster state#700
mchmarny merged 9 commits into
NVIDIA:mainfrom
Jont828:jont828/aks-h100-dynamo-recipe-update

Conversation

@Jont828

@Jont828 Jont828 commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Updates the h100-aks-ubuntu-inference-dynamo recipe to reproduce the state of a working AKS ND-H100 cluster running Dynamo. Adds the full RDMA/InfiniBand stack and migrates Dynamo from v0.9 to v1.0.1.

Motivation / Context

The recipe was missing the RDMA stack required for GPUDirect on AKS ND-series nodes, had no version pins, and targeted Dynamo v0.9 (now v1.0.1 with a restructured values schema). This change makes aicr bundle produce a deployable configuration that matches a known-good cluster.

Fixes: N/A
Related: N/A

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Docs/examples (docs/, examples/)
  • Other: Recipes (recipes/)

Implementation Notes

RDMA stack (5 layers):

  1. NFD NodeFeatureRule (nfd-network-rule.yaml, hook-weight 1) — labels ConnectX-7 nodes (PCI device IDs 101c/101e, vendor 15b3) with pci-15b3.present=true. Replaces the chart's generic Ethernet class rule (deployNodeFeatureRules: false in values-aks.yaml).

  2. NicClusterPolicy (nic-cluster-policy-aks.yaml, hook-weight 5) — reconciled by network-operator to deploy: MOFED driver (doca3.2.0-25.10-1.2.8.0-2), RDMA shared device plugin (rdma/hca_shared_devices_a, rdmaHcaMax: 1000), DOCA telemetry. Folds the standalone rdma-shared-dp-ds DaemonSet that was manually deployed on the working cluster into the CR.

  3. ib-node-config DaemonSet (ib-node-config-aks.yaml) — host-level: loads ib_umad/rdma_ucm kernel modules, sets LimitMEMLOCK=infinity on containerd/kubelet systemd units so container processes inherit unlimited memlock (no IPC_LOCK needed in pod specs). Node selector uses pci-15b3.present=true (portable) rather than the working cluster's pool-specific label. Includes a TODO to replace with Nodewright (formerly Skyhook) once AKS RDMA support lands upstream (NVIDIA/nodewright-packages#39).

  4. driver.rdma.useHostMofed: true — tells GPU operator to use MOFED installed by network-operator instead of building nvidia-peermem from source. nfd.enabled: false prevents duplicate NFD DaemonSets.

  5. Version pinsgpu-operator: v26.3.0, network-operator: v26.1.0, kube-prometheus-stack: 83.7.0 — match the working cluster.

Dynamo v0.9 → v1.0.1:

  • dynamo-crds componentRef removed — v1.0 platform chart manages CRDs itself.
  • dynamo-platform/values.yaml rewritten for v1.0 global.* subchart keys: grove.install: true (Dynamo manages grove), kai-scheduler.install: false / enabled: true (use AICR-managed external instance), etcd.install: false (k8s-native discovery).
  • nats.enabled: false removed — NATS is active in v1.0; operator injects natsAddress into DynamoGraphDeployment workloads.
  • Prometheus endpoint corrected: kube-prometheus-kube-prome-prometheus (derived from fullnameOverride: kube-prometheus in AICR's kube-prometheus-stack values).
  • NATS JetStream PVC override restored: storageClassName: managed-csi.
  • registry.yaml dynamo-platform defaultVersion: 0.9.11.0.1.

Testing

make test
make lint

Recipe compilation and bundle inspection against the working cluster confirmed correct values structure.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: AKS-only changes. Other cloud overlays (EKS, GKE) are untouched. Dynamo version bump only affects the AKS Dynamo overlay; dynamo-crds registry entry is unchanged for other overlays that may reference it.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@Jont828 Jont828 requested review from a team as code owners April 27, 2026 23:29
@copy-pr-bot

copy-pr-bot Bot commented Apr 27, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Comment thread recipes/components/dynamo-platform/values.yaml Outdated
@Jont828 Jont828 force-pushed the jont828/aks-h100-dynamo-recipe-update branch from 9617e29 to 539d299 Compare April 29, 2026 00:16
coderabbitai[bot]

This comment was marked as resolved.

@Jont828

Jont828 commented Apr 29, 2026

Copy link
Copy Markdown
Contributor Author

This branch has been rebased onto chore/bump-dynamo-platform-1.0.1 which handles the Dynamo version bump (0.9.x → 1.0.2), dynamo-crds removal, and grove component addition across all overlays. That PR should merge first so this one only carries the AKS-specific changes (RDMA stack, network-operator manifests, gpu-operator values).

Conflicts resolved during rebase:

  • recipes/components/dynamo-platform/values.yaml — took the base branch's structure (grove as separate AICR component with install: false / enabled: true, correct prometheus endpoint kube-prometheus-prometheus)
  • recipes/overlays/h100-aks-ubuntu-inference-dynamo.yaml — took the base branch's version (1.0.2, grove componentRef, no dead etcd/nats overrides)
  • recipes/registry.yaml — took 1.0.2 defaultVersion

@github-actions github-actions Bot added size/XL and removed size/L labels Apr 29, 2026
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

Copy link
Copy Markdown
Contributor

@Jont828 this PR now has merge conflicts with main. Please rebase to resolve them.

Jont828 added 3 commits April 30, 2026 16:39
Signed-off-by: Jont828 <jt572@cornell.edu>
Plan to update the h100-aks-ubuntu-inference-dynamo recipe to match
the working AKS ND-H100 cluster state. Covers version bumps (GPU
operator v26.3.0, network operator v26.1.0, dynamo v1.0.1),
NicClusterPolicy manifest for MOFED + RDMA device plugin, NFD
NodeFeatureRule for Mellanox IB detection, and dynamo values rewrite
for the 1.0 schema.

Signed-off-by: Jont828 <jt572@cornell.edu>
… state

Brings h100-aks-ubuntu-inference-dynamo in sync with a working AKS
ND-H100 cluster running Dynamo. Changes span the full RDMA stack and
Dynamo v0.9 → v1.0.1 migration.

RDMA / InfiniBand (aks.yaml + network-operator manifests):
- Add NFD NodeFeatureRule manifest (hook-weight 1): labels ConnectX-7
  nodes (device IDs 101c/101e) with pci-15b3.present=true; replaces
  chart's generic Ethernet class rule (deployNodeFeatureRules: false)
- Add NicClusterPolicy manifest (hook-weight 5): MOFED driver
  doca3.2.0-25.10-1.2.8.0-2, RDMA shared device plugin
  (hca_shared_devices_a, rdmaHcaMax 1000, vendor 15b3/mlx5_core),
  DOCA telemetry; folds standalone rdma-shared-dp-ds DaemonSet into CR
- Add ib-node-config DaemonSet manifest: loads IB kernel modules and
  sets LimitMEMLOCK=infinity on containerd/kubelet systemd units;
  node selector changed from pool-specific label to pci-15b3.present
  for portability; TODO comment for Skyhook replacement
- Add network-operator/values-aks.yaml with deployNodeFeatureRules: false
- Version pins: gpu-operator v26.3.0, network-operator v26.1.0,
  kube-prometheus-stack 83.7.0
- gpu-operator/values-aks.yaml: add driver.rdma.useHostMofed: true
  (use MOFED from network-operator, skip peermem source build),
  nfd.enabled: false (avoid duplicate NFD DaemonSets)

Dynamo v0.9 → v1.0.1:
- Remove dynamo-crds componentRef (v1.0 platform manages CRDs itself)
- Rewrite dynamo-platform/values.yaml for v1.0 global.* subchart keys:
  grove.install: true, kai-scheduler.install: false / enabled: true
  (use AICR-managed external instance), etcd.install: false
- Remove nats.enabled: false (NATS is active in v1.0; operator injects
  natsAddress into DynamoGraphDeployments)
- Fix Prometheus endpoint: kube-prometheus-kube-prome-prometheus
  (derived from fullnameOverride: kube-prometheus in AICR values)
- Restore NATS JetStream PVC override: storageClassName: managed-csi
- registry.yaml: dynamo-platform defaultVersion 0.9.1 → 1.0.1

Docs: replace implementation plan with post-implementation summary
covering RDMA stack layers, Dynamo migration rationale, and version table.

Signed-off-by: Jont828 <jt572@cornell.edu>
Jont828 added 4 commits April 30, 2026 16:39
Skyhook is being renamed to Nodewright. Nodewright is adding AKS support
(NVIDIA/nodewright-packages#39) but RDMA/MEMLOCK setup is not yet
included. Update the TODO comment to track the right upstream project.

Signed-off-by: Jont828 <jt572@cornell.edu>
Signed-off-by: Jont828 <jt572@cornell.edu>
Signed-off-by: Jont828 <jt572@cornell.edu>
The AKS recipe unconditionally deploys the full RDMA/InfiniBand stack
via network-operator. Users running H100s without InfiniBand can now
opt out at bundle time with:

  --set networkoperator:enabled=false \
  --set gpuoperator:driver.rdma.useHostMofed=false

Add CheckHostMofedWithoutNetworkOperator validation that warns when
network-operator is disabled but useHostMofed is not overridden. Add a
ConfigMap kill switch to the ib-node-config DaemonSet so operators can
disable host-level RDMA setup at runtime without redeploying.
@Jont828 Jont828 force-pushed the jont828/aks-h100-dynamo-recipe-update branch from fc54e87 to e5a85bb Compare April 30, 2026 20:39
The assert-recipe.yaml had grove listed twice in componentRefs, causing
a slice length mismatch (18 expected vs 17 actual) in the E2E test.
@mchmarny mchmarny enabled auto-merge (squash) May 4, 2026 22:57

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mchmarny mchmarny merged commit 4e602f2 into NVIDIA:main May 4, 2026
79 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Jun 11, 2026
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants