Skip to content

chore(slinky-engine): Use PodSelector to replace scontrol show partition#294

Closed
ravisoundar wants to merge 1 commit intomainfrom
rs-dynamic-nodes
Closed

chore(slinky-engine): Use PodSelector to replace scontrol show partition#294
ravisoundar wants to merge 1 commit intomainfrom
rs-dynamic-nodes

Conversation

@ravisoundar
Copy link
Copy Markdown
Collaborator

Description

Currently, scontrol show partition is used to get the nodes from slurm to generate the topology.
This PR, instead, uses the PodSelector configured per partition to identify the pods and their corresponding slurm nodes.
When the podSelector is not specified, the implementation falls back to the scontrol show partition approach.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

@ravisoundar ravisoundar requested a review from dmitsh as a code owner April 23, 2026 21:35
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR replaces the scontrol show partition call in the Slinky engine with a PodSelector-based node discovery path: each topology entry can now carry a podSelector in its Other config field, which is decoded and used to list matching pods and map them to SLURM node names. When no podSelector is configured the existing scontrol fallback is preserved. Most of the prior review concerns (missing MatchExpressions guard, topology key vs. partition name lookup, test coverage gap) are addressed in this revision.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/quality suggestions that do not affect correctness.

All previously flagged P1 issues are addressed: the MatchExpressions-only guard, topology-key vs partition-name lookup (fixed with the new for-loop), and the test coverage gap (tests now call GenerateOutput). Remaining comments are non-blocking.

pkg/translate/yaml.go — the blockSizes → block_sizes rename is a breaking schema change for existing ConfigMap consumers; coordinate or document the migration before deploying to existing clusters.

Important Files Changed

Filename Overview
pkg/engines/slinky/engine.go Adds PodSelector-based node discovery per partition in getPartitionNodes; refactors getClusterNodes into a reusable getMatchingNodes helper. Non-deterministic node ordering (P2) and one unnecessary node-list fetch per partition (P2) are minor concerns.
pkg/engines/slinky/engine_test.go Tests refactored to call GenerateOutput (exercising the full GetTranslateConfiggetPartitionNodes path); test topology uses Partition: key which masks partition-name-vs-topology-key divergence (P2).
pkg/engines/slurm/slurm.go Adds Other map[string]interface{} with mapstructure:",remain" to Topology struct to capture extra config fields like podSelector without breaking existing fields.
pkg/translate/yaml.go Renames YAML tag blockSizesblock_sizes in BlockTopo; breaking schema change for existing ConfigMap consumers (noted in prior review).
pkg/translate/yaml_test.go Expected YAML strings updated to reflect block_sizes rename and narrower per-partition node sets.
.github/workflows/docker.yml Adds buildkitd-config: /etc/buildkit/buildkitd.toml to Docker Buildx setup; assumes the file exists on the runner.

Sequence Diagram

sequenceDiagram
    participant GO as GenerateOutput
    participant GTC as GetTranslateConfig
    participant GPN as getPartitionNodes (slinky)
    participant GMN as getMatchingNodes
    participant K8S as Kubernetes API
    participant SC as scontrol (fallback)

    GO->>GTC: GetTranslateConfig(ctx, params, topologyNodeFinder)
    loop for each topology/partition
        GTC->>GPN: getPartitionNodes(ctx, sect.Partition, params)
        GPN->>GPN: search Topologies for matching Partition field → topoName
        alt podSelector found in topo.Other
            GPN->>GMN: getMatchingNodes(ctx, client, ns, nodeListOpt, podListOpt)
            GMN->>K8S: GetNodes (nodeListOpt)
            GMN->>K8S: List Pods (podListOpt / podSelector)
            K8S-->>GMN: pods with slurm.node.name labels
            GMN-->>GPN: nodeMap (k8s→SLURM names)
            GPN-->>GTC: ' Nodes=slurm1,slurm2,...'
        else no podSelector
            GPN->>SC: scontrol show partition name
            SC-->>GPN: raw partition info
            GPN-->>GTC: raw scontrol output
        end
        GTC->>GTC: parsePartitionNodes → cluset.ExpandList → cluset.Compact
    end
    GTC-->>GO: translate.Config
    GO->>GO: generateDynamicNodesOutput / reconciliation
Loading

Reviews (5): Last reviewed commit: "chore(slinky-engine): Use PodSelecotr to..." | Re-trigger Greptile

Comment thread pkg/engines/slinky/engine.go Outdated
Comment thread pkg/translate/yaml.go
Comment thread pkg/engines/slinky/engine_test.go Outdated
Comment thread pkg/engines/slinky/engine.go Outdated
Comment thread pkg/engines/slinky/engine.go
@ravisoundar ravisoundar force-pushed the rs-dynamic-nodes branch 2 times, most recently from dc437a5 to 0766007 Compare April 24, 2026 16:48
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants