Skip to content

Honor a host's declared primary interface when picking its boot device #2657

@chet

Description

@chet

An operator can already point a host at any boot interface after the fact with set-primary-interface (#2314). What's missing is the declared, up-front intent: ExpectedHostNic.primary is meant to say "this NIC is the boot interface," but our ingestion automation ignores it and picks by DPU discovery order instead. This is the piece that lets a host declare "neither DPU is north/south primary -- this integrated NIC is" and have it stick -- exactly what a host with a DPU in NIC mode, or a DPU present-but-unused, needs (see #870).

We can't require primary to be set -- tens of thousands of existing machines have never needed it -- so the rule is: when it's declared, it wins; when it isn't, today's automation stands.

Current behavior

  • pick_boot_interface (crates/api-model/src/machine/mod.rs:280) already reads the primary_interface flag first, then falls back to the lowest-MAC non-underlay interface. The selection (read) side is fine.
  • The flag is written by three independent paths, and only one honors the declared intent -- and only as a demotion:
    • Ingestion: configure_host_machine (crates/site-explorer/src/machine_creator.rs:751) marks the first DPU's host interface primary and demotes the rest, purely by discovery order. It already receives machine_data -- the ExpectedMachineData with host_nics[].primary -- but never reads it for this decision.
    • DHCP: discover.rs:255 reads the declared primary into is_primary_nic and passes it to find_or_create_machine_interface. But machine_interface.rs:464 only ever sets the flag to false (demoting a non-declared NIC) -- there is no promote-to-true branch. A declared NIC stays primary only by inheriting the creation default of true.
    • Operator RPC: set_primary_interface (crates/api-db/src/machine_interface.rs:171) -- the post-hoc override from feat: make any host interface the primary, not just a DPU #2314.

The change

  • Make the declared ExpectedHostNic.primary authoritative across the writers, ideally behind one reconcile decider -- "given this host's interfaces and its declared primary, exactly one is primary" -- that both ingestion and DHCP route through. Precedence: declared primary > DPU takeover > lowest-MAC non-underlay.
  • configure_host_machine must not promote a DPU interface when a different NIC is declared primary.

Open questions to settle while implementing

  • How (and whether) configure_host_machine runs for a DPU in NIC mode. In NIC mode no DPU snapshot is attached, so this path may not run for it at all -- which would mean NIC-mode hosts already fall through to the DHCP/declared/lowest-MAC logic. Confirm before assuming the takeover overrides the declared NIC in NIC mode.
  • The pre-ownership window: DHCP creates these rows with machine_id = NULL (the None branch of find_or_create_machine_interface), and the one_primary_per_machine partial index does not constrain NULL machine_ids. So with no declared primary, multiple newly-leased NICs can each default to primary_interface = true, and pick_boot_interface returns whichever it finds first. Decide whether the reconcile decider should also settle this.

Done when

  • A host that declares an integrated NIC as primary boots from it even when a DPU is ingested in DPU mode.
  • No declared primary -> behavior is unchanged from today.
  • Tests cover: declared primary beats DPU takeover; declared primary survives regardless of DHCP arrival order; absent declared primary keeps today's automation. test_dhcp_marks_non_primary_mac_as_non_primary is the seed.

Part of #870.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions