Skip to content

bug: gb200 server: stuck machine in terminating due to TPM cert issue #2669

@brusmith-nvidia

Description

@brusmith-nvidia

Version

v0.9.3-0-gd09a7dd35

Describe the bug.

machine is being terminated by tenant. Termination state machine stops here:
MachineValidation { machine_validation: MachineValidating { context: "Cleanup", id: MachineValidationId { uuid: 9d3b9aa2-de6a-48ab-a412-70c41bbf2c64 }, completed: 1, total: 1, is_enabled: true }

looking at console status:
Jun 15 22:03:25 scout forge-scout[3831]: IGNORING SERVER CERT, Please ensure that I am removed to actually validate TLS.

Minimum reproducible example

MachineValidatingState::MachineValidating {
            context,
            id,
            ...
        } => {
            if !rebooted(&mh_snapshot.host_snapshot) {
                // ... retry reboot ...
            }
            // ...
            if machine_validation_completed(&mh_snapshot.host_snapshot) {
                if mh_snapshot.host_snapshot.failure_details.cause == FailureCause::NoError {
                    // success → reboot → HostInit/Discovered
                } else {
                    // → Failed
                }
            }
            Ok(StateHandlerOutcome::do_nothing())  // still waiting for scout
        }

Relevant log output

systemctl status forge-scout
● forge-scout.service - Scout Service
     Loaded: loaded (/usr/lib/systemd/system/forge-scout.service; enabled; preset:
enabled)
     Active: active (running) since Fri 2026-06-12 16:45:05 UTC; 3 da
ys ago
    Process: 3281 ExecStartPre=/opt/forge/forge-scout-pre.sh (code=exited, statu
s=0/SUCCESS)
   Main PID: 3831 (forge-scout)
      Tasks: 1 (limit: 629145)
     Memory: 17.5M (peak: 30.8M)
        CPU: 22.179s
     CGroup: /system.slice/forge-scout.service
             └─3831 /opt/forge/forge-scout --api=https://carbide-api
.forge --machine-interface-id=b9cafcf5-489c-4438-bf6b-1e989246d9cc

Jun 15 22:03:25 scout forge-scout[3831]: IGNORING SERVER CERT, Please ensure tha
t I am removed to actually validate TLS.
Jun 15 22:03:25 scout forge-scout[3831]: level=ERROR msg="Error attempting to di
scover_machine (attempt: 4636): code: \'The system is not in a state required fo
r the operation\'s execution\', message: \"Machine topology machine_id foreign k
ey violation: error returned from database: insert or update on table \\\"machin
e_topologies\\\" violates foreign key constraint \\\"machine_topologies_machine_
id_fkey\\\"\"" location="crates/host-support/src/registration.rs:144"
Jun 15 22:04:25 scout forge-scout[3831]: level=INFO msg="Attempting to discover_
machine (attempt: 4637)" location="crates/host-support/src/registration.rs:130"
Jun 15 22:04:25 scout forge-scout[3831]: IGNORING SERVER CERT, Please ensure tha
t I am removed to actually validate TLS.

Other/Misc.

No response

Code of Conduct

  • I agree to follow NVIDIA Infra Controller's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)

    Type

    No fields configured for Bug.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions