Skip to content

fix(runtimes): set MPI SSH auth secret volume default mode to 0640#3368

Closed
harxhist wants to merge 3 commits into
kubeflow:masterfrom
harxhist:fix(runtimes)/mpi-ssh-secret-perms
Closed

fix(runtimes): set MPI SSH auth secret volume default mode to 0640#3368
harxhist wants to merge 3 commits into
kubeflow:masterfrom
harxhist:fix(runtimes)/mpi-ssh-secret-perms

Conversation

@harxhist

Copy link
Copy Markdown

What this PR does?:

  • Set DefaultMode to 0640 on the MPI SSH auth Secret volume in the MPI plugin so mounted secret files (private key, authorized_keys) have restricted permissions.
  • Update unit, framework, and integration test expectations to include DefaultMode: 0640 for MPI SSH auth volumes.
  • In the framework component-builder test, add cmpopts.IgnoreFields(corev1.SecretVolumeSource{}, "DefaultMode") so the generic comparison focuses on Secret data; DefaultMode is asserted in MPI-specific tests.

Why we need it?

Explicit default mode improves security for MPI SSH keys (owner read/write, group read, others none) and avoids relying on cluster default behavior.

Which issue this PR fixes

Fixes 3262

Checklist:

  • Unit tests passed
  • Integration tests passed
  • E2E tests passed
  • Linter & fmt ran

Copilot AI review requested due to automatic review settings March 19, 2026 14:32
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions

Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@Sridhar1030 Sridhar1030 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The change to set explicit defaultMode 0640 on the MPI SSH auth Secret volume matches the issue and is a sensible hardening step. Test updates look consistent with the behavior change.

just a small non-blocking nit
the global IgnoreFields(..., "DefaultMode") in the framework component-builder test could mask defaultMode on other Secret volumes someday; spelling out 0640 on the expected Secret volumes and dropping the ignore would make the test stricter.

@google-oss-prow

Copy link
Copy Markdown

@Sridhar1030: changing LGTM is restricted to collaborators

Details

In response to this:

Thanks for the PR. The change to set explicit defaultMode 0640 on the MPI SSH auth Secret volume matches the issue and is a sensible hardening step. Test updates look consistent with the behavior change.

just a small non-blocking nit
the global IgnoreFields(..., "DefaultMode") in the framework component-builder test could mask defaultMode on other Secret volumes someday; spelling out 0640 on the expected Secret volumes and dropping the ignore would make the test stricter.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Harsh <harxhist@gmail.com>
@google-oss-prow

Copy link
Copy Markdown

@Sridhar1030: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jiayuzhao05 jiayuzhao05 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is one important behavioral risk to verify before merge.

DefaultMode is applied to the entire MPI SSH auth secret volume, which means the same 0640 permission will be used for the private key, public key, and authorized_keys. That may be problematic for the private key specifically. In many SSH/OpenSSH environments, private keys are expected to be more restrictive, and group-readable permissions can trigger “permissions are too open” style errors.

So while the patch correctly updates the generated volume spec and the related tests, it does not yet prove that SSH runtime behavior remains valid with 0640 for the private key. I would recommend setting per-item modes instead of a single volume-wide DefaultMode.

@google-oss-prow

Copy link
Copy Markdown

@jiayuzhao05: changing LGTM is restricted to collaborators

Details

In response to this:

I think there is one important behavioral risk to verify before merge.

DefaultMode is applied to the entire MPI SSH auth secret volume, which means the same 0640 permission will be used for the private key, public key, and authorized_keys. That may be problematic for the private key specifically. In many SSH/OpenSSH environments, private keys are expected to be more restrictive, and group-readable permissions can trigger “permissions are too open” style errors.

So while the patch correctly updates the generated volume spec and the related tests, it does not yet prove that SSH runtime behavior remains valid with 0640 for the private key. I would recommend setting per-item modes instead of a single volume-wide DefaultMode.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@harxhist

Copy link
Copy Markdown
Author

I think there is one important behavioral risk to verify before merge.

DefaultMode is applied to the entire MPI SSH auth secret volume, which means the same 0640 permission will be used for the private key, public key, and authorized_keys. That may be problematic for the private key specifically. In many SSH/OpenSSH environments, private keys are expected to be more restrictive, and group-readable permissions can trigger “permissions are too open” style errors.

So while the patch correctly updates the generated volume spec and the related tests, it does not yet prove that SSH runtime behavior remains valid with 0640 for the private key. I would recommend setting per-item modes instead of a single volume-wide DefaultMode.

@jiayuzhao05

Thanks for catching this, that makes sense.

I’ll switch from using a single defaultMode to setting specific modes per key. The private key will use a stricter permission (0600), while the public key and authorized_keys will use 0644.

I’ll update the PR accordingly 👍

Signed-off-by: Harsh <harxhist@gmail.com>
@google-oss-prow google-oss-prow Bot added size/L and removed size/M labels Mar 21, 2026
@harxhist harxhist requested a review from jiayuzhao05 March 21, 2026 10:29

@jiayuzhao05 jiayuzhao05 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. This revision fixes previous concern.

@google-oss-prow

Copy link
Copy Markdown

@jiayuzhao05: changing LGTM is restricted to collaborators

Details

In response to this:

This looks good to me now. This revision fixes previous concern.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@harxhist

Copy link
Copy Markdown
Author

@jinchihe @kuizhiqing
Please review this PR.

@Sridhar1030

Copy link
Copy Markdown
Member

looks good to me too

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @harxhist! This review was generated using AI tooling. Inline comments below — one is a likely runtime regression that I'd like to confirm before merge.

One PR-level note: the title and the "What this PR does" section still describe DefaultMode: 0640, but the implementation now uses per-key modes (0600 private, 0644 shared). Please update so the eventual commit message matches the code.

corev1ac.KeyToPath().
WithKey(corev1.SSHAuthPrivateKey).
WithPath(constants.MPISSHPrivateKeyFile).
WithMode(constants.MPISSHSecretPrivateKeyFileMode),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely runtime regression — please verify before merge.

Mode 0600 makes the private key readable only by its owner. Secret-volume files are mounted root:root by default, but the bundled deepspeed-distributed and mlx-distributed runtimes set runAsUser: 1000 (mpiuser) — see manifests/base/runtimes/deepspeed_distributed.yaml:36,46 and the matching mlx_distributed.yaml. Neither sets fsGroup, so mpiuser cannot read /home/mpiuser/.ssh/id_rsa and mpirun→SSH will fail at the kernel DAC layer before OpenSSH's sshkey_perm_ok check is reached.

Issue #3262 explicitly recommended 0640 for this exact reason. Options:

  • (a) revert this constant to 0640 (matches the title + issue analysis), or
  • (b) keep 0600 and add a pod-level fsGroup to the bundled runtimes so the mpiuser group can read the key.

Current unit tests only assert the rendered spec — please confirm via an E2E run against deepspeed-distributed before merge.

MPISSHSecretPrivateKeyFileMode int32 = 0600

// MPISSHSecretSharedSSHFileMode is the mode for the mounted MPI SSH public key and authorized_keys files (0644).
MPISSHSecretSharedSSHFileMode int32 = 0644

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestion for consistency with the surrounding MPISSHPublicKey* family:

  • MPISSHSecretPrivateKeyFileModeMPISSHPrivateKeyFileMode (the Secret infix is noise; every MPI SSH file already lives in the Secret)
  • MPISSHSecretSharedSSHFileModeMPISSHPublicKeyFileMode (SharedSSH is opaque; both id_rsa.pub and authorized_keys are public-key material)

Minor: consider 0o600 / 0o644 notation (Go 1.13+) or include the decimal in the doc comment — makes the octal intent explicit.

WithPath(constants.MPISSHAuthorizedKeys),
),
WithSecret(
corev1ac.SecretVolumeSource().

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defense-in-depth suggestion: now that the test-side IgnoreFields(SecretVolumeSource{}, "DefaultMode") was removed, consider also setting WithDefaultMode(0o400) on the SecretVolumeSource. Per-item modes override everything today, but a future contributor adding a 4th item without WithMode(...) would silently inherit Kubernetes' 0644 default.

@andreyvelich

Copy link
Copy Markdown
Member

We are looking for someone to pick up this PR to finalize it.
/good-first-issue

@andreyvelich

Copy link
Copy Markdown
Member

Fixed by: #3649
/close

@google-oss-prow google-oss-prow Bot closed this Jun 24, 2026
@google-oss-prow

Copy link
Copy Markdown

@andreyvelich: Closed this PR.

Details

In response to this:

Fixed by: #3649
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set defaultMode on MPI SSH Secret volume to restrict private key permissions

5 participants