docs: clarify undeploy.sh orphan-CRD behavior and manual cleanup#669
docs: clarify undeploy.sh orphan-CRD behavior and manual cleanup#669yuanchen8911 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
undeploy.sh intentionally skips CRDs that don't carry helm release annotations (chart crds/ subdir, operator-runtime-installed, or out-of-band). That is the documented shared-cluster safety default, but nothing in the docs told users these CRDs survive undeploy or that a clean redeploy on a single-tenant cluster needs manual CRD cleanup. Add an "Orphan-CRD behavior (by design)" subsection under the undeploy.sh section that: - Calls out why these CRDs are intentionally left behind - Names the two redeploy failure modes (helm import refusal, CRD delete hanging on unreconciled finalizers) - Gives the manual finalizer-clear + CRD-delete recipe for single-tenant clusters, with a warning against running it on shared clusters
📝 WalkthroughWalkthroughThe CLI reference docs for Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/user/cli-reference.md`:
- Around line 1370-1374: Replace the bold labels in this subsection with proper
markdown headings: change "**Orphan-CRD behavior (by design):**" into a heading
(e.g., "### Orphan-CRD behavior (by design)"), and change "**Consequence for
redeploy on single-tenant clusters:**" into a heading (e.g., "### Consequence
for redeploy on single-tenant clusters"); ensure the subsequent paragraphs
remain intact and update any heading level to match surrounding document
structure for navigability and consistency.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 2ec59ac6-aae7-4f1c-a586-ea2b2e60270a
📒 Files selected for processing (1)
docs/user/cli-reference.md
…ight claim Two issues Codex caught on PR NVIDIA#669: 1. The finalizer-clear loop silently skipped cluster-scoped CRs. jq emitted namespace first (empty for cluster-scoped), so `read -r ns name` received a leading space, assigned the CR name to `ns`, left `name` empty, and the `[[ -z "${name}" ]] && continue` guard dropped the iteration. CRDs like `clusterpolicies.nvidia.com` and `clustertrainingruntimes.trainer.kubeflow.org` would have kept hanging on finalizers even after running the documented recipe. Emit name first so `read -r name ns` binds correctly in both scopes. 2. The doc overstated post-flight coverage. undeploy.sh only surfaces CRDs that carry an in-bundle Helm release annotation, or whose exact name is in the script's `extra_crds_for_release` list — arbitrary out-of-band CRDs are not warned about. Narrow the claim so operators don't treat the post-flight warning as a complete orphan inventory.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
docs/user/cli-reference.md (1)
1370-1374:⚠️ Potential issue | 🟡 MinorConvert these bold labels into proper markdown headings.
Line 1370 and Line 1374 are substantial subsections and should use heading syntax for anchorability and navigation consistency.
As per coding guidelines: “Promote bold labels to headings only when the topic has substantial content (≥ ~8 lines).”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user/cli-reference.md` around lines 1370 - 1374, Replace the bold subsection labels "**Orphan-CRD behavior (by design):**" and "**Consequence for redeploy on single-tenant clusters:**" with proper markdown headings (e.g., "### Orphan-CRD behavior (by design)" and "### Consequence for redeploy on single-tenant clusters") to make them anchorable and consistent with the rest of the CLI reference; ensure you pick the same heading level used by neighboring subsections and keep the original punctuation/phrasing intact while removing the bold markup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/user/cli-reference.md`:
- Around line 1370-1374: Replace the bold subsection labels "**Orphan-CRD
behavior (by design):**" and "**Consequence for redeploy on single-tenant
clusters:**" with proper markdown headings (e.g., "### Orphan-CRD behavior (by
design)" and "### Consequence for redeploy on single-tenant clusters") to make
them anchorable and consistent with the rest of the CLI reference; ensure you
pick the same heading level used by neighboring subsections and keep the
original punctuation/phrasing intact while removing the bold markup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a07dcfc9-1eaf-403c-aa58-1ecc31aaaf6f
📒 Files selected for processing (1)
docs/user/cli-reference.md
|
Thanks for the thorough write-up, @yuanchen8911. The orphan-CRD problem you've documented is real, but I think documenting workarounds for Position: we should remove The undeploy script has been accumulating complexity (#602 added pre-flight/post-flight hardening, #661 proposed 350+ lines for orphan-CRD cleanup, and now this PR documents manual recovery steps for gaps that remain). Each iteration reinforces that AICR is taking on CRD lifecycle management — something Helm itself deliberately avoids. Helm doesn't delete CRDs on By shipping What I'd propose instead:
This was discussed in #610 and came up again when #661 was closed. I think it's time to commit to that direction rather than continuing to harden a script that's growing beyond its intended scope. Happy to discuss further — this is a design-level call that affects the whole team, so let's align before we go either way. |
|
Agreed, @mchmarny. Moving I'll open an issue to track the move (plus the CI/e2e sweep that currently depends on bundle |
|
Tracking issue filed: #671. |
Summary
Document that
undeploy.shintentionally leaves behind CRDs not covered by Helm annotation discovery, and give users the manual cleanup recipe required before a clean redeploy on single-tenant clusters.Motivation / Context
undeploy.sh(since #602) only deletes CRDs whosemeta.helm.sh/release-nameandmeta.helm.sh/release-namespaceannotations both match an in-bundle release. That is the correct shared-cluster safety default — but it means CRDs installed via a chart'scrds/subdir, installed by an operator at runtime, or added out-of-band survive undeploy. These orphan CRDs block a subsequentdeploy.shon single-tenant clusters in two concrete ways:helm upgrade --installrefuses to adopt a pre-existing CRD that carries no release annotation.kubectl delete crdfrom ever completing.Today's docs don't mention this at all — the only orphan-related section under
undeploy.shcovers webhooks. Users discovering the failure have no documented recovery path.Fixes: N/A
Related: #602, #661 (closed — superseded by this docs-only clarification after maintainer feedback preferred keeping cleanup out of the generated bundle artifact)
Type of Change
Component(s) Affected
docs/,examples/)Implementation Notes
Adds one new subsection —
**Orphan-CRD behavior (by design):**— under the existingUndeploy Script Behavior (undeploy.sh)block indocs/user/cli-reference.md. The section:<plural>.<group>, handling both cluster-scoped and namespaced CRs via shell parameter expansion (${ns:+-n "${ns}"}).No code changes.
Testing
tools/check-docs-sidebar # OK: all docs files have sidebar entriesDoc-only change — no user-visible behavior, no Go/YAML/CI touched. Per the
CLAUDE.local.md"Doc-only and infra-only verification" guidance, the scoped sidebar check is sufficient here;make qualifyis skipped because test/e2e/Go-lint cannot regress from a single-file markdown addition.Risk Assessment
Rollout notes: N/A.
Checklist
make qualify— see Testing)git commit -S)