Skip to content

docs: clarify undeploy.sh orphan-CRD behavior and manual cleanup#669

Closed
yuanchen8911 wants to merge 2 commits intoNVIDIA:mainfrom
yuanchen8911:docs/clarify-undeploy-orphan-crd-behavior
Closed

docs: clarify undeploy.sh orphan-CRD behavior and manual cleanup#669
yuanchen8911 wants to merge 2 commits intoNVIDIA:mainfrom
yuanchen8911:docs/clarify-undeploy-orphan-crd-behavior

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Document that undeploy.sh intentionally leaves behind CRDs not covered by Helm annotation discovery, and give users the manual cleanup recipe required before a clean redeploy on single-tenant clusters.

Motivation / Context

undeploy.sh (since #602) only deletes CRDs whose meta.helm.sh/release-name and meta.helm.sh/release-namespace annotations both match an in-bundle release. That is the correct shared-cluster safety default — but it means CRDs installed via a chart's crds/ subdir, installed by an operator at runtime, or added out-of-band survive undeploy. These orphan CRDs block a subsequent deploy.sh on single-tenant clusters in two concrete ways:

  1. helm upgrade --install refuses to adopt a pre-existing CRD that carries no release annotation.
  2. A remaining CR with an unreconciled finalizer (its controller is already uninstalled) prevents kubectl delete crd from ever completing.

Today's docs don't mention this at all — the only orphan-related section under undeploy.sh covers webhooks. Users discovering the failure have no documented recovery path.

Fixes: N/A
Related: #602, #661 (closed — superseded by this docs-only clarification after maintainer feedback preferred keeping cleanup out of the generated bundle artifact)

Type of Change

  • Documentation update

Component(s) Affected

  • Docs/examples (docs/, examples/)

Implementation Notes

Adds one new subsection — **Orphan-CRD behavior (by design):** — under the existing Undeploy Script Behavior (undeploy.sh) block in docs/user/cli-reference.md. The section:

  • States why these CRDs are left behind and ties the decision back to shared-cluster safety.
  • Names the two redeploy failure modes (helm import refusal, CRD delete hang on finalizers).
  • Provides the minimal finalizer-clear + CRD-delete recipe, scoped per <plural>.<group>, handling both cluster-scoped and namespaced CRs via shell parameter expansion (${ns:+-n "${ns}"}).
  • Explicitly warns against running the recipe on shared/multi-tenant clusters, since the manual steps do not re-check Helm release/namespace ownership.

No code changes.

Testing

tools/check-docs-sidebar   # OK: all docs files have sidebar entries

Doc-only change — no user-visible behavior, no Go/YAML/CI touched. Per the CLAUDE.local.md "Doc-only and infra-only verification" guidance, the scoped sidebar check is sufficient here; make qualify is skipped because test/e2e/Go-lint cannot regress from a single-file markdown addition.

Risk Assessment

  • Low — Documentation addition, no behavior change, trivially reversible.

Rollout notes: N/A.

Checklist

  • Tests pass locally (doc-only change; scoped sidebar check run instead of full make qualify — see Testing)
  • Linter passes (N/A for markdown-only change; no YAML/Go touched)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A)
  • I updated docs if user-facing behavior changed (this PR is the doc update)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

undeploy.sh intentionally skips CRDs that don't carry helm release
annotations (chart crds/ subdir, operator-runtime-installed, or
out-of-band). That is the documented shared-cluster safety default, but
nothing in the docs told users these CRDs survive undeploy or that a
clean redeploy on a single-tenant cluster needs manual CRD cleanup.

Add an "Orphan-CRD behavior (by design)" subsection under the
undeploy.sh section that:
- Calls out why these CRDs are intentionally left behind
- Names the two redeploy failure modes (helm import refusal,
  CRD delete hanging on unreconciled finalizers)
- Gives the manual finalizer-clear + CRD-delete recipe for
  single-tenant clusters, with a warning against running it on
  shared clusters
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

The CLI reference docs for undeploy.sh were expanded to precisely define CRD cleanup semantics: CRDs are deleted only when both Helm release name and namespace annotations match an in-bundle release. CRDs installed via chart crds/, runtime operators, or otherwise out-of-band are preserved unless their exact CRD name is listed in the bundle’s extra_crds_for_release. The docs also describe single-tenant redeploy effects (Helm refusing to adopt unannotated CRDs and finalizer-blocked kubectl delete crd) and add a manual procedure to clear CR finalizers before deleting CRDs, plus a multi-tenant warning.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: documenting orphan-CRD behavior and manual cleanup procedures in undeploy.sh, which aligns with the primary objective of this doc-only PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset. It explains the motivation (CRDs left behind by undeploy.sh design), the two failure modes blocking redeploy, and the manual recovery procedure documented in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user/cli-reference.md`:
- Around line 1370-1374: Replace the bold labels in this subsection with proper
markdown headings: change "**Orphan-CRD behavior (by design):**" into a heading
(e.g., "### Orphan-CRD behavior (by design)"), and change "**Consequence for
redeploy on single-tenant clusters:**" into a heading (e.g., "### Consequence
for redeploy on single-tenant clusters"); ensure the subsequent paragraphs
remain intact and update any heading level to match surrounding document
structure for navigability and consistency.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2ec59ac6-aae7-4f1c-a586-ea2b2e60270a

📥 Commits

Reviewing files that changed from the base of the PR and between 4e158cf and 8735e89.

📒 Files selected for processing (1)
  • docs/user/cli-reference.md

Comment thread docs/user/cli-reference.md
…ight claim

Two issues Codex caught on PR NVIDIA#669:

1. The finalizer-clear loop silently skipped cluster-scoped CRs. jq
   emitted namespace first (empty for cluster-scoped), so `read -r ns name`
   received a leading space, assigned the CR name to `ns`, left `name`
   empty, and the `[[ -z "${name}" ]] && continue` guard dropped the
   iteration. CRDs like `clusterpolicies.nvidia.com` and
   `clustertrainingruntimes.trainer.kubeflow.org` would have kept hanging
   on finalizers even after running the documented recipe. Emit name
   first so `read -r name ns` binds correctly in both scopes.

2. The doc overstated post-flight coverage. undeploy.sh only surfaces
   CRDs that carry an in-bundle Helm release annotation, or whose exact
   name is in the script's `extra_crds_for_release` list — arbitrary
   out-of-band CRDs are not warned about. Narrow the claim so operators
   don't treat the post-flight warning as a complete orphan inventory.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/user/cli-reference.md (1)

1370-1374: ⚠️ Potential issue | 🟡 Minor

Convert these bold labels into proper markdown headings.

Line 1370 and Line 1374 are substantial subsections and should use heading syntax for anchorability and navigation consistency.

As per coding guidelines: “Promote bold labels to headings only when the topic has substantial content (≥ ~8 lines).”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user/cli-reference.md` around lines 1370 - 1374, Replace the bold
subsection labels "**Orphan-CRD behavior (by design):**" and "**Consequence for
redeploy on single-tenant clusters:**" with proper markdown headings (e.g., "###
Orphan-CRD behavior (by design)" and "### Consequence for redeploy on
single-tenant clusters") to make them anchorable and consistent with the rest of
the CLI reference; ensure you pick the same heading level used by neighboring
subsections and keep the original punctuation/phrasing intact while removing the
bold markup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/user/cli-reference.md`:
- Around line 1370-1374: Replace the bold subsection labels "**Orphan-CRD
behavior (by design):**" and "**Consequence for redeploy on single-tenant
clusters:**" with proper markdown headings (e.g., "### Orphan-CRD behavior (by
design)" and "### Consequence for redeploy on single-tenant clusters") to make
them anchorable and consistent with the rest of the CLI reference; ensure you
pick the same heading level used by neighboring subsections and keep the
original punctuation/phrasing intact while removing the bold markup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a07dcfc9-1eaf-403c-aa58-1ecc31aaaf6f

📥 Commits

Reviewing files that changed from the base of the PR and between 8735e89 and 7e34fb1.

📒 Files selected for processing (1)
  • docs/user/cli-reference.md

@yuanchen8911 yuanchen8911 requested a review from mchmarny April 24, 2026 14:42
@mchmarny
Copy link
Copy Markdown
Member

Thanks for the thorough write-up, @yuanchen8911. The orphan-CRD problem you've documented is real, but I think documenting workarounds for undeploy.sh moves us in the wrong direction. I'd like to step back and revisit the broader question.

Position: we should remove undeploy.sh from the generated bundle entirely.

The undeploy script has been accumulating complexity (#602 added pre-flight/post-flight hardening, #661 proposed 350+ lines for orphan-CRD cleanup, and now this PR documents manual recovery steps for gaps that remain). Each iteration reinforces that AICR is taking on CRD lifecycle management — something Helm itself deliberately avoids.

Helm doesn't delete CRDs on helm uninstall for a specific reason: deleting a CRD triggers a cascading delete of every CR of that type cluster-wide. That's an irreversible, destructive operation that can wipe another tenant's workloads, or even the operator's own data resources in a shared namespace. This is exactly the kind of blast-radius problem that deployers (Argo CD, Flux, Helmfile) are purpose-built to manage — they understand ownership, sync state, and rollback semantics in ways a generated shell script cannot replicate.

By shipping undeploy.sh as part of the bundle artifact, we're implicitly taking on responsibility for safe teardown — including CRD lifecycle, finalizer resolution, and multi-tenancy safety — without the machinery to do it correctly. Documenting the gaps (as this PR does) makes that responsibility more visible, but it doesn't close the gaps. Users following the manual finalizer-clearing recipe on the wrong cluster is a real risk.

What I'd propose instead:

  • Remove undeploy.sh from the generated bundle. Teardown is the deployer's job.
  • For internal dev workflows (Kind, KWOK, CI evidence runs), move cleanup logic to tools/ where it's clearly scoped as a developer utility, not a shipped artifact.
  • If we need to document anything in user-facing docs, it should be guidance on how to use the deployer's native uninstall capabilities (e.g., Argo CD app deletion, Flux garbage collection).

This was discussed in #610 and came up again when #661 was closed. I think it's time to commit to that direction rather than continuing to harden a script that's growing beyond its intended scope.

Happy to discuss further — this is a design-level call that affects the whole team, so let's align before we go either way.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Agreed, @mchmarny. Moving undeploy.sh to tools/ for dev use and dropping it from the generated bundle is the right direction — aligns with the #610 consensus and addresses the blast-radius concern you raised. Closed the PR.

I'll open an issue to track the move (plus the CI/e2e sweep that currently depends on bundle undeploy.sh), and a separate one to document deployer-native uninstall (Argo CD app deletion, Flux GC) for users. Worth a quick team alignment on whether deploy.sh should follow the same path — happy to defer that decision to the tracking issue.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Tracking issue filed: #671.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants