Skip to content

Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394

Merged
mason-sharp merged 4 commits intomainfrom
DOC-ACE
Mar 24, 2026
Merged

Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394
mason-sharp merged 4 commits intomainfrom
DOC-ACE

Conversation

@ibrarahmad
Copy link
Contributor

@ibrarahmad ibrarahmad commented Mar 17, 2026

Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild, a section on the risk of unreplicated transactions due to asynchronous logical replication, and stronger troubleshooting steps when the source node fails mid-replay.

SPOC-141

@ibrarahmad ibrarahmad requested a review from mason-sharp March 17, 2026 15:55
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new “Two of Three Nodes Fail” recovery scenario and full workflow; expands recovery steps and validation; introduces a “Risk of Lost Transactions” section with mitigation guidance; standardizes phase headings and imperative phrasing; clarifies --preserve-origin behavior and updates diagrams and prose.

Changes

Cohort / File(s) Summary
Disaster Recovery Documentation
docs/recovery/catastrophic_node_failure.md
Added a third scenario (“Two of Three Nodes Fail”) and a dedicated recovery procedure; added Spock cleanup sequencing and rebuilt-node workflow using ZODAN add_node; introduced “Risk of Lost Transactions” with mitigation options and RPO guidance; standardized phase headings and imperative phrasing; clarified --preserve-origin semantics (origin ID and commit timestamp precision); converted checklist-style items to sentence-form validation steps; updated diagrams and contextual wording.

Poem

🐰 Beneath the cluster's midnight hum I peered,

Two went quiet, one brave node persevered.
I scrubbed Spock traces, then stitched each lane,
Re-added nodes, kept origins' tiny grain.
A hop, a diff — the garden grows again.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.' accurately and specifically summarizes the three main changes added in this PR: a new multi-node failure scenario, lost transactions documentation, and source-failover guidance.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description directly addresses the changeset by outlining three specific additions: a new two-of-three-nodes recovery scenario, a section on transaction replication risks, and improved troubleshooting guidance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch DOC-ACE

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/recovery/catastrophic_node_failure.md (1)

282-312: Add a one-line origin-semantics note in the data-loss section

Since this section discusses origin-filtered recovery, consider explicitly clarifying that local writes are origin 0, while unavailable origin metadata is NULL/unknown—they are not the same state.

Based on learnings: origin = 0 means local change; NULL/unknown means origin metadata unavailable and must not be conflated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 282 - 312, Add a
one-line clarification to the "Risk of Lost Transactions" / origin-filtered
recovery discussion stating the origin semantics: explicitly note that origin =
0 denotes a local write, whereas unavailable origin metadata is represented as
NULL (unknown) and these two states must not be conflated; reference the symbols
"origin = 0" and "NULL" in that sentence for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 25-31: The intro now lists three scenarios but nearby text still
uses the phrase "both cases"; update any occurrences of "both cases" in
docs/recovery/catastrophic_node_failure.md to a plural-consistent phrase such as
"these cases" or "any of these scenarios" so the wording matches the three-item
list, and scan surrounding sentences for pronouns that assume only two cases and
adjust them likewise.
- Around line 852-857: Update the failover retry step that instructs rerunning
table-diff and table-repair (mentions table-diff and table-repair and the
example --source-of-truth n4) to require reusing the original recovery flags:
add explicit text requiring --preserve-origin, --recovery-mode and the same
--until cutoff(s) in the command examples and prose so operators know to include
them alongside --source-of-truth to ensure equivalent repairs on retry.

---

Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 282-312: Add a one-line clarification to the "Risk of Lost
Transactions" / origin-filtered recovery discussion stating the origin
semantics: explicitly note that origin = 0 denotes a local write, whereas
unavailable origin metadata is represented as NULL (unknown) and these two
states must not be conflated; reference the symbols "origin = 0" and "NULL" in
that sentence for clarity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d89fc73-0b46-4e16-8ba5-07e6c2b1eaf1

📥 Commits

Reviewing files that changed from the base of the PR and between 0ae4205 and e5ccf42.

📒 Files selected for processing (1)
  • docs/recovery/catastrophic_node_failure.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 654-655: Typo fix: add a missing space between the inline code
token and the following word so "`--preserve-origin`clause" becomes
"`--preserve-origin` clause" in the docs; update the text near the occurrence of
`--preserve-origin` to insert the space for correct readability.
- Around line 122-123: The document contains duplicate section headings "###
Implementing a Recovery" which create identical anchors; locate both occurrences
of the heading text "### Implementing a Recovery" and make them unique by
appending a scenario-specific suffix (for example "### Implementing a Recovery —
Catastrophic Node Failure" and "### Implementing a Recovery — <other scenario
name>"), and then update any internal links/anchors that reference those
headings to use the new unique anchor names; ensure the two headings are
distinct and descriptive so anchors are no longer duplicated.
- Around line 127-136: The numbered list in the recovery doc skips from "2." to
"4."; update the markdown so the sequence is continuous by changing the "4."
line (the step starting "When the initial sync completes for each node, validate
with `table-diff`...") to "3." (or insert a proper step 3 if additional content
is needed) so the list reads 1, 2, 3; modify the stanza containing "Clean up
Spock..." / "Rebuild n2 and n3..." / "When the initial sync completes..."
accordingly.
- Line 839: Fix the typo in the markdown heading "Origin Netadata Missing for
Some Rows" by changing "Netadata" to "Metadata" so the heading reads "Origin
Metadata Missing for Some Rows"; update that exact heading string in
docs/recovery/catastrophic_node_failure.md.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 358af2bc-3dfc-4a38-a545-d50a5d6f42ec

📥 Commits

Reviewing files that changed from the base of the PR and between e5ccf42 and 8bf4b33.

📒 Files selected for processing (1)
  • docs/recovery/catastrophic_node_failure.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/recovery/catastrophic_node_failure.md (1)

33-35: ⚠️ Potential issue | 🟡 Minor

Inconsistent plural reference: "both cases" should be "these cases" or "all three scenarios".

Line 33 states "The same idea applies in both cases" but the introduction now lists three scenarios (lines 25-31). Update "both cases" to "these cases" or "all three scenarios" for consistency.

Suggested fix
-The same idea applies in both cases; you identify what is missing on the
+The same idea applies in these cases; you identify what is missing on the
 surviving node(s), then repair the nodes using a node with complete data, 
 preserving the origin ID and timestamp for every repaired row.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 33 - 35, Replace the
inconsistent phrase "The same idea applies in both cases" with a plural that
matches the three listed scenarios; locate the sentence containing the exact
text "The same idea applies in both cases" and update it to "The same idea
applies in these cases" or "The same idea applies in all three scenarios" so the
pluralization matches the earlier enumeration and preserves the rest of the
sentence ("you identify what is missing on the surviving node(s), then repair
the nodes using a node with complete data, preserving the origin ID and
timestamp for every repaired row").
🧹 Nitpick comments (1)
docs/recovery/catastrophic_node_failure.md (1)

296-299: Optional: Strengthen "very small" phrasing.

Line 297 states the window "is very small (often sub-second)" - consider revising to be more direct: "is typically sub-second under normal conditions".

Suggested refinement
 The size of this window depends on replication lag, network conditions,
-and transaction volume. Under normal conditions it is very small (often
-sub-second), but it can grow during heavy write bursts, network
+and transaction volume. Under normal conditions it is typically sub-second,
+but it can grow during heavy write bursts, network
 disruptions, or long-running transactions on the publisher.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 296 - 299, Replace
the phrase "very small (often sub-second)" with a clearer, more direct wording
such as "typically sub-second under normal conditions" in the sentence that
currently reads "The size of this window depends on replication lag, network
conditions, and transaction volume. Under normal conditions it is very small
(often sub-second), but..."—update that sentence to use the suggested wording to
strengthen clarity and tone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 33-35: Replace the inconsistent phrase "The same idea applies in
both cases" with a plural that matches the three listed scenarios; locate the
sentence containing the exact text "The same idea applies in both cases" and
update it to "The same idea applies in these cases" or "The same idea applies in
all three scenarios" so the pluralization matches the earlier enumeration and
preserves the rest of the sentence ("you identify what is missing on the
surviving node(s), then repair the nodes using a node with complete data,
preserving the origin ID and timestamp for every repaired row").

---

Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 296-299: Replace the phrase "very small (often sub-second)" with a
clearer, more direct wording such as "typically sub-second under normal
conditions" in the sentence that currently reads "The size of this window
depends on replication lag, network conditions, and transaction volume. Under
normal conditions it is very small (often sub-second), but..."—update that
sentence to use the suggested wording to strengthen clarity and tone.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 66dac58f-1a42-4493-a268-5992c1bd6957

📥 Commits

Reviewing files that changed from the base of the PR and between 8bf4b33 and f6a4c42.

📒 Files selected for processing (1)
  • docs/recovery/catastrophic_node_failure.md

Ibrar Ahmed and others added 3 commits March 24, 2026 09:53
…ce to CNF doc.

Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild,
a section on the risk of unreplicated transactions due to asynchronous
logical replication, and stronger troubleshooting steps when the source
node fails mid-replay.
Also adding more detail to a paragraph
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/recovery/catastrophic_node_failure.md (1)

430-443: ⚠️ Potential issue | 🟡 Minor

Nested list items should use standard Markdown indentation for portable rendering.

Lines 432 and 439 contain nested numbered list items indented with four spaces under "On each surviving node (n2, n3, n4, and n5):". This indentation level may render as code blocks in some Markdown renderers instead of continuing as list items. Use 2-3 space indentation instead for better cross-renderer compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 430 - 443, The
nested numbered list lines under "On each surviving node (n2, n3, n4, and n5):"
use four-space indentation which can render as code blocks; change the
indentation for the nested list items (the lines describing spock.sub_drop() and
spock.node_drop()) from four spaces to 2 (or 3) spaces so they are recognized as
continued list items, preserving the numbered ordering and keeping the
references to spock.sub_drop() and spock.node_drop() intact.
♻️ Duplicate comments (1)
docs/recovery/catastrophic_node_failure.md (1)

867-870: ⚠️ Potential issue | 🟠 Major

Re-run instructions must preserve the original recovery envelope.

At Line 867–Line 870, retry guidance only calls out --source-of-truth. Please explicitly require reusing the original --preserve-origin, --recovery-mode, and the same per-origin --until cutoff(s), otherwise retries can produce non-equivalent repairs.

Suggested doc patch
-3. Pick a new source of truth (e.g., n4 or n5) and re-run `table-diff`
-   and `table-repair` with `--source-of-truth n4` for any tables that
-   were not yet fully repaired. Tables that were already repaired and
-   validated do not need to be re-done.
+3. Pick a new source of truth (e.g., n4 or n5) and re-run `table-diff`
+   and `table-repair` for any tables not yet fully repaired, using the
+   same recovery options as before (including `--preserve-origin`,
+   `--recovery-mode`, and the same `--until` cutoff timestamp[s] used per
+   failed origin), plus `--source-of-truth n4` (or your chosen survivor).
+   Tables already repaired and validated do not need to be re-done.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 867 - 870, Update
the retry instructions for table-diff and table-repair to require reusing the
original recovery envelope flags: when re-running with --source-of-truth (e.g.,
n4), explicitly state that operators must also pass the original
--preserve-origin, --recovery-mode, and the same per-origin --until cutoff(s)
that were used in the initial repair, so retries produce equivalent repairs and
do not change the recovery envelope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 845-849: Update the "Origin Metadata Missing" section to clearly
distinguish local-origin rows (origin = 0) from truly missing origin metadata
(NULL in spock.resolutions.local_origin or 'unknown' in PostgreSQL logs): state
that origin = 0 indicates a local change (not replication) and should not be
treated as metadata loss, while NULL/'unknown' means origin metadata is
unavailable and requires repair; also call out the relevant config/flags to
check (`track_commit_timestamp = on`, diff run with `--preserve-origin`) and
suggest inspecting the diff file and ACE logs for NULL/'unknown' markers rather
than treating origin = 0 as an error.

---

Outside diff comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 430-443: The nested numbered list lines under "On each surviving
node (n2, n3, n4, and n5):" use four-space indentation which can render as code
blocks; change the indentation for the nested list items (the lines describing
spock.sub_drop() and spock.node_drop()) from four spaces to 2 (or 3) spaces so
they are recognized as continued list items, preserving the numbered ordering
and keeping the references to spock.sub_drop() and spock.node_drop() intact.

---

Duplicate comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 867-870: Update the retry instructions for table-diff and
table-repair to require reusing the original recovery envelope flags: when
re-running with --source-of-truth (e.g., n4), explicitly state that operators
must also pass the original --preserve-origin, --recovery-mode, and the same
per-origin --until cutoff(s) that were used in the initial repair, so retries
produce equivalent repairs and do not change the recovery envelope.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee9367de-264d-4616-9318-92e096da9a78

📥 Commits

Reviewing files that changed from the base of the PR and between f6a4c42 and 23b6643.

📒 Files selected for processing (1)
  • docs/recovery/catastrophic_node_failure.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/recovery/catastrophic_node_failure.md (1)

872-875: Consider explicitly mentioning all recovery flags in failover guidance.

When instructing operators to re-run table-diff and table-repair after a source-of-truth failover, the current text mentions only --source-of-truth n4. For consistency and to prevent operator errors during incident response, consider explicitly stating that they should also reuse the original recovery flags: --preserve-origin, --recovery-mode, and the same --until cutoff timestamp(s).

While the main procedure (Phases 3 and 4) documents these flags extensively, being explicit in the troubleshooting section reduces the chance of mistakes when operators are working under pressure.

📝 Suggested clarification
 3. Pick a new source of truth (e.g., n4 or n5) and re-run `table-diff`
    and `table-repair` with `--source-of-truth n4` for any tables that
-   were not yet fully repaired. Tables that were already repaired and
-   validated do not need to be re-done.
+   were not yet fully repaired, using the same recovery options as
+   before (including `--preserve-origin`, `--recovery-mode`, and the
+   same `--until` cutoff timestamp). Tables that were already repaired
+   and validated do not need to be re-done.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 872 - 875, Update
the recovery guidance so the operator is told to re-run table-diff and
table-repair not only with --source-of-truth (e.g., --source-of-truth n4) but
also include and reuse the original recovery flags --preserve-origin,
--recovery-mode, and the same --until cutoff timestamps; amend the sentence
referencing re-running table-diff and table-repair to explicitly list these
flags so operators know to carry forward --preserve-origin, --recovery-mode, and
--until when performing the failover repair steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 872-875: Update the recovery guidance so the operator is told to
re-run table-diff and table-repair not only with --source-of-truth (e.g.,
--source-of-truth n4) but also include and reuse the original recovery flags
--preserve-origin, --recovery-mode, and the same --until cutoff timestamps;
amend the sentence referencing re-running table-diff and table-repair to
explicitly list these flags so operators know to carry forward
--preserve-origin, --recovery-mode, and --until when performing the failover
repair steps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4292c751-f3c8-4089-8f0d-6aace7a087a8

📥 Commits

Reviewing files that changed from the base of the PR and between 23b6643 and 8fb83b3.

📒 Files selected for processing (1)
  • docs/recovery/catastrophic_node_failure.md

@mason-sharp mason-sharp merged commit a116148 into main Mar 24, 2026
10 checks passed
@mason-sharp mason-sharp deleted the DOC-ACE branch March 24, 2026 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants