Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394
Add multi-node failure, lost transactions, and source-failover guidance to CNF doc.#394mason-sharp merged 4 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new “Two of Three Nodes Fail” recovery scenario and full workflow; expands recovery steps and validation; introduces a “Risk of Lost Transactions” section with mitigation guidance; standardizes phase headings and imperative phrasing; clarifies Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/recovery/catastrophic_node_failure.md (1)
282-312: Add a one-line origin-semantics note in the data-loss sectionSince this section discusses origin-filtered recovery, consider explicitly clarifying that local writes are origin
0, while unavailable origin metadata isNULL/unknown—they are not the same state.Based on learnings: origin = 0 means local change; NULL/unknown means origin metadata unavailable and must not be conflated.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 282 - 312, Add a one-line clarification to the "Risk of Lost Transactions" / origin-filtered recovery discussion stating the origin semantics: explicitly note that origin = 0 denotes a local write, whereas unavailable origin metadata is represented as NULL (unknown) and these two states must not be conflated; reference the symbols "origin = 0" and "NULL" in that sentence for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 25-31: The intro now lists three scenarios but nearby text still
uses the phrase "both cases"; update any occurrences of "both cases" in
docs/recovery/catastrophic_node_failure.md to a plural-consistent phrase such as
"these cases" or "any of these scenarios" so the wording matches the three-item
list, and scan surrounding sentences for pronouns that assume only two cases and
adjust them likewise.
- Around line 852-857: Update the failover retry step that instructs rerunning
table-diff and table-repair (mentions table-diff and table-repair and the
example --source-of-truth n4) to require reusing the original recovery flags:
add explicit text requiring --preserve-origin, --recovery-mode and the same
--until cutoff(s) in the command examples and prose so operators know to include
them alongside --source-of-truth to ensure equivalent repairs on retry.
---
Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 282-312: Add a one-line clarification to the "Risk of Lost
Transactions" / origin-filtered recovery discussion stating the origin
semantics: explicitly note that origin = 0 denotes a local write, whereas
unavailable origin metadata is represented as NULL (unknown) and these two
states must not be conflated; reference the symbols "origin = 0" and "NULL" in
that sentence for clarity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7d89fc73-0b46-4e16-8ba5-07e6c2b1eaf1
📒 Files selected for processing (1)
docs/recovery/catastrophic_node_failure.md
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 654-655: Typo fix: add a missing space between the inline code
token and the following word so "`--preserve-origin`clause" becomes
"`--preserve-origin` clause" in the docs; update the text near the occurrence of
`--preserve-origin` to insert the space for correct readability.
- Around line 122-123: The document contains duplicate section headings "###
Implementing a Recovery" which create identical anchors; locate both occurrences
of the heading text "### Implementing a Recovery" and make them unique by
appending a scenario-specific suffix (for example "### Implementing a Recovery —
Catastrophic Node Failure" and "### Implementing a Recovery — <other scenario
name>"), and then update any internal links/anchors that reference those
headings to use the new unique anchor names; ensure the two headings are
distinct and descriptive so anchors are no longer duplicated.
- Around line 127-136: The numbered list in the recovery doc skips from "2." to
"4."; update the markdown so the sequence is continuous by changing the "4."
line (the step starting "When the initial sync completes for each node, validate
with `table-diff`...") to "3." (or insert a proper step 3 if additional content
is needed) so the list reads 1, 2, 3; modify the stanza containing "Clean up
Spock..." / "Rebuild n2 and n3..." / "When the initial sync completes..."
accordingly.
- Line 839: Fix the typo in the markdown heading "Origin Netadata Missing for
Some Rows" by changing "Netadata" to "Metadata" so the heading reads "Origin
Metadata Missing for Some Rows"; update that exact heading string in
docs/recovery/catastrophic_node_failure.md.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 358af2bc-3dfc-4a38-a545-d50a5d6f42ec
📒 Files selected for processing (1)
docs/recovery/catastrophic_node_failure.md
There was a problem hiding this comment.
♻️ Duplicate comments (1)
docs/recovery/catastrophic_node_failure.md (1)
33-35:⚠️ Potential issue | 🟡 MinorInconsistent plural reference: "both cases" should be "these cases" or "all three scenarios".
Line 33 states "The same idea applies in both cases" but the introduction now lists three scenarios (lines 25-31). Update "both cases" to "these cases" or "all three scenarios" for consistency.
Suggested fix
-The same idea applies in both cases; you identify what is missing on the +The same idea applies in these cases; you identify what is missing on the surviving node(s), then repair the nodes using a node with complete data, preserving the origin ID and timestamp for every repaired row.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 33 - 35, Replace the inconsistent phrase "The same idea applies in both cases" with a plural that matches the three listed scenarios; locate the sentence containing the exact text "The same idea applies in both cases" and update it to "The same idea applies in these cases" or "The same idea applies in all three scenarios" so the pluralization matches the earlier enumeration and preserves the rest of the sentence ("you identify what is missing on the surviving node(s), then repair the nodes using a node with complete data, preserving the origin ID and timestamp for every repaired row").
🧹 Nitpick comments (1)
docs/recovery/catastrophic_node_failure.md (1)
296-299: Optional: Strengthen "very small" phrasing.Line 297 states the window "is very small (often sub-second)" - consider revising to be more direct: "is typically sub-second under normal conditions".
Suggested refinement
The size of this window depends on replication lag, network conditions, -and transaction volume. Under normal conditions it is very small (often -sub-second), but it can grow during heavy write bursts, network +and transaction volume. Under normal conditions it is typically sub-second, +but it can grow during heavy write bursts, network disruptions, or long-running transactions on the publisher.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 296 - 299, Replace the phrase "very small (often sub-second)" with a clearer, more direct wording such as "typically sub-second under normal conditions" in the sentence that currently reads "The size of this window depends on replication lag, network conditions, and transaction volume. Under normal conditions it is very small (often sub-second), but..."—update that sentence to use the suggested wording to strengthen clarity and tone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 33-35: Replace the inconsistent phrase "The same idea applies in
both cases" with a plural that matches the three listed scenarios; locate the
sentence containing the exact text "The same idea applies in both cases" and
update it to "The same idea applies in these cases" or "The same idea applies in
all three scenarios" so the pluralization matches the earlier enumeration and
preserves the rest of the sentence ("you identify what is missing on the
surviving node(s), then repair the nodes using a node with complete data,
preserving the origin ID and timestamp for every repaired row").
---
Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 296-299: Replace the phrase "very small (often sub-second)" with a
clearer, more direct wording such as "typically sub-second under normal
conditions" in the sentence that currently reads "The size of this window
depends on replication lag, network conditions, and transaction volume. Under
normal conditions it is very small (often sub-second), but..."—update that
sentence to use the suggested wording to strengthen clarity and tone.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 66dac58f-1a42-4493-a268-5992c1bd6957
📒 Files selected for processing (1)
docs/recovery/catastrophic_node_failure.md
…ce to CNF doc. Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild, a section on the risk of unreplicated transactions due to asynchronous logical replication, and stronger troubleshooting steps when the source node fails mid-replay.
Also adding more detail to a paragraph
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/recovery/catastrophic_node_failure.md (1)
430-443:⚠️ Potential issue | 🟡 MinorNested list items should use standard Markdown indentation for portable rendering.
Lines 432 and 439 contain nested numbered list items indented with four spaces under "On each surviving node (n2, n3, n4, and n5):". This indentation level may render as code blocks in some Markdown renderers instead of continuing as list items. Use 2-3 space indentation instead for better cross-renderer compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 430 - 443, The nested numbered list lines under "On each surviving node (n2, n3, n4, and n5):" use four-space indentation which can render as code blocks; change the indentation for the nested list items (the lines describing spock.sub_drop() and spock.node_drop()) from four spaces to 2 (or 3) spaces so they are recognized as continued list items, preserving the numbered ordering and keeping the references to spock.sub_drop() and spock.node_drop() intact.
♻️ Duplicate comments (1)
docs/recovery/catastrophic_node_failure.md (1)
867-870:⚠️ Potential issue | 🟠 MajorRe-run instructions must preserve the original recovery envelope.
At Line 867–Line 870, retry guidance only calls out
--source-of-truth. Please explicitly require reusing the original--preserve-origin,--recovery-mode, and the same per-origin--untilcutoff(s), otherwise retries can produce non-equivalent repairs.Suggested doc patch
-3. Pick a new source of truth (e.g., n4 or n5) and re-run `table-diff` - and `table-repair` with `--source-of-truth n4` for any tables that - were not yet fully repaired. Tables that were already repaired and - validated do not need to be re-done. +3. Pick a new source of truth (e.g., n4 or n5) and re-run `table-diff` + and `table-repair` for any tables not yet fully repaired, using the + same recovery options as before (including `--preserve-origin`, + `--recovery-mode`, and the same `--until` cutoff timestamp[s] used per + failed origin), plus `--source-of-truth n4` (or your chosen survivor). + Tables already repaired and validated do not need to be re-done.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 867 - 870, Update the retry instructions for table-diff and table-repair to require reusing the original recovery envelope flags: when re-running with --source-of-truth (e.g., n4), explicitly state that operators must also pass the original --preserve-origin, --recovery-mode, and the same per-origin --until cutoff(s) that were used in the initial repair, so retries produce equivalent repairs and do not change the recovery envelope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 845-849: Update the "Origin Metadata Missing" section to clearly
distinguish local-origin rows (origin = 0) from truly missing origin metadata
(NULL in spock.resolutions.local_origin or 'unknown' in PostgreSQL logs): state
that origin = 0 indicates a local change (not replication) and should not be
treated as metadata loss, while NULL/'unknown' means origin metadata is
unavailable and requires repair; also call out the relevant config/flags to
check (`track_commit_timestamp = on`, diff run with `--preserve-origin`) and
suggest inspecting the diff file and ACE logs for NULL/'unknown' markers rather
than treating origin = 0 as an error.
---
Outside diff comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 430-443: The nested numbered list lines under "On each surviving
node (n2, n3, n4, and n5):" use four-space indentation which can render as code
blocks; change the indentation for the nested list items (the lines describing
spock.sub_drop() and spock.node_drop()) from four spaces to 2 (or 3) spaces so
they are recognized as continued list items, preserving the numbered ordering
and keeping the references to spock.sub_drop() and spock.node_drop() intact.
---
Duplicate comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 867-870: Update the retry instructions for table-diff and
table-repair to require reusing the original recovery envelope flags: when
re-running with --source-of-truth (e.g., n4), explicitly state that operators
must also pass the original --preserve-origin, --recovery-mode, and the same
per-origin --until cutoff(s) that were used in the initial repair, so retries
produce equivalent repairs and do not change the recovery envelope.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ee9367de-264d-4616-9318-92e096da9a78
📒 Files selected for processing (1)
docs/recovery/catastrophic_node_failure.md
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/recovery/catastrophic_node_failure.md (1)
872-875: Consider explicitly mentioning all recovery flags in failover guidance.When instructing operators to re-run table-diff and table-repair after a source-of-truth failover, the current text mentions only
--source-of-truth n4. For consistency and to prevent operator errors during incident response, consider explicitly stating that they should also reuse the original recovery flags:--preserve-origin,--recovery-mode, and the same--untilcutoff timestamp(s).While the main procedure (Phases 3 and 4) documents these flags extensively, being explicit in the troubleshooting section reduces the chance of mistakes when operators are working under pressure.
📝 Suggested clarification
3. Pick a new source of truth (e.g., n4 or n5) and re-run `table-diff` and `table-repair` with `--source-of-truth n4` for any tables that - were not yet fully repaired. Tables that were already repaired and - validated do not need to be re-done. + were not yet fully repaired, using the same recovery options as + before (including `--preserve-origin`, `--recovery-mode`, and the + same `--until` cutoff timestamp). Tables that were already repaired + and validated do not need to be re-done.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/recovery/catastrophic_node_failure.md` around lines 872 - 875, Update the recovery guidance so the operator is told to re-run table-diff and table-repair not only with --source-of-truth (e.g., --source-of-truth n4) but also include and reuse the original recovery flags --preserve-origin, --recovery-mode, and the same --until cutoff timestamps; amend the sentence referencing re-running table-diff and table-repair to explicitly list these flags so operators know to carry forward --preserve-origin, --recovery-mode, and --until when performing the failover repair steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 872-875: Update the recovery guidance so the operator is told to
re-run table-diff and table-repair not only with --source-of-truth (e.g.,
--source-of-truth n4) but also include and reuse the original recovery flags
--preserve-origin, --recovery-mode, and the same --until cutoff timestamps;
amend the sentence referencing re-running table-diff and table-repair to
explicitly list these flags so operators know to carry forward
--preserve-origin, --recovery-mode, and --until when performing the failover
repair steps.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4292c751-f3c8-4089-8f0d-6aace7a087a8
📒 Files selected for processing (1)
docs/recovery/catastrophic_node_failure.md
Add a new scenario for two-of-three-nodes-fail recovery via ZODAN rebuild, a section on the risk of unreplicated transactions due to asynchronous logical replication, and stronger troubleshooting steps when the source node fails mid-replay.
SPOC-141