Skip to content

ENG-2756: Self-healing DSR watchdog — requeue all no-subtask_id cases through retry mechanism#7684

Open
JadeCara wants to merge 15 commits intomainfrom
ENG-2756-fix-watchdog-false-positive-pending-tasks-3
Open

ENG-2756: Self-healing DSR watchdog — requeue all no-subtask_id cases through retry mechanism#7684
JadeCara wants to merge 15 commits intomainfrom
ENG-2756-fix-watchdog-false-positive-pending-tasks-3

Conversation

@JadeCara
Copy link
Contributor

@JadeCara JadeCara commented Mar 17, 2026

Ticket ENG-2756

Description Of Changes

Reviewer note: This PR looks large (~900 additions) but most of it is noise:

  • specs/DsrWatchdog.tla (+443 lines) — TLA+ formal spec, not production code
  • tests/task/test_requeue_interrupted_tasks.py — refactored for readability (factory fixtures, constants, removed try/finally blocks); the actual behavioral changes are 4 tests out of 22

The real change is ~55 lines in request_service.py.

Extends the DSR watchdog fix to be fully self-healing. Previously, in_processing tasks with no cache key (the Celery task started and updated the DB status, but the cache key is missing due to Redis eviction or worker crash) were hard-cancelled, bypassing _handle_privacy_request_requeue entirely.

Now all no-subtask_id cases route through _handle_privacy_request_requeue, which applies the configured retry limit (privacy_request_requeue_retry_count, default 3). Cancellation only fires when retries are exhausted — never on first encounter.

Three scenarios are now handled uniformly:

  • pending + no upstream (root task never dispatched) → requeue
  • pending + upstream complete (parent died before dispatch) → requeue
  • in_processing + no cache key (cache eviction / worker crash) → requeue

Also fixes a bug introduced in the previous commit: _has_async_tasks_awaiting_external_completion was evaluated after the pending+not_awaiting_upstream early-return, meaning pending tasks with async_type=callback/polling were incorrectly requeued instead of being left alone. The async guard is now evaluated before the dispatch decision and covers both pending and in_processing tasks.

_has_async_tasks_awaiting_external_completion is now wrapped in a try/except — a transient DB error skips the current PR (fail safe) instead of aborting the whole watchdog pass.

The fix is verified with a TLA+ formal spec (specs/DsrWatchdog.tla). The new invariant CancelImpliesRetriesExhausted (pr_status = "error" => retry_count >= MaxRetries) passes on 74 states with MaxRetries=2. The buggy spec fails in 3 states (cancel fires at retry_count=0).

Code Changes

  • src/fides/api/service/privacy_request/request_service.py — Reordered if not subtask_id guards: async check now before dispatch decision; in_processing+no_cache sets should_requeue=True instead of calling _cancel_interrupted_tasks_and_error_privacy_request; async check wrapped in try/except (fail safe on DB error)
  • tests/task/test_requeue_interrupted_tasks.py — Refactored for readability (factory fixtures, mock path constants, removed try/finally); updated 2 tests (cancel→requeue for in_processing+no_cache); added tests: over_retry_limit_is_canceled, in_processing+no_cache+async_task_is_not_canceled, async_check_db_error_skips_request; 22 tests total
  • specs/DsrWatchdog.tla — Updated: retry_count/MaxRetries state variable + constant; WatchdogDecisionFixed no longer has a direct cancel path; new CancelImpliesRetriesExhausted invariant replaces CancelImpliesGenuinelyStuck
  • specs/DsrWatchdog.cfg / specs/DsrWatchdog_buggy.cfg — Updated configs with MaxRetries=2 constant

Steps to Confirm

  1. Run tests/task/test_requeue_interrupted_tasks.py — all 22 tests should pass
  2. Verify TLA+ spec: java -XX:+UseParallelGC -jar ~/tla2tools.jar -deadlock -config specs/DsrWatchdog.cfg specs/DsrWatchdog.tla should report "No error has been found" (74 states)
  3. To reproduce the fixed behavior: create a privacy request, start a request task (set status to in_processing), delete its cache key from Redis, trigger poll_for_exited_privacy_request_tasks — the request should be requeued (logged as "requeueing"), not cancelled

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change (i.e. potential for performance impact or unexpected regression) that should be flagged
    • Updates unreleased work already in Changelog, no new entry necessary
  • UX feedback:
    • All UX related changes have been reviewed by a designer
    • No UX review needed
  • Followup issues:
    • Followup issues created
    • No followup issues
  • Database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!
    • No migrations
  • Documentation:
    • Documentation complete, PR opened in fidesdocs
    • Documentation issue created in fidesdocs
    • If there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
    • No documentation updates required

Jade Wibbels and others added 4 commits March 17, 2026 16:36
…omplete

The ENG-2756 fix protected pending tasks awaiting upstream deps from
being incorrectly cancelled. This extends the fix to also protect
pending tasks where upstream IS complete but the task was never
dispatched — the parent Celery task died between upstream completing
and this task being queued.

A pending task with no subtask ID has never been dispatched to Celery
regardless of upstream status. Only in_processing tasks with no cache
key are genuinely stuck and warrant cancellation.

Also adds a TLA+ spec (specs/DsrWatchdog.tla) that formally verifies
the corrected watchdog decision logic across all interleavings of
parent task death, task dispatch, and watchdog execution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y mechanism

Previously, in_processing tasks with no cache key were hard-cancelled, bypassing
_handle_privacy_request_requeue entirely. Now ALL no-subtask_id cases (both
pending+upstream_complete and in_processing+no_cache) set should_requeue=True and
go through the retry mechanism. Cancellation only fires when retry_count exceeds
privacy_request_requeue_retry_count (default 3).

Also fixes a bug introduced in the previous commit: the async-tasks check
(_has_async_tasks_awaiting_external_completion) was placed after the
pending+not_awaiting_upstream early-return, meaning pending tasks with async_type
callback/polling were incorrectly requeued. The check is now evaluated before the
pending dispatch decision.

TLA+ spec updated: WatchdogDecisionFixed no longer has a direct cancel path.
New invariant CancelImpliesRetriesExhausted replaces CancelImpliesGenuinelyStuck —
cancellation requires retry_count >= MaxRetries. Fixed spec: 74 states, all pass.
Buggy spec: fails in 3 states (cancel at retry_count=0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update comment to list all three no-subtask_id scenarios (root task
  with no upstream was missing from the two listed)
- Add test: in_processing+no_cache+async_task_present → skipped (the
  async guard now applies to in_processing tasks too, not just pending)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link
Contributor

vercel bot commented Mar 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
fides-plus-nightly Ready Ready Preview, Comment Mar 20, 2026 0:02am
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
fides-privacy-center Ignored Ignored Mar 20, 2026 0:02am

Request Review

Jade Wibbels and others added 2 commits March 17, 2026 17:39
- Add make_privacy_request and make_request_task factory fixtures —
  handle creation and teardown, eliminating all inline try/finally blocks
- Add _CANCEL / _REQUEUE / _QUEUE / _IN_FLIGHT constants for mock paths
- Move increment/reset/get_privacy_request_retry_count imports to top level
- Collapse two identical async-type tests into one parametrized test
- Merge TestEnhancedRequeueInterruptedTasks into TestRequeueInterruptedTasks
- Remove fragile test_requeue_with_retry_count_logging (tested old branch,
  asserted on log message text)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@JadeCara JadeCara marked this pull request as ready for review March 18, 2026 00:44
@JadeCara JadeCara requested a review from a team as a code owner March 18, 2026 00:44
@JadeCara JadeCara requested review from johnewart and removed request for a team March 18, 2026 00:44
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR completes the DSR watchdog's self-healing behaviour by routing all no-subtask_id cases (pending+upstream_complete, pending+no_upstream, and in_processing+no_cache) through _handle_privacy_request_requeue instead of the previous hard-cancel path. Cancellation now only fires once the configured retry limit (privacy_request_requeue_retry_count, default 3) is exhausted. The async guard (_has_async_tasks_awaiting_external_completion) is now also wrapped in a try/except so a transient DB error fails safe by skipping the request rather than aborting the whole watchdog pass.

  • The core service change is ~55 lines and well-reasoned; the remaining ~845 added lines are TLA+ formal spec and test refactoring.
  • The TLA+ spec with CancelImpliesRetriesExhausted is a strong correctness guarantee and the buggy config doubles as a regression test.
  • Test refactoring introduces factory fixtures and centralised mock-path constants, making the 22 tests considerably easier to read. Four new tests and two updated tests correctly cover the changed behaviour.
  • One style violation: pr (2 chars) is used as the local variable name throughout the new fixtures and tests, contrary to the team convention of using full variable names (rule violation also seen at make_privacy_request fixture teardown and every call site).
  • The PR exceeds 500 line changes; however, as the PR description clearly explains, the bulk is non-production TLA+ spec and test refactoring, so this is acceptable with the justification provided.

Confidence Score: 4/5

  • Safe to merge after addressing the pr variable naming style issue; production logic is correct and formally verified.
  • The production code change is small, well-reasoned, and backed by both a comprehensive test suite (22 tests, all scenarios covered) and a TLA+ formal proof. No logic bugs were found. The only issue is a style convention violation — the new test code uses pr (2 chars) as a variable name pervasively, against the team's full-name convention. This does not affect correctness but is a consistent rule violation worth fixing before merge.
  • tests/task/test_requeue_interrupted_tasks.py — pr short variable name used throughout new fixtures and tests.

Important Files Changed

Filename Overview
src/fides/api/service/privacy_request/request_service.py Core watchdog change (~55 lines): routes all no-subtask_id cases (pending+upstream_complete, pending+no_upstream, in_processing+no_cache) through _handle_privacy_request_requeue instead of directly calling cancel. Async guard moved before the dispatch decision and wrapped in try/except for fail-safe behaviour. Logic is sound, well-commented, and verified by TLA+ spec.
tests/task/test_requeue_interrupted_tasks.py Refactored for readability using factory fixtures and centralised mock-path constants; two tests changed (cancel→requeue for in_processing+no_cache); four new tests added. Minor style violation: the pr variable name (2 chars) is used pervasively throughout new test code and the fixture, contrary to the team convention requiring full variable names.
specs/DsrWatchdog.tla New TLA+ formal spec modelling the fixed watchdog logic. Includes both buggy and fixed variants, the CancelImpliesRetriesExhausted safety invariant, and an EventualResolution liveness property. Well-documented and complete.

Last reviewed commit: "Merge branch 'main' ..."

- Fix stale test docstring (said "skips" but cancel is asserted)
- Distinguish root task vs upstream-complete log messages for pending tasks with no subtask_id

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@JadeCara
Copy link
Contributor Author

@greptile please review

/code-review

The pending+awaiting_upstream=True case is already handled by the
`continue` guard above, so awaiting_upstream is always False at the
log site. Collapse to a single message with an explanatory comment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@JadeCara
Copy link
Contributor Author

@greptile please review

@JadeCara
Copy link
Contributor Author

/code-review

- Wrap _has_async_tasks_awaiting_external_completion in try/except;
  a transient DB error now skips the PR (fail safe) instead of
  aborting the whole watchdog pass
- Add test: test_async_check_db_error_skips_request
- Remove duplicate test_in_processing_request_task_with_no_cached_subtask_id_is_requeued
  (identical to test_in_processing_task_with_no_cache_key_is_requeued)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@JadeCara JadeCara requested a review from galvana March 18, 2026 15:59
@JadeCara
Copy link
Contributor Author

@greptile please review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants