ENG-2756: Self-healing DSR watchdog — requeue all no-subtask_id cases through retry mechanism#7684
ENG-2756: Self-healing DSR watchdog — requeue all no-subtask_id cases through retry mechanism#7684
Conversation
…omplete The ENG-2756 fix protected pending tasks awaiting upstream deps from being incorrectly cancelled. This extends the fix to also protect pending tasks where upstream IS complete but the task was never dispatched — the parent Celery task died between upstream completing and this task being queued. A pending task with no subtask ID has never been dispatched to Celery regardless of upstream status. Only in_processing tasks with no cache key are genuinely stuck and warrant cancellation. Also adds a TLA+ spec (specs/DsrWatchdog.tla) that formally verifies the corrected watchdog decision logic across all interleavings of parent task death, task dispatch, and watchdog execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y mechanism Previously, in_processing tasks with no cache key were hard-cancelled, bypassing _handle_privacy_request_requeue entirely. Now ALL no-subtask_id cases (both pending+upstream_complete and in_processing+no_cache) set should_requeue=True and go through the retry mechanism. Cancellation only fires when retry_count exceeds privacy_request_requeue_retry_count (default 3). Also fixes a bug introduced in the previous commit: the async-tasks check (_has_async_tasks_awaiting_external_completion) was placed after the pending+not_awaiting_upstream early-return, meaning pending tasks with async_type callback/polling were incorrectly requeued. The check is now evaluated before the pending dispatch decision. TLA+ spec updated: WatchdogDecisionFixed no longer has a direct cancel path. New invariant CancelImpliesRetriesExhausted replaces CancelImpliesGenuinelyStuck — cancellation requires retry_count >= MaxRetries. Fixed spec: 74 states, all pass. Buggy spec: fails in 3 states (cancel at retry_count=0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update comment to list all three no-subtask_id scenarios (root task with no upstream was missing from the two listed) - Add test: in_processing+no_cache+async_task_present → skipped (the async guard now applies to in_processing tasks too, not just pending) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
- Add make_privacy_request and make_request_task factory fixtures — handle creation and teardown, eliminating all inline try/finally blocks - Add _CANCEL / _REQUEUE / _QUEUE / _IN_FLIGHT constants for mock paths - Move increment/reset/get_privacy_request_retry_count imports to top level - Collapse two identical async-type tests into one parametrized test - Merge TestEnhancedRequeueInterruptedTasks into TestRequeueInterruptedTasks - Remove fragile test_requeue_with_retry_count_logging (tested old branch, asserted on log message text) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR completes the DSR watchdog's self-healing behaviour by routing all
Confidence Score: 4/5
Important Files Changed
Last reviewed commit: "Merge branch 'main' ..." |
- Fix stale test docstring (said "skips" but cancel is asserted) - Distinguish root task vs upstream-complete log messages for pending tasks with no subtask_id Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@greptile please review /code-review |
The pending+awaiting_upstream=True case is already handled by the `continue` guard above, so awaiting_upstream is always False at the log site. Collapse to a single message with an explanatory comment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@greptile please review |
|
/code-review |
- Wrap _has_async_tasks_awaiting_external_completion in try/except; a transient DB error now skips the PR (fail safe) instead of aborting the whole watchdog pass - Add test: test_async_check_db_error_skips_request - Remove duplicate test_in_processing_request_task_with_no_cached_subtask_id_is_requeued (identical to test_in_processing_task_with_no_cache_key_is_requeued) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@greptile please review |
Ticket ENG-2756
Description Of Changes
Extends the DSR watchdog fix to be fully self-healing. Previously,
in_processingtasks with no cache key (the Celery task started and updated the DB status, but the cache key is missing due to Redis eviction or worker crash) were hard-cancelled, bypassing_handle_privacy_request_requeueentirely.Now all no-subtask_id cases route through
_handle_privacy_request_requeue, which applies the configured retry limit (privacy_request_requeue_retry_count, default 3). Cancellation only fires when retries are exhausted — never on first encounter.Three scenarios are now handled uniformly:
pending+ no upstream (root task never dispatched) → requeuepending+ upstream complete (parent died before dispatch) → requeuein_processing+ no cache key (cache eviction / worker crash) → requeueAlso fixes a bug introduced in the previous commit:
_has_async_tasks_awaiting_external_completionwas evaluated after thepending+not_awaiting_upstreamearly-return, meaning pending tasks withasync_type=callback/pollingwere incorrectly requeued instead of being left alone. The async guard is now evaluated before the dispatch decision and covers bothpendingandin_processingtasks._has_async_tasks_awaiting_external_completionis now wrapped in a try/except — a transient DB error skips the current PR (fail safe) instead of aborting the whole watchdog pass.The fix is verified with a TLA+ formal spec (
specs/DsrWatchdog.tla). The new invariantCancelImpliesRetriesExhausted(pr_status = "error" => retry_count >= MaxRetries) passes on 74 states withMaxRetries=2. The buggy spec fails in 3 states (cancel fires atretry_count=0).Code Changes
src/fides/api/service/privacy_request/request_service.py— Reorderedif not subtask_idguards: async check now before dispatch decision;in_processing+no_cachesetsshould_requeue=Trueinstead of calling_cancel_interrupted_tasks_and_error_privacy_request; async check wrapped in try/except (fail safe on DB error)tests/task/test_requeue_interrupted_tasks.py— Refactored for readability (factory fixtures, mock path constants, removed try/finally); updated 2 tests (cancel→requeue forin_processing+no_cache); added tests:over_retry_limit_is_canceled,in_processing+no_cache+async_task_is_not_canceled,async_check_db_error_skips_request; 22 tests totalspecs/DsrWatchdog.tla— Updated:retry_count/MaxRetriesstate variable + constant;WatchdogDecisionFixedno longer has a direct cancel path; newCancelImpliesRetriesExhaustedinvariant replacesCancelImpliesGenuinelyStuckspecs/DsrWatchdog.cfg/specs/DsrWatchdog_buggy.cfg— Updated configs withMaxRetries=2constantSteps to Confirm
tests/task/test_requeue_interrupted_tasks.py— all 22 tests should passjava -XX:+UseParallelGC -jar ~/tla2tools.jar -deadlock -config specs/DsrWatchdog.cfg specs/DsrWatchdog.tlashould report "No error has been found" (74 states)in_processing), delete its cache key from Redis, triggerpoll_for_exited_privacy_request_tasks— the request should be requeued (logged as "requeueing"), not cancelledPre-Merge Checklist
CHANGELOG.mdupdatedmaindowngrade()migration is correct and works