Databricks workflow repair airflow3 by Beat-Nick · Pull Request #2 · Beat-Nick/airflow

Beat-Nick · 2026-06-29T19:39:42Z

Add Databricks-native retry settings to task operators

Summary

Adds first-class Databricks task retry settings to
DatabricksNotebookOperator and DatabricksTaskOperator:
max_retries, min_retry_interval_millis, and retry_on_timeout.

These are Databricks task-level retries, not Airflow task retries. Databricks
reruns the failed task attempt inside the same job run; Airflow retries rerun
the operator.

The payload shape change is gated on explicit retry configuration, so existing
standalone tasks keep their current runs/submit payload unless users opt in by
setting a Databricks retry field.

This follows the recovery-model discussion in
apache/airflow#68358: native
task retries handle transient task failures first, while workflow repair remains
separate follow-up work for run-level recovery.

Details

The retry fields live on Databricks Jobs API tasks, so the implementation sits
in DatabricksTaskBaseOperator and applies to both standalone submits and tasks
inside DatabricksWorkflowTaskGroup.

For standalone DatabricksNotebookOperator and DatabricksTaskOperator,
_get_run_json() switches to the tasks[] submit form only when a retry field
is configured through operator arguments or, for DatabricksTaskOperator,
task_config. This is required because Databricks ignores these fields at the
top level of runs/submit; they must be placed on a SubmitTask.

Monitoring becomes retry-aware only when the effective Databricks max_retries
permits another native attempt (-1 or a positive integer). In that mode:

Standalone operators wait on the submit run, whose terminal state includes all
Databricks retry attempts.
Workflow task operators re-resolve the latest attempt for the same task_key
and treat a failed attempt as final only after the parent workflow run is
terminal.
Deferrable workflow monitoring passes workflow_run_id and
databricks_task_key to DatabricksExecutionTrigger, so on_kill can cancel
the latest retry attempt instead of a stale attempt id.

Explicit settings that do not enable retries, such as max_retries=0,
retry_on_timeout=False, or min_retry_interval_millis alone, still land in
the task payload but keep existing single-attempt monitoring behavior.

Changes

Adds retry settings to DatabricksNotebookOperator and
DatabricksTaskOperator.
Preserves DatabricksTaskOperator precedence: direct operator arguments
override matching task_config fields, and the operator-managed task_key
cannot be shadowed by task_config.
Updates sync and deferrable monitoring to wait for the final Databricks retry
outcome.
Accepts WAITING_FOR_RETRY and BLOCKED as non-terminal RunState life
cycle states.
Adds tests for payload generation, argument precedence, sync and deferrable
monitoring, trigger serialization, and waiting through WAITING_FOR_RETRY.

DatabricksSubmitRunOperator and DatabricksCreateJobsOperator remain raw
payload pass-through operators; users can already set per-task retry fields in
their task payloads.

Was generative AI tooling used to co-author this PR?

Yes - Codex (GPT-5)

Generated-by: Codex (GPT-5) following the guidelines

Beat-Nick and others added 16 commits June 9, 2026 09:27

init

b330186

rename cordinator, update docstring

729c868

more docstring trimming

79d9c2d

More cleanup

8f87d18

trim LOC, reduce exceptions, hardcode grace polls

c6f6956

tweak seralization test

52f1644

keep language consistent

f2777fc

rename params for clarity

bb6b773

Fix race conditions and consolidate grace poll constant

da23363

Update rst

1995a18

rework cordinator polling for repair

40550d9

tweak new parameter name for simplicity

8d6d13f

Merge branch 'apache:main' into databricks-workflow-repair-airflow3

882032a

Unify repair poll deadline to single clock

d0dc257

Update find_new_workflow_task_attempt to require a start_time

45d91da

fix test import error

b01e19c

Beat-Nick closed this Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks workflow repair airflow3#2

Databricks workflow repair airflow3#2
Beat-Nick wants to merge 16 commits into
mainfrom
databricks-workflow-repair-airflow3

Beat-Nick commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Beat-Nick commented Jun 29, 2026

Add Databricks-native retry settings to task operators

Summary

Details

Changes

Was generative AI tooling used to co-author this PR?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant