Databricks workflow repair airflow3#2
Closed
Beat-Nick wants to merge 16 commits into
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Databricks-native retry settings to task operators
Summary
Adds first-class Databricks task retry settings to
DatabricksNotebookOperatorandDatabricksTaskOperator:max_retries,min_retry_interval_millis, andretry_on_timeout.These are Databricks task-level retries, not Airflow task retries. Databricks
reruns the failed task attempt inside the same job run; Airflow
retriesrerunthe operator.
The payload shape change is gated on explicit retry configuration, so existing
standalone tasks keep their current
runs/submitpayload unless users opt in bysetting a Databricks retry field.
This follows the recovery-model discussion in
apache/airflow#68358: native
task retries handle transient task failures first, while workflow repair remains
separate follow-up work for run-level recovery.
Details
The retry fields live on Databricks Jobs API tasks, so the implementation sits
in
DatabricksTaskBaseOperatorand applies to both standalone submits and tasksinside
DatabricksWorkflowTaskGroup.For standalone
DatabricksNotebookOperatorandDatabricksTaskOperator,_get_run_json()switches to thetasks[]submit form only when a retry fieldis configured through operator arguments or, for
DatabricksTaskOperator,task_config. This is required because Databricks ignores these fields at thetop level of
runs/submit; they must be placed on aSubmitTask.Monitoring becomes retry-aware only when the effective Databricks
max_retriespermits another native attempt (
-1or a positive integer). In that mode:Databricks retry attempts.
task_keyand treat a failed attempt as final only after the parent workflow run is
terminal.
workflow_run_idanddatabricks_task_keytoDatabricksExecutionTrigger, soon_killcan cancelthe latest retry attempt instead of a stale attempt id.
Explicit settings that do not enable retries, such as
max_retries=0,retry_on_timeout=False, ormin_retry_interval_millisalone, still land inthe task payload but keep existing single-attempt monitoring behavior.
Changes
DatabricksNotebookOperatorandDatabricksTaskOperator.DatabricksTaskOperatorprecedence: direct operator argumentsoverride matching
task_configfields, and the operator-managedtask_keycannot be shadowed by
task_config.outcome.
WAITING_FOR_RETRYandBLOCKEDas non-terminalRunStatelifecycle states.
monitoring, trigger serialization, and waiting through
WAITING_FOR_RETRY.DatabricksSubmitRunOperatorandDatabricksCreateJobsOperatorremain rawpayload pass-through operators; users can already set per-task retry fields in
their task payloads.
Was generative AI tooling used to co-author this PR?
Generated-by: Codex (GPT-5) following the guidelines