Databricks expose task repair params#3
Draft
Beat-Nick wants to merge 2 commits into
Draft
Conversation
e93052b to
c9908a4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Databricks-native retry settings to task operators
Summary
Adds first-class Databricks task retry settings to
DatabricksNotebookOperatorandDatabricksTaskOperator:max_retries,min_retry_interval_millis, andretry_on_timeout.These are Databricks task-level retries, not Airflow task retries. Databricks reruns the failed task attempt inside the same job run; Airflow
retriesrerun the operator.The payload shape change is gated on explicit retry configuration, so existing standalone tasks keep their current
runs/submitpayload unless users opt in by setting a Databricks retry field.This follows the recovery-model discussion in apache/airflow#68358: native task retries handle transient task failures first, while workflow repair remains separate follow-up work for run-level recovery.
Details
The retry fields live on Databricks Jobs API tasks, so the implementation sits in
DatabricksTaskBaseOperatorand applies to both standalone submits and tasks insideDatabricksWorkflowTaskGroup.For standalone
DatabricksNotebookOperatorandDatabricksTaskOperator,_get_run_json()switches to thetasks[]submit form only when a retry field is configured through operator arguments or, forDatabricksTaskOperator,task_config. This is required because Databricks ignores these fields at the top level ofruns/submit; they must be placed on aSubmitTask.Monitoring becomes retry-aware only when the effective Databricks
max_retriespermits another native attempt (-1or a positive integer). In that mode:task_keyand treat a failed attempt as final only after the parent workflow run is terminal.workflow_run_idanddatabricks_task_keytoDatabricksExecutionTrigger, soon_killcan cancel the latest retry attempt instead of a stale attempt id.Explicit settings that do not enable retries, such as
max_retries=0,retry_on_timeout=False, ormin_retry_interval_millisalone, still land in the task payload but keep existing single-attempt monitoring behavior.Changes
DatabricksNotebookOperatorandDatabricksTaskOperator.DatabricksTaskOperatorprecedence: direct operator arguments override matchingtask_configfields, and the operator-managedtask_keycannot be shadowed bytask_config.WAITING_FOR_RETRYandBLOCKEDas non-terminalRunStatelife cycle states.WAITING_FOR_RETRY.DatabricksSubmitRunOperatorandDatabricksCreateJobsOperatorremain raw payload pass-through operators; users can already set per-task retry fields in their task payloads.Was generative AI tooling used to co-author this PR?
Generated-by: Codex (GPT-5) following the guidelines