Skip to content

Databricks expose task repair params#3

Draft
Beat-Nick wants to merge 2 commits into
mainfrom
databricks-expose-task-repair-params
Draft

Databricks expose task repair params#3
Beat-Nick wants to merge 2 commits into
mainfrom
databricks-expose-task-repair-params

Conversation

@Beat-Nick

@Beat-Nick Beat-Nick commented Jun 29, 2026

Copy link
Copy Markdown
Owner

Add Databricks-native retry settings to task operators

Summary

Adds first-class Databricks task retry settings to DatabricksNotebookOperator and DatabricksTaskOperator: max_retries, min_retry_interval_millis, and retry_on_timeout.

These are Databricks task-level retries, not Airflow task retries. Databricks reruns the failed task attempt inside the same job run; Airflow retries rerun the operator.

The payload shape change is gated on explicit retry configuration, so existing standalone tasks keep their current runs/submit payload unless users opt in by setting a Databricks retry field.

This follows the recovery-model discussion in apache/airflow#68358: native task retries handle transient task failures first, while workflow repair remains separate follow-up work for run-level recovery.

Details

The retry fields live on Databricks Jobs API tasks, so the implementation sits in DatabricksTaskBaseOperator and applies to both standalone submits and tasks inside DatabricksWorkflowTaskGroup.

For standalone DatabricksNotebookOperator and DatabricksTaskOperator, _get_run_json() switches to the tasks[] submit form only when a retry field is configured through operator arguments or, for DatabricksTaskOperator, task_config. This is required because Databricks ignores these fields at the top level of runs/submit; they must be placed on a SubmitTask.

Monitoring becomes retry-aware only when the effective Databricks max_retries permits another native attempt (-1 or a positive integer). In that mode:

  • Standalone operators wait on the submit run, whose terminal state includes all Databricks retry attempts.
  • Workflow task operators re-resolve the latest attempt for the same task_key and treat a failed attempt as final only after the parent workflow run is terminal.
  • Deferrable workflow monitoring passes workflow_run_id and databricks_task_key to DatabricksExecutionTrigger, so on_kill can cancel the latest retry attempt instead of a stale attempt id.

Explicit settings that do not enable retries, such as max_retries=0, retry_on_timeout=False, or min_retry_interval_millis alone, still land in the task payload but keep existing single-attempt monitoring behavior.

Changes

  • Adds retry settings to DatabricksNotebookOperator and DatabricksTaskOperator.
  • Preserves DatabricksTaskOperator precedence: direct operator arguments override matching task_config fields, and the operator-managed task_key cannot be shadowed by task_config.
  • Updates sync and deferrable monitoring to wait for the final Databricks retry outcome.
  • Accepts WAITING_FOR_RETRY and BLOCKED as non-terminal RunState life cycle states.
  • Adds tests for payload generation, argument precedence, sync and deferrable monitoring, trigger serialization, and waiting through WAITING_FOR_RETRY.

DatabricksSubmitRunOperator and DatabricksCreateJobsOperator remain raw payload pass-through operators; users can already set per-task retry fields in their task payloads.

Was generative AI tooling used to co-author this PR?
  • Yes - Codex (GPT-5)

Generated-by: Codex (GPT-5) following the guidelines

@Beat-Nick Beat-Nick force-pushed the databricks-expose-task-repair-params branch from e93052b to c9908a4 Compare June 30, 2026 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant