Skip to content

Databricks workflow repair airflow3#2

Closed
Beat-Nick wants to merge 16 commits into
mainfrom
databricks-workflow-repair-airflow3
Closed

Databricks workflow repair airflow3#2
Beat-Nick wants to merge 16 commits into
mainfrom
databricks-workflow-repair-airflow3

Conversation

@Beat-Nick

Copy link
Copy Markdown
Owner

Add Databricks-native retry settings to task operators

Summary

Adds first-class Databricks task retry settings to
DatabricksNotebookOperator and DatabricksTaskOperator:
max_retries, min_retry_interval_millis, and retry_on_timeout.

These are Databricks task-level retries, not Airflow task retries. Databricks
reruns the failed task attempt inside the same job run; Airflow retries rerun
the operator.

The payload shape change is gated on explicit retry configuration, so existing
standalone tasks keep their current runs/submit payload unless users opt in by
setting a Databricks retry field.

This follows the recovery-model discussion in
apache/airflow#68358: native
task retries handle transient task failures first, while workflow repair remains
separate follow-up work for run-level recovery.

Details

The retry fields live on Databricks Jobs API tasks, so the implementation sits
in DatabricksTaskBaseOperator and applies to both standalone submits and tasks
inside DatabricksWorkflowTaskGroup.

For standalone DatabricksNotebookOperator and DatabricksTaskOperator,
_get_run_json() switches to the tasks[] submit form only when a retry field
is configured through operator arguments or, for DatabricksTaskOperator,
task_config. This is required because Databricks ignores these fields at the
top level of runs/submit; they must be placed on a SubmitTask.

Monitoring becomes retry-aware only when the effective Databricks max_retries
permits another native attempt (-1 or a positive integer). In that mode:

  • Standalone operators wait on the submit run, whose terminal state includes all
    Databricks retry attempts.
  • Workflow task operators re-resolve the latest attempt for the same task_key
    and treat a failed attempt as final only after the parent workflow run is
    terminal.
  • Deferrable workflow monitoring passes workflow_run_id and
    databricks_task_key to DatabricksExecutionTrigger, so on_kill can cancel
    the latest retry attempt instead of a stale attempt id.

Explicit settings that do not enable retries, such as max_retries=0,
retry_on_timeout=False, or min_retry_interval_millis alone, still land in
the task payload but keep existing single-attempt monitoring behavior.

Changes

  • Adds retry settings to DatabricksNotebookOperator and
    DatabricksTaskOperator.
  • Preserves DatabricksTaskOperator precedence: direct operator arguments
    override matching task_config fields, and the operator-managed task_key
    cannot be shadowed by task_config.
  • Updates sync and deferrable monitoring to wait for the final Databricks retry
    outcome.
  • Accepts WAITING_FOR_RETRY and BLOCKED as non-terminal RunState life
    cycle states.
  • Adds tests for payload generation, argument precedence, sync and deferrable
    monitoring, trigger serialization, and waiting through WAITING_FOR_RETRY.

DatabricksSubmitRunOperator and DatabricksCreateJobsOperator remain raw
payload pass-through operators; users can already set per-task retry fields in
their task payloads.

Was generative AI tooling used to co-author this PR?
  • Yes - Codex (GPT-5)

Generated-by: Codex (GPT-5) following the guidelines

@Beat-Nick Beat-Nick closed this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant