Skip to content

Keep the restart_workers monitor alive on transient errors (and drop debug prints)#694

Open
lollinng wants to merge 1 commit into
Lightning-AI:mainfrom
lollinng:fix/worker-monitor-resilience
Open

Keep the restart_workers monitor alive on transient errors (and drop debug prints)#694
lollinng wants to merge 1 commit into
Lightning-AI:mainfrom
lollinng:fix/worker-monitor-resilience

Conversation

@lollinng

@lollinng lollinng commented Jun 5, 2026

Copy link
Copy Markdown

Problem

The restart_workers watchdog in _start_worker_monitoring.monitor() wraps its entire while not self._shutdown_event.is_set() loop in a single try/except:

def monitor():
    try:
        while not self._shutdown_event.is_set():
            ...           # map dead workers, relaunch them
    except Exception as e:
        print(e)

So if any single iteration raises — a transient worker→API mapping error, a failed launch_single_inference_worker, etc. — the exception breaks out of the while loop, the monitor thread exits, and restart_workers self-healing is permanently disabled for the rest of the server's lifetime. The loop also shipped three leftover debug prints (misspelled [monoriting]).

Fix

Move the try/except inside the loop so a single bad iteration is logged (logger.exception) and the watchdog keeps monitoring. Replace the [monoriting] debug prints with logger calls. The intentional return on restart_workers=False (graceful shutdown) is preserved.

Verification

Modeled both control-flow shapes with a body that raises once then succeeds:

OLD: ran 0 iterations then the monitor thread DIED (self-healing disabled)
NEW: ran 3 iterations — survived the transient error and kept monitoring

The restart_workers watchdog wrapped its entire while loop in a single
try/except, so any exception in one iteration (e.g. a transient worker->API
mapping error or a failed relaunch) exited the loop and killed the monitor
thread, silently disabling self-healing for the rest of the server's life.
Move the try/except inside the loop so one bad iteration is logged and the
watchdog keeps running. Also replace the leftover '[monoriting]' debug prints
with logger calls.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant