feat(session): switch a session's agent mid-flight#2412
Conversation
Add the ability to change a session's agent (harness) without losing the
worktree. Previously a session's harness was fixed at spawn and could never
change; mixing agents on one task (e.g. codex to test, claude-code to write,
a cheaper model for cost) was impossible.
A switch keeps the same git worktree (all code + uncommitted work preserved)
and launches the new agent fresh — there is no native resume, since a
different harness cannot read the outgoing agent's session. An optional model
override rides the same path.
Two cases are handled:
- Live session: swap in place. The old agent is torn down only AFTER the new
launch command validates, so a bad/unknown harness never disrupts the
running session. A BeginSwitch/EndSwitch guard on the lifecycle manager makes
the reaper ignore the brief runtime gap so it is not mistaken for a crash.
- Terminated session (e.g. the agent exited): relaunch-as. The worktree is
restored and the new agent launched fresh under it — the way to bring a
finished task back under a different agent (plain restore keeps the harness).
lifecycle.MarkSwitched atomically changes the persisted harness, points at the
new runtime handle, and clears the harness-specific AgentSessionID (which
MarkSpawned's merge cannot), resetting activity/first-signal so the new agent
re-proves its hook pipeline.
Surfaces:
- Backend: sessionmanager.SwitchHarness + lifecycle guard/MarkSwitched.
- API: POST /sessions/{id}/switch {harness, model?} (400/404/409 mapped);
OpenAPI spec + frontend schema.ts regenerated.
- CLI: `ao session switch <id> --harness <agent> [--model ...]`.
- UI: the session inspector's Overview "Agent" row is now a dropdown that
switches the agent in place (or relaunches a terminated one); merged sessions
stay read-only.
Tests cover: live swap clears AgentSessionID, unknown harness leaves the agent
running, terminated relaunch-as, create-failure terminates cleanly, the reaper
guard suppresses termination during a switch, and MarkSwitched semantics.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nikhilachale
left a comment
There was a problem hiding this comment.
-
P1: switch-in-progress guard is not atomic, so duplicate switches can still race
internal/session_manager/manager.go:591 checks IsSwitching, but internal/session_manager/manager.go:656 sets the guard later and BeginSwitch is idempotent with no failure return. Two
concurrent requests can both pass the check, both enter BeginSwitch, and then both destroy/create runtimes against the same worktree. The terminated path is worse because internal/
session_manager/manager.go:691 never begins the guard at all, so two relaunch-as switches can create two runtimes. This needs an atomic TryBeginSwitch/compare-and-set style API used by
both live and terminated switch paths. -
P2: terminated switch bypasses the existing not-resumable guard
Restore refuses promptless, unresumable worker sessions via restoreArgv at internal/session_manager/manager.go:1433, but relaunchTerminatedWithHarness always calls fresh
GetLaunchCommand with meta.Prompt at internal/session_manager/manager.go:707. That means ao session switch can resurrect a terminated worker with no prompt/native resume context into a
blank agent session, which the existing restore path deliberately prevents. If switch-relaunch is meant to preserve restore semantics except for harness selection, it should reject the
same promptless worker case.
…ability P1: the switch-in-progress guard was a check-then-act (IsSwitching + later BeginSwitch) and the terminated path never claimed it, so two concurrent switches could both proceed and race two teardown/relaunch cycles over one worktree. Replace with an atomic lifecycle.TryBeginSwitch (single-critical- section compare-and-set) claimed once at the top of SwitchHarness for both the live and terminated paths, released via defer. P2: relaunchTerminatedWithHarness always fresh-launched with meta.Prompt, bypassing restoreArgv's guard. Since the new harness cannot native-resume the old agent's session, a terminated worker with no saved prompt would blank- relaunch — which Restore deliberately refuses. Reject the same promptless- worker case with ErrNotResumable (orchestrators stay promptless by design). Tests: reject concurrent switch, terminated promptless worker rejected; lifecycle TryBeginSwitch compare-and-set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks @nikhilachale — both good catches, fixed in 551742c. P1 (atomic switch guard): Replaced the check-then-act ( P2 (terminated resumability): Full backend |
# Conflicts: # backend/internal/httpd/apispec/openapi.yaml # frontend/src/api/schema.ts # frontend/src/renderer/components/SessionInspector.tsx
Flow —
|
|
@nikhilachale re-requesting your review both of your comments are addressed (atomic |
…-used agents Two fixes on the agent-switch path. Medium (review): terminated relaunch-as restored the worktree to ws.Path but MarkSwitched only updated the runtime handle, leaving the old Metadata.WorkspacePath/Branch. A changed session prefix or managed root could restore to a different path while the stored session still pointed at the old one, breaking later terminal/workspace/cleanup ops. MarkSwitched now takes full SessionMetadata and persists WorkspacePath/Branch from the launch. Bug: switching to a harness that had already run the session relaunched it fresh, colliding with the agent's own prior native session — Claude Code pins a deterministic --session-id, so a fresh relaunch failed with "Session ID <uuid> is already in use". Sessions now track the set of harnesses they've launched (SessionMetadata.LaunchedHarnesses); a previously-used harness RESUMES (via the adapter's restore command) while a new one launches fresh. The promptless-worker guard now applies only to fresh launches (a resume needs no saved prompt). Tests: MarkSwitched persists workspace path/branch + launched set; resume for a previously-used harness; fresh launch (and set update) for a new harness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Fixed in 1e6d9a3. Medium (workspace path not persisted): Bonus — also fixed a related runtime bug I hit while testing: switching to a harness that had already run the session relaunched it fresh and collided with the agent's own prior native session — Claude Code pins a deterministic Backend |
…pdown to authed agents A terminated agent's tmux session can outlive the agent process (the keep-alive shell keeps it open), so its deterministic session name stays taken. The terminated relaunch-as path skipped Destroy (it assumed no live runtime) and went straight to Create, which failed with "duplicate session <id>" (surfaced as a 500). It now tears down any leftover runtime handle before Create — Destroy is idempotent, so an already-gone session is a no-op. The live path already did this. Test: terminated relaunch with a lingering handle destroys it before Create. UI: the Agent-row switch dropdown now lists only agents whose local auth probe passed (the catalog's authorized set, same source the spawn dialogs use) instead of every known harness, so users don't pick an agent that just fails at launch. Empty/loading states render a disabled hint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y triggers
The previous commit tracked LaunchedHarnesses on SessionMetadata to drive the
resume-vs-fresh switch decision, but the SQLite store maps metadata to explicit
columns and had no column for it — so it was silently dropped on write and read
back empty. The resume branch never fired, and switching back to a
previously-used deterministic-id agent (Claude Code) still fresh-launched and
collided ("Session ID <uuid> is already in use").
Add a durable launched_harnesses column (migration 0022), thread it through the
InsertSession/UpdateSession/Select queries (sqlc regenerated), and serialise the
harness set as a comma-separated string in the store. The set now round-trips,
so a previously-used harness resumes instead of colliding, surviving daemon
restarts (the agent's on-disk session does too).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What & why
A session's agent (
harness) is currently fixed at spawn (SessionRecord.Harness, resolved once byeffectiveHarnessinsession_manager) and can never change. Mixing agents on a single task — e.g. codex to test, claude-code to write, a cheaper model for cost — is impossible without throwing the work away and respawning.This PR lets you change a session's agent in place, keeping the same git worktree (all code + uncommitted work preserved). The new agent launches fresh — there is no native resume, since a different harness cannot read the outgoing agent's session. An optional
modeloverride rides the same path.Behaviour
Two cases are handled:
BeginSwitch/EndSwitchguard on the lifecycle manager makes the reaper ignore the brief window where the old runtime is gone and the new one isn't up yet — otherwise a "dead" probe would wrongly terminate the session mid-switch.mergedsessions stay locked.lifecycle.MarkSwitchedatomically changes the persisted harness, points at the new runtime handle, and clears the harness-specificAgentSessionID(whichMarkSpawned's metadata merge cannot, since it only sets non-empty fields), resetting activity/first-signal so the new agent re-proves its hook pipeline.Surfaces
sessionmanager.SwitchHarness(live + terminated paths) andlifecycleguard +MarkSwitched.POST /sessions/{id}/switch{harness, model?}—400unknown harness,404unknown session,409switch-in-progress. OpenAPI spec +frontend/src/api/schema.tsregenerated (pinnedopenapi-typescript@7.4.4).ao session switch <id> --harness <agent> [--model ...].dropdown-menuprimitive andAGENT_OPTIONS;mergedsessions stay read-only.Tests
AgentSessionIDMarkSwitchedchanges harness, clearsAgentSessionID, resets first-signalgo build ./...,go vet ./..., the full backendgo test ./..., and frontendtscall pass.Notes / follow-ups
harnesscolumn andUpdateSessionalready exist.🤖 Generated with Claude Code