Skip to content

feat(session): switch a session's agent mid-flight#2412

Open
Pulkit7070 wants to merge 9 commits into
AgentWrapper:mainfrom
Pulkit7070:feat/switch-agent
Open

feat(session): switch a session's agent mid-flight#2412
Pulkit7070 wants to merge 9 commits into
AgentWrapper:mainfrom
Pulkit7070:feat/switch-agent

Conversation

@Pulkit7070

Copy link
Copy Markdown

What & why

A session's agent (harness) is currently fixed at spawn (SessionRecord.Harness, resolved once by effectiveHarness in session_manager) and can never change. Mixing agents on a single task — e.g. codex to test, claude-code to write, a cheaper model for cost — is impossible without throwing the work away and respawning.

This PR lets you change a session's agent in place, keeping the same git worktree (all code + uncommitted work preserved). The new agent launches fresh — there is no native resume, since a different harness cannot read the outgoing agent's session. An optional model override rides the same path.

Behaviour

Two cases are handled:

  • Live session → swap in place. The old agent is torn down only after the new launch command validates, so a bad/unknown harness never disrupts the running session. A BeginSwitch/EndSwitch guard on the lifecycle manager makes the reaper ignore the brief window where the old runtime is gone and the new one isn't up yet — otherwise a "dead" probe would wrongly terminate the session mid-switch.
  • Terminated session → relaunch-as. When an agent exits (e.g. codex ends its process on completion) the session is marked terminated. Switching such a session restores its worktree and launches the chosen agent fresh under it — the way to bring a finished task back under a different agent (plain restore keeps the original harness). Only fully merged sessions stay locked.

lifecycle.MarkSwitched atomically changes the persisted harness, points at the new runtime handle, and clears the harness-specific AgentSessionID (which MarkSpawned's metadata merge cannot, since it only sets non-empty fields), resetting activity/first-signal so the new agent re-proves its hook pipeline.

Surfaces

  • Backend: sessionmanager.SwitchHarness (live + terminated paths) and lifecycle guard + MarkSwitched.
  • API: POST /sessions/{id}/switch {harness, model?}400 unknown harness, 404 unknown session, 409 switch-in-progress. OpenAPI spec + frontend/src/api/schema.ts regenerated (pinned openapi-typescript@7.4.4).
  • CLI: ao session switch <id> --harness <agent> [--model ...].
  • UI: the session inspector's Overview → Agent row is now a dropdown that switches the agent in place (or relaunches a terminated one), built from the existing dropdown-menu primitive and AGENT_OPTIONS; merged sessions stay read-only.

Tests

  • live swap clears AgentSessionID
  • unknown harness leaves the running agent untouched (validate-before-destroy)
  • terminated session relaunches under the new agent (restore-as)
  • create-failure terminates cleanly (no live-session-with-dead-handle)
  • reaper guard suppresses termination during a switch
  • MarkSwitched changes harness, clears AgentSessionID, resets first-signal

go build ./..., go vet ./..., the full backend go test ./..., and frontend tsc all pass.

Notes / follow-ups

  • No migration: the harness column and UpdateSession already exist.
  • The target agent's CLI must be installed on PATH — the existing pre-flight binary check rejects otherwise, surfaced inline in the UI.
  • Context handoff (carrying the outgoing agent's transcript into the new one) is intentionally out of scope; this leaves a clean seam (the fresh launch's prompt) for it.

🤖 Generated with Claude Code

Pulkit7070 and others added 2 commits July 4, 2026 15:07
Add the ability to change a session's agent (harness) without losing the
worktree. Previously a session's harness was fixed at spawn and could never
change; mixing agents on one task (e.g. codex to test, claude-code to write,
a cheaper model for cost) was impossible.

A switch keeps the same git worktree (all code + uncommitted work preserved)
and launches the new agent fresh — there is no native resume, since a
different harness cannot read the outgoing agent's session. An optional model
override rides the same path.

Two cases are handled:
- Live session: swap in place. The old agent is torn down only AFTER the new
  launch command validates, so a bad/unknown harness never disrupts the
  running session. A BeginSwitch/EndSwitch guard on the lifecycle manager makes
  the reaper ignore the brief runtime gap so it is not mistaken for a crash.
- Terminated session (e.g. the agent exited): relaunch-as. The worktree is
  restored and the new agent launched fresh under it — the way to bring a
  finished task back under a different agent (plain restore keeps the harness).

lifecycle.MarkSwitched atomically changes the persisted harness, points at the
new runtime handle, and clears the harness-specific AgentSessionID (which
MarkSpawned's merge cannot), resetting activity/first-signal so the new agent
re-proves its hook pipeline.

Surfaces:
- Backend: sessionmanager.SwitchHarness + lifecycle guard/MarkSwitched.
- API: POST /sessions/{id}/switch {harness, model?} (400/404/409 mapped);
  OpenAPI spec + frontend schema.ts regenerated.
- CLI: `ao session switch <id> --harness <agent> [--model ...]`.
- UI: the session inspector's Overview "Agent" row is now a dropdown that
  switches the agent in place (or relaunches a terminated one); merged sessions
  stay read-only.

Tests cover: live swap clears AgentSessionID, unknown harness leaves the agent
running, terminated relaunch-as, create-failure terminates cleanly, the reaper
guard suppresses termination during a switch, and MarkSwitched semantics.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@nikhilachale nikhilachale self-requested a review July 4, 2026 14:36

@nikhilachale nikhilachale left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. P1: switch-in-progress guard is not atomic, so duplicate switches can still race
    internal/session_manager/manager.go:591 checks IsSwitching, but internal/session_manager/manager.go:656 sets the guard later and BeginSwitch is idempotent with no failure return. Two
    concurrent requests can both pass the check, both enter BeginSwitch, and then both destroy/create runtimes against the same worktree. The terminated path is worse because internal/
    session_manager/manager.go:691 never begins the guard at all, so two relaunch-as switches can create two runtimes. This needs an atomic TryBeginSwitch/compare-and-set style API used by
    both live and terminated switch paths.

  2. P2: terminated switch bypasses the existing not-resumable guard
    Restore refuses promptless, unresumable worker sessions via restoreArgv at internal/session_manager/manager.go:1433, but relaunchTerminatedWithHarness always calls fresh
    GetLaunchCommand with meta.Prompt at internal/session_manager/manager.go:707. That means ao session switch can resurrect a terminated worker with no prompt/native resume context into a
    blank agent session, which the existing restore path deliberately prevents. If switch-relaunch is meant to preserve restore semantics except for harness selection, it should reject the
    same promptless worker case.

Pulkit7070 and others added 2 commits July 4, 2026 23:56
…ability

P1: the switch-in-progress guard was a check-then-act (IsSwitching + later
BeginSwitch) and the terminated path never claimed it, so two concurrent
switches could both proceed and race two teardown/relaunch cycles over one
worktree. Replace with an atomic lifecycle.TryBeginSwitch (single-critical-
section compare-and-set) claimed once at the top of SwitchHarness for both the
live and terminated paths, released via defer.

P2: relaunchTerminatedWithHarness always fresh-launched with meta.Prompt,
bypassing restoreArgv's guard. Since the new harness cannot native-resume the
old agent's session, a terminated worker with no saved prompt would blank-
relaunch — which Restore deliberately refuses. Reject the same promptless-
worker case with ErrNotResumable (orchestrators stay promptless by design).

Tests: reject concurrent switch, terminated promptless worker rejected;
lifecycle TryBeginSwitch compare-and-set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Pulkit7070

Copy link
Copy Markdown
Author

Thanks @nikhilachale — both good catches, fixed in 551742c.

P1 (atomic switch guard): Replaced the check-then-act (IsSwitching + later BeginSwitch) with an atomic lifecycle.TryBeginSwitch — a single-critical-section compare-and-set. It's now claimed once at the top of SwitchHarness for both the live and terminated paths and released via defer, so two concurrent requests can't both pass, and the terminated path is no longer unguarded. Added TestSwitchHarness_RejectsConcurrentSwitch and a compare-and-set test on the lifecycle side.

P2 (terminated resumability): relaunchTerminatedWithHarness now mirrors restoreArgv's guard — since the new harness can't native-resume the old agent's session, a terminated worker with no saved prompt would blank-relaunch, which Restore deliberately refuses. It now returns ErrNotResumable for that case (orchestrators stay promptless by design). Added TestSwitchHarness_TerminatedPromptlessWorkerRejected.

Full backend go test ./... + frontend tsc pass.

# Conflicts:
#	backend/internal/httpd/apispec/openapi.yaml
#	frontend/src/api/schema.ts
#	frontend/src/renderer/components/SessionInspector.tsx
@Pulkit7070

Copy link
Copy Markdown
Author

Flow — SwitchHarness

One entry point, forking on whether the agent is still alive, converging on one atomic write. Validate-before-destroy + an atomic guard mean a bad agent never disrupts a running session and the reaper never mistakes the swap for a crash.

flowchart TD
  A["POST /sessions/{id}/switch { harness, model? }"] --> B{session found +<br/>worktree present?}
  B -- no --> E1["reject: ErrNotFound /<br/>ErrIncompleteHandle"]
  B -- yes --> C{"TryBeginSwitch(id)<br/>atomic compare-and-set"}
  C -- already switching --> E2["reject: ErrSwitchInProgress"]
  C -- claimed (defer EndSwitch) --> D{"validate agent:<br/>known + adapter + binary on PATH"}
  D -- invalid --> E3["reject: ErrUnknownHarness /<br/>binary not found"]
  D -- ok --> S{session terminated?}

  S -- "no · LIVE (swap in place)" --> L1["prepareWorkspace"]
  L1 --> L2["runtime.Destroy old handle"]
  L2 --> L3["runtime.Create new"]
  L3 -- fail --> LF["MarkTerminated<br/>(no dead-handle-live)"]

  S -- "yes · TERMINATED (relaunch-as)" --> T0{promptless worker?}
  T0 -- yes --> E4["reject: ErrNotResumable"]
  T0 -- no --> T1["workspace.Restore"]
  T1 --> T2["prepareWorkspace → runtime.Create new"]

  L3 --> M["lifecycle.MarkSwitched<br/>set harness · point at new handle ·<br/>clear AgentSessionID · reset activity"]
  T2 --> M
  M --> R["EndSwitch (defer) →<br/>updated session streamed over CDC<br/>(terminal re-attaches, UI reflects new agent)"]
Loading

While the guard is held, the lifecycle reducer ignores this session's reaper "dead" probe — that's what makes the brief runtime gap safe. The worktree (code + uncommitted work) is preserved throughout; only the runtime handle is disposable.

@Pulkit7070

Pulkit7070 commented Jul 5, 2026

Copy link
Copy Markdown
Author

@nikhilachale re-requesting your review both of your comments are addressed (atomic TryBeginSwitch guard for P1, promptless-worker ErrNotResumable for P2), merge conflicts resolved, and the flow diagram is posted above. Ready for another pass

@Pulkit7070 Pulkit7070 requested a review from nikhilachale July 5, 2026 05:04
…-used agents

Two fixes on the agent-switch path.

Medium (review): terminated relaunch-as restored the worktree to ws.Path but
MarkSwitched only updated the runtime handle, leaving the old
Metadata.WorkspacePath/Branch. A changed session prefix or managed root could
restore to a different path while the stored session still pointed at the old
one, breaking later terminal/workspace/cleanup ops. MarkSwitched now takes full
SessionMetadata and persists WorkspacePath/Branch from the launch.

Bug: switching to a harness that had already run the session relaunched it
fresh, colliding with the agent's own prior native session — Claude Code pins a
deterministic --session-id, so a fresh relaunch failed with "Session ID <uuid>
is already in use". Sessions now track the set of harnesses they've launched
(SessionMetadata.LaunchedHarnesses); a previously-used harness RESUMES (via the
adapter's restore command) while a new one launches fresh. The promptless-worker
guard now applies only to fresh launches (a resume needs no saved prompt).

Tests: MarkSwitched persists workspace path/branch + launched set; resume for a
previously-used harness; fresh launch (and set update) for a new harness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Pulkit7070

Copy link
Copy Markdown
Author

Fixed in 1e6d9a3.

Medium (workspace path not persisted): MarkSwitched now takes the full domain.SessionMetadata and persists WorkspacePath/Branch from the launch, so the terminated relaunch-as records the restored worktree instead of leaving the stale one. Added a MarkSwitched test that passes a different path/branch and asserts the session metadata is updated.

Bonus — also fixed a related runtime bug I hit while testing: switching to a harness that had already run the session relaunched it fresh and collided with the agent's own prior native session — Claude Code pins a deterministic --session-id, so a fresh relaunch failed with Session ID <uuid> is already in use. Sessions now track the set of harnesses they've launched (SessionMetadata.LaunchedHarnesses); a previously-used harness resumes via the adapter's restore command, a new one launches fresh. The promptless-worker ErrNotResumable guard now applies only to fresh launches (a resume needs no saved prompt). Tests cover resume-vs-fresh both ways.

Backend go test ./... + go vet pass.

Pulkit7070 and others added 2 commits July 5, 2026 20:02
…pdown to authed agents

A terminated agent's tmux session can outlive the agent process (the keep-alive
shell keeps it open), so its deterministic session name stays taken. The
terminated relaunch-as path skipped Destroy (it assumed no live runtime) and
went straight to Create, which failed with "duplicate session <id>" (surfaced
as a 500). It now tears down any leftover runtime handle before Create — Destroy
is idempotent, so an already-gone session is a no-op. The live path already did
this. Test: terminated relaunch with a lingering handle destroys it before Create.

UI: the Agent-row switch dropdown now lists only agents whose local auth probe
passed (the catalog's authorized set, same source the spawn dialogs use) instead
of every known harness, so users don't pick an agent that just fails at launch.
Empty/loading states render a disabled hint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y triggers

The previous commit tracked LaunchedHarnesses on SessionMetadata to drive the
resume-vs-fresh switch decision, but the SQLite store maps metadata to explicit
columns and had no column for it — so it was silently dropped on write and read
back empty. The resume branch never fired, and switching back to a
previously-used deterministic-id agent (Claude Code) still fresh-launched and
collided ("Session ID <uuid> is already in use").

Add a durable launched_harnesses column (migration 0022), thread it through the
InsertSession/UpdateSession/Select queries (sqlc regenerated), and serialise the
harness set as a comma-separated string in the store. The set now round-trips,
so a previously-used harness resumes instead of colliding, surviving daemon
restarts (the agent's on-disk session does too).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants