fix(transport): close stdout pipe FD before process.wait() to avoid ~10s shutdown deadlock (#728)#1017
Open
rr-develop wants to merge 1 commit into
Conversation
…10s shutdown deadlock When ``Query.close()`` cancels the stdout reader and then ``SubprocessCLITransport.close()`` sends EOF on stdin without closing the read-end of stdout, the ``claude`` CLI subprocess can block in the kernel writing its final stream-json record to a full pipe buffer that nobody is reading. SIGTERM does not help — Node.js leaves write() SA_RESTART- interruptible — so close() waits the full 5s + 5s + SIGKILL. Empirical finding (4 standalone repro scripts on Linux/Python 3.12): on the asyncio backend, ``TextReceiveStream.aclose()`` does NOT close the kernel pipe FD. ``StreamReaderWrapper.aclose()`` only sets a Python-level exception on the underlying StreamReader; the FD owned by ``_UnixReadPipeTransport`` stays open and the child remains blocked in write(). The only reliable way to close the read-end FD is to close the asyncio subprocess transport's stdout pipe transport directly via the documented ``BaseSubprocessTransport.get_pipe_transport(1).close()`` (public since Python 3.4). We do this in a new private helper ``_close_stdout_pipe_fd`` that walks a small chain of private attrs (``self._process._process._transport``), guarded by a broad try/except with a WARNING-level log on failure. Failure falls back to the legacy SIGTERM/SIGKILL path — i.e. no regression vs. today. The kept ``await self._stdout_stream.aclose()`` is load-bearing: pipe_transport.close() schedules ``_call_connection_lost`` via ``call_soon()``, so the FD is not actually closed until the loop runs another step; the await gives the loop that step before ``process.wait()``. Tests ----- ``tests/test_subprocess_cli_close_does_not_deadlock.py`` — 3 new regression tests using a real fake-claude binary (Python script, chmod 0o755 explicit). Exercises the pipe-full + SA_RESTART scenario and asserts close() returns in < 2 s (vs ~10 s pre-fix). Happy-path and idempotency tests included. All existing tests pass (727 passed, 30 pre-existing trio-backend failures unrelated to this change). Fixes: anthropics#728
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #728.
SubprocessCLITransport.close()could hang for ~10 s when theclaudeCLI subprocess blocks in the kernel writing to a full stdout pipe that the parent has stopped reading.Query.close()cancels the stdout reader before callingtransport.close();transport.close()then sends EOF on stdin but leaves the stdout read-end open. The child's finalwrite()blocks (S-state, interruptible sleep). SIGTERM doesn't help because Node.js leaveswrite()SA_RESTART-interruptible, so close() burns the full 5 s graceful + 5 s SIGTERM budget and only SIGKILL ends the deadlock.Empirical finding
On the asyncio backend (Linux, Python 3.12, anyio 4.10),
TextReceiveStream.aclose()does not close the kernel pipe FD.StreamReaderWrapper.aclose()only sets a Python-levelClosedResourceErroron the underlyingStreamReader; the FD owned by_UnixReadPipeTransportstays open and the child remains blocked inwrite().Reproduced via 4 standalone scripts (real subprocess, real kernel pipes, no mocks): the child still in process table,
proc.wait()hangs at the 10 s budget. After patching to closeproc._process._transport.get_pipe_transport(1)directly, the child gets EPIPE and is reaped within 500 ms.Fix
New private helper
_close_stdout_pipe_fd()walksself._process._process._transport.get_pipe_transport(1)and calls.close()on it.BaseSubprocessTransport.get_pipe_transport(int)is public/stable since Python 3.4. Failure to walk the chain is logged at WARNING (not DEBUG) and falls back to the legacy SIGTERM/SIGKILL path — no regression vs. today, just no improvement.In
close()we call the helper between the existing stdin/stderr aclose block andprocess.wait(). The keptawait self._stdout_stream.aclose()is load-bearing:pipe_transport.close()schedules_call_connection_lostviacall_soon(), so the FD isn't actually closed until the loop runs another step; theawaitgives the loop that step beforeprocess.wait()enters its grace period.Tests
tests/test_subprocess_cli_close_does_not_deadlock.py— 3 new regression tests using a real fake-claudebinary (Python script,chmod 0o755):close()returns in < 2 s. Pre-fix: ~10 s. Post-fix: ~0.5 s.close()calls in a row must not raise.No mocks of the transport or anyio internals. The autouse fixture sets
CLAUDE_AGENT_SDK_SKIP_VERSION_CHECK=1because the fake CLI cannot answer a--versionprobe (the SDK's own intended escape hatch).All existing tests pass (727 passed, 30 pre-existing failures in
*_triotests — trio backend not installed in CI here).Notes for reviewers
_process._process._transport. We tried only public anyio surfaces first; the empirical section above documents why they're insufficient on asyncio. The traversal is fully guarded; a future anyio rewrite will degrade to the current SIGTERM/SIGKILL behaviour, with a clear WARNING log pointing at this issue.self._stdout_stream = Nonereset at the bottom ofclose()is kept — it's a no-op on the success path (we already cleared it) but still clears state if the new block was skipped entirely (constructor never reachedconnect()).This fix has been running for several hours on our internal deep-worker fork (50+ summarization rotations / 100+ transport closes) with no recurrences of the freeze. Happy to address any review comments.