Skip to content

fix(transport): close stdout pipe FD before process.wait() to avoid ~10s shutdown deadlock (#728)#1017

Open
rr-develop wants to merge 1 commit into
anthropics:mainfrom
rr-develop:upstream-fix/subprocess-stdout-close-deadlock
Open

fix(transport): close stdout pipe FD before process.wait() to avoid ~10s shutdown deadlock (#728)#1017
rr-develop wants to merge 1 commit into
anthropics:mainfrom
rr-develop:upstream-fix/subprocess-stdout-close-deadlock

Conversation

@rr-develop
Copy link
Copy Markdown

Summary

Fixes #728. SubprocessCLITransport.close() could hang for ~10 s when the claude CLI subprocess blocks in the kernel writing to a full stdout pipe that the parent has stopped reading. Query.close() cancels the stdout reader before calling transport.close(); transport.close() then sends EOF on stdin but leaves the stdout read-end open. The child's final write() blocks (S-state, interruptible sleep). SIGTERM doesn't help because Node.js leaves write() SA_RESTART-interruptible, so close() burns the full 5 s graceful + 5 s SIGTERM budget and only SIGKILL ends the deadlock.

Empirical finding

On the asyncio backend (Linux, Python 3.12, anyio 4.10), TextReceiveStream.aclose() does not close the kernel pipe FD. StreamReaderWrapper.aclose() only sets a Python-level ClosedResourceError on the underlying StreamReader; the FD owned by _UnixReadPipeTransport stays open and the child remains blocked in write().

Reproduced via 4 standalone scripts (real subprocess, real kernel pipes, no mocks): the child still in process table, proc.wait() hangs at the 10 s budget. After patching to close proc._process._transport.get_pipe_transport(1) directly, the child gets EPIPE and is reaped within 500 ms.

Fix

New private helper _close_stdout_pipe_fd() walks self._process._process._transport.get_pipe_transport(1) and calls .close() on it. BaseSubprocessTransport.get_pipe_transport(int) is public/stable since Python 3.4. Failure to walk the chain is logged at WARNING (not DEBUG) and falls back to the legacy SIGTERM/SIGKILL path — no regression vs. today, just no improvement.

In close() we call the helper between the existing stdin/stderr aclose block and process.wait(). The kept await self._stdout_stream.aclose() is load-bearing: pipe_transport.close() schedules _call_connection_lost via call_soon(), so the FD isn't actually closed until the loop runs another step; the await gives the loop that step before process.wait() enters its grace period.

Tests

tests/test_subprocess_cli_close_does_not_deadlock.py — 3 new regression tests using a real fake-claude binary (Python script, chmod 0o755):

  1. Pipe-full + SA_RESTART: child ignores SIGTERM, blocks on stdin, then dumps 256 KB into an unread stdout pipe on EOF. Asserts close() returns in < 2 s. Pre-fix: ~10 s. Post-fix: ~0.5 s.
  2. Happy path: trivial child that exits on stdin EOF — asserts no regression.
  3. Idempotency: two close() calls in a row must not raise.

No mocks of the transport or anyio internals. The autouse fixture sets CLAUDE_AGENT_SDK_SKIP_VERSION_CHECK=1 because the fake CLI cannot answer a --version probe (the SDK's own intended escape hatch).

All existing tests pass (727 passed, 30 pre-existing failures in *_trio tests — trio backend not installed in CI here).

Notes for reviewers

  • Why private API? We're touching _process._process._transport. We tried only public anyio surfaces first; the empirical section above documents why they're insufficient on asyncio. The traversal is fully guarded; a future anyio rewrite will degrade to the current SIGTERM/SIGKILL behaviour, with a clear WARNING log pointing at this issue.
  • Why WARNING (not DEBUG)? Production typically runs on asyncio + uvloop. Silent fallback would hide a regression for months.
  • Trailing self._stdout_stream = None reset at the bottom of close() is kept — it's a no-op on the success path (we already cleared it) but still clears state if the new block was skipped entirely (constructor never reached connect()).

This fix has been running for several hours on our internal deep-worker fork (50+ summarization rotations / 100+ transport closes) with no recurrences of the freeze. Happy to address any review comments.

…10s shutdown deadlock

When ``Query.close()`` cancels the stdout reader and then
``SubprocessCLITransport.close()`` sends EOF on stdin without closing the
read-end of stdout, the ``claude`` CLI subprocess can block in the kernel
writing its final stream-json record to a full pipe buffer that nobody
is reading. SIGTERM does not help — Node.js leaves write() SA_RESTART-
interruptible — so close() waits the full 5s + 5s + SIGKILL.

Empirical finding (4 standalone repro scripts on Linux/Python 3.12):
on the asyncio backend, ``TextReceiveStream.aclose()`` does NOT close the
kernel pipe FD. ``StreamReaderWrapper.aclose()`` only sets a Python-level
exception on the underlying StreamReader; the FD owned by
``_UnixReadPipeTransport`` stays open and the child remains blocked in
write().

The only reliable way to close the read-end FD is to close the asyncio
subprocess transport's stdout pipe transport directly via the documented
``BaseSubprocessTransport.get_pipe_transport(1).close()`` (public since
Python 3.4). We do this in a new private helper ``_close_stdout_pipe_fd``
that walks a small chain of private attrs
(``self._process._process._transport``), guarded by a broad try/except
with a WARNING-level log on failure. Failure falls back to the legacy
SIGTERM/SIGKILL path — i.e. no regression vs. today.

The kept ``await self._stdout_stream.aclose()`` is load-bearing:
pipe_transport.close() schedules ``_call_connection_lost`` via
``call_soon()``, so the FD is not actually closed until the loop runs
another step; the await gives the loop that step before ``process.wait()``.

Tests
-----
``tests/test_subprocess_cli_close_does_not_deadlock.py`` — 3 new
regression tests using a real fake-claude binary (Python script, chmod
0o755 explicit). Exercises the pipe-full + SA_RESTART scenario and
asserts close() returns in < 2 s (vs ~10 s pre-fix). Happy-path and
idempotency tests included.

All existing tests pass (727 passed, 30 pre-existing trio-backend
failures unrelated to this change).

Fixes: anthropics#728
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SubprocessCLITransport.close() hangs indefinitely if subprocess shutdown handler blocks

1 participant