Skip to content

worker: remove fd from epoll on the loop thread before recycling Conn#216

Open
kwsantiago wants to merge 1 commit into
karlseguin:masterfrom
privkeyio:fix/epoll-del-before-free
Open

worker: remove fd from epoll on the loop thread before recycling Conn#216
kwsantiago wants to merge 1 commit into
karlseguin:masterfrom
privkeyio:fix/epoll-del-before-free

Conversation

@kwsantiago

Copy link
Copy Markdown
Contributor

Problem

In non-blocking mode, an HTTP connection's fd is armed level-triggered (monitorRead: EPOLL.IN | RDHUP, no ONESHOT) with the raw Conn pointer as epoll userdata. When a request finishes with .close/.unknown, processHTTPData closes the socket on a worker thread to (implicitly) drop it from epoll, then signals the loop; processSignal later recycles the Conn via the MemoryPool (disown), with no explicit EPOLL_CTL_DEL.

Closing the fd on the worker thread races epoll_wait on the loop thread. Under connection churn the fd can remain armed past the point where disown returns the Conn to the pool, so a later batch delivers a .recv carrying a pointer to freed memory. protocol is the first field of Conn and the MemoryPool free-list clobbers it, so the recycled node reads as { .http = null } and http_conn.getState() dereferences null. We hit this in production as a recurring SIGSEGV in getState (@atomicLoad(&self._state)) at the _state offset, during normal operation, every few hours.

The deferred-signal change only orders signal-vs-recv within a single batch; it does not cover a .recv delivered in a later batch for an already-recycled Conn. HTTPConn.disown already does a synchronous EPOLL_CTL_DEL for the res.disown() path, for the same reason; the .close/.unknown path just relied on the worker-thread close() instead.

Fix

Move teardown of .close/.unknown connections onto the loop thread: drop the worker-thread posix.close, and in processSignal remove the fd from the event loop (EPOLL_CTL_DEL / EV.DELETE) and close it before disown recycles the Conn. Since the fd is still open at that point, the DEL is valid, and no later batch can reference the freed Conn. A remove helper is added to both the epoll and kqueue loops, mirroring HTTPConn.disown.

The signal-vs-recv-in-one-batch case the worker-thread close used to guard is already handled by deferring signal processing to the end of the batch plus the getState() .handover check in the recv path.

Builds clean on 0.16; full test suite (125 tests) passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant