Skip to content

[BUG] httpz epoll .recv segfault: getState on null/freed HTTPConn (worker disown lacks EPOLL_CTL_DEL) #120

Description

@wksantiago

Summary

Production SIGSEGV on 2026-06-27 09:02:54 (node on v0.5.5, normal operation, not shutdown):

Segmentation fault at address 0x2f8
httpz worker.zig:1693:40 in getState -> @atomicLoad(State, &self._state, .acquire)

0x2f8 is the offset of HTTPConn._state, so self (the http_conn pointer) is null/freed when the epoll loop dispatches a .recv event for it.

Caller (worker.zig ~591):

.recv => |conn| switch (conn.protocol) {
    .http => |http_conn| switch (http_conn.getState()) { ... }
}

conn.protocol is tagged .http but http_conn is null/garbage — a Conn was freed/reused while epoll still delivered a .recv event for it.

Root cause

This is in vendored karlseguin/http.zig (pin 8dc6441, == upstream master == our v0.5.7). The worker.zig getState body and the .recv caller block are byte-identical between the v0.5.5 pin (5d1b4e2) and v0.5.7, so v0.5.5 -> v0.5.7 does not fix this crash.

  • epoll stores raw @intFromPtr(conn) as userdata (monitorRead, worker.zig:1298/1303).
  • Conn is recycled via std.heap.MemoryPool conn_mem_pool (decl :423); worker-level disown() does conn_mem_pool.destroy(conn) at :911.
  • MemoryPool free-list stores a next-ptr in the first bytes of a freed node, which alias Conn.protocol (first field, :1513). For the last-freed node that ptr is null => protocol reads as {.http = null}.
  • .recv handler (:595-599) then does conn.protocol.http -> http_conn.getState() (:1697) -> @atomicLoad(&null._state) faults at offset 0x2f8. Exact match.

Ordering bug: worker-level disown (:900-912) frees the Conn (http_conn_pool.release + conn_mem_pool.destroy) with no EPOLL_CTL_DEL of its own; it relies on a prior close() side-effect. HTTPConn.disown (:1721-1746) does epoll_ctl DEL, showing maintainers know epoll needs synchronous removal, but the worker disown path omits it. processHTTPData .close/.unknown (:838) closes the socket on a worker thread then signals; the actual free happens later in the event loop (processSignal :782). close()/epoll vs concurrent epoll_wait has an inherent window under max_connections churn. The deferred-signal fix (:576-657) only orders signal-vs-recv within one batch and does not cover cross-batch / worker-close races.

Fix direction

  • EPOLL_CTL_DEL before close, synchronously on the owning still-valid fd, centralized in Conn.close() (:1540) + the processHTTPData .close branch; or
  • robust fix: generation guard / stop recycling Conn through a clobbering MemoryPool.

Lands upstream in karlseguin/http.zig; carry via upstream PR or a temporary fork pin in build.zig.zon (archive url + hash).

Repro

Likely under connection churn at/near max_connections (1000).

Not a duplicate of

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions