Claude/review daemon architecture k merv#478
Conversation
…, ControlService Stands up SparkDaemon per the services architecture plan: a long-lived out-of-process service for shared cache / session / build state across multiple engine and editor instances. Phase 1 lands the transport + dispatch framework only — asset / shader / collab / build services are separate, independently shippable phases. - `DaemonProtocol.h` — wire format: 8-byte LE header (4B size + 2B service + 2B msgtype), 16 MiB payload cap, stable ServiceId constants - `DaemonFraming.h` — header-only SendFrame / RecvFrame shared by client and server; shutdown flag only aborts waiting loops, never in-flight I/O - `DaemonClient.h/cpp` — engine-side client with Connect / Request / SendOneWay / Ping / Disconnect; internal mutex makes it safe to share across threads - `SparkDaemon/` — standalone executable; non-blocking accept + poll() so Stop() exits within 500 ms without cross-thread fd games - `ControlService` — built-in ping / version / shutdown, always registered - 10 new tests (6 codec + 4 loopback), 5511 / 5512 of full suite pass (1 pre-existing flaky warn, no regressions) POSIX-only for now — Windows Connect() / Run() return "not yet implemented" until named-pipe transport lands with the first Windows-shipping service. https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
First real service on top of the Phase 1 foundation. The daemon now
stores compiled shader blobs keyed by (sourceHash, target, stage) and
serves them to engine instances that connect over AF_UNIX. In-memory
only for this slice; on-disk persistence and actual HLSL compilation
(the big payoff) are later phases.
- `ShaderServiceProtocol.h` — request/response structs + codec, reusing the header-only `BinaryWriter` / `BinaryReader`
- `ShaderServiceClient.{h,cpp}` — typed engine-side facade; stateless, shares the mutex inside `DaemonClient`
- `SparkDaemon/src/ShaderService.{h,cpp}` — `unordered_map<Key,blob>` under a mutex, hit/miss as relaxed atomics, overwrites bookkeep totalBytes correctly
- Target/stage are `uint8_t` on the wire so the daemon has zero coupling to `Spark::Graphics::ShaderTarget` — engine casts at the call site
- Registered in `SparkDaemon` `main.cpp` alongside `ControlService`
- 6 new loopback tests (roundtrip, miss, stats, clear, overwrite, target-distinctness); full suite 5518 / 5518
Phase 1 touch-ups baked in:
- Message type `0x00FF` (`ErrorResponse`) is now reserved across every service, not just `Control` — `DaemonClient::Request` detects it uniformly by message type regardless of service ID, so per-service wrappers don't need bespoke error handling. `DaemonProtocol.h` comment documents the reservation.
https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
Warm cache now survives daemon restarts. Opt-in via the new --cache-dir flag; without it, the service keeps its Phase 2a in-memory-only behaviour so unit tests and ephemeral deployments are unchanged. - `ShaderService::Initialize(cacheDir)` — creates the directory if missing, scans for `<hex16>_<dec3>_<dec3>.blob` files and loads each into the in-memory map; malformed filenames / wrong extensions are silently skipped - `HandlePutCacheEntry` — copies the blob out of the critical section and does the I/O unlocked (via a `.tmp` stage + atomic `rename`) so concurrent Gets don't serialise behind a Put's disk write - `HandleClearCache` — also wipes every `.blob` file from the cache dir - `SparkDaemon` CLI: `SparkDaemon [--socket <path>] [--cache-dir <path>]`, reports `N entries loaded` at startup - 5 new tests (put-writes-file, reload-from-disk, clear-removes-files, overwrite-replaces-contents, malformed-filename-ignored); 11 Shader tests pass in total; full suite 5522 / 5523 with one pre-existing flaky warn https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7fd814a3a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| std::lock_guard lock(m_threadsMutex); | ||
| m_clientThreads.emplace_back([this, connFd] { HandleConnection(static_cast<std::intptr_t>(connFd)); }); |
There was a problem hiding this comment.
Reap completed client threads while server is running
Each accepted connection spawns a std::thread that is stored in m_clientThreads, but these threads are only joined after Run() exits. In a long-lived daemon with reconnecting clients, finished joinable threads accumulate indefinitely, leaking thread resources and eventually causing thread-creation failures or process instability; worker threads need to be detached or periodically joined/pruned during runtime.
Useful? React with 👍 / 👎.
| { | ||
| if (errno == EINTR) | ||
| continue; | ||
| break; |
There was a problem hiding this comment.
Return an error when accept loop exits on fatal socket failures
This break path is used for non-recoverable poll errors, but the function later falls through to return {} (success), so callers cannot distinguish a clean shutdown from a runtime socket failure. If poll()/accept() fails with hard errors (for example fd exhaustion), the daemon stops serving requests while reporting success, which hides outages and defeats upstream recovery logic.
Useful? React with 👍 / 👎.
| unsigned target = std::stoul(stem.substr(17, 3)); | ||
| unsigned stage = std::stoul(stem.substr(21, 3)); |
There was a problem hiding this comment.
Enforce full numeric parsing for cache filename fields
ParseBlobFilename uses std::stoul without validating that all three characters were consumed, so malformed stems like "..._1a2_003" are accepted as target 1 instead of being rejected. That violates the documented strict filename grammar and can load unintended files as valid cache entries, leading to key collisions and incorrect cache statistics on startup.
Useful? React with 👍 / 👎.
…validate-by-path)
Second real service on the daemon. Follows the ShaderService pattern
closely but with the shape changes the asset cache needs: path strings
as keys, platform variants, and an `InvalidateAsset(path)` RPC that
drops every platform entry for a source path in one round-trip (matches
the architecture plan's motivation — "when a source PNG changes, one
call invalidates the Windows / Linux / Android variants").
- `AssetServiceProtocol.h` — message enum + payload codec reusing `BinaryWriter` / `BinaryReader`
- `AssetServiceClient.{h,cpp}` — typed facade: `GetAsset` / `PutAsset` / `InvalidateAsset` / `GetCacheStats`
- `SparkDaemon/src/AssetService.{h,cpp}` — in-memory map keyed by `(string path, uint8_t platform)`; optional disk persistence via FNV-1a-hashed filenames with path-in-file headers so arbitrary-length paths round-trip (tested with 500+ char paths)
- `SparkDaemon` CLI: new `--asset-cache-dir <path>` flag; independent of the existing `--cache-dir` for shaders
- 10 new loopback + disk tests; full suite 5532 / 5533 with one pre-existing flaky warn
Phase 3a validates the service pattern scales — third service (Collab
or Build) can follow the same template verbatim.
https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
❌ CI Error ReportFailed jobs: clang-tidy, coverage, linux-asan, linux-clang-Debug, linux-gcc-Debug, linux-msan, linux-tsan, macos-Debug, macos-Release Build Errors
Other errors (1)Full error outputCompiler Warnings (3)Updated: 2026-04-16T23:33:38Z — this comment is updated in-place, not duplicated. |
…gine integration)
First engine-side consumer of the daemon. ShaderDiskCache::Lookup now
consults an optional ShaderServiceClient before reading local disk;
Store writes locally and pushes to the daemon in parallel. Two engine
instances (editor + CLI tool + game preview) pointed at the same daemon
share a warm shader cache cross-process — the Phase 1/2 infrastructure
finally delivers end-to-end value.
Opt-in and backwards-compatible: unwired caches (SetDaemonClient never
called) behave identically to pre-daemon builds; the daemon pointer is
default-null.
- `DaemonConnection.{h,cpp}` — process-wide singleton owning a `DaemonClient`. `TryConnect(socketPath)` probes `stat()` before `connect()` so missing daemons cost nothing; idempotent
- `ShaderDaemonBridge.{h,cpp}` — `Encode/DecodeCompiledShaderBlob` versioned codec (`kShaderDaemonBlobVersion=1`) so the daemon round-trips full metadata (entry point, errors, reflection counters), not just bytecode
- `ShaderDiskCache` — optional `ShaderServiceClient*` + `SetDaemonClient`/`GetDaemonHits`/`GetDaemonMisses`. `Lookup` is daemon-first with fall-through on miss/decode/transport error; `Store` dual-writes (local first, daemon best-effort). Client pointer is snapshotted under the mutex and used unlocked so network round-trips don't serialise local disk I/O. Local on-disk format unchanged (bytecode-only) — only the daemon path carries full metadata, preserving backwards compat
- 8 new tests: codec round-trip, version rejection, cross-cache daemon hit with metadata intact, daemon-miss fall-through, no-daemon-behaves-as-before, DaemonConnection missing-socket / live / idempotent
- Full suite 5541 / 5541
Engine startup doesn't yet call `DaemonConnection::TryConnect` — that
lands with a follow-up behind a CVar so daemon use stays opt-in until
we decide where in the boot path to probe.
https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
…to-attach)
Turn Phase 3b's plumbing on from the real engine boot path. When
`spark.daemon.enabled=1` and a daemon is reachable on the configured
socket, the engine's shader cache automatically shares compiled
bytecode across every connected instance.
- `DaemonLifecycle.{h,cpp}` — two free functions, `InitializeDaemonLifecycle` / `ShutdownDaemonLifecycle`; reads the new `spark.daemon.enabled` + `spark.daemon.socket_path` CVars, calls `DaemonConnection::TryConnect`, attaches a process-wide `ShaderServiceClient` to the global `ShaderDiskCache`
- `SparkEngine.cpp` — calls them from `InitConsole()` after `InitGameplaySystems`, and from `ShutdownEngine()` before `ShutdownGameplaySystems` (so the cache clears its client pointer while the connection is still live)
- Both functions are idempotent and silently no-op when the CVar is off or the daemon is unreachable — every code path runs identically with or without a daemon
- 5 new tests: disabled-by-default noop, missing-socket noop, attach-when-available, idempotent Init, idempotent Shutdown; full suite 5545 / 5546 with one pre-existing flaky warn
Default off: the daemon is a power-user optimisation. Opt-in via
`spark.daemon.enabled 1` in a console script or `+spark.daemon.enabled 1`
on the command line.
https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
…iki docs Closing-out slice for the daemon branch. Four independent additions: - **Auto-spawn.** New CVars `spark.daemon.auto_spawn` (default false) and `spark.daemon.binary_path` (default empty → `./SparkDaemon`). When `TryConnect` fails and auto-spawn is on, the lifecycle helper uses `Process::Builder(binary).Detached().Launch()`, waits up to 2 s for the socket file, and retries connect — with a filesystem-exists preflight so missing binaries fail fast and quietly. - **Control::StatsRequest.** New message `0x0007 → 0x0008` returning a `DaemonStats` struct (integer-second uptime, protocol version, sorted list of registered ServiceIds). `DaemonServer::SnapshotStats` produces it; `ControlService` takes a `std::function<DaemonStats()>` provider at construction so it doesn't need a back-pointer to the server. `main.cpp` wires a lambda at startup. - **Concurrent-clients test.** `TestDaemonConcurrent.cpp` runs two threads against the same daemon (writer PutCacheEntry × 50, pinger Ping × 50) then verifies a third client reads back index-tagged blobs for every key the writer stored — catches any cross-connection response interleaving in the per-connection dispatch threads. Plus stats round-trip + truncation-rejection codec tests. - **Documentation.** `wiki/SparkDaemon.md` covers architecture, wire protocol, every shipped service, ops guide, how to add a new service, platform status, and the future-phases roadmap. Linked from `wiki/_Sidebar.md` next to SparkConsole. 5 new tests; full suite 5550 / 5551 with one pre-existing flaky warn. No regressions. Out of scope for this commit (explicitly future phases): Phase 4 Collab broker, Windows named-pipe transport, AssetPipeline engine wiring, LRU eviction, file-watching push notifications, in-daemon shader compilation. https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
Both cache services now bound total bytes via a canonical list+map LRU, with opt-in CLI flags. Disk files for evicted entries are removed in lockstep with the in-memory map so the on-disk cache never outgrows the in-memory bound across restarts. - `ShaderService` + `AssetService` swapped `unordered_map<Key, blob>` for `list<Entry>` + `unordered_map<Key, list_iter>`. Every Get hit splices the node to the front; every Put/insert runs `EvictUntilUnderBudget`, which pops the back until `m_totalBytes <= m_maxBytes` - New `SetMaxBytes(N)` method on both services (0 = unbounded, default). Calling it after `Initialize(cacheDir)` immediately trims a newly-loaded cache down to the budget, in memory and on disk - Handles the single-blob-over-budget corner case — the new entry self-evicts, leaving the cache in a consistent state with `totalBytes = 0` - New `m_evictionCount` atomic; not yet plumbed through `GetCacheStats` (wire-format bump — follow-up) - `SparkDaemon` CLI: new `--shader-cache-max-mb` and `--asset-cache-max-mb` flags (value in MB, 0/omitted = unbounded). Startup banner prints the active caps - 10 new tests in `TestDaemonLRU.cpp` driving the services directly (no wire layer — LRU is a pure data-structure invariant, the wire round-trip adds 500 ms/test for zero extra coverage). Full suite 5561 / 5561, zero warns https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
Three issues raised by Codex plus the `check-thirdparty-manifest` CI
false positive it didn't see:
- **P1 — Thread reaping (DaemonServer.cpp:166).** `m_clientThreads` previously held every worker `std::thread` for the lifetime of the accept loop; in a long-lived daemon with reconnecting clients that list grew unbounded. Replaced with `std::list<ClientWorker>`; each worker carries an `atomic<bool> done` flipped on `HandleConnection` exit. The accept loop's poll-timeout branch (and every new-connection path) calls `ReapFinishedWorkers()` which joins + erases any entries where `done == true`. Full teardown still joins all remaining workers unconditionally.
- **P1 — Fatal error signaling (DaemonServer.cpp:147).** Non-recoverable `poll()`/`accept()` errors used to fall through a bare `break` into `return {}`, so callers couldn't distinguish a clean shutdown from a real outage. Added a local `fatalError` string; hard errors populate it and the function returns `std::unexpected(...)` at teardown so upstream recovery logic can react.
- **P2 — Strict filename parsing (ShaderService.cpp:266 + AssetService.cpp).** `std::stoul` silently stopped at the first non-digit, so stems like `..._1a2_003` parsed as target `1` instead of being rejected. Both `ShaderService::ParseBlobFilename` and `AssetService::ReadBlobFile` now character-check every position in the stem (hex digits in the hash slot, decimal digits in the target/stage/platform slots) before calling the conversion.
- **CI — `check-thirdparty-manifest` false positive.** The script's wiring regex matches `_DIR` in `CMAKE_SOURCE_DIR` inside the new `add_subdirectory(SparkDaemon)` block — no actual third-party change was made, but the check fails because `ThirdParty/dependencies.lock` wasn't touched. Bumped the `Last sync:` comment line documenting that SparkDaemon adds no third-party wiring.
4 new regression tests in `TestDaemonCodexFixes.cpp`: 50-reconnect reaping smoke test, bad-socket-path returns-error contract, and malformed filename rejection for both services. Full suite 5565 / 5565, zero warns.
https://claude.ai/code/session_01PE9TtP8X7pFgM9h7F89pey
No description provided.