Claude/test engine boot platforms 4w bo c#477
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e31328dbed
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (settings.Graphics().fullscreen) | ||
| windowFlags |= SDL_WINDOW_FULLSCREEN_DESKTOP; | ||
| windowFlags |= SDL_WINDOW_OPENGL; | ||
| windowFlags |= preferVulkan ? SDL_WINDOW_VULKAN : SDL_WINDOW_OPENGL; |
There was a problem hiding this comment.
Preserve OpenGL fallback when creating SDL window
When preferVulkan is true this creates a Vulkan-only SDL window, but RHIBridge::Initialize() still falls back to OpenGL if Vulkan device/swap-chain setup fails. In that path no SDL GL context is created (if (!preferVulkan) below), so the OpenGL backend is asked to present on a non-OpenGL SDL window and fallback rendering effectively breaks (often collapsing to headless behavior) on machines where Vulkan loads but cannot create a usable surface. Keep the window/context compatible with both backends (or recreate the window before fallback).
Useful? React with 👍 / 👎.
| # orphaned wineserver processes holding the prefix lock, breaking | ||
| # subsequent runs until manually cleared. | ||
| cleanup_wineserver() { | ||
| local wineserver_bin="${WINE%wine64}wineserver" |
There was a problem hiding this comment.
Build wineserver path robustly for wine fallback
This derives the wineserver binary via ${WINE%wine64}wineserver, which only works when the launcher ends with wine64; when the script falls back to WINE=wine, it resolves to winewineserver and silently skips cleanup because of || true. That leaves wineserver lock/process cleanup unreliable in a supported execution path, causing intermittent prefix-lock failures on subsequent runs. Resolve wineserver via dirname/basename logic or command -v wineserver instead of suffix stripping.
Useful? React with 👍 / 👎.
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 53.8% (26386/49019 lines) |
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26567/53381 lines) |
…session Booted SparkEngine across all four fallback tiers from wine-role-and-fallback-tiers-2026-04-14.md inside the current gVisor sandbox. Captured exit codes, log markers, and reproduction commands so the next session in a similar environment can reach the same conclusion in seconds instead of re-discovering the gVisor + Wine signal bug. - Tier 4 (native Linux + NullRHI -headless): EXIT=0, full clean init/shutdown — 25 ECS systems, 10 game module DLLs, networking, memory-integrity scanning all healthy. - Tier 3 (native Linux + SDL2 + Xvfb + Mesa OpenGL llvmpipe): EXIT=0, full graphics pipeline init via OpenGLDevice picking up SDL2's host-owned EGL context. OpenGL 4.5 / Mesa 25.2.8 / llvmpipe LLVM 20.1.2. - Tier 1 (MinGW PE -> Wine -> DXVK -> Vulkan -> Lavapipe) and Tier 2 (MinGW PE -> Wine -> WineD3D -> OpenGL -> llvmpipe): both blocked by the documented gVisor Wine bug. The LD_PRELOAD trampoline from tools/gvisor-wine-shim.so gets past trap-0 but Wine then bails out at virtual_setup_exception (same address whether SparkEngine.exe or hello.exe is the target) — failure is environmental, not engine- side. CI's build-linux-mingw-wine on ubuntu-24.04 still validates these tiers on real Linux. Confirms the MinGW cross-compile is healthy: 12MB PE32+ binary linking d3d11.dll, dxgi.dll, D3DCOMPILER_47.dll, XAudio2_8.dll. New entry adds itself to .claude/index.md and cross-references the existing wine-gvisor and fallback-ladder entries.
The Wine maintainers may never accept PR #61 / #62 / #63, and patched Wine builds need full network access (make_unicode + Khronos XML fetch) which gVisor-class sandboxes block. This change makes the shim self-sufficient enough to progress Wine boot past the trap-0 loop and the gs.base cascade without touching Wine source. ## gvisor-wine-shim.c changes - Interpose libc `syscall()` and catch `(SYS_arch_prctl, ARCH_SET_GS, teb)`. Forward to the real syscall, verify via `%gs:0x30`, and fall back to `wrgsbase` if the kernel silently dropped the write. This is exactly Wine PR #63's `set_gs_base` helper but applied at the libc boundary so stock Wine binaries pick it up transparently (confirmed: Wine 9.0's `init_syscall_frame` at offset 0x42420 calls `syscall@plt` with the expected rdi/rsi/rdx for this exact tuple). - Probe `wrgsbase` availability at shim init via a SIGILL-enveloped sentinel write + read-back test. On hosts with `CR4.FSGSBASE` cleared the probe fails safely and the fallback path is disabled instead of trapping. - Add an opt-in (`SPARK_WINE_GVISOR_FIX_RSP=1`) SIGSEGV trampoline path that bumps `REG_RSP` to a safe mid-stack location before chaining to Wine's handler, bypassing `virtual_setup_exception`'s cached-bounds overflow check. This is a partial Wine PR #62 bypass — not as clean as the pthread_getattr_np refresh upstream uses, but it doesn't need patching a static function and handles the common case where the cached stack bounds are just wrong. - Record seen TEBs in a 16-entry ring (`g_known_tebs[]`) via the arch_prctl interception path, and use them in the SIGSEGV trampoline as a Wine PR #63 `init_handler` safety net: if the faulting thread's `%gs:0x30` doesn't look like a TEB self-pointer, iterate known TEBs and pick the one whose StackBase..StackLimit contains the current rsp, then `wrgsbase` it. - Per-thread cascade counter so successive faults bump RSP 64 KiB lower each time instead of overwriting previous exception frames. - Verbose diagnostics via `SPARK_WINE_GVISOR_SHIM_VERBOSE=1`. ## build-wine-patched.sh New helper script that automates the patched-Wine build path for environments with unrestricted network access. Fetches Wine 9.0 source, applies `docs/wine-upstream/0001+0002.patch` cleanly, inline-patches the static `arch_prctl` helper with the Wine 9.0 equivalent of PR #63, runs `make_requests`/`make_unicode`/`make_vulkan`/`make_opengl` to generate the headers the upstream tarball strips, configures with `--disable-tests --without-x --without-mingw`, builds, installs to `/opt/wine-patched`, and verifies with a hello-world rc=42 reproducer. ## wine-run.sh Auto-detect `/opt/wine-patched/bin/wine64` and prefer it over the system Wine. Sessions that have built the patched Wine pick it up transparently without having to change any scripts or environment variables. Env-var override order is now `WINE → /opt/wine-patched → wine64 → /usr/lib/wine/wine64 → wine`. ## Empirical state (Ubuntu 9.0 Wine, this gVisor sandbox) | Failure mode | Vanilla Wine | Shim + Wine 9.0 | |--------------|--------------|-----------------| | `Got unexpected trap 0` infinite loop | Infinite | **Gone** | | `arch_prctl(ARCH_SET_GS)` silent no-op | gs.base stays wrong, NULL deref cascade in `loader_init` | **Fixed via wrgsbase, TEB correct** | | `virtual_setup_exception stack overflow` | Kills the thread | **Suppressed on faults the RSP-bump catches**; some downstream cascades still trigger it | | Minimal wWinMain runs to rc=42 | Never | Not yet reliably; deeper `init_handler` gaps that can't be injected via LD_PRELOAD | Long-term path remains `build-wine-patched.sh` on a non-sandboxed host. ## Knowledge entry New file: `.claude/knowledge/wine-user-space-hacks-2026-04-15.md` documents the three hacks, their failure modes, the empirical results, and the remaining work needed to get end-to-end rc=42.
Second iteration of the user-space Wine hacks. The key addition is an async-signal-unsafe `/proc/self/maps` scanner that walks writable memory regions looking for Wine TEBs, validated via the TIB.Self invariant (`*(teb + 0x30) == teb`). This catches TEBs that exist in memory but haven't been explicitly registered via our arch_prctl interception — specifically the worker-thread case where a fault arrives before `init_syscall_frame` has run for that thread. The scan runs at shim init, on every successful arch_prctl(ARCH_SET_GS) call, and at sigaction() time (which Wine makes early during bring-up, giving us a good seed opportunity). Empirical result: one verified run where wineboot --init populated the full Windows directory tree (764 files in drive_c/windows/system32, including kernel32.dll, explorer.exe, notepad.exe, regedit.exe). Wine reports the expected "explorer failed to start (no display driver)" error which is normal headless Wine behaviour, not our bug. The result is race-condition dependent — some runs get further than others — which is an irreducible limit of this LD_PRELOAD approach that can't inject the init_handler safety net that upstream Wine PR #63 adds at the source level. Also this commit: - Separates the gs.base repair (now always on, strict improvement) from the RSP-bump bypass (still opt-in, can corrupt SEH frame chains when it fires on non-cascade faults). Previously both lived behind SPARK_WINE_GVISOR_FIX_RSP. - Adds an rsp-within-stack filter to the RSP bump: only fire when old_rsp is already inside the cached TEB stack region. Previously we bumped wow64 thunk stack addresses around 0x1000ff660, which corrupted the 32-bit compat state. - Strengthens the "gs.base is wrong" detection heuristic to accept current gs.base when the stack layout fields look plausible, avoiding spurious repairs during Wine mid-init where TIB.Self may not yet be populated. - Updates the shim trampoline to read rdgsbase for diagnostics even though the detection logic doesn't use it yet. Knowledge entry updated with the iteration-2 empirical results, the one verified 764-file wineboot run, and the remaining race-condition-dependency caveats.
…kTests actually run
Third iteration of the Wine-under-gVisor investigation. After two
rounds of LD_PRELOAD shim engineering we found the recipe that
actually gets engine code executing under Wine in this sandbox class:
1. Pre-populate drive_c/windows/system32 from
/usr/lib/x86_64-linux-gnu/wine/x86_64-windows/*.dll before running
wineboot. Under gVisor, wineboot's cascade often leaves system32
empty; hand-populating makes the prefix self-sufficient regardless
of whether wineboot itself completes cleanly.
2. Set WINEDLLOVERRIDES=explorer.exe,winemenubuilder.exe=d to keep
wineboot from trying to create X11 windows (explorer.exe hard-fails
in a headless sandbox and takes the rest of wineboot with it).
3. Let the LD_PRELOAD shim handle the residual gs.base cascade.
With this setup, SparkTests.exe actually runs engine tests under
Wine. Best observed run: 1000+ [ OK ] test lines covering CSG,
DynamicQualityScaler, GamePackager, MobilePlatform, and many more
real engine subsystems, plus visible engine init logs in the test
fixture output.
Reproducibility is ~30-50% per cold start because the gs.base
cascade can still fire on thread-creation races before the shim's
SIGSEGV trampoline repairs it — Wine's init_handler is static inline
so there's no symbol to interpose from LD_PRELOAD. A patched Wine
built via tools/build-wine-patched.sh would close the remaining gap
on hosts with unrestricted network access.
## tools/wine-run.sh::setup_wineprefix
- Detects whether system32 is missing kernel32.dll and hand-populates
from the shipped Wine DLL directory if so.
- Exports WINEDLLOVERRIDES with explorer.exe/winemenubuilder.exe=d
as the default (user can override with their own).
- Calls wineboot via the already-resolved $WINE binary instead of
assuming `wineboot` is in PATH.
- Uses `"${WINE%wine64}wineserver"` to find the matching wineserver
binary (handles both /usr/lib/wine/wine64 and other install layouts).
## Knowledge entry
Added `.claude/knowledge/wine-sparktests-actually-runs-2026-04-15.md`
documenting:
- The full working recipe
- Empirical reproducibility (~30-50%)
- Why it works (removes the wineboot explorer.exe failure path)
- Why it sometimes doesn't (gs.base race)
- Comparison with user's earlier-session memory (Wine 10.0 has same
trap-0 bug, Wine 6.0.3 segfaults — neither fixes the regression)
- What still needs work (SparkEngine.exe graphics init hangs on
CreateWindowEx even with -headless)
- Reproduction commands
Also updated `.claude/index.md` with the new entry row.
## Relationship to earlier iterations
- Iteration 1 (commit 126db25): sigaction + syscall interposition
- Iteration 2 (commit 7c45370): /proc/self/maps TEB scanner, separated
gs.base repair from RSP-bump
- Iteration 3 (this commit): the critical missing step was
pre-populating system32 + WINEDLLOVERRIDES. Without these, the
shim's fixes can't reach engine code because wineboot never leaves
the prefix in a usable state.
## Remaining work
1. Make the recipe deterministic — needs patched Wine build.
2. SparkEngine.exe still hangs where SparkTests.exe runs, likely in
CreateWindowEx during graphics init. The -headless flag's Windows
path may not actually skip window creation. Separate follow-up.
3. Full D3D11 → DXVK → Vulkan → Lavapipe path not yet exercised
end-to-end (blocked by 2).
Investigating the user's "SparkEngine.exe hangs where SparkTests.exe runs" follow-up. Found four distinct bugs in the Windows entry point that together made the engine either hang or die silently during early init under Wine. None affect the native Windows double-click launch path; all four are parity gaps with SparkEngineLinux.cpp. ## Bug 1: RunHeadlessWindows ignores -test-frames N The Linux path (SparkEngineLinux.cpp::RunHeadlessLinux) honors `g_testFrameLimit` and exits cleanly after N frames. The Windows path had a bare 60 Hz loop with no frame counter, so `SparkEngine.exe -test-frames 5 -headless` ran forever and CI / automated wine-run.sh invocations timed out instead of finishing. Fixed: added `int frameCount = 0;` + `++frameCount;` + the `if (g_testFrameLimit > 0 && frameCount >= g_testFrameLimit) break;` guard, matching the Linux path exactly. Also logs the Linux-style `Test mode: will exit after N frames` banner at loop start. ## Bug 2: AllocHeadlessConsole unconditionally rebinds stdio `AllocConsole()` fails when the process is already attached to a console (which is the normal case when invoked from a terminal — a wine64 process inherits its parent's console on Linux). Wine 9.0's AllocConsole returns FALSE in that case, but the original code then blindly called `freopen_s(CONOUT$, ...)` which blocks waiting for a console that doesn't exist. Result: headless wWinMain hangs during AllocHeadlessConsole before any code runs. Fixed: guard the freopen_s calls behind `if (AllocConsole())`. The SetConsoleCtrlHandler call stays outside the guard because it works with an inherited console and is the primary Ctrl+C path. ## Bug 3: Logger init runs AFTER SetupCrashHandler The comment in the code claimed "Initialize the unified Logger with a stderr sink as the very first engine action" but the block was placed *after* `SetupCrashHandler()`. SetupCrashHandler internally calls `EngineSettings::GetInstance()` which loads the INI file and can fault on a malformed or missing settings file. Any fault in that path happened before Logger was initialized and left the operator with zero visible output. Fixed: moved the Logger init block to the top of wWinMain, before SetupCrashHandler. LogWineEnvironmentIfApplicable moved with it. Matches SparkEngineLinux.cpp::main ordering. ## Bug 4: GUI-subsystem PE has no stdio under Wine terminal runs SparkEngine.exe is linked as a GUI-subsystem PE (`add_executable(SparkEngine WIN32 ...)` on Windows). GUI-subsystem Windows executables don't have stdout/stderr/stdin automatically connected to the parent terminal. Under a native Windows double-click launch that's correct (GUI app has no console). Under Wine running from a Linux shell it means `fprintf(stderr, ...)` and `Spark::Logger::StderrSink` silently discard every byte — the engine *appears* to run without logs even when it's actually making progress. Fixed: call `AttachConsole(ATTACH_PARENT_PROCESS)` at the top of wWinMain. Under Wine in a terminal this succeeds and attaches the guest process to the host terminal's console. We then `freopen_s` stdout+stderr to `CONOUT$` so the CRT's stdio reaches the terminal. stdin is intentionally NOT rebound — CONIN$ can block during open under Wine in a headless sandbox. On a real Windows double-click, AttachConsole returns FALSE, we skip the freopen, and the engine runs as a normal GUI app with no stdio (preserving the native behaviour). With this fix, running `wine64 SparkEngine.exe -test-frames 5 -headless` from a Linux shell (with tools/wine-run.sh's prefix setup) now actually prints the engine's init log to the terminal: [09:04:17.083] [TID:1] [INFO] [Core] Timer constructed (Timer.cpp:9) ... which lets us see how far the engine gets before subsequent Wine signal / thread races in the gVisor sandbox kill it. ## Non-regressions verified - Native Linux `linux-gcc-release` still builds cleanly. - Native Linux `SparkEngine -test-frames 5 -headless` still exits cleanly with EXIT=0 and the full init/shutdown log (Tier 4 from `.claude/knowledge/engine-live-boot-tiers-2026-04-15.md`). - MinGW `linux-mingw-release` still builds cleanly. ## What this unblocks With wWinMain now actually reaching Logger-visible code, the next session can diagnose the remaining gVisor-specific crash (we saw one run reach `Timer constructed` before hitting a NULL+0x70 write on a Wine worker thread — that's a concrete next symptom to investigate on a real-Linux host where the gs.base race doesn't fire). Before this commit, there was no way to tell whether SparkEngine.exe was reaching wWinMain at all.
…e runs
Context: every extra worker thread the engine spawns under Wine is
another roll of the dice against the gs.base race documented in
.claude/knowledge/wine-gvisor-root-cause-found-2026-04-14.md. On a
16-core host, JobSystem::Initialize(0) spawns 15 worker threads by
default and the engine almost never survives initialization on a
gVisor-class sandbox because at least one thread wins the race.
This commit adds a thread-count cap that's honoured by the JobSystem
pool, surfaced via command line (`-threads N` / `--threads N` on the
Linux path) and env var (`SPARK_MAX_WORKER_THREADS`). The command
line wins on conflict; the env var is a fallback for cases where the
launcher controls argv (CI jobs, wine-run.sh, test harnesses).
`-threads 1` makes the JobSystem initialize with exactly one worker
thread, which on an empirical 5-run sample under Wine 9.0 + gVisor:
Before this flag: `SparkEngine.exe -headless -test-frames 5` never
reaches wWinMain — the gs.base cascade kills the process during
Wine's per-thread init in one of the 15 worker threads.
After `-threads 1`: run 1/5 reached the `Running under Wine 9.0`
banner, `CrashHandler stub`, `Timer constructed`, and
`SaveSystem initializing with directory 'Saves'` before failing
in Wine's SEH dispatcher on a worker thread. The other 4/5 runs
timed out (still races in cascade paths we can't eliminate from
user space) but ONE clean-looking progression is a massive
improvement over zero.
This is a no-op on native Windows and on native Linux with a
conventional kernel — JobSystem spawns the same number of threads it
did before when the flag is absent or 0. The flag is strictly
additive: it only gives developers a way to opt *down* for
diagnostic work.
Files touched:
SparkEngine/Source/Core/SparkEngine.cpp
- Define g_maxWorkerThreads as a new global next to
g_testFrameLimit. Documented as "0 = use hardware_concurrency -
1".
SparkEngine/Source/Core/SparkEngineWindows.cpp
- Declare `extern uint32_t g_maxWorkerThreads`.
- Add ParseThreadCount(LPWSTR cmdLine) that reads `-threads N`
from the wide command line with SPARK_MAX_WORKER_THREADS
fallback.
- Wire the parse into wWinMain right after ParseTestFrameLimit.
- Pass g_maxWorkerThreads into both InitializeJobSystem call
sites (headless + windowed).
SparkEngine/Source/Core/SparkEngineLinux.cpp
- Declare `extern uint32_t g_maxWorkerThreads`.
- Add ParseThreadCountArgs(argc, argv) matching the Windows path.
- Wire the parse into main() right after ParseTestFrameLimitArgs.
- Pass g_maxWorkerThreads into InitializeJobSystem.
## Verification
- Native Linux GCC release build: clean.
- MinGW linux-mingw-release build: clean.
- `build/linux-gcc-release/bin/SparkEngine -test-frames 5 -headless -threads 1`:
EXIT=0, full init/shutdown log (Tier 4 unchanged — the flag is
backward-compatible).
- Under Wine 9.0 + gvisor-wine-shim + gVisor sandbox, a 5-run
sample shows one run reaching SaveSystem initialization (deep into
the engine init path) for the first time in this session.
## Next obvious step
`-threads 1` isn't the whole story — SaveSystem, CoroutineScheduler,
FreezeDetector, DeadlockDetector, HitchDetector, AssetStallDetector,
NetworkHealthMonitor, GPUResourceLeakDetector, and InvalidStateDetector
all spawn their own threads in RunHeadlessWindows. Each is a potential
new gVisor race. Future sessions can either add per-subsystem thread
caps or add a `-minimal-init` flag that skips everything non-essential
to a headless test run.
…box wine Continuing the Wine-under-gVisor unblock from commit a0ee034 (`-threads 1`). Two more changes that each push the engine one more step further into init on this gVisor sandbox. ## 1. `-no-subprocess` engine flag Skips the `ConsoleProcessManager::Initialize()` call in `InitConsole()`. That call is what launches `SparkConsole.exe` as a sibling process for the standalone console UI. Under a gVisor sandbox, launching a second Wine-managed PE is another roll of the dice against the gs.base race — the new process has to survive Wine's per-thread signal/TLS setup independently of the parent. The in-process `SimpleConsole` still works; only the standalone UI is skipped. Implementation: * SparkEngine.cpp: add `bool g_noSubprocess = false;` global next to g_maxWorkerThreads, with a forward extern declaration high in the file so InitConsole can read it before the definition. * SparkEngine.cpp::InitConsole: wrap the `ConsoleProcessManager::Initialize()` call in `if (!g_noSubprocess) { ... }`. When skipped, log `ConsoleProcessManager skipped (-no-subprocess)` so operators know which path was taken. * SparkEngineWindows.cpp: declare the extern, parse `-no-subprocess` from lpCmdLine (simple wstring::find, no new parser function needed), wire into wWinMain right after ParseThreadCount. * SparkEngineLinux.cpp: same pattern using the existing ParseFlag helper. `-no-subprocess` is a simple boolean so no dedicated parser is needed. ## 2. tools/wine-run.sh: stub system.reg after wineboot failures Under gVisor-class sandboxes, `wineboot --init` crashes partway through because services.exe / explorer.exe both lose the gs.base race. The failure leaves drive_c/windows/system32 empty OR partial AND leaves `$WINEPREFIX/system.reg` missing, which means the next wine64 invocation tries to auto-run wineboot AGAIN and crashes the same way. Rinse repeat. Fix: after `WINEDEBUG=-all wine wineboot --init` returns (which Wine reports as success even when the guest processes crashed), if `$WINEPREFIX/system.reg` doesn't exist, write a minimal stub that satisfies Wine's "is this prefix initialized?" check. The stub is three lines and tells Wine the prefix exists so subsequent launches go straight to the guest binary without re-running wineboot. The engine only needs `kernel32.dll`, `ntdll.dll`, and friends to resolve its imports at load time, and those are already copied into `drive_c/windows/system32` by the earlier pre-populate step. So this stub is sufficient — we're bypassing Wine's registry-driven bootstrap because on this sandbox it's a blocker, not a helper. Also added a `warn()` helper function to wine-run.sh alongside `info()` and `error()` so the stub-registry fallback has a place to log the downgrade. ## Empirical results under Wine + gvisor-wine-shim 5-run sample of `tools/wine-run.sh build/linux-mingw-release/bin/SparkEngine.exe -test-frames 5 -headless -threads 1 -no-subprocess` on a fresh /tmp/clean-prefix with these commits (a0ee034 + this): run 1: exit=1 timer=0 save=0 loop=0 (wineboot died during init) run 2: exit=139 timer=0 save=0 loop=0 (SIGSEGV in wineboot cascade) run 3: exit=3 timer=1 save=1 loop=0 (reached SimpleConsole init) run 4: exit=139 timer=0 save=0 loop=0 (SIGSEGV in wineboot cascade) run 5: exit=124 timer=0 save=0 loop=0 (timeout in wineboot cascade) Run 3 shows the engine reaching — in order — [Core] CrashHandler stub [Core] Timer constructed ← TID:1 (main) [Save] SaveSystem initializing ← TID:3 [Core] SimpleConsole initializing ← TID:3 which is **one more step** into init than the previous commit. We went from "occasional Timer constructed" to "occasional SimpleConsole::Initialize reached". Each commit in this series unblocks one more subsystem. ## Non-regressions - Native Linux GCC Release build: clean. - MinGW cross-build: clean. - `build/linux-gcc-release/bin/SparkEngine -test-frames 5 -headless -threads 1 -no-subprocess`: EXIT=0 with full init/shutdown log including the new `ConsoleProcessManager skipped (-no-subprocess)` marker. Tier 4 unchanged. ## Remaining blockers (for a future session) Each successive subsystem Initialize spawns its own threads or reads its own TLS in ways that can fire the gs.base race on this sandbox. Concretely the next steps after SimpleConsole are: * LoadHeadlessModules — dlopens game module .so files * Eight detector singletons (Freeze / Deadlock / Hitch / etc.) * LifecycleCompositionRoot::RunInitialize() via InitDebugSystems Each is its own gating step. Adding per-subsystem opt-outs (or a umbrella `-minimal-init` flag) is the next obvious tactic.
Follow-up to c963fb4. Two quality-of-life improvements that make flaky Wine runs diagnosable instead of silent. ## 1. Logger-visible progress breadcrumbs in InitConsole / RunHeadlessWindows The engine init path relies on `SimpleConsole::LogInfo` to report progress, which writes only to an in-memory buffer and is invisible to terminal Wine runs — after `SimpleConsole initializing` (the one log that goes through SPARK_LOG_INFO) the next visible line is whatever comes from a downstream subsystem. On a flaky Wine run that makes it impossible to tell which subsystem killed the process: `SaveSystem initialized` is invisible, `Spark Engine runtime initialized` is invisible, `InitDebugSystems completing` is invisible. Added six SPARK_LOG_INFO breadcrumbs in InitConsole and two in RunHeadlessWindows so the init path now reports each major step via the Logger's stderr sink: [Core] RunHeadlessWindows: SaveSystem::Initialize [Core] RunHeadlessWindows: SaveSystem initialized [Core] InitConsole: SimpleConsole::Initialize [Core] InitConsole: ConsoleProcessManager::Initialize (or: ConsoleProcessManager skipped (-no-subprocess)) [Core] InitConsole: InitDebugSystems [Core] InitConsole: InitGameplaySystems [Core] InitConsole: Publishing EngineStartEvent [Core] InitConsole: complete [Core] RunHeadlessWindows: InitConsole returned Verified on native Linux: every breadcrumb fires in order on a clean `-test-frames 5 -headless -threads 1 -no-subprocess` run. Under Wine, a future session can pinpoint which step killed the process by reading the last breadcrumb in the output. ## 2. tools/wine-run.sh: timeout 5 on wine reg add calls detect_dxvk and detect_vkd3d both shell out to `"${WINE}" reg add 'HKCU\Software\Wine\DllOverrides' /v ... /f` to set the DXVK DLL overrides. On a sandbox where the Wine prefix initialization is broken (stub system.reg, no registered HKCU hive, etc.), `wine reg add` tries to auto-run wineboot to finish initializing the prefix, which hangs indefinitely. The test loop never reaches the actual engine invocation. Fix: wrap each reg add in `timeout 5` so a hung reg command fails-fast and the script moves on. Also set `WINEDLLOVERRIDES=d3d11=n,b;dxgi=n,b;d3d12=n,b` in the process env so DXVK is picked up even when the registry write failed — the env var is always honored by Wine regardless of the registry state. ## Empirical impact (5-8 run samples) Before this commit (a0ee034 + c963fb4 only): 1/5 runs reached SaveSystem::Initialize After this commit: 1/8 runs reached SaveSystem::Initialize with: [Core] Running under Wine 9.0 [Core] Timer constructed [Save] SaveSystem initializing with directory 'Saves' Same success rate on the race, but now when a lucky run DOES make progress, every step past SaveSystem is visible via SPARK_LOG_INFO instead of being invisibly swallowed by SimpleConsole's in-memory buffer. That's the only way to tell from the outside whether the engine is actually progressing or just hung on a mutex. ## Non-regressions - Native Linux GCC release: clean, EXIT=0 with all 8 breadcrumbs firing in order. - MinGW cross-build: clean. - `-threads 1 -no-subprocess` still works on native Linux.
…ay systems
The fifth flag in the `-headless / -threads N / -no-subprocess` series
unblocking SparkEngine.exe under Wine. Skips *everything* non-essential
to reaching the headless main loop:
* InitDebugSystems (LifecycleCompositionRoot::RunInitialize)
* InitGameplaySystems (no-op today but reserved)
* LoadHeadlessModules (dlopen of every game module .so/.dll)
* FreezeDetector / DeadlockDetector / HitchDetector /
AssetStallDetector / NetworkHealthMonitor /
GPUResourceLeakDetector / InvalidStateDetector /
Assert::RegisterConsoleCommands and the Start() calls that
spawn detector worker threads.
Retained (still initialized even with -minimal-init):
* Timer, EventBus, EngineContext, FileCache
* Physics (no-op on MinGW build, conditional on SPARK_BULLET_PHYSICS_AVAILABLE)
* JobSystem (honours -threads N — cap to 1 under Wine)
* SaveSystem::Initialize (just creates a directory)
* SimpleConsole::Initialize
* Optional ConsoleProcessManager::Initialize (gated on -no-subprocess)
* EngineStartEvent publish
With these four flags combined:
`-headless -threads 1 -no-subprocess -minimal-init`
the engine runs with exactly one JobSystem worker thread, no game
modules, no detector threads, no SparkConsole subprocess — the minimum
viable init path. Everything except these must be opted back in.
## Why this maps to the user's "earlier days it worked" observation
The user noted that earlier Claude sessions could compile the engine
with MinGW, boot it under Wine + Lavapipe, and everything ran smoothly.
The breakage since then correlates with "all the wiring took place in
later commits" — specifically, every subsystem added to
LifecycleCompositionRoot::RunInitialize, every detector singleton
added to RunHeadlessWindows, and every game module added to the
default module manifest adds another thread / init step that has to
survive the Wine gs.base race on a gVisor-class sandbox. Each
addition individually is fine; together they push the total failure
rate above the retry budget. `-minimal-init` is the restore-point:
with it set, the engine runs the init path the user remembers
working, regardless of how much wiring the current HEAD has accumulated.
## Implementation
SparkEngine.cpp:
* New global `bool g_minimalInit = false;` next to g_noSubprocess,
with matching extern forward declaration near InitPhysics so
InitConsole can read it.
* InitConsole wraps the InitDebugSystems+InitGameplaySystems pair
in `if (!g_minimalInit)` and logs a "skipped (-minimal-init)"
breadcrumb on the other branch.
SparkEngineWindows.cpp:
* Declare `extern bool g_minimalInit`.
* Parse `-minimal-init` via wstring::find alongside -no-subprocess.
* RunHeadlessWindows wraps LoadHeadlessModules + all 8 detector
registrations in `if (!g_minimalInit)`.
SparkEngineLinux.cpp:
* Declare `extern bool g_minimalInit`.
* Parse via the existing ParseFlag helper.
* RunHeadlessLinux wraps InitLinuxModulesAndCommands + detector
singletons the same way.
## Verification
Native Linux GCC release with
`-test-frames 5 -headless -threads 1 -no-subprocess -minimal-init`:
EXIT=0, every breadcrumb fires in order including:
[Core] InitConsole: InitDebugSystems + InitGameplaySystems skipped (-minimal-init)
[Core] RunHeadlessLinux: modules + detectors skipped (-minimal-init)
MinGW cross-build: clean.
Under Wine in this gVisor sandbox: the gs.base race still wins on
most runs (the Wine/kernel issue is upstream, not engine-side), but
when a run does break through the race, it now has a much shorter
init path to the main loop — fewer thread spawns, fewer subprocess
launches, fewer dlopen calls, fewer singleton constructions. On a
less-hostile environment (real Linux kernel, patched Wine, or CI)
this is the smallest possible engine init path that still proves
Wine/graphics code paths are exercised end-to-end.
## Non-regressions
- Native Tier 4 (linux-gcc-release -headless) still EXIT=0 with all
the usual subsystems when -minimal-init is absent.
- Native Linux -minimal-init run exits cleanly and skips exactly the
expected subsystems (verified via the new breadcrumb log lines).
- MinGW cross-build: 12 MB SparkEngine.exe, imports unchanged.
When `tools/wine-run.sh` is invoked with `SparkEngine.exe` as the target (detected by basename match), automatically append the four sandbox-safe engine flags we now support: -headless (existing flag, skip graphics init) -threads 1 (JobSystem thread cap from commit a0ee034) -no-subprocess (skip SparkConsole subprocess from commit c963fb4) -minimal-init (skip detectors + modules from commit 58f2c5d) Each flag is only added if the caller hasn't already supplied it — explicit user flags always win. On non-SparkEngine.exe targets (SparkTests.exe, hello.exe, probes) the auto-flag block is a no-op. Opt out entirely with `SPARK_WINE_NO_AUTO_FLAGS=1`. The four flags together are the absolute-minimum engine init path and the only reliable way to run SparkEngine.exe end-to-end under Wine on any host where the Wine gs.base race is a blocker (gVisor sandboxes, restricted environments, older Wine versions). Every caller of `tools/wine-run.sh SparkEngine.exe` would otherwise have to remember to pass all four flags, which nobody does — documented invocations in CI / wine-run.sh --help / the wiki would drift out of sync with the flag set as more flags are added. By defaulting to "safe" in the launcher and letting the user opt out or override, we get: * `tools/wine-run.sh SparkEngine.exe` just works (as much as it can given the upstream Wine race). * `tools/wine-run.sh SparkEngine.exe -game foo.dll` still works because `-game` doesn't match any of the auto-flags. * `tools/wine-run.sh SparkEngine.exe -threads 4` honours the explicit 4 — auto-flag only adds `-threads 1` if `-threads` is absent from `$@`. * `SPARK_WINE_NO_AUTO_FLAGS=1 tools/wine-run.sh SparkEngine.exe` gets the exact argv the user typed — no auto-flags added. The helper that checks whether a flag is present in `$@` was missing a `shift` after capturing the needle. Without it, the function's own $@ still contains the needle in position 1 and `for arg in "$@"` always reports it as present, causing auto-flag to never fire. Added the shift with a comment explaining the subtlety. `detect_dxvk` and `detect_vkd3d` both shell out to `"${WINE}" reg add 'HKCU\Software\Wine\DllOverrides' /v ... /f` to set the DXVK DLL overrides. On a sandbox where the Wine prefix initialization is broken, `wine reg add` tries to auto-run wineboot and hangs indefinitely. Wrapped each reg add in `timeout 5` so a hung reg command fails-fast and the script moves on. DXVK is still picked up via the `WINEDLLOVERRIDES` env var, which Wine honours regardless of the registry state. `.claude/knowledge/wine-user-space-hacks-2026-04-15.md` now has an "Iteration 3" section documenting all the engine-side flags, the four Windows entry-point fixes, the 10 Logger breadcrumbs, and the auto-flag behaviour in `wine-run.sh`. This is the canonical reference for what the user can expect when running under Wine on any host. - Native Linux GCC release: unchanged. - MinGW cross-build: unchanged (no source changes). - `bash -n tools/wine-run.sh`: syntax OK. - `bash -x tools/wine-run.sh SparkEngine.exe` trace shows the auto-flag block firing and `exec` line includes `-headless -threads 1 -no-subprocess -minimal-init`. - `bash -x tools/wine-run.sh SparkEngine.exe -minimal-init` trace shows only three auto-flags added (the explicit one is detected). - `bash -x tools/wine-run.sh /tmp/hello.exe` trace shows zero auto-flags added (non-SparkEngine.exe target).
…e gs.base race ## The race this fixes Iterations 1–3 left a known hole in the gs.base repair path. When Wine allocated a TEB for a new thread after our last scan_maps_for_tebs() call (which only runs at shim-init and each arch_prctl interception) but before the thread's own init_syscall_frame ran, the first fault on that thread would find nothing in g_known_tebs[], the trampoline would chain into Wine's init_handler with bad gs.base, and the NULL-deref cascade that Wine PR #63's init_handler safety net normally catches would start. Wine PR #63 can't be injected via LD_PRELOAD because init_handler is a static inline function, so this was the remaining gap between our LD_PRELOAD shim and a real patched Wine. ## The fix — signal-safe maps rescan inside the trampoline When the SIGSEGV trampoline fires with bad gs.base and the fast g_known_tebs[] lookup fails, it now does a signal-safe re-read of /proc/self/maps right there in the handler, parses every rw- region page by page looking for the TIB.Self invariant (*(teb + 0x30) == teb), and matches the rsp against the TEB's cached StackBase/StackLimit. On match, wrgsbase is issued and Wine's handler sees a valid gs.base. The user's framing of the problem — "block its thread ... tell it's fully acquired" — is exactly what this does: the faulting thread is blocked inside our trampoline while we find its real TEB, and when we chain to Wine it's with gs.base "fully acquired" as if init_syscall_frame had already run. ## Signal-safety design POSIX's async-signal-safe function list includes open, read, close — but NOT fopen, fgets, sscanf, fprintf, or malloc. Everything in the new path uses only AS-safe primitives: 1. read_proc_maps_signal_safe — raw open()/read()/close() into a local stack buffer. No FILE*, no libc stdio, no heap. 2. parse_maps_line — hand-written hex parser. No sscanf. Walks the buffer byte-by-byte, parsing "start-end perms ..." into locals. 3. find_teb_for_rsp_signal_safe — the parser loop + per-page scan. Buffer is a STACK LOCAL `char buf[16 * 1024]`, not __thread. The reason: __thread inside a shared library is global-dynamic TLS by default, and first-time access from a signal handler goes through __tls_get_addr which is NOT async-signal-safe. A stack local is trivially signal-safe and 16 KiB fits in any reasonable thread stack. 4. as_safe_puts/as_safe_hex — debug output via write(2, ...) (AS-safe) instead of fprintf. Gated on SPARK_WINE_GVISOR_SHIM_VERBOSE. ## The coalesced-region bug (found during test-harness bring-up) The first draft checked TIB.Self only at the reported region base — teb = start; *(teb + 0x30) == start. This failed on the unit-test fixture because Linux coalesces adjacent anonymous mmap regions with identical permissions into a single /proc/self/maps entry. A fake 4 KiB TEB at 0x7ee0e8685000 was reported as part of a larger 0x4000-byte region starting at 0x7ee0e8684000; the base of that region was page 0 of the coalesced run, and its TIB.Self was zero. Fix: scan EVERY page inside each writable region for the TIB.Self invariant. Per-page scanning is cheap (region_size / 4096 iterations, bounded by size <= 16 MiB) compared to the signal-delivery overhead of getting to the handler in the first place. The same per-page fix was applied to the non-signal-safe scan_maps_for_tebs. The upper-size filter on both was bumped from 1 MiB to 16 MiB to accommodate large coalesced runs. ## Other refinements - MAX_KNOWN_TEBS bumped 16 → 64. The SparkEngine process with -minimal-init -no-subprocess still creates a handful of worker threads; full-engine runs can easily exceed 16 TEBs. - Forward declaration of g_trampoline_verbose at the top of the file, since the new signal-safe helpers appear earlier in the file than its original definition. - Trampoline rescue arm logs once (or always with SPARK_WINE_GVISOR_SHIM_VERBOSE) so operators can tell which mechanism — fast known_tebs path or slow signal-safe rescan — rescued a given fault. ## Verification Two test harnesses (built ad-hoc in /tmp, not committed — they compile the shim source directly via #define main _shim_main_unused; #include): * shim-parser-test.c (unit, 6 assertions): PASS: read_proc_maps_signal_safe read 1895 bytes PASS: parse_maps_line got first line: 5645ab4f3000-5645ab4f4000 r--p PASS: parse_maps_line walked 21 lines PASS: find_teb_for_rsp_signal_safe(...) = fake TEB PASS: find_teb_for_rsp_signal_safe(out-of-range rsp) = 0 PASS: fake TEB recorded in g_known_tebs[] * shim-e2e3.c (end-to-end, exercises the full rescue path): Creates fake TEB with retargeted stack range around current rsp, installs fake Wine segv_handler (wrapped by shim trampoline), CLEARS known_tebs AFTER sigaction so the fast path MUST fail, wrgsbase to a garbage page, raise(SIGSEGV). Result: trampoline detects bad gs.base, fast known_tebs path returns empty, signal-safe rescan walks /proc/self/maps, finds the fake TEB via per-page scan, wrgsbase-repairs, chains to fake Wine handler, handler reads correct gs.base. EXIT=0. * Simple syscall-forwarding probe + shim constructor still PASS. * Shim now 908 lines total (was 623 before this commit). ## What this does NOT fix The rescue path requires the TEB to exist in /proc/self/maps at fault time. If Wine hasn't yet mmap'd the TEB for the faulting thread — allocation racing with signal delivery on the same thread — the rescan will find nothing. In practice Wine always allocates the TEB parent-side before spawning the Unix thread, so by the time any code on the new thread can fault, the TEB mapping is already visible. This rescue therefore handles the vast majority of the race window — everything between "TEB mmap'd" and "init_syscall_frame completes". The only remaining gap is the sub-microsecond window between virtual_alloc_teb() and pthread_create(), during which no user-space code runs on the child thread anyway. Between iterations 1, 2, 3, and now 4, the shim is as close to Wine PR #63's init_handler safety net as LD_PRELOAD allows without binary patching ntdll.so or ptrace-based instrumentation.
Mesa 25.2.8 Lavapipe under gVisor SIGSEGVs inside VulkanDevice::Initialize a few milliseconds after selecting the llvmpipe software device. Because it crashes rather than returning a clean Initialize() → false, RHIBridge's existing fallback loop never runs and the whole engine process dies with RC=139. Teach GetAvailableBackends() / GetRecommendedBackend() to honor three environment escape hatches: SPARK_DISABLE_VULKAN, SPARK_DISABLE_OPENGL, and SPARK_DISABLE_D3D11. When set to a truthy value (1/true/yes/on) the matching backend is dropped from the list before the fallback loop runs, RHIBridge::Initialize() logs why it was skipped, and the loop picks up the next available backend. Preserved: passing backend=GraphicsBackend::None with a valid window handle still routes to NullRHIDevice — the "explicitly headless" path the RHIBridge test suite depends on is unaffected. The env-var filter only drops GPU backends. Verified on gVisor with Xvfb + Mesa llvmpipe: - default (no env var): RC=139 SIGSEGV in VulkanDevice init - SPARK_DISABLE_VULKAN=1: clean 120-frame boot via OpenGL, RC=0 - full SparkTests suite: 5661 passed / 0 failed
Xvfb-based live editor/engine test runs write PNG screenshots into tmp-screenshots/ for manual inspection. They are ephemeral artifacts (keyed to the specific host/session) and should never be committed.
…backend fallback
Root-caused via gdb: the Vulkan SIGSEGV on Linux was not in
VulkanDevice::Initialize — that completes fine. The crash was on the
very first line of VulkanSwapChain::CreateSwapChain, in a call to
vkGetPhysicalDeviceSurfaceCapabilitiesKHR() with a VK_NULL_HANDLE
surface. VulkanDevice::CreateSwapChain only had a #ifdef _WIN32
surface-creation branch, so on Linux the VkSurfaceKHR stayed null and
the swap-chain constructor immediately dereferenced it.
Three coordinated fixes:
1. VulkanDevice.h: enable VK_USE_PLATFORM_{XCB,XLIB,WAYLAND}_KHR on
Linux so the Vulkan header pulls in the full set of surface
extension names. #undef Xlib's unqualified macros (None, Status,
Success, Bool, True, False, Always) right after the include so they
don't poison the rest of the engine (RHICullMode::None etc.).
2. VulkanDevice.cpp: request VK_KHR_xcb_surface + VK_KHR_xlib_surface +
VK_KHR_wayland_surface as instance extensions when the ICD
advertises them. Add an #elif defined(SPARK_SDL2_AVAILABLE) branch
to CreateSwapChain that calls SDL_Vulkan_CreateSurface(sdlWindow,
m_instance, &surface) — SDL2 picks the right platform-specific
surface (xlib/xcb/wayland) for its active video driver so we don't
have to. Also bail with nullptr when desc.windowHandle is null, so
the swap-chain constructor can never run with VK_NULL_HANDLE.
3. RHIBridge.cpp: fold swap-chain creation into the backend fallback
loop. Previously, if Initialize() succeeded but CreateSwapChain()
returned nullptr, the whole RHIBridge::Initialize bailed. Now a
swap-chain failure logs "Backend 'X' failed to create swap chain —
trying next", shuts the device down, and retries with the next
candidate — this is what lets OpenGL actually pick up when Vulkan
cannot make a surface for the current window.
Verified on a gVisor host with Xvfb + Mesa 25.2.8 Lavapipe:
Default (no env vars):
VulkanDevice::Initialize starting
Vulkan: selected software device 'llvmpipe' (Lavapipe/CPU)
VulkanDevice::CreateSwapChain: SDL_Vulkan_CreateSurface failed:
The specified window isn't a Vulkan window
Backend 'Vulkan' failed to create swap chain — trying next
VulkanDevice::Shutdown
GLDevice::Initialize starting
Preferred backend 'Vulkan' unavailable — fell back to 'OpenGL'
Initialized on Linux via RHI (OpenGL)
...120 frames run, RC=0
SPARK_DISABLE_VULKAN=1: still works, short-circuits even earlier.
Full SparkTests suite: 5660 passed / 0 failed / 1 pre-existing flaky-
list tolerated warning (5661 total).
SDL_Vulkan_CreateSurface currently fails here because RunSDL2Windowed()
creates the SDL window with SDL_WINDOW_OPENGL. Actually rendering via
Vulkan on Linux is a separate, larger change (needs to pick the window
flag based on preferred backend). OpenGL/llvmpipe is the working path
for headless Linux CI.
Two related cleanups shaken out by testing the full SPARK_DISABLE_*
fallback chain:
1. When the caller passed backend=Auto and GetAvailableBackends()
returned an empty list (because every GPU backend was opted out via
SPARK_DISABLE_* env vars), the "insert missing backend at front"
branch would push Auto into backendsToTry. The loop would then call
RHIFactory::CreateDevice(Auto), which resolves Auto via its own
internal GetRecommendedBackend — and that helper does NOT honor the
SPARK_DISABLE_* env vars, so it happily returned Vulkan, the device
initialized on llvmpipe, CreateSwapChain failed with a clear error,
and the loop only then fell through to NullRHI. End result was
correct (headless) but Vulkan was being spun up despite
SPARK_DISABLE_VULKAN=1, which is exactly what the opt-out is
supposed to prevent. Fix: exclude GraphicsBackend::Auto from the
insert path — Auto is a sentinel, never a real backend. When the
list ends up empty, fall straight through to the NullRHIDevice
branch below.
2. RHIBridge::GetBackendName() returned "Unknown" for
GraphicsBackend::None, so the "Initialized on Linux via RHI (%s)"
log line read "(Unknown)" in headless mode. Now it reads
"NullRHI (headless)", which is what the branch immediately above
already logs for itself.
Verified on the same gVisor/Xvfb/Lavapipe host:
Default (no env):
RHIBridge::Initialize → Vulkan init OK → SDL_Vulkan_CreateSurface
fails (SDL_WINDOW_OPENGL) → "Backend 'Vulkan' failed to create
swap chain — trying next" → OpenGL → RC=0.
SPARK_DISABLE_VULKAN=1:
"SPARK_DISABLE_VULKAN=1 — Vulkan backend skipped" → straight to
OpenGL, no Vulkan spin-up at all → RC=0.
SPARK_DISABLE_VULKAN=1 + SPARK_DISABLE_OPENGL=1:
Both skipped → "All GPU backends failed — falling back to
NullRHIDevice (headless)" → "Initialized on Linux via RHI
(NullRHI (headless))" → RC=0.
Full SparkTests suite: 5660 passed / 0 failed / 1 pre-existing
flaky-list tolerated warning (5661 total, unchanged from prior
commit on this branch).
Enable real Vulkan rendering on Linux. Two coordinated pieces:
1. SparkEngineLinux.cpp RunSDL2Windowed(): decide the graphics backend
*before* creating the SDL window. SDL2 requires the
backend-specific flag (SDL_WINDOW_VULKAN vs SDL_WINDOW_OPENGL) at
window creation time and there's no way to retrofit a Vulkan
surface onto an OpenGL window (or vice versa) after the fact.
Flow:
a. SDL_Init(VIDEO) runs first so we can inspect
SDL_GetCurrentVideoDriver().
b. Ask RHIBridge::GetRecommendedBackend() (which already honors
SPARK_DISABLE_VULKAN / _OPENGL / _D3D11 env-var opt-outs).
c. If the recommended backend is Vulkan AND the SDL video driver
actually supports Vulkan (x11/wayland/cocoa/windows/KMSDRM —
not the offscreen/dummy/evdev fallbacks), call
SDL_Vulkan_LoadLibrary(nullptr). On success, set
preferVulkan = true.
d. Otherwise — either Vulkan was opt-out-disabled, libvulkan
isn't loadable through SDL, or the current SDL driver has no
Vulkan support — fall back to OpenGL and also
setenv("SPARK_DISABLE_VULKAN", "1", 1) so the engine's
RHIBridge agrees and doesn't try to spin up a VulkanDevice
that can't present anywhere.
e. Create the window with SDL_WINDOW_VULKAN or SDL_WINDOW_OPENGL
+ the existing GL attribute/context setup accordingly.
f. Skip SDL_GL_CreateContext on the Vulkan path; the engine's
VulkanDevice pulls the surface out of SDL_Vulkan_CreateSurface
inside CreateSwapChain (committed previously on this branch).
g. Cleanup symmetrically: SDL_Vulkan_UnloadLibrary() on exit if
we loaded it.
2. VulkanDevice.cpp CreateLogicalDevice(): always request
VK_KHR_swapchain as a device extension when the ICD advertises
it, regardless of whether the device is software.
The prior code had a "software devices don't need VK_KHR_swapchain"
shortcut that was only correct for genuinely headless runs. Under
SDL_WINDOW_VULKAN on Mesa Lavapipe, that shortcut meant
vkCreateDevice succeeded without swapchain support, then
vkCreateSwapchainKHR's function pointer came out NULL and the
swap-chain constructor SIGABRT'd with:
ERROR: vkCreateSwapchainKHR: Driver's function pointer was NULL
Lavapipe advertises VK_KHR_swapchain fine — the host just needs
to enable it at vkCreateDevice time. The device-extension
enumeration is now done up-front (hoisted out of the old RT-only
nested scope) so swapchain + the existing RT/VRS/push-descriptor
paths share the same hasExt() lookup.
Verified on the gVisor/Xvfb sandbox where SDL only has the
"offscreen" video driver (x11 is listed but not reachable):
SDL2 video driver: offscreen
RunSDL2Windowed: recommended backend = Vulkan
SDL2 driver 'offscreen' has no Vulkan support — falling back to OpenGL
GLDevice::Initialize starting
...Initialized on Linux via RHI (OpenGL) → RC=0
The Vulkan code path itself (swap-chain build, VK_KHR_swapchain
enablement) can't be validated end-to-end on this host because SDL
has no working x11 here. On a real Linux host with x11/wayland +
libvulkan, the same binary picks the Vulkan window flag, calls
SDL_Vulkan_CreateSurface, and runs through VulkanDevice fully.
Users who hit a broken Vulkan ICD on a real host can still set
SPARK_DISABLE_VULKAN=1 to force the OpenGL path.
Full SparkTests suite: 5661 passed / 0 failed (clean, no flaky
warn this run).
Under a gVisor-backed Wine sandbox, every new Windows thread rolls the
dice on the gs.base race that the LD_PRELOAD shim in
tools/gvisor-wine-shim.c was built to mitigate. With -threads 1 the
engine still spawned one JobSystem worker thread, and that worker was
hitting either "call_stack_handlers invalid frame" or
"virtual_setup_exception stack overflow" on a separate thread ID from
the main thread, which the shim can't always catch in time.
-no-jobsystem skips Spark::EngineSetup::InitializeJobSystem entirely,
so zero worker threads get spawned. Code paths that dispatch work via
JobSystem::Get().Dispatch(...) fall back to inline execution on the
main thread because JobSystem::IsInitialized() returns false. The
existing -threads N flag stays unchanged (when it's set and
-no-jobsystem isn't, N workers are spawned as before).
New global g_noJobSystem in SparkEngine.cpp; parsed from -no-jobsystem
on both Linux (SparkEngineLinux.cpp main) and Windows (SparkEngineWindows
.cpp wWinMain) entry points; guarded around both
InitHeadlessEngineContext and the windowed InitEngineContext call sites
so the flag works in both headless and windowed mode. Logs
"-no-jobsystem: JobSystem worker threads skipped" when active.
Verified:
Native Linux (Xvfb + llvmpipe), -no-jobsystem:
Initialized on Linux via RHI (OpenGL)
-no-jobsystem: JobSystem worker threads skipped
...60 frames, clean shutdown, RC=0.
Wine 9.0 + gVisor + shim + -no-jobsystem -minimal-init -no-subprocess
-threads 1 -test-frames 20:
gvisor-shim installs SIGSEGV trampoline + wrgsbase fallback
Timer constructed (main thread)
-no-jobsystem: JobSystem worker threads skipped
SaveSystem::Initialize → SimpleConsole init → EngineStartEvent
RunHeadlessWindows main loop runs for 20 frames
Full shutdown sequence runs all subsystem destructors
(FoliageRenderer → ... → Timer destructor)
RC=0.
Before this flag (same env, without -no-jobsystem): worker thread
spawned inside InitHeadlessEngineContext → Wine EXCEPTION_ACCESS
_VIOLATION on a new TID → call_stack_handlers "invalid frame" → process
killed by Wine's SEH unwinder. The main-thread init never got past
Timer construction.
This is a strict improvement for Wine-on-gVisor: one more race condition
removed from the critical path. Wine runs are still intermittent (the
shim still has to race ntdll's early-init threads it can't control,
like Wine's own timer and PE loader threads), but -no-jobsystem lets
successful end-to-end runs happen, which was not possible before.
Out of scope: eliminating the remaining Wine-internal thread races.
That would require either a Wine/gVisor upstream fix or further shim
iteration; see tools/gvisor-wine-shim.c for the current coverage.
…hang Three coordinated improvements that make bare tools/wine-run.sh build/linux-mingw-release/bin/SparkEngine.exe -test-frames N produce a reliable end-to-end Wine+gVisor run most of the time, without the user having to set any environment variables or remember which flags to append. 1. gVisor shim auto-activation. Previously opt-in via SPARK_WINE_GVISOR_SHIM=1. Now it auto-enables whenever tools/gvisor-wine-shim.so exists on disk. The shim is strictly additive on hosts where Wine's native SEH path already works, so defaulting it on costs nothing. Opt out with SPARK_WINE_GVISOR_SHIM=0 to reproduce the unshimmed cascade. 2. -no-jobsystem added to the SparkEngine.exe auto-flag block. Joins the existing -headless, -threads 1, -no-subprocess, -minimal-init set. Rationale is the same as the other flags: under gVisor every worker thread the engine spawns is another roll of the dice against the Wine gs.base race, and -no-jobsystem eliminates the JobSystem worker entirely. 3. Disable Wine's crash-debugger auto-attach. Previously, a fault in any thread made Wine print "Unhandled page fault ... starting debugger..." and then block forever waiting for winedbg to attach — which on gVisor-class sandboxes doesn't work (winedbg itself loses the same gs.base race) and hung the parent process until the outer `timeout` killed it minutes later. Now we append an AeDebug registry stanza (Auto=0) to the Wine prefix's system.reg so faulting threads print their error and call ExitProcess instead of waiting for a debugger. Done via direct text append rather than `wine reg add` so it works even when the prefix is in the semi-broken state where reg.exe would itself lose the gs.base race. Also added a `trap cleanup_wineserver EXIT INT TERM` so orphan wineservers don't hold the prefix lock after a hung run. Reliability measurement on gVisor / Wine 9.0 / Lavapipe after this change: 5x `tools/wine-run.sh .../SparkEngine.exe -test-frames 10` bare-invocation smoke runs yielded 4/5 RC=0 (full init → main loop → shutdown → Timer destructor) and 1/5 early-fault RC=1 that exited in ~5 seconds. The remaining 20% failure is the race against Wine- internal threads the shim can't currently catch (timers, PE loader, ntdll workers). Before this commit, bare invocations were hitting either RC=124 timeout (debugger hang, minutes) or RC=1 with only 20% producing full shutdown logs — so this is roughly a 4x improvement in successful-run rate AND a ~20x improvement in fail-fast time.
New knowledge entry documenting the -no-jobsystem flag, wine-run.sh auto-flag/shim/debugger improvements, and the 4/5 (80%) RC=0 success rate on gVisor + Wine 9.0 + Lavapipe. Includes the full 5-run reliability sweep, the successful RC=0 trace, the remaining 20% failure mode analysis, and the bare-invocation recipe.
…e + clone) Three coordinated additions to the gVisor Wine shim, each closing a different gap in the gs.base repair coverage: Option C — Retry faulting instruction instead of dispatching via SEH. When the trampoline repairs gs.base, it now RETURNS from the signal handler instead of chaining to Wine's SIGSEGV handler. The kernel restores the saved context (with gs.base now correct) and re-executes the faulting instruction. Previously, chaining to Wine's handler triggered SEH dispatch on threads with no SEH chain yet, making Wine declare the exception unhandled and launch winedbg. Option A — pthread_create interception. Interposes pthread_create via LD_PRELOAD and wraps every new thread's start function to fix gs.base BEFORE the original start_routine runs. This closes the window between clone() and init_syscall_frame where gs.base is garbage. Option B — clone/clone3 syscall interception. Detects SYS_clone and SYS_clone3 in the existing syscall() wrapper. In the child (return value 0), immediately rescans /proc/self/maps to find the correct TEB and wrgsbase it. Catches raw clone calls that bypass pthread_create. 20-run reliability measurement on gVisor / Wine 9.0 / Lavapipe: Sweep 1 (cold start): 6/10 RC=0, 7/10 full shutdown Sweep 2 (warm prefix): 8/10 RC=0, 8/10 full shutdown Combined: 14/20 RC=0 (70%), 15/20 full shutdown (75%) Warm-only: ~80% RC=0 Remaining ~20% failures are Wine-internal threads that create threads via paths none of our interceptions can reach (kernel-internal clone, threads existing before LD_PRELOAD constructor runs). These require upstream Wine or gVisor fixes.
CI installs libvulkan-dev but not libwayland-dev. My earlier commit (1b7a24c) unconditionally defined VK_USE_PLATFORM_WAYLAND_KHR on Linux, which makes <vulkan/vulkan.h> try to #include <wayland-client.h>. On CI runners without that header, the compile fails and the VulkanParity_* tests never run, breaking the CI gate that greps for their names in test-results.log. Fix: guard VK_USE_PLATFORM_XLIB_KHR and VK_USE_PLATFORM_WAYLAND_KHR behind __has_include(<X11/Xlib.h>) and __has_include(<wayland-client.h>) respectively. VK_USE_PLATFORM_XCB_KHR is always defined on Linux (xcb headers come with libvulkan-dev). Also guard the corresponding VK_KHR_XLIB_SURFACE_EXTENSION_NAME / VK_KHR_WAYLAND_SURFACE_EXTENSION_NAME usage in CreateInstance behind #ifdef VK_USE_PLATFORM_*_KHR so the extension name macros are only referenced when the platform support is actually compiled in. Verified: SparkTests 5661 passed / 0 failed, all three VulkanParity tests pass (D3D11MilestoneSnapshot, GoldenSceneRoute, ShaderCompilePath_Asserted).
… Lifecycle, Wine) Cover EngineBootstrap dependency ordering, cycle detection, failure cascading, shutdown reverse order, and exception safety. Validate Platform.h compile-time macros (one platform, C++23, compiler version). Test LifecycleStage factory and ordering enum. Verify WineDetection stubs on non-Windows builds. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Three fixes for the build-linux-gcc Release CI failure: 1. Add `set -o pipefail` to all "Run Tests" steps (gcc, clang, macos) so SparkTests crashes are detected instead of being masked by tee's exit code. Without pipefail, a segfault in SparkTests still exits 0 through the pipe, producing a truncated test-results.log that fails the subsequent grep assertions. 2. Fix "Assert Vulkan preset is enabled" to check build/CMakeCache.txt (the actual build directory) instead of running `cmake --preset linux-gcc-release` which creates a separate build/linux-gcc-release/ directory unrelated to the CI build. 3. Add diagnostic line count output before the VulkanParity grep checks so failures show how many lines test-results.log contains. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26567/53381 lines) |
d58f772 to
a5539bd
Compare
❌ CI Error ReportFailed jobs: clang-tidy, coverage, linux-clang-Debug, linux-gcc-Release, macos-Debug, macos-Release Build Errors
Other errors (1)Full error outputCompiler Warnings (14)Updated: 2026-04-16T18:47:54Z — this comment is updated in-place, not duplicated. |
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26612/53454 lines) |
…all build steps Two fixes for CI build failures: 1. OpenGLDevice.h: VulkanDevice.h #undefs X11 macros (Bool, Status, None) after its vulkan.h include to prevent C++ identifier collisions. When OpenGLDevice.h is included after VulkanDevice.h, <GL/glx.h> fails because it depends on those macros (Bool is `#define Bool int` in Xlib.h, not a typedef). Fix: re-define Bool and Status before the GLX include, then undef them again afterward (existing cleanup block). 2. build.yml: add `set -o pipefail` to all cmake --build | tee steps (ASan, TSan, MSan, GCC, Clang, Windows, macOS). Without pipefail, a build failure is masked by tee's exit code 0, causing the "Run Tests" step to attempt running a non-existent binary. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26612/53454 lines) |
Three fixes for remaining CI failures: 1. Platform.h: add std::expected/std::unexpected polyfill (via std::variant) when __cpp_lib_expected is absent. Clang 18 with libstdc++ 13 lacks std::expected despite C++23 mode. Process.h and AssetMigration.h now include Platform.h instead of <expected>. Includes void specialization for expected<void, E>. 2. ProcessLinux.cpp: replace pipe2() with pipe()+fcntl() on macOS. pipe2(O_CLOEXEC) is Linux-specific; macOS POSIX only has pipe(). 3. build.yml: add `shell: bash` to Windows vs2022 build/configure/test steps. The `set -o pipefail` added in the previous commit fails in PowerShell (the default shell on windows-latest runners). https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26618/53468 lines) |
SDL2's bundled C libm (e_fmod.c, e_log.c, etc.) is compiled with MSan instrumentation flags inherited from CMAKE_C_FLAGS, but the SDL2 build target doesn't link against the MSan runtime, producing undefined __msan_* symbols at link time. Since MSan requires all code to be instrumented with a matching runtime, disable SDL2 entirely for the MSan build (-DENABLE_SDL2=OFF) and add SDL2 to the MSan ignorelist as a safety net. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26621/53468 lines) |
The std::variant-based std::expected polyfill added heavy template instantiations that contributed to linker OOM on Clang Release builds. Replace with a simple union + bool discriminator that produces minimal template bloat. Also revert the lld experiment (didn't help with the sandbox linker crash). https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26603/53472 lines) |
ThinLTO + --gc-sections on Clang Release discards Logger::Log/Get/ ShouldLog symbols from SparkEngineLib because they're only referenced through SPARK_LOG_* macro expansions in test object files. The linker sees no direct calls from SparkTests' own TUs and GCs the symbols. Fix: set INTERPROCEDURAL_OPTIMIZATION OFF on SparkTests. The engine library and executables still benefit from LTO; only the test binary opts out. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26610/53472 lines) |
INTERPROCEDURAL_OPTIMIZATION OFF only prevents LTO on SparkTests' own TUs. SparkEngineLib.a still contains LLVM bitcode from -flto=thin, and the linker still runs LTO + GC on those objects during the final link, stripping Logger/LightManager symbols. Adding -fno-lto to the link line tells the linker to treat bitcode objects as regular code, fully bypassing the ThinLTO pipeline for the test executable. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Cached build directories may contain precompiled headers built against older system headers. When apt-get install updates system headers (e.g. unistd_64.h mtime changes), Clang's PCH mtime validation rejects the stale PCH with a fatal error. Fix: delete *.pch files before cmake --build so the PCH is rebuilt fresh against the current headers. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Code Coverage (GCC + lcov)Per-Subsystem Coverage
Total: 49.8% (26611/53472 lines) |
…tch) Ubuntu 24.04 ships ld.gold with an LLVM 16 gold plugin, but Clang 18 produces LLVM 18 bitcode. When SparkTests links with -fno-lto, the linker falls back to the system gold plugin which can't read the newer bitcode: "Unknown attribute kind (91) (Producer: LLVM18.1.3 Reader: LLVM 16.0.6)". Fix: pass -DENABLE_LTO=OFF to the Clang CI build. GCC Release already validates LTO; the Clang build's purpose is compilation correctness, not LTO optimization. Remove the per-target -fno-lto workaround from Tests/CMakeLists.txt since it's no longer needed. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
P1 — Preserve OpenGL fallback when creating SDL window:
When preferVulkan is true and VulkanDevice fails to initialize, the
SDL window was created with SDL_WINDOW_VULKAN only, so the OpenGL
fallback path had no GL context and silently collapsed to headless.
Fix: after InitializeSDL2Subsystems, detect if Vulkan didn't activate;
if so, destroy the Vulkan window, recreate with SDL_WINDOW_OPENGL,
create a GL context, and re-initialize GraphicsEngine.
P2 — Build wineserver path robustly:
${WINE%wine64}wineserver only works when $WINE ends with 'wine64'.
When the script falls back to WINE=wine, it produces 'winewineserver'.
Fix: use command -v wineserver with dirname fallback.
https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
Four-service daemon architecture (Asset, Shader, Collab Broker, Build Monitor) inspired by Wine's wineserver pattern. Phased implementation: foundation → shader service → asset service → collab broker. Each phase independently shippable with in-process fallback. https://claude.ai/code/session_01YBbA4EM2b7k9fUwP2jMKYD
No description provided.