Fix crashes in multi-threaded process teardown#3167
Fix crashes in multi-threaded process teardown#3167louen wants to merge 7 commits intoml-explore:mainfrom
Conversation
Merge in TDGNUI/mlx from fix/168797669/mlx_leak_resources to mlx-2p
Squashed commit of the following:
commit 090d19edb47f506b52d62c7f63e5b6e86ec933bc
Author: Valentin Roussellet <vroussellet@apple.com>
Date: Tue Feb 17 15:45:18 2026 -0800
proposed fix
angeloskath
left a comment
There was a problem hiding this comment.
This looks great!
I left a comment regarding metal::set_enabled. Regarding CI you can run the test after the CPP tests in GitHub actions. I would also make the test easier on the GPU. A 2k by 2k matmul is probably unnecessarily large.
| } | ||
|
|
||
| void set_enabled(bool enabled) { | ||
| metal::enabled.store(enabled, std::memory_order_release); |
There was a problem hiding this comment.
That is probably not the right way to do that. I understand that it changes the default stream from gpu to cpu when metal is disabled but the stream is still there. People could still submit to it and if that causes a failure it can still cause a failure.
Eg what should the following code do?
auto s = mx::default_stream(mx::default_device());
mx::metal::set_enabled(false);
auto x = mx::add(mx::array(1), mx::array(2), s);
mx::eval(x)Just so that I don't sound pedantic. It is common to grab a stream specific to a task. For instance in mlx-lm we have generation_stream for well generation. That means if I was generating in a thread and the program exited and set the above, it would still submit to the GPU even though it was "disabled".
|
@louen how do you feel about removing the flag and merging this? |
This PR is a proposed fix for issue #3126 , where exceptions or segfaults can be triggered during the teardown of a multi-thread process using MLX.
A typical example would be an interactive application where UI runs on the main thread and MLX work is being done in the background on a separate thread.
If the app exits while the GPU is busy with a compute kernel, that kernel can return, and execute its completion call back after static objects have been destroyed by the runtime. Unfortunately in many case these callbacks try to access those singleton static objects, either the
Deviceor theScheduler.Proposed changes
This PR implements a mitigation strategy in two parts
deleted. This ensure that the objects remain alive even after the main thread's static object destruction has completed.metalnamespace which allows a user to disable the metal backend. This prevents additional work from being enqueued on the GPU while the application is shutting down.Tests
A new test executable has been added with the minimal repro case from #3126.
Before proposed changes:
After:
Showing the effect of the two fixes
Checklist
Put an
xin the boxes that apply.pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes