Fix crashes in multi-threaded process teardown by louen · Pull Request #3167 · ml-explore/mlx

louen · 2026-02-24T23:26:55Z

This PR is a proposed fix for issue #3126 , where exceptions or segfaults can be triggered during the teardown of a multi-thread process using MLX.

A typical example would be an interactive application where UI runs on the main thread and MLX work is being done in the background on a separate thread.
If the app exits while the GPU is busy with a compute kernel, that kernel can return, and execute its completion call back after static objects have been destroyed by the runtime. Unfortunately in many case these callbacks try to access those singleton static objects, either the Device or the Scheduler.

Proposed changes

This PR implements a mitigation strategy in two parts

intentionally leaking singleton objects, by replacing the static instance with a pointer, which is never deleted. This ensure that the objects remain alive even after the main thread's static object destruction has completed.
adding a global atomic toggle flag to metal namespace which allows a user to disable the metal backend. This prevents additional work from being enqueued on the GPU while the application is shutting down.

Tests

A new test executable has been added with the minimal repro case from #3126.

Before proposed changes:

> ./build/tests/test_teardown
START
...
Step 200
Step 210
Main thread exiting.
Step 220
libc++abi: terminating due to uncaught exception of type std::out_of_range: vector
zsh: abort      ./tests/test_teardown

After:

 ./tests/test_teardown
START
...
Step 200
Step 210
Main thread exiting.

Showing the effect of the two fixes

No more crash at exit
Step 220 is not being run after the exit signal

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Merge in TDGNUI/mlx from fix/168797669/mlx_leak_resources to mlx-2p Squashed commit of the following: commit 090d19edb47f506b52d62c7f63e5b6e86ec933bc Author: Valentin Roussellet <vroussellet@apple.com> Date: Tue Feb 17 15:45:18 2026 -0800 proposed fix

angeloskath

This looks great!

I left a comment regarding metal::set_enabled. Regarding CI you can run the test after the CPP tests in GitHub actions. I would also make the test easier on the GPU. A 2k by 2k matmul is probably unnecessarily large.

angeloskath · 2026-02-25T00:26:26Z

mlx/backend/metal/metal.cpp

+}
+
+void set_enabled(bool enabled) {
+  metal::enabled.store(enabled, std::memory_order_release);


That is probably not the right way to do that. I understand that it changes the default stream from gpu to cpu when metal is disabled but the stream is still there. People could still submit to it and if that causes a failure it can still cause a failure.

Eg what should the following code do?

auto s = mx::default_stream(mx::default_device()); mx::metal::set_enabled(false); auto x = mx::add(mx::array(1), mx::array(2), s); mx::eval(x)

Just so that I don't sound pedantic. It is common to grab a stream specific to a task. For instance in mlx-lm we have generation_stream for well generation. That means if I was generating in a thread and the program exited and set the above, it would still submit to the GPU even though it was "disabled".

angeloskath · 2026-02-27T02:09:13Z

@louen how do you feel about removing the flag and merging this?

louen added 7 commits February 24, 2026 13:25

change flag to toggle

b1f6a5a

update comment

136eb2c

format

72f89a4

add test

9197602

Fix test

7fc5c71

add breaker to test

0b37a64

angeloskath requested changes Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crashes in multi-threaded process teardown#3167

Fix crashes in multi-threaded process teardown#3167
louen wants to merge 7 commits intoml-explore:mainfrom
louen:val/fix_crash_quit

louen commented Feb 24, 2026 •

edited

Loading

Uh oh!

angeloskath left a comment

Uh oh!

angeloskath Feb 25, 2026

Uh oh!

angeloskath commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

louen commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Tests

Checklist

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

angeloskath Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

angeloskath commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

louen commented Feb 24, 2026 •

edited

Loading