Skip to content

Fix crashes in multi-threaded process teardown#3167

Open
louen wants to merge 7 commits intoml-explore:mainfrom
louen:val/fix_crash_quit
Open

Fix crashes in multi-threaded process teardown#3167
louen wants to merge 7 commits intoml-explore:mainfrom
louen:val/fix_crash_quit

Conversation

@louen
Copy link
Contributor

@louen louen commented Feb 24, 2026

This PR is a proposed fix for issue #3126 , where exceptions or segfaults can be triggered during the teardown of a multi-thread process using MLX.

A typical example would be an interactive application where UI runs on the main thread and MLX work is being done in the background on a separate thread.
If the app exits while the GPU is busy with a compute kernel, that kernel can return, and execute its completion call back after static objects have been destroyed by the runtime. Unfortunately in many case these callbacks try to access those singleton static objects, either the Device or the Scheduler.

Proposed changes

This PR implements a mitigation strategy in two parts

  • intentionally leaking singleton objects, by replacing the static instance with a pointer, which is never deleted. This ensure that the objects remain alive even after the main thread's static object destruction has completed.
  • adding a global atomic toggle flag to metal namespace which allows a user to disable the metal backend. This prevents additional work from being enqueued on the GPU while the application is shutting down.

Tests

A new test executable has been added with the minimal repro case from #3126.

Before proposed changes:

> ./build/tests/test_teardown
START
...
Step 200
Step 210
Main thread exiting.
Step 220
libc++abi: terminating due to uncaught exception of type std::out_of_range: vector
zsh: abort      ./tests/test_teardown

After:

 ./tests/test_teardown
START
...
Step 200
Step 210
Main thread exiting.

Showing the effect of the two fixes

  • No more crash at exit
  • Step 220 is not being run after the exit signal

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Merge in TDGNUI/mlx from fix/168797669/mlx_leak_resources to mlx-2p

Squashed commit of the following:

commit 090d19edb47f506b52d62c7f63e5b6e86ec933bc
Author: Valentin Roussellet <vroussellet@apple.com>
Date:   Tue Feb 17 15:45:18 2026 -0800

    proposed fix
Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I left a comment regarding metal::set_enabled. Regarding CI you can run the test after the CPP tests in GitHub actions. I would also make the test easier on the GPU. A 2k by 2k matmul is probably unnecessarily large.

}

void set_enabled(bool enabled) {
metal::enabled.store(enabled, std::memory_order_release);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is probably not the right way to do that. I understand that it changes the default stream from gpu to cpu when metal is disabled but the stream is still there. People could still submit to it and if that causes a failure it can still cause a failure.

Eg what should the following code do?

auto s = mx::default_stream(mx::default_device());
mx::metal::set_enabled(false);
auto x = mx::add(mx::array(1), mx::array(2), s);
mx::eval(x)

Just so that I don't sound pedantic. It is common to grab a stream specific to a task. For instance in mlx-lm we have generation_stream for well generation. That means if I was generating in a thread and the program exited and set the above, it would still submit to the GPU even though it was "disabled".

@angeloskath
Copy link
Member

@louen how do you feel about removing the flag and merging this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants