Skip to content

Fix ASAN issues in sharness tests#7538

Open
chu11 wants to merge 8 commits into
flux-framework:masterfrom
chu11:issue7537_asan_checks
Open

Fix ASAN issues in sharness tests#7538
chu11 wants to merge 8 commits into
flux-framework:masterfrom
chu11:issue7537_asan_checks

Conversation

@chu11

@chu11 chu11 commented Apr 14, 2026

Copy link
Copy Markdown
Member

Fix various ASAN failures in sharness tests. Most fixes are by adding NO_ASAN to tests, in which ASAN leads to unexpected behavior changes. One use-after-free was corrected in libterminus. (Edit: Changed my mind, will move that fix to #7539)

Note that a number of hangs still exist in sharness with ASAN. Those are being dealt with separately.

@chu11 chu11 force-pushed the issue7537_asan_checks branch from 11d4f72 to e84367c Compare April 14, 2026 22:27

@grondo grondo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! However, since the testsuite still doesn't run under ASan, I'd say keep the PR open and wait until we have a fully passing make check with ASan enabled, then title the PR "testsuite: allow entire testsuite to run with address sanitizers enabled"

grep "size=2 nodes=2" flux-${id4}.out &&
grep "size=2 nodes=2" flux-${id5}.out
'
test_expect_success MULTICORE 'flux batch: exclusive flag worked' '

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message is difficult to parse: perhaps

t: set NO_ASAN on tests that depend on ASAN-skipped tests

Comment thread t/t0001-basic.t
grep "^test two commands" help2.out &&
grep "a test two" help2.out
'
# N.B. Set NO_ASAN, as there appears to be errors in 'man', which is executed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message suggestion: Remove interesting/quirky from problem description - "test specific reasons" is sufficient to fully describe the problem being fixed.

@grondo

grondo commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Uhoh, the restart check is failing now that we're actually generating a restart dump from the last tag.

  FAIL: t2219-job-manager-restart.t 27 - successfully started from flux-v0.84.0.tar.bz2
  #	restart_flux /usr/src/t/job-manager/dumps/valid/flux-v0.84.0.tar.bz2

I highly doubt we've broken restarts since v0.84.0, so this is probably a testing issue.

@grondo

grondo commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

The test passed locally, so I'm not sure what went wrong. Unfortunately there was no detailed *.output file because the test doesn't obey FLUX_TESTS_LOGFILE. I've opened #7540 to rectify that (and maybe we'll get the same failure in that PR)

@chu11 chu11 force-pushed the issue7537_asan_checks branch from e84367c to 494c415 Compare April 21, 2026 20:23
@chu11 chu11 changed the title Fix various ASAN failures in sharness tests Fix ASAN issues in sharness tests Apr 21, 2026
@chu11 chu11 force-pushed the issue7537_asan_checks branch 2 times, most recently from 79d97fe to c5951f2 Compare April 21, 2026 20:32
@chu11

chu11 commented Apr 21, 2026

Copy link
Copy Markdown
Member Author

just re-pushed, this lot of fixes

  • skips sharness tests that failed under ASAN
  • skips / handle tests that hung under ASAN
  • skip / handle tests that would pass but result in a asan log file being generated
  • in general, make sure make check can pass sharness when ASAN is enabled

Note: exception with issue #7556, which we can workaround in CI / workflow we setup

Note that I did not setup asan CI workflow to run sharness. We can do that in another PR. Due to the fact ASAN takes a long time to run, it's possible we may want to debate if we want to run in CI or not. Or perhaps we want to architect it to run a unique way in CI (i.e. split it up into multiple runners so it collectively runs faster).

@chu11 chu11 force-pushed the issue7537_asan_checks branch 2 times, most recently from 2ff9148 to 892f8c6 Compare April 21, 2026 22:14
@chu11

chu11 commented Apr 23, 2026

Copy link
Copy Markdown
Member Author

side note, #7558 shows that asan on other systems (maybe other versions of libasan) may have troubles. So perhaps we should consider this PR one that fixes asan issues so that sharness can succeed on rhel8 w/ asan.

chu11 and others added 8 commits May 28, 2026 16:38
Problem: A test in t2714-python-cli-batch.t assumes a prior test
has been executed.  But that prior test will not run if ASAN is
configured.

Set NO_ASAN on the test to ensure it isn't executed like prior
dependent ones.
Problem: Tests in t0006-module-exec.t and t2614-job-shell-doom.t
test how segfaults are reported.  Under ASAN, segfault signals may be
handled / reported differently.

Skip tests that expect a specific message when a segfault occurs.  Under
ASAN, that specific message cannot be expected.
Problem: A test in t0006-module-exec.t and t3306-system-routercrash.t
simulate a segfault by sending a SIGSEGV signal to crash a module / broker.
With ASAN, the signal could be captured, the expected result may not happen,
and an asan log will be generated indicating a segfault happened.  Follow up
tests depend on the SIGSEGV crashing a module / broker, so we can't just skip
the test.  These tests are specifically covering SIGSEGV, so we don't want to
just change the signal.

Under ASAN, instead send a SIGKILL to crash the broker / module.  This will
ensure follow on tests continue to work as expected under ASAN.  We will still
get ample coverage of the SIGSEGV case under non-ASAN workflows.
Problem: Several tests are skipped when ASAN is enabled, but they
have no comments / explanation why.

Add comments explaining that the tests in t0005-exec.t are skipped
because ASAN causes a segfault to be reported differently than we
normally would expect.  Tests in t0016-cron-faketime.t and
t3001-mpi-personalities.t are skipped because they change LD_PRELOAD.
Tests in t2714-python-cli-batch.t are skipped because of slowness.
Problem: Some tests file under ASAN due to problems with underlying
commands.  This can be worked around if ASAN was not set on those
specific commands.

Add a no-asan-wrapper.sh script.  The simple script will unset
LD_PRELOAD and ASAN_OPTIONS before running a command.
Problem: The 'ps' and 'pkill' command hangs when ASAN is enabled.

Run 'ps' and 'pkill' under the no-asan script.
Problem: The 'dd' command has shown to have problems when run under
asan.

Wrap all calls to 'dd' under the no-asan-wrapper.sh script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Problem: A few tests fail or hang under ASAN for test specific reasons.

Set NO_ASAN on those tests.
@chu11 chu11 force-pushed the issue7537_asan_checks branch from 892f8c6 to c52774f Compare May 28, 2026 23:39
@chu11

chu11 commented May 28, 2026

Copy link
Copy Markdown
Member Author

re-pushed with a lot more fixes, most notably to make sharness pass on fedora40 with asan, not counting the python tests that fail with #7624.

The big change is the addition of a new util script no-asan-wrapper.sh.

unset LD_PRELOAD
unset ASAN_OPTIONS

"$@"

And for problematic commands, we call $NOASAN ps or $NOASAN dd.

I went back on forth if I should replace dd with $NOASAN dd or if I should make a global var at the top of test files and do DD="$NOASAN dd" and replace with $DD.

I settled on doing "$NOASAN dd". It was partially influenced by the fact I didn't want to make a mountain of PS="$NOASAN ps", etc. variants everywhere. But could see others want to go with the other style instead. Happy to change based on the collective option of reviewers.

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.01%. Comparing base (f85f01f) to head (c52774f).
⚠️ Report is 81 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7538      +/-   ##
==========================================
+ Coverage   83.99%   84.01%   +0.02%     
==========================================
  Files         576      576              
  Lines       97215    97215              
==========================================
+ Hits        81651    81678      +27     
+ Misses      15564    15537      -27     

see 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants