This is an archive of the discontinued LLVM Phabricator instance.

[asan] Reduce flakiness of heavy recovery mode tests.
ClosedPublic

Authored by m.ostapenko on Jul 28 2016, 9:10 AM.

Details

Summary

People keep complain about spurious fails on heavy halt_on_error* tests. After r275539 I've managed to reproduce a failure on halt_on_error-torture.cc, reported by Kostya:

/home/max/build/llvm/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -fsanitize-recover=address -pthread /home/max/src/llvm/projects/compiler-rt/test/asan/TestCases/Posix/halt_on_error-torture.cc -o /home/max/build/llvm/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Posix/Output/halt_on_error-torture.cc.tmp
env ASAN_OPTIONS=halt_on_error=false:suppress_equal_pcs=false /home/max/build/llvm/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Posix/Output/halt_on_error-torture.cc.tmp 10 20 >10.txt 2>&1 || true
FileCheck --check-prefix=CHECK-COLLISION /home/max/src/llvm/projects/compiler-rt/test/asan/TestCases/Posix/halt_on_error-torture.cc < 10.txt || FileCheck --check-prefix=CHECK-NO-COLLISION /home/max/src/llvm/projects/compiler-rt/test/asan/TestCases/Posix/halt_on_error-torture.cc < 10.txt

-- 
Exit Code: 1

Command Output (stderr):
-- 
/home/max/src/llvm/projects/compiler-rt/test/asan/TestCases/Posix/halt_on_error-torture.cc:76:22: error: expected string not found in input
 // CHECK-COLLISION: AddressSanitizer: nested bug in the same thread, aborting
                     ^
<stdin>:1:1: note: scanning from here
=================================================================
^
<stdin>:22:10: note: possible intended match here
SUMMARY: AddressSanitizer: use-after-poison /home/max/src/llvm/projects/compiler-rt/test/asan/TestCases/Posix/halt_on_error-torture.cc:45:14 in run(void*)
         ^
/home/max/src/llvm/projects/compiler-rt/test/asan/TestCases/Posix/halt_on_error-torture.cc:77:25: error: expected string not found in input
 // CHECK-NO-COLLISION: All threads terminated
                        ^
<stdin>:1:1: note: scanning from here
=================================================================
^
<stdin>:3:31: note: possible intended match here
WRITE of size 1 at 0x7f36b00fedb0 thread T1
                              ^

-- 

********************

Here what I found out:
When we run halt_on_error-torture.cc with 10 threads and 20 iterations with halt_on_error=false:suppress_equal_pcs=false, we write 200 reports to 10.txt file and sometimes have collisions. We have CHECK-COLLISION check that greps 'AddressSanitizer: nested bug in the same thread, aborting' message in 10.txt, but for some reason it doesn't contain this line. If I don't redirect stderr > 10.txt 'AddressSanitizer: nested bug in the same thread, aborting' is printed to my screen as expected.

This happens because we hit on race in WriteToFile function called from ScopedInErrorReport constructor:

u32 current_tid = GetCurrentTidOrInvalid();
if (reporting_thread_tid_ == current_tid ||
    reporting_thread_tid_ == kInvalidTid) {
  // This is either asynch signal or nested error during error reporting.
  // Fail simple to avoid deadlocks in Report().

  // Can't use Report() here because of potential deadlocks
  // in nested signal handlers.
  const char msg[] = "AddressSanitizer: nested bug in the same thread, "
                     "aborting.\n";
  WriteToFile(kStderrFd, msg, sizeof(msg));

  internal__exit(common_flags()->exitcode);
}

Here we have concurrent write of "AddressSanitizer: nested bug in the same thread, aborting.\n" message in one thread and report writing from another thread. There's no guarantee of write operation atomicity unless we open corresponding file with O_APPEND flag in Unix, thus the "AddressSanitizer: nested bug in the same thread, aborting.\n" message might be missed that leads to test failure if collision occurred.

Ideally, this race in ScopedInErrorReport should be eliminated, but for now we can fix heavy recovery mode tests by implicitly setting O_APPEND for opened files (use >> instead of > for stderr redirection).

Diff Detail

Repository
rL LLVM

Event Timeline

m.ostapenko retitled this revision from to [asan] Reduce flakiness of heavy recovery mode tests..
m.ostapenko updated this object.
m.ostapenko added reviewers: kcc, eugenis, dvyukov.
m.ostapenko set the repository for this revision to rL LLVM.
m.ostapenko added a project: Restricted Project.
m.ostapenko added subscribers: llvm-commits, ygribov.

There's no guarantee of write operation atomicity unless we open corresponding file with O_APPEND flag in Unix

Note that this behavior depends on particular kernel version and filesystem in use e.g. we were unable to repro with ext4 on fresh kernel.

Ideally, this race in ScopedInErrorReport should be eliminated

I tried to do this once but failed. Note that race only happens when recoverable mode is enabled and two threads try to print error message at the same time. In that case the race may happen but won't cause any error - it'll just terminate execution with informative message.

m.ostapenko added a comment.EditedJul 28 2016, 9:51 AM

Ideally, this race in ScopedInErrorReport should be eliminated

I tried to do this once but failed. Note that race only happens when recoverable mode is enabled and two threads try to print error message at the same time. In that case the race may happen but won't cause any error - it'll just terminate execution with informative message.

Yeah. Also we have suppress_equal_pcs=true set by default that reduces race probability.

eugenis edited edge metadata.Jul 28 2016, 1:33 PM

Is this the same write() atomicity issue as in test/msan/fork.cc ?
Does piping the output directly to FileCheck help?

Is this the same write() atomicity issue as in test/msan/fork.cc ?

Yeah, exactly the same.

Does piping the output directly to FileCheck help?

I'm afraid we can't use pipe here, because we have 2 possible outputs:

// RUN: FileCheck --check-prefix=CHECK-COLLISION %s < 10.txt || FileCheck --check-prefix=CHECK-NO-COLLISION %s < 10.txt
.....
  // CHECK-COLLISION: AddressSanitizer: nested bug in the same thread, aborting
  // CHECK-NO-COLLISION: All threads terminated
eugenis accepted this revision.Jul 29 2016, 11:02 AM
eugenis edited edge metadata.

LGTM
thanks!

This revision is now accepted and ready to land.Jul 29 2016, 11:02 AM
This revision was automatically updated to reflect the committed changes.