This is an archive of the discontinued LLVM Phabricator instance.

tsan: add new trace
ClosedPublic

Authored by dvyukov on Aug 11 2021, 9:22 AM.

Details

Summary

Add structures for the new trace format,
functions that serialize and add events to the trace
and trace replaying logic.

Depends on D107910.

Diff Detail

Event Timeline

dvyukov created this revision.Aug 11 2021, 9:22 AM
dvyukov requested review of this revision.Aug 11 2021, 9:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2021, 9:22 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript

I want to add few more tests, but the rest of the code is ready. So I mailed it to get any high-level feedback sooner.

It's pretty large, but it's the minimal amount of code that's testable end-to-end.

(review incomplete, but this is what I have so far)

compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
583

clang-tidy: please fix

684

clang-tidy: please fix

compiler-rt/lib/tsan/rtl/tsan_trace.h
73

You could make this an 'enum class EventType : u64'.

And then the bitfield below could actually be:

EventType type : 3;

because bitfield works for enum.

And the enum values could simply be s/EventType/k/.

Concrete suggestion: https://github.com/melver/llvm-project/commit/6a71c35bbf660788899e3302bedd2748113fb153

Note:

  • This found some switch statements didn't handle all values (unsure if intentional)
  • The test's lookup of EvenType was getting confused because one was not prefixed with v3::
dvyukov updated this revision to Diff 366059.Aug 12 2021, 11:43 AM
dvyukov marked 2 inline comments as done.

fix review comments

dvyukov added inline comments.Aug 12 2021, 11:45 AM
compiler-rt/lib/tsan/rtl/tsan_trace.h
73

Amended. Thanks.

I thought this will increase bitfield size:
EventType type : 3;
but apparently it doesn't.

This found some switch statements didn't handle all values (unsure if intentional)

Yes, this was intentional (there is nothing to do in that case), but I added it to make it explicit.

vitalybuka accepted this revision.Aug 12 2021, 1:37 PM
vitalybuka added 1 blocking reviewer(s): melver.
melver accepted this revision.Aug 13 2021, 12:55 AM
melver added inline comments.
compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
569

Could use kSizeLogX.

-  u64 size_log = size == 1 ? 0 : size == 2 ? 1 : size == 4 ? 2 : 3;
+  const u64 size_log = size == 1   ? kSizeLog1
+                       : size == 2 ? kSizeLog2
+                       : size == 4 ? kSizeLog4
+                                   : kSizeLog8;

sanitizer_common also has Log2(), which uses __builtin_ctzll(). Could it be faster? However, it'd require turning its CHECK()s into DCHECKs().

625

Could use UNUSED bool res = ... and instead drop this?

634

Could use UNUSED?

compiler-rt/lib/tsan/rtl/tsan_rtl.h
916

Other code below uses reinterpret_cast<Event *>(atomic_load_relaxed(&thr->trace_pos)); -- also switch to reintepret_cast?

compiler-rt/lib/tsan/rtl/tsan_rtl_report.cpp
607–608

This comparison is repeated several times and is quite distracting from the rest of the logic. Could at least the check that the accesses are overlapping be pulled out e.g. into a constexpr bool IsWithinAccess(addr1, size2, addr2, size2)?

And just to verify, this is checking that access of addr|size is contained in ev_addr|ev_size (and not just overlapping)?

654

Brief comment why this is also unhandled.

This revision is now accepted and ready to land.Aug 13 2021, 12:55 AM
dvyukov updated this revision to Diff 366306.Aug 13 2021, 10:09 AM
dvyukov marked 5 inline comments as done.

addressed comments

dvyukov added inline comments.Aug 13 2021, 10:12 AM
compiler-rt/lib/tsan/rtl/tsan_rtl.cpp
569

size is constant in all performance-critical callers, so this is as fast as it can be.
The new runtime removes kSizeLog* consts, I think this is the only place that still uses logs, so I would not add more uses.
But I will try to add some tests for precise size handling instead.

compiler-rt/lib/tsan/rtl/tsan_rtl_report.cpp
607–608

Done. I've added IsWithinAccess.

Re contained vs overlapping, there was a comment in the case above (I just did not duplicate it 3 times):

// Note: addr/size must be fully contained in ev_addr/ev_size
// because a memory access is always traced once, but can be
// split into multiple accesses in shadow.

But now it's on IsWithinAccess.

dvyukov updated this revision to Diff 366333.Aug 13 2021, 12:04 PM

add more tests

dvyukov updated this revision to Diff 366421.Aug 14 2021, 3:24 AM

add more tests

I've added more tests. I will submit on Monday.

This revision was landed with ongoing or failed builds.Aug 16 2021, 1:24 AM
Closed by commit rGc97318996fc1: tsan: add new trace (authored by dvyukov). · Explain Why
This revision was automatically updated to reflect the committed changes.
dyung added a subscriber: dyung.Aug 16 2021, 3:09 AM

Hi, I believe that your change is causing a build failure on one of the bots, can you please take a look?

https://lab.llvm.org/buildbot/#/builders/37/builds/6103


In file included from ../rtl/tsan_rtl.h:42,
from tsan_go.cpp:13:
../rtl/tsan_trace.h:90:20: error: ‘__tsan::v3::Event::type’ is too small to hold all values of ‘enum class __tsan::v3::EventType’ [-Werror]
EventType type : 3;
^
../rtl/tsan_trace.h:124:20: error: ‘__tsan::v3::EventAccessExt::type’ is too small to hold all values of ‘enum class __tsan::v3::EventType’ [-Werror]
EventType type : 3; // = EventType::kAccessExt
^
../rtl/tsan_trace.h:140:20: error: ‘__tsan::v3::EventAccessRange::type’ is too small to hold all values of ‘enum class __tsan::v3::EventType’ [-Werror]
EventType type : 3; // = EventType::kAccessRange
^
../rtl/tsan_trace.h:156:20: error: ‘__tsan::v3::EventLock::type’ is too small to hold all values of ‘enum class __tsan::v3::EventType’ [-Werror]
EventType type : 3; // = EventType::kLock or EventType::kRLock
^
../rtl/tsan_trace.h:169:20: error: ‘__tsan::v3::EventUnlock::type’ is too small to hold all values of ‘enum class __tsan::v3::EventType’ [-Werror]
EventType type : 3; // = EventType::kUnlock
^
../rtl/tsan_trace.h:179:20: error: ‘__tsan::v3::EventTime::type’ is too small to hold all values of ‘enum class __tsan::v3::EventType’ [-Werror]
EventType type : 3; // = EventType::kTime
^

Hi, I believe that your change is causing a build failure on one of the bots, can you please take a look?

https://lab.llvm.org/buildbot/#/builders/37/builds/6103


In file included from ../rtl/tsan_rtl.h:42,
from tsan_go.cpp:13:
../rtl/tsan_trace.h:90:20: error: ‘__tsan::v3::Event::type’ is too small to hold all values of ‘enum class __tsan::v3::EventType’ [-Werror]
EventType type : 3;
^

I think GCC version 8.3 is bad (which the build bot uses). GCC 8.4 seems to have fixed the issue: https://godbolt.org/z/66xYK1xMf

FTR, one bot failed with:

/home/buildbots/ppc64le-sanitizer/sanitizer-ppc64le/build/llvm-project/compiler-rt/lib/tsan/tests/unit/tsan_trace_test.cpp:153:12: error: unused variable 'res' [-Werror,-Wunused-variable]
    bool res = RestoreStack(thr->tid, v3::EventType::kLock, thr->sid,
         ^
/home/buildbots/ppc64le-sanitizer/sanitizer-ppc64le/build/llvm-project/compiler-rt/lib/tsan/tests/unit/tsan_trace_test.cpp:194:12: error: unused variable 'res' [-Werror,-Wunused-variable]
    bool res =

https://lab.llvm.org/buildbot/#/builders/19/builds/6003/steps/24/logs/stdio

I've sent https://reviews.llvm.org/D108118 to fix this.

In builds with the -Werror -Wmissing-field-initializers option(s) enabled, the following instantiation will cause an error.

Thread::Params tests[] = {
    {1, 0, 1, true},  {4, 0, 2, true},
    {4, 2, 2, true},  {8, 3, 1, true},
    {2, 1, 1, true},  {1, 1, 1, false},
    {8, 5, 4, false}, {4, static_cast<uptr>(-1l), 4, false},
};

Could the patch be updated to explicitly initialize the last int field in the Thread::Params struct?

quinnp added a subscriber: quinnp.EditedAug 16 2021, 11:36 AM

This patch is causing some failures on the clang-ppc64le-rhel buildbot. Here is the failing build. Specifically in the compiler-rt/lib/tsan/tests/unit/tsan_trace_test.cpp test case:

  • missing field 'type' initializer from line 128 to line 131

Hi, I believe that your change is causing a build failure on one of the bots, can you please take a look?

https://lab.llvm.org/buildbot/#/builders/37/builds/6103


In file included from ../rtl/tsan_rtl.h:42,
from tsan_go.cpp:13:
../rtl/tsan_trace.h:90:20: error: ‘__tsan::v3::Event::type’ is too small to hold all values of ‘enum class __tsan::v3::EventType’ [-Werror]
EventType type : 3;
^

I think GCC version 8.3 is bad (which the build bot uses). GCC 8.4 seems to have fixed the issue: https://godbolt.org/z/66xYK1xMf

That may be the case, but unless GCC 8.3 is an unsupported compiler (I don't believe it is), can we get a fix, a workaround or get this change reverted? It is causing the bot I linked to be red, and also one of our internal bots are failing because of this problem.

I think GCC version 8.3 is bad (which the build bot uses). GCC 8.4 seems to have fixed the issue: https://godbolt.org/z/66xYK1xMf

That may be the case, but unless GCC 8.3 is an unsupported compiler (I don't believe it is), can we get a fix, a workaround or get this change reverted? It is causing the bot I linked to be red, and also one of our internal bots are failing because of this problem.

This is not breaking any default builds. This is only failing because Werror is on. By default it isn't. It can be disabled with -DCOMPILER_RT_ENABLE_WERROR=OFF, which I'd recommend for the bots still using GCC 8.3, given we know it's buggy.

I think GCC version 8.3 is bad (which the build bot uses). GCC 8.4 seems to have fixed the issue: https://godbolt.org/z/66xYK1xMf

That may be the case, but unless GCC 8.3 is an unsupported compiler (I don't believe it is), can we get a fix, a workaround or get this change reverted? It is causing the bot I linked to be red, and also one of our internal bots are failing because of this problem.

This is not breaking any default builds. This is only failing because Werror is on. By default it isn't. It can be disabled with -DCOMPILER_RT_ENABLE_WERROR=OFF, which I'd recommend for the bots still using GCC 8.3, given we know it's buggy.

We probably should not leave it to the user

Simple change should fix it.

struct Event {
  u64 type : 3;

  EventType GetType() const {
    return static_cast<EventType>(type);
  }
};

@dvyukov To avoid revert I've done in commit above. Feel free to change if you have a better solution.

Simple change should fix it.

struct Event {
  u64 type : 3;

  EventType GetType() const {
    return static_cast<EventType>(type);
  }
};

We probably should not leave it to the user

Simple change should fix it.

struct Event {
  u64 type : 3;

  EventType GetType() const {
    return static_cast<EventType>(type);
  }
};

This seems simple enough, but still would be nice if the code doesn't need to suffer too much because of known buggy compilers. Would it be reasonable to force Werror off for known bad compiler versions? In this case GCC 9.2 or earlier (except 8.4 or later from 8 series which also include the fix).

dyung added a comment.Aug 16 2021, 4:55 PM

@dvyukov To avoid revert I've done in commit above. Feel free to change if you have a better solution.

Simple change should fix it.

struct Event {
  u64 type : 3;

  EventType GetType() const {
    return static_cast<EventType>(type);
  }
};

This doesn't seem to fix the problem on our internal bot (and likely also the upstream one).

Interestingly, @melver's solution of -DCOMPILER_RT_ENABLE_WERROR=OFF doesn't work either.

It would appear that -Werror is being hardcoded in line 62 of compiler-rt/lib/tsan/go/buildgo.sh. If I remove that, I can get the compilation to "succeed".

I'll put up that workaround shortly.

This patch still fails on the PPC64LE-RHEL Bot (https://lab.llvm.org/buildbot/#/builders/57/builds/9429). Could we please revert this patch or add a follow up fix to it? (The fix should be to explicitly initialize all fields in the struct within the initializer list (in lines 128 - 131) of the Thread::Params struct)

This patch still fails on the PPC64LE-RHEL Bot (https://lab.llvm.org/buildbot/#/builders/57/builds/9429). Could we please revert this patch or add a follow up fix to it? (The fix should be to explicitly initialize all fields in the struct within the initializer list (in lines 128 - 131) of the Thread::Params struct)

Resolved by https://reviews.llvm.org/D108207.

We probably should not leave it to the user

Simple change should fix it.

struct Event {
  u64 type : 3;

  EventType GetType() const {
    return static_cast<EventType>(type);
  }
};

This seems simple enough, but still would be nice if the code doesn't need to suffer too much because of known buggy compilers. Would it be reasonable to force Werror off for known bad compiler versions? In this case GCC 9.2 or earlier (except 8.4 or later from 8 series which also include the fix).

I will have to respectfully disagree here. "Buggy compilers" are a fact of life. We are all developers on a software project like any other and just as developers on other projects sometimes have to put in code workarounds for bugs in Clang (which we develop), we too have to do the same for bugs in compilers we support for building our project.

Turning on -Werror is a very useful thing and any suggestion to turn it off for bots that turn it on needs very solid justification. There have been many subtle issues identified by bots that use this option and turning it off completely just because of one issue is ill-advised IMHO. Not to mention that as @dyung mentioned, it doesn't fix the problem.

While I can certainly sympathize with not wanting to put in code to work around bugs that are not yours, I am afraid that this is something we all have to do sometimes.

All in all, if there is no obvious immediate fix that will bring the bot back to green, I recommend pulling this patch and any follow-up fixes that depend on it. Once you think you have a complete fixed patch, you can post it here and we'll be happy to test it ahead of the commit on the very machine that the bot runs on.

The newly added unit tests are failing on Darwin. Can you please look into it? Would it possible to revert this until the issue is resolved?

TsanUnitTest-x86_64-Test --gtest_filter=Trace.MemoryAccessSize
--
Note: Google Test filter = Trace.MemoryAccessSize
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Trace
[ RUN      ] Trace.MemoryAccessSize
access_size=1, offset=0, size=1, res=1, type=0
ThreadSanitizer:DEADLYSIGNAL
==49176==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000000f8 (pc 0x00010a190c52 bp 0x7e8000080b70 sp 0x7e8000080b00 T570136)
==49176==The signal is caused by a READ memory access.
==49176==Hint: address points to the zero page.
    #0 __tsan::v3::TraceSwitchPart(__tsan::ThreadState*) tsan_rtl.cpp:682 (TsanUnitTest-x86_64-Test:x86_64+0x1000bcc52)
    #1 __tsan::v3::TraceFunc(__tsan::ThreadState*, unsigned long) tsan_rtl.cpp:630 (TsanUnitTest-x86_64-Test:x86_64+0x1000bd2b1)
    #2 __tsan::Trace_MemoryAccessSize_Test::TestBody()::Thread::Func(void*) tsan_trace_test.cpp:93 (TsanUnitTest-x86_64-Test:x86_64+0x10003fe77)
    #3 _pthread_start <null>:2 (libsystem_pthread.dylib:x86_64+0x6108)
    #4 thread_start <null>:2 (libsystem_pthread.dylib:x86_64+0x1b8a)

==49176==Register values:
rax = 0x0000000000000000  rbx = 0x0000000000000000  rcx = 0x0000000000000008  rdx = 0x0000000000000000  
rdi = 0x000000010ea8e000  rsi = 0x0000000000001000  rbp = 0x00007e8000080b70  rsp = 0x00007e8000080b00  
 r8 = 0x0000000063000000   r9 = 0x0000000000000000  r10 = 0x000000010ea8e000  r11 = 0x0000000000000206  
r12 = 0x0000000000000000  r13 = 0x0000000000000000  r14 = 0x0000000000001000  r15 = 0x0000000000000000  
ThreadSanitizer can not provide additional info.
SUMMARY: ThreadSanitizer: SEGV tsan_rtl.cpp:682 in __tsan::v3::TraceSwitchPart(__tsan::ThreadState*)
==49176==ABORTING

Refer to https://green.lab.llvm.org/green/job/clang-stage1-RA/23374/testReport/junit/ThreadSanitizer-Unit/.

The newly added unit tests are failing on Darwin. Can you please look into it? Would it possible to revert this until the issue is resolved?

Refer to https://green.lab.llvm.org/green/job/clang-stage1-RA/23374/testReport/junit/ThreadSanitizer-Unit/.

The test and the code is new, and the only user of the code is the test. So I think it's fair to just disable the test on Darwin until Dmitry has a chance to look at it. I'll try and do this later today.