This is an archive of the discontinued LLVM Phabricator instance.

[XRay][compiler-rt] Use pthread for initializing thread-local data
ClosedPublic

Authored by dberris on Sep 20 2017, 12:23 AM.

Details

Summary

We avoid using C++11's thread_local keyword on non-trivially
destructible objects because it may introduce deadlocks when the C++
runtime registers destructors calling std::malloc(...). The deadlock may
happen when the allocator implementation is itself XRay instrumented.

To avoid having to call malloc(...) and free(...) in particular, we use
pthread_once, pthread_create_key, and pthread_setspecific to instead
manually register the cleanup implementation we want.

The code this replaces used an RAII type that implements the cleanup
functionality in the destructor, that was then initialized as a
function-local thread_local object. While it works in usual situations,
unfortunately it breaks when using a malloc implementation that itself
is XRay-instrumented.

Event Timeline

dberris created this revision.Sep 20 2017, 12:23 AM
dblaikie added inline comments.Sep 20 2017, 7:41 AM
lib/xray/xray_fdr_logging_impl.h
110

Presumably this should use LLVM_ALIGNAS?

Is this related to the rest of the patch, if so, how? Is it for correctness or performance?

143

"so document the semantics that we want to achieve here" seems weird - that sounds like a patch note or feature request (& somewhat redundant, it feels like - the reader can probably see that the semantics are documented below), rather than a statement of fact.

"the semantics are documented below" or maybe "The following semantics are intended:" or skip this comment entirely? Not sure.

147

"strive to" may be removed if this is a hard requirement? "should not call ... "

148–150

Maybe remove the personalization (here and in the following)? "This means functions like malloc() and free() should not be called while instrumenting"? Though I think the first sentence probably already says that?

173–181

Maybe this goes into more detail than needed if the XRay runtime in general shouldn't be using the standard library for all these reasons - that's a general requirement that wouldn't need to be rejustified at every use case/implementation detail?

187–188

Seems the goal in general was/is to move away from C++ standard library usage in the runtime library, right? Is there an attempt to pick and choose which parts can be relied upon, or is the overall goal to actually remove C++ standard library usage entirely, like other parts of compiler-rt?

If not, then maybe std::aligned_union<1, ThreadLocalData> might be marginally simpler here, but only marginally. (& I wouldn't say it's strictly better - since a union of only one thing is a bit strange)

dberris updated this revision to Diff 116128.Sep 20 2017, 6:14 PM
dberris marked 5 inline comments as done.
  • fixup: address comments

Thanks for the review!

lib/xray/xray_fdr_logging_impl.h
110

It's related in that we'd like to use aligned uninitialised storage to store one of these objects. We'd the buffer to not only be correctly sized but also aligned correctly, since we're going to interpret the buffer's address as a different type (punning? aliasing? I can never get the right terminology for that). The performance benefit is a side-effect of the correctness issue (since we have non-trivially destructible things in it).

BTW, I can't use LLVM_ALIGNAS because that seems to be undefined in compiler-rt. :/

143

I like skipping it. Stream of consciousness comments at the end of the day don't come out very well. :)

173–181

This is specifically for the function, since the original implementation (and the potentially more obvious solution) uses a thread-exit guard type (i.e. RAII but using a thread-local). Other parts of the implementation have avoided using memory allocation routines because of performance reasons but not entirely for correctness.

187–188

I think I should document why we're using a thread-local uninitialized memory at all in this situation. :)

First, the type ThreadLocalData isn't trivially destructable. If this was a thread_local object, we end up with the same problem we had in the first place -- having a thread_local object with a non-trivial destructor may cause the C++ runtime to call std::malloc(...) and introduce a deadlock when initialising the implementation as well as when destroying the object at thread-exit time.

Second, having uninitialised buffer space makes the initialisation of the memory trivial. We do want uninitialised bytes because we don't need to zero out the bytes, since we're going to initialize it once using placement new anyway.

So unfortunately, this is the solution that works and fits within the requirements we laid out. Updated with a bit more documentation on specifically why we're using an uninitialised aligned buffer. :)

Removing the dependency on standard C++ library things that have runtime issues I think is a good goal, but is orthogonal to this particular change (since really we're just avoiding the use of thread_local non-trivially destructible objects).

dberris updated this revision to Diff 116912.Sep 27 2017, 5:35 PM
  • fixup: address comments
pelikan accepted this revision.Oct 2 2017, 10:11 PM
pelikan added inline comments.
lib/xray/xray_fdr_logging_impl.h
193

I reckon this should get the UNUSED sanitizer_common macro, which expands to __attribute__((unused)) to stop the compiler whinging about it.

Plus, CodingStandards.rst wants it to be called "Unused" (same above - Key).

222

This line can go away with the UNUSED macro up there.

This revision is now accepted and ready to land.Oct 2 2017, 10:11 PM
pelikan added a comment.EditedOct 2 2017, 10:22 PM

Technically, the pthread functions can return EAGAIN or ENOMEM but it (ehm) "should not happen in this case". Really. I mean, come on!

Maybe a Report() in that case is sufficient, to explain any weirdness in behaviour.

EDIT: Actually, EAGAIN from pthread_key_create(3) is *per process*, so we can run out of them.

dberris updated this revision to Diff 117472.Oct 2 2017, 11:11 PM
  • Rebase
    • fixup: use UNUSED, restore alignment for cache-line friendliness, and report on errors found when pthread_create_key fails
This revision was automatically updated to reflect the committed changes.

Hi,

This change appears to be causing three test failures on several buildbots:

FAIL: XRay-Unit :: unit/./XRayBufferQueueTest-powerpc64le-Test/BufferQueueTest.Apply (36360 of 36408)
******************** TEST 'XRay-Unit :: unit/./XRayBufferQueueTest-powerpc64le-Test/BufferQueueTest.Apply' FAILED ********************
Note: Google Test filter = BufferQueueTest.Apply
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BufferQueueTest
[ RUN      ] BufferQueueTest.Apply
/home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/projects/compiler-rt/lib/xray/tests/unit/buffer_queue_test.cc:107: Failure
      Expected: Buffers.getBuffer(B)
      Which is: 4-byte object <01-00 00-00>
To be equal to: BufferQueue::ErrorCode::Ok
      Which is: 4-byte object <00-00 00-00>
[  FAILED  ] BufferQueueTest.Apply (0 ms)
[----------] 1 test from BufferQueueTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] BufferQueueTest.Apply

 1 FAILED TEST

********************
PASS: XRay-Unit :: unit/./XRayBufferQueueTest-powerpc64le-Test/BufferQueueTest.ErrorsWhenFinalising (36361 of 36408)
FAIL: XRay-Unit :: unit/./XRayBufferQueueTest-powerpc64le-Test/BufferQueueTest.GetAndRelease (36362 of 36408)
******************** TEST 'XRay-Unit :: unit/./XRayBufferQueueTest-powerpc64le-Test/BufferQueueTest.GetAndRelease' FAILED ********************
Note: Google Test filter = BufferQueueTest.GetAndRelease
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BufferQueueTest
[ RUN      ] BufferQueueTest.GetAndRelease
/home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/projects/compiler-rt/lib/xray/tests/unit/buffer_queue_test.cc:36: Failure
      Expected: Buffers.releaseBuffer(Buf)
      Which is: 4-byte object <01-00 00-00>
To be equal to: BufferQueue::ErrorCode::Ok
      Which is: 4-byte object <00-00 00-00>
[  FAILED  ] BufferQueueTest.GetAndRelease (0 ms)
[----------] 1 test from BufferQueueTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] BufferQueueTest.GetAndRelease

 1 FAILED TEST

********************
PASS: XRay-Unit :: unit/./XRayBufferQueueTest-powerpc64le-Test/BufferQueueTest.API (36363 of 36408)
FAIL: XRay-Unit :: unit/./XRayBufferQueueTest-powerpc64le-Test/BufferQueueTest.GetUntilFailed (36364 of 36408)
******************** TEST 'XRay-Unit :: unit/./XRayBufferQueueTest-powerpc64le-Test/BufferQueueTest.GetUntilFailed' FAILED ********************
Note: Google Test filter = BufferQueueTest.GetUntilFailed
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BufferQueueTest
[ RUN      ] BufferQueueTest.GetUntilFailed
/home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/projects/compiler-rt/lib/xray/tests/unit/buffer_queue_test.cc:48: Failure
      Expected: Buffers.releaseBuffer(Buf0)
      Which is: 4-byte object <01-00 00-00>
To be equal to: BufferQueue::ErrorCode::Ok
      Which is: 4-byte object <00-00 00-00>
[  FAILED  ] BufferQueueTest.GetUntilFailed (0 ms)
[----------] 1 test from BufferQueueTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] BufferQueueTest.GetUntilFailed

 1 FAILED TEST

********************

That's very interesting. I can't reproduce this locally though (and I suspect it's related to the other change, rL314766).