This is an archive of the discontinued LLVM Phabricator instance.

[XRay][compiler-rt] XRay Buffer Queue
ClosedPublic

Authored by dberris on Nov 2 2016, 12:01 AM.

Details

Summary

This implements a simple buffer queue to manage a pre-allocated queue of
fixed-sized buffers to hold XRay records. We need this to support
Flight Data Recorder (FDR) mode. We also implement this as a sub-library
first to allow for development before actually using it in an
implementation.

Some important properties of the buffer queue:

  • Thread-safe enqueueing/dequeueing of fixed-size buffers.
  • Pre-allocation of buffers at construction.

Diff Detail

Repository
rL LLVM

Event Timeline

dberris updated this revision to Diff 76674.Nov 2 2016, 12:01 AM
dberris retitled this revision from to [XRay][compiler-rt] XRay Buffer Queue.
dberris updated this object.
dberris added reviewers: majnemer, rSerge, echristo.
dberris added a subscriber: llvm-commits.
dberris updated this revision to Diff 77460.Nov 10 2016, 2:10 AM
  • Add proper files for testing
rSerge added inline comments.Nov 15 2016, 8:06 AM
lib/xray/CMakeLists.txt
9 ↗(On Diff #77460)

Can here be a comment on what FDR is?

lib/xray/tests/unit/buffer_queue_test.cc
20 ↗(On Diff #77460)

Here http://man7.org/linux/man-pages/man2/getpagesize.2.html they say "Portable applications should employ sysconf(_SC_PAGESIZE) instead of getpagesize():".
Also, AFAIK, page sizes can be large (gigabytes). Would this code work then?

lib/xray/xray_buffer_queue.cc
24 ↗(On Diff #77460)

Why not to do just 1 malloc and then distribute the pointers with offsets?

32 ↗(On Diff #77460)

Shouldn't this be checked inside the mutex lock?

46 ↗(On Diff #77460)

You are making it a queue, so the least recently used memory region is popped. This is bad for CPU cache. Why not to make it a stack, so that the most recently used buffer is returned first?
It needs just a change to push_front here... and perhaps renaming the data structure.

dberris updated this revision to Diff 78133.Nov 15 2016, 8:00 PM
dberris marked 4 inline comments as done.
  • Address review comments
dberris added inline comments.Nov 15 2016, 8:03 PM
lib/xray/tests/unit/buffer_queue_test.cc
20 ↗(On Diff #77460)

This is really just a test, the page size is a convenient way of getting large-ish values. :)

lib/xray/xray_buffer_queue.cc
24 ↗(On Diff #77460)

The idea is to have differentiated buffers that can be treated as individual thunks of memory. It also allows us to check for when malloc fails.

32 ↗(On Diff #77460)

Nope, Finalizing is atomic, and is already synchronised -- so we avoid locking the mutex when the BufferQueue is already finalizing.

46 ↗(On Diff #77460)

The point of the queue is so that we can ensure a temporal bound, i.e. we can keep around buffers that have been filled before. We explicitly want to actually keep the data around in the buffers, so that operations that have happened in the past are kept.

This is a key concept that the flight data recorder mode described in the whitepaper actually requires. Otherwise, if we made this a stack, then only the most recent operations will ever be kept (as opposed to a running log of things that have happened in the past).

rSerge added inline comments.Nov 16 2016, 11:28 AM
lib/xray/xray_buffer_queue.cc
24 ↗(On Diff #77460)

Ok, I see.

32 ↗(On Diff #77460)

What if finalize() is called between Finalizing.load(std::memory_order_acquire) and std::lock_guard<std::mutex> Guard(Mutex); here? The description for this data structure claims that getBuffer requests must be denied after finalize() is called, but this doesn't happen in this scenario. To enforce that invariant, the finalization flag must be set and read with mutex locked, so the flag itself doesn't have to be atomic (because the mutex provides the necessary acquire/release semantics).

46 ↗(On Diff #77460)

So is this data structure just dropping the least recently used data? I didn't get that initially.

dberris added inline comments.Nov 16 2016, 5:32 PM
lib/xray/xray_buffer_queue.cc
32 ↗(On Diff #77460)

What if finalize() is called between Finalizing.load(std::memory_order_acquire) and std::lock_guard<std::mutex> Guard(Mutex); here? The description for this data structure claims that getBuffer requests must be denied after finalize() is called, but this doesn't happen in this scenario.

But it does, right?

If one thread called getBuffer(...) just before another thread called finalize() then the ongoing getBuffer(...) should continue because finalize() had not technically been called yet when it started. We don't intend to make finalize() block on outstanding/ongoing getBuffer(...) calls (or the other way around).

How do I make the documentation on the function clear about this?

To enforce that invariant, the finalization flag must be set and read with mutex locked, so the flag itself doesn't have to be atomic (because the mutex provides the necessary acquire/release semantics).

On some platforms, operations on std::mutex may be implemented much more heavily compared to an atomic bool (hence the avoidance of locking the mutex in the first place). I'm actually thinking about whether a relaxed load is actually sufficient (at least in x86 it makes it faster) for the check, while using a release store on the finalize() side.

At this point I'm trying to avoid having to manually implement a wait-free or lock-free circular buffer, but maybe that's what this needs to end up becoming. ;)

Let me think about it a little more.

rSerge added inline comments.Nov 17 2016, 10:47 AM
lib/xray/xray_buffer_queue.cc
32 ↗(On Diff #77460)

Have you considered http://en.cppreference.com/w/cpp/atomic/atomic_flag ? There is an example on how to acquire and release it, so to get a "spin lock" you need just to spin more.
I am not sure if "relaxed" will work, but you can avoid 2 acquire operations in a row (1 for Finalizing, 1 for the mutex) and instead just lock on the atomic flag, then check whether we are finalizing from a normal variable, then do the work and release the spin lock. It should not be spinning a lot because (if I understood correctly) finalize operation is rare and getBuffer is not called from multiple threads concurrently.

dberris planned changes to this revision.Nov 17 2016, 4:11 PM

I'll be adding more representative use-case tests (making sure we're doing the right thing when functions are called in multiple threads).

lib/xray/xray_buffer_queue.cc
32 ↗(On Diff #77460)

Have you considered http://en.cppreference.com/w/cpp/atomic/atomic_flag ? There is an example on how to acquire and release it, so to get a "spin lock" you need just to spin more.

I have considered atomic_flag, but see no advantage to using that instead of std::atomic<bool>. I also don't need to spin-lock here, I just need to make sure that as soon as finalize() is done, all future getBuffer() calls will fail reliably.

I am not sure if "relaxed" will work, but you can avoid 2 acquire operations in a row (1 for Finalizing, 1 for the mutex) and instead just lock on the atomic flag, then check whether we are finalizing from a normal variable, then do the work and release the spin lock. It should not be spinning a lot because (if I understood correctly) finalize operation is rare and getBuffer is not called from multiple threads concurrently.

getBuffer() is meant to be called in all threads that need a buffer (consider each thread to be writing to its own buffer) so this has to be synchronised appropriately. Now I could just use a spinlock, assuming that the pop_front() operation on a std::deque<...> will be fast enough, but pay dearly when the process is preempted in the middle of that critical section (i.e. while other threads are spinning).

I suppose I should add a set of tests to make sure we're fine here, let me do that next.

dberris updated this revision to Diff 78465.Nov 17 2016, 9:40 PM
  • Add a multi-threaded test
dberris updated this revision to Diff 78467.Nov 17 2016, 10:04 PM
  • Fix conditional on spin
dberris updated this revision to Diff 78687.Nov 20 2016, 9:51 PM
  • Make the launch policy explicitly async
rSerge accepted this revision.Nov 24 2016, 4:11 AM
rSerge edited edge metadata.

LGTM

lib/xray/xray_buffer_queue.cc
32 ↗(On Diff #77460)

Ok, I see.

This revision is now accepted and ready to land.Nov 24 2016, 4:11 AM
This revision was automatically updated to reflect the committed changes.
dberris reopened this revision.Nov 24 2016, 8:05 PM

Broke the build in arm7 and aarch64.

This revision is now accepted and ready to land.Nov 24 2016, 8:05 PM
rSerge added a comment.EditedNov 25 2016, 10:14 AM

I forgot one argument in favor of a spinlock: to alleviate the problem of sudden long-running operations (such as thread preemption), after spinning for N times (where N is such that 95% of times the guarded operation takes less time), the spinlock acquisition code should issue std::this_thread::yield(), so not to consume CPU if other threads have real work to do. Even though the CPU load with such approach is higher than with syscall-based mutexes, that load is effectively in background priority. While the responsiveness of spinlock is sub-nanosecond, while for a mutex it's on the order of microsecond.
But see yourself, perhaps it's not a good idea to optimize prematurely.

I forgot one argument in favor of a spinlock: to alleviate the problem of sudden long-running operations (such as thread preemption), after spinning for N times (where N is such that 95% of times the guarded operation takes less time), the spinlock acquisition code should issue std::this_thread::yield(), so not to consume CPU if other threads have real work to do. Even though the CPU load with such approach is higher than with syscall-based mutexes, that load is effectively in background priority. While the responsiveness of spinlock is sub-nanosecond, while for a mutex it's on the order of microsecond.
But see yourself, perhaps it's not a good idea to optimize prematurely.

Yeah, I think we can do this later when we have a better idea as to the cost of the short critical section involved in updating the data structure (the deque). The spin lock should be fine with a yield after a few iterations, but can't tell whether this will be something worth optimising yet. :)

Trying to submit again now that we have some workarounds to the arm and aarch64 builds failing and being flaky.

This revision was automatically updated to reflect the committed changes.
rSerge added inline comments.Dec 6 2016, 12:08 PM
compiler-rt/trunk/lib/xray/tests/unit/buffer_queue_test.cc
28

Here and in the other places ASSERT_NE(Buffers.getBuffer(Buf), 0) would give more information in case the test fails: the error code would get into the report produced by gtest.
Forcing it to boolean producess less informative report:

Value of: Buffers.getBuffer(Buf)
Actual: true
Expected: false

http://lab.llvm.org:8011/builders/clang-native-aarch64-full/builds/100/steps/ninja%20check%202/logs/FAIL%3A%20XRay-Unit%3A%3ABufferQueueTest.GetAndRelease

dberris added inline comments.Dec 6 2016, 4:19 PM
compiler-rt/trunk/lib/xray/tests/unit/buffer_queue_test.cc
28

Good point -- I'll go improve these in a separate patch.

Thanks @rSerge!