This is an archive of the discontinued LLVM Phabricator instance.

[XRay][compiler-rt] Use a hand-written circular buffer in BufferQueue
ClosedPublic

Authored by dberris on Sep 20 2017, 9:45 PM.

Details

Summary

This change removes the dependency on using a std::deque<...> for the
storage of the buffers in the buffer queue. We instead implement a
fixed-size circular buffer that's resilient to exhaustion, and preserves
the semantics of the BufferQueue.

We're moving away from using std::deque<...> for two reasons:

  • We want to remove dependencies on the STL for data structures.
  • We want the data structure we use to not require re-allocation in the normal course of operation.

The internal implementation of the buffer queue uses heap-allocated
arrays that are initialized once when the BufferQueue is created, and
re-uses slots in the buffer array as buffers are returned in order.

We also change the lock used in the implementation to a spinlock
instead of a blocking mutex. We reason that since the release operations
now take very little time in the critical section, that a spinlock would
be appropriate.

This change is related to D38073.

Event Timeline

dberris created this revision.Sep 20 2017, 9:45 PM
dblaikie edited edge metadata.Sep 27 2017, 11:06 AM

Does this class have unit test coverage?

lib/xray/xray_buffer_queue.cc
28

Remove the Mutex() here, since that's the default (& probably () for the Finalizing initialization for consistency)?

49–51

Why the change in locks/locking scheme?

69

std::none_of?

97–98

Maybe here and other similar loops:

for (auto I = Buffers.get(), E = I + BufferCount; I != E; ++I)

(there's an LLVM style guide rule/suggestion to cache the end)

dberris updated this revision to Diff 116930.Sep 28 2017, 12:31 AM
dberris marked 4 inline comments as done.

Address comment.

Does this class have unit test coverage?

Yes.

lib/xray/xray_buffer_queue.cc
28

We can't use (...) for Finalizing because it's an aggregate (no constructors).

49–51

We explain it in the commit description -- since we no longer need to be allocating memory, and we posit that the bodies of these functions are now quicker, that we'd have a better chance to not causing too much latency due to contention for these functions.

Why all the logging & test changes in the latest update? Did they slip in accidentally?

lib/xray/xray_buffer_queue.cc
30

Looks like LLVM uses lowercase 'i' for integer counters much more often than uppercase 'I' (the uppercase is often used for iterators, but not so often used for integer counters).

(& probably "!=" rather than "<" I would guess is more idiomatic in LLVM)

dberris updated this revision to Diff 117087.Sep 28 2017, 9:20 PM
dberris marked an inline comment as done.
  • fixup: I -> i.

Why all the logging & test changes in the latest update? Did they slip in accidentally?

Not accidental -- it turns out that sometimes, f2() doesn't last long enough to be logged, and despite the threshold for deltas being 0 it seems we are still able to get to that state.

That seems like a separate "bug" that I need to investigate.

Adding the fences in the test made it exhibit the problem less, so I kept it.

The atomic global is there to ensure that there's an observable dependency between the threads writing to the variable and the output of the main function.

dberris planned changes to this revision.Sep 28 2017, 9:25 PM

Ah, I see one bit that did slip in to the logging implementation. Let me break that out as a separate NFC change.

dberris updated this revision to Diff 117088.Sep 28 2017, 9:31 PM

Rebase after NFC, breaking out logging-specific refactoring.

pelikan accepted this revision.Oct 2 2017, 7:31 PM

I'm wondering how much will all the malloc(3) calls in one thread impact the thread locality work done in tcmalloc. Better ask a tcmalloc expert though, I've lost track of what's going on.

This revision is now accepted and ready to land.Oct 2 2017, 7:31 PM
This revision was automatically updated to reflect the committed changes.
dberris reopened this revision.Oct 3 2017, 10:10 PM

Reverted as it caused unit test failures, for some reason weren't caught in local testing.

Re-opening now to keep track of review history.

This revision is now accepted and ready to land.Oct 3 2017, 10:10 PM
dberris updated this revision to Diff 117626.Oct 3 2017, 10:14 PM
  • fixup: keep track of live buffers indepdendent of the pointers
dberris updated this revision to Diff 117627.Oct 3 2017, 10:17 PM
  • fixup: remove unused include
This revision was automatically updated to reflect the committed changes.
dblaikie added inline comments.Oct 9 2017, 11:14 AM
compiler-rt/trunk/lib/xray/xray_buffer_queue.h
45 ↗(On Diff #117475)

Shouldn't we be avoiding adding more standard library memory allocation to compiler-rt? (I thought that was one of the goals)

& that reminds me: I think the reason to avoid /all/ C++ standard library in compiler-rt is because of the possibility of mismatch between the standard library used to build compiler-rt and the one used to build user code. If theer is such a mismatch then the comdat functions will be chosen essentially at random and either the user code or the compiler-rt code using such a library function could end up quite broken. Right?

dberris added inline comments.Oct 9 2017, 5:01 PM
compiler-rt/trunk/lib/xray/xray_buffer_queue.h
45 ↗(On Diff #117475)

What's the harm with using std::unique_ptr<T[]> here? That it will use the standard allocator? Is there an alternative, should we be using raw pointers instead? Should we be allocating it with new[] or with malloc directly?