Page MenuHomePhabricator

[XRay] Use optimistic logging model for FDR mode

Authored by dberris on Nov 1 2017, 8:35 PM.



Before this change, the FDR mode implementation relied on at thread-exit
handling to return buffers back to the (global) buffer queue. This
introduces issues with the initialisation of the thread_local objects
which, even through the use of pthread_setspecific(...) may eventually
call into an allocation function. Similar to previous changes in this
line, we're finding that there is a huge potential for deadlocks when
initialising these thread-locals when the memory allocation
implementation is also xray-instrumented.

In this change, we limit the call to pthread_setspecific(...) to provide
a non-null value to associate to the key created with
pthread_key_create(...). While this doesn't completely eliminate the
potential for the deadlock(s), it does allow us to still clean up at
thread exit when we need to. The change is that we don't need to do more
work when starting and ending a thread's lifetime. We also have a test
to make sure that we actually can safely recycle the buffers in case we
end up re-using the buffer(s) available from the queue on multiple
thread entry/exits.

This change cuts across both LLVM and compiler-rt to allow us to update
both the XRay runtime implementation as well as the library support for
loading these new versions of the FDR mode logging. Version 2 of the FDR
logging implementation makes the following changes:

  • Introduction of a new 'BufferExtents' metadata record that's outside of the buffer's contents but are written before the actual buffer. This data is associated to the Buffer handed out by the BufferQueue rather than a record that occupies bytes in the actual buffer.
  • Removal of the "end of buffer" records. This is in-line with the changes we described above, to allow for optimistic logging without explicit record writing at thread exit.

The optimistic logging model operates under the following assumptions:

  • Threads writing to the buffers will potentially race with the thread attempting to flush the log. To avoid this situation from occuring, we make sure that when we've finalized the logging implementation, that threads will see this finalization state on the next write, and either choose to not write records the thread would have written or write the record(s) in two phases -- first write the record(s), then update the extents metadata.
  • We change the buffer queue implementation so that once it's handed out a buffer to a thread, that we assume that buffer is marked "used" to be able to capture partial writes. None of this will be safe to handle if threads are racing to write the extents records and the reader thread is attempting to flush the log. The optimism comes from the finalization routine being required to complete before we attempt to flush the log.

This is a fairly significant semantics change for the FDR
implementation. This is why we've decided to update the version number
for FDR mode logs. The tools, however, still need to be able to support
older versions of the log until we finally deprecate those earlier

Additional changes:

  • Move per-thread key initialisation to when we initialise the logging implementation
  • Define a grace period for finalisation (default to 100ms).
  • Use aligned allocations for the buffers handed out by the BufferQueue.

Diff Detail


Event Timeline

dberris created this revision.Nov 1 2017, 8:35 PM
dberris updated this revision to Diff 121248.Nov 2 2017, 1:38 AM
  • fixup: reintroduce thread exit handler, and test
dberris updated this revision to Diff 121251.Nov 2 2017, 1:56 AM
dberris edited the summary of this revision. (Show Details)

Re-wording a bit to explain the retention of one call to pthread_setspecific.

dberris updated this revision to Diff 121260.Nov 2 2017, 2:29 AM
  • fixup: some changes;

Okay, now it's ready for a look. I'm done tweaking this for now. :)

Thanks in advance!

dberris planned changes to this revision.Nov 5 2017, 7:16 PM

Actually, I just realised there's an issue with this approach -- I may need to ensure that the size_t being written doesn't straddle a cache line so that partial writes couldn't cause the thread reading the data to see an incorrect value of the extents. My options for fixing this are:

  • Attaching additional metadata to the Buffer type, that's cache-line aligned instead of using the metadata record in the buffer that's character-aligned.
  • Create a dedicated section in the Buffer that's treated as metadata, and is guaranteed cache-aligned. This will mean that we use an aligning allocator (posix_memalign perhaps) to get a suitably-aligned metadata section (buffer header?) that gets updated with the useful data about the buffer.

I'll think about this a little more, and update with a more correct solution.

dberris updated this revision to Diff 122623.Nov 13 2017, 2:25 AM

Update implementation to be more thread-safe, despite accepting potential races
in the implementation.

dberris updated this revision to Diff 122624.Nov 13 2017, 2:32 AM
dberris edited the summary of this revision. (Show Details)

Update description to reflect new state of the change.

dberris updated this revision to Diff 122625.Nov 13 2017, 2:34 AM
dberris edited the summary of this revision. (Show Details)

Fix description.

This is now ready for a look @pelikan @kpw.

Thanks in advance!

dberris updated this revision to Diff 122627.Nov 13 2017, 2:42 AM
  • fixup: remove unnecessary 'version' propagation

Okay, I'm going to stop updating this now and wait for review. :)

pelikan edited edge metadata.Nov 15 2017, 8:05 PM

partial review only

167 ↗(On Diff #122627)

I can't see a call to munmap(2). Does this mean each TEST_F() is run in a separate process?

79 ↗(On Diff #122627)

I'd be inclined to use a semaphore instead but I suspect its overhead would be too high due to MOESI traffic for the memory of its internal count. Please confirm.

142–146 ↗(On Diff #122627)

We might benefit from turning this into a writev(2):

struct iov { iov_vec = {{ExtendsRecord, 8}, {B.Buffer, Extents} }, iov_cnt = 2 };

294 ↗(On Diff #122627)

I trust it's safe to incrementExtents after the memcpy(3) because this will never run concurrently in multiple threads.

345–350 ↗(On Diff #122627)

I think it would improve readability if this indented left a bit, as in:

if (TLD.BQ == nullptr) return;

then the rest.

still not 100% finished - I'll get to it tomorrow. I am convinced the overall concept will work though.

294 ↗(On Diff #122627)

It's correct because incrementExtents is acq_rel so before that technically nothing happened :-) Marking as done.

56 ↗(On Diff #122627)

You speak of MemPtr but your words mean nothing.

195 ↗(On Diff #122627)

Since we have different types of "records", how about calling the variable Metadata?

316–317 ↗(On Diff #122627)

Plus million for this coding style. Much better than wrapping in the middle of the screen.

dberris updated this revision to Diff 123271.Nov 16 2017, 4:32 PM
dberris marked 9 inline comments as done.
  • fixup: Address some review comments
167 ↗(On Diff #122627)

Good catch -- it just so happens that the lit test runner does it one test case per run. Fixed to register the map to the ScopedFileCloserAndDeleter.

79 ↗(On Diff #122627)

We don't want a semaphore because that will cause synchronisation across threads. This explicitly not require synchronisation, and allow threads to observe some other signalling (atomic) mechanism instead, i.e. whether the buffer queue is finalising.

142–146 ↗(On Diff #122627)

Good idea. Maybe changing the API of retryingWriteAll to support a number of ranges might be useful to do. Probably not in this change.

316–317 ↗(On Diff #122627)

Thank clang-format -- this only happened because we removed the MemPtr argument.

pelikan accepted this revision.Nov 20 2017, 9:56 PM
pelikan added inline comments.
316–317 ↗(On Diff #122627)

It didn't do that before, and I *really* tried to make it.

This revision is now accepted and ready to land.Nov 20 2017, 9:56 PM
This revision was automatically updated to reflect the committed changes.
dberris marked an inline comment as done.