Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[compiler-rt][XRay] Initial per-thread inmemory logging implementation
ClosedPublic

Authored by dberris on Jul 5 2016, 12:26 AM.

Details

Summary

Depends on D21612 which implements the building blocks for the compiler-rt
implementation of the XRay runtime. We use a naive in-memory log of fixed-size
entries that get written out to a log file when the buffers are full, and when
the thread exits.

This implementation lays some foundations on to allowing for more complex XRay
records to be written to the log in subsequent changes. It also defines the format
that the function call accounting tool in D21987 will start building upon.

Once D21987 lands, we should be able to start defining more tests using that tool
once the function call accounting tool becomes part of the llvm distribution.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dberris updated this revision to Diff 66433.Aug 1 2016, 11:33 PM

Rebase again, with fix to math for stack adjustment.

majnemer added inline comments.Aug 9 2016, 10:05 PM
include/xray/xray_records.h
41–42 ↗(On Diff #67112)

Should these be bitfields?

lib/xray/xray_inmemory_log.cc
47 ↗(On Diff #67112)

Are we following LLVM convention for XRay? If so, I'd make this function start with a lowercase letter.

52–53 ↗(On Diff #67112)

This should be Written < 0. I don't believe errno will be set to anything good if write returns 0; I think you'd observe a previous errno value.

73 ↗(On Diff #67112)

Why noexcept?

90 ↗(On Diff #67112)

Ditto regarding llvm convention.

90 ↗(On Diff #67112)

Should the long be a long long?

91 ↗(On Diff #67112)

Should we pass in O_CLOEXEC?

95–96 ↗(On Diff #67112)

How do you feel about these potentially failing?

98 ↗(On Diff #67112)

Shouldn't this be s64?

156 ↗(On Diff #67112)

Any reason not to just use 0?

175 ↗(On Diff #67112)

I'd just use unsigned here to match how __rdtscp is defined.

dberris updated this revision to Diff 67464.Aug 9 2016, 10:37 PM
dberris marked 11 inline comments as done.

Rebase + style comments and some correctness fixes

include/xray/xray_records.h
41–42 ↗(On Diff #67112)

Good idea, yep.

lib/xray/xray_inmemory_log.cc
47 ↗(On Diff #67112)

Yes, we should be -- sorry I haven't quite internalised the rules. :/

52–53 ↗(On Diff #67112)

Good catch, thanks. Done.

73 ↗(On Diff #67112)

Habbit... removed.

91 ↗(On Diff #67112)

Good idea -- yes, done.

95–96 ↗(On Diff #67112)

Not very good... is it better now?

majnemer added inline comments.Aug 15 2016, 8:20 PM
include/xray/xray_records.h
40–42 ↗(On Diff #68126)

Why not give these bitfields an underlying type of bool?

lib/xray/xray_inmemory_log.cc
71 ↗(On Diff #68126)

Should we assert that Fd > 0?

77 ↗(On Diff #68126)

Why do we have this fsync? Might deserve a comment.

95–99 ↗(On Diff #68126)

This seems wrong in the face of partial reads. Perhaps a retryingRead is called for?

dberris updated this revision to Diff 68136.Aug 15 2016, 11:30 PM
dberris marked 3 inline comments as done.
  • Address review comments
  • Make use of the naive log configurable
  • Add test to make sure options from args works
dberris added inline comments.Aug 15 2016, 11:31 PM
lib/xray/xray_inmemory_log.cc
71 ↗(On Diff #68126)

Instead of asserting, I check that Fd is valid before using it in the write in the destructor.

77 ↗(On Diff #68126)

Explained a little in a comment, essentially because we're not doing any sort of reference counting (for simplicity's sake) we rely on the sync to flush the data at a thread's exit. We can implement reference counting, but that would complicate this a bit more than necessary IMO.

majnemer added inline comments.Aug 15 2016, 11:37 PM
include/xray/xray_records.h
39–42 ↗(On Diff #68136)

You could gain another byte for Padding by sending this between Type and CycleFrequency.

lib/xray/xray_inmemory_log.cc
70 ↗(On Diff #68136)

ssize_t?

dberris updated this revision to Diff 68141.Aug 15 2016, 11:59 PM
dberris marked an inline comment as done.
  • Re-order some flags in XRayFileHeader
dberris updated this revision to Diff 68142.Aug 16 2016, 12:02 AM
dberris marked an inline comment as done.
  • Use ssize_t for BytesRead
rSerge requested changes to this revision.Aug 22 2016, 8:17 AM
rSerge added a reviewer: rSerge.
rSerge added a subscriber: rSerge.

Please, see inline.

include/xray/xray_records.h
47 ↗(On Diff #68142)

So CycleFrequency is an 8-byte variable not aligned to 8 bytes. This may cause portability issues to other CPUs, not only performance drops for unaligned memory access, but they say on some CPUs unaligned access is completely forbidden.

lib/xray/xray_inmemory_log.cc
71 ↗(On Diff #68142)

So the loop runs while BytesRead<0 ? Shouldn't it run while BytesToRead > 0?
Then in an if, before checking errno, you can check whether BytesRead < 0.

90 ↗(On Diff #68142)

So retryingWrite writes all bytes, while retryingRead reads some bytes up to the given number of bytes, right? Then their parameters should indicate this, or the functions should be renamed to e.g. retryingWriteAll and retryingReadSome .

130 ↗(On Diff #68142)

This crashes if Buffer contains formatting, e.g. %d (random memory read) or worse %n (random memory write), but no formatting arguments are given. Consider changing to `fprintf(stderr, "%s", Buffer);

135 ↗(On Diff #68142)

(Minor) Why not to have BuffLen as a local static const? This would remove "action at a distance" anti-pattern.

137 ↗(On Diff #68142)

Why not thread_local ? How do you ensure no 2 threads race to check whether Fd is initialized and, if not, initializing it?

139 ↗(On Diff #68142)

Why do you call it Printf? f in printf stays for formatted, I guess. But PrintToStdErr doesn't support formatting. Consider renaming to SetPrintAndReportCallback .

141 ↗(On Diff #68142)

Would be UB in a race condition.

142 ↗(On Diff #68142)

How the magic 246 number is calculated?

143 ↗(On Diff #68142)

Is 255 logically equal to sizeof(TmpFilename)-1 ? The latter would be preferred over a magic number, and it's easier then to change just one place - the declaration of TmpFilename variable.

147 ↗(On Diff #68142)

Consider making "XXXXXX" a constant, replacing 6 with its sizeof-1.

159–163 ↗(On Diff #68142)

What if the system has multiple CPU sockets, with different CPUs installed so that TSC frequencies are different for different cores?

166 ↗(On Diff #68142)

Why not to report an error?

This revision now requires changes to proceed.Aug 22 2016, 8:17 AM
dberris updated this revision to Diff 68969.Aug 23 2016, 5:08 AM
dberris edited edge metadata.
dberris marked 12 inline comments as done.

Address comments

dberris added inline comments.Aug 23 2016, 5:10 AM
include/xray/xray_records.h
47 ↗(On Diff #68142)

Are you aware of specific CPUs where this would be an issue?

I'm happy to change it, but I'd like to understand whether this is an actual problem, or just a hypothetical one.

Note that we don't read this, we only ever write it down, so the performance hit isn't as important as maximising the packing of the bytes that eventually make it to disk. The 15 byte padding gives us 15 more bytes to use for later extending the information in the file header.

lib/xray/xray_inmemory_log.cc
137 ↗(On Diff #68142)

C++11 ensures this for us, that the first time this function is called, this static variable is initialised exactly once -- this is meant to be synchronised across all threads too.

139 ↗(On Diff #68142)

This API is defined by the common sanitizer tools -- I'm just a using it. :)

There are places where Report(...) is called with a format string and additional arguments (like printf), and underneath it dispatches to a callback when these formatted print's are actually invoked. We're providing something that deals with the string that's post-formatted.

141 ↗(On Diff #68142)

C++11 ensures this is not a racy initialisation.

142 ↗(On Diff #68142)

I was being conservative here, giving an allowance of 10 bytes difference between the size of the temporary filename's buffer size total and the base filename provided as a flag. Fixed it to do relative calculations instead of magic numbers.

159–163 ↗(On Diff #68142)

I'm not sure how this works in non-x86, but we assume that all the CPUs will run in the same rate, and use that as an approximation for the TSC frequency rate. This means if we're on this branch of the conditional, we don't get the reliable number and take a guess anyway.

sdardis added inline comments.
include/xray/xray_records.h
48 ↗(On Diff #68969)

Apologies for the drive-by comment.

Pre MIPSR6 requires different instructions (e.g. swl/swr to store 4 bytes) to write unaligned data if it is not naturally aligned to its size. Attempting to access 2/4/8 byte words with a normal load/store triggers a CPU level address exception if unaligned.

For MIPSR6 unaligned support is required, but it is implementation dependent how this is achieved. For example, an implementation could support unaligned access within a cacheline but require OS intervention if the unaligned access crosses a cacheline.

IIRC x86, PowerPC, AArch64 have native support for unaligned access. SPARC, ARM typically don't.

dberris planned changes to this revision.Aug 23 2016, 5:52 AM
dberris added inline comments.
include/xray/xray_records.h
48 ↗(On Diff #68969)

Apologies for the drive-by comment.

No need for apologies, comments most appreciated!

Pre MIPSR6 requires different instructions (e.g. swl/swr to store 4 bytes) to write unaligned data if it is not naturally aligned to its size. Attempting to access 2/4/8 byte words with a normal load/store triggers a CPU level address exception if unaligned.

For MIPSR6 unaligned support is required, but it is implementation dependent how this is achieved. For example, an implementation could support unaligned access within a cacheline but require OS intervention if the unaligned access crosses a cacheline.

IIRC x86, PowerPC, AArch64 have native support for unaligned access. SPARC, ARM typically don't.

Cool, this is good to know. Happy to change this implementation to be more portable to more platforms. :)

dberris updated this revision to Diff 68988.Aug 23 2016, 6:49 AM
dberris edited edge metadata.
  • Undo changes to CMakeLists.txt
  • Align CycleFrequency to 8 byte boundaries.
dberris marked 2 inline comments as done.Aug 23 2016, 6:52 AM

Address comments on alignment of CycleFrequency.

rSerge accepted this revision.Aug 23 2016, 7:26 AM
rSerge edited edge metadata.

Looks good to me.

lib/xray/xray_inmemory_log.cc
160–164 ↗(On Diff #68988)

I meant a x86/x86_64 system, consumer motherboards can easily have 2 sockets to install 2 different CPUs. Non-x86/x86_64 system is a harder question: a replacement for TSC/RDTSC would be needed.

This revision is now accepted and ready to land.Aug 23 2016, 7:26 AM
dberris updated this revision to Diff 69321.Aug 25 2016, 11:40 PM
dberris edited edge metadata.

Rebase before landing.

This revision was automatically updated to reflect the committed changes.