This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
compiler-rt/
-
lib/xray/
-
xray/
-
CMakeLists.txt
7/16
xray_profiler.cc
-
test/xray/TestCases/Posix/
-
xray/
-
TestCases/
-
Posix/
3/7
profiling-multi-threaded.cc
5/5
profiling-single-threaded.cc

Differential D44620

[XRay][profiler] Part 4: Profiler Mode Wiring
ClosedPublic

Authored by dberris on Mar 18 2018, 10:24 PM.

Download Raw Diff

Details

Reviewers

eizan
kpw
pelikan

Commits

rGcfd7eec3d83e: [XRay][profiler] Part 4: Profiler Mode Wiring
rCRT334469: [XRay][profiler] Part 4: Profiler Mode Wiring
rL334469: [XRay][profiler] Part 4: Profiler Mode Wiring

Summary

This is part of the larger XRay Profiling Mode effort.

This patch implements the wiring required to enable us to actually
select the xray-profiling mode, and install the handlers to start
measuring the time and frequency of the function calls in call stacks.
The current way to get the profile information is by working with the
XRay API to __xray_process_buffers(...).

In subsequent changes we'll implement profile saving to files, similar
to how the FDR and basic modes operate, as well as means for converting
this format into those that can be loaded/visualised as flame graphs. We
will also be extending the accounting tool in LLVM to support
stack-based function call accounting.

We also continue with the implementation to support building small
histograms of latencies for the FunctionCallTrie::Node type, to allow
us to actually approximate the distribution of latencies per function.

Depends on D45758 and D46998.

Diff Detail

Build Status

Buildable 17172
Build 17172: arc lint + arc unit

Event Timeline

dberris created this revision.Mar 18 2018, 10:24 PM

Herald added a subscriber: mgorny. · View Herald TranscriptMar 18 2018, 10:24 PM

ping @echristo or @kpw?

dberris mentioned this in D45474: [XRay][clang+compiler-rt] Support build-time mode selection.Apr 9 2018, 11:24 PM

dberris mentioned this in rL329772: [XRay][clang+compiler-rt] Support build-time mode selection.Apr 10 2018, 6:31 PM

dberris mentioned this in rC329772: [XRay][clang+compiler-rt] Support build-time mode selection.

dberris mentioned this in rCRT329772: [XRay][clang+compiler-rt] Support build-time mode selection.

Sorry for the late review, and for destroying your diff.

What I think would also make interesting test cases:

when the call sequence is A → B → C → setjmp(3) ↓ B ↓ A → D → longjmp(3), where an exit from A would be after N times the loop ran
or (alternatively) strange situations involving C++ exceptions

compiler-rt/lib/xray/tests/unit/allocator_test.cc
36 ↗	(On Diff #138879)	ASSERT_EQ(A.Counter, 1);
39 ↗	(On Diff #138879)	ASSERT_EQ(A.Counter, 1);
compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
39–40 ↗	(On Diff #138879)	For readability, can we have the TSCs to be 100, 200, 300 etc.? Now the numbers look the same. (and I see a test below does that already)
compiler-rt/lib/xray/xray_allocator.h
12–13 ↗	(On Diff #138879)	I would at least add a TODO for adding support for replacing this allocator with any of the security-checking ones. (ubsan? efence? valgrind? I keep forgetting the names of them. compiler-rt IIRC has some "secure" allocator too.) We don't want to become another OpenSSL.
66 ↗	(On Diff #138879)	static_assert(Size <= 16); to make sure we don't run out of bits when a CPU manufacturer goes crazy?
compiler-rt/lib/xray/xray_function_call_trie.h
24–27 ↗	(On Diff #138879)	I'd reword/reorder this slightly, to make it clearer what's actually being stored. FunctionCallTrie represents stack traces of XRay instrumented functions that we've encountered, where a node corresponds to a function call and the path from the root to that node represents its stack trace.
84 ↗	(On Diff #138879)	IIUC, a comma before "then", or a new sentence.
90 ↗	(On Diff #138879)	New line not necessary?
116–117 ↗	(On Diff #138879)	Why are these not unsigned? There's no such thing as a negative count or negative time spent in a function. ShadowStackEntry has the time as u64. (I'd make the function ID unsigned too.)
117 ↗	(On Diff #138879)	Please put something like "// TSC ticks" at the end of the line, or introduce u64 typedefs for TSC ticks/deltas to avoid someone mistaking it for nanoseconds or the like.
131 ↗	(On Diff #138879)	It's not clear to me why does this need FId when the Node pointer below has it as well. Please add a brief comment if it's really necessary.
152 ↗	(On Diff #138879)	Um, why is this line necessary? :-) Line 230 should work without Allocators:: too.
255–256 ↗	(On Diff #138879)	Why is the comment related to the function not above the function's first line? Same for exitFunction().
315–319 ↗	(On Diff #138879)	It's not clear to me from this comment that this function, unlike mergeInto, may create duplicate entries.
compiler-rt/lib/xray/xray_profile_collector.cc
31 ↗	(On Diff #138879)	I'm not sure a spinning lock is the best idea when deepCopying a huge tree, but have no data to prove anything.
129 ↗	(On Diff #138879)	Why not just "auto FId"?
134–138 ↗	(On Diff #138879)	Are you sure "static" is needed here? With just a local zero variable, the compiler may inline the memcpy and turn it into memset, whereas you're telling it to load a thing far away in memory. That said, I reckon "internal_memset(NextPtr, 0, 4); NextPtr += 4;" would be both easier to read and faster.
151–154 ↗	(On Diff #138879)	Same here, memset(NextPtr, 0, 8); NextPtr += 8;
compiler-rt/lib/xray/xray_profile_collector.h
10 ↗	(On Diff #138879)	instruementation has an typoe in it -> instrumentation Please fix other files as well.
31–32 ↗	(On Diff #138879)	Why is this a class with public static methods, when it doesn't have any data or subclasses and therefore should be a namespace?
compiler-rt/test/xray/TestCases/Posix/profiling-single-threaded.cc
16–17	These local macros don't make it that much shorter or more readable. Consider either removing "XRAY_" or dropping them.

pelikan added inline comments.Apr 11 2018, 12:33 PM

compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
91–92 ↗	(On Diff #138879)	f0 → f1 → setjmp ↓(f1→f0) longjmp ↓(f1↓f0) longjmp ↓(f1↓f0) should generate lots of exits but only one entry. Or when the profiling starts in a signal handler. So it shouldn't be "impossible", just "infrequent" :-)
106–110 ↗	(On Diff #138879)	Test name says "MissingIntermediaryEntry" but this is missing an intermediary exit.
145–150 ↗	(On Diff #138879)	Same, please use multiplies of 100 for TSCs. I wonder whether we should test the TSC time series not being non-decreasing due to TSC mismatches when rescheduling among poorly synchronized CPU packages.
compiler-rt/lib/xray/tests/unit/profile_collector_test.cc
21 ↗	(On Diff #138879)	did you mean: "the only one we actually care about"? If we "only care" about it, what more can we do about it? :-)
47 ↗	(On Diff #138879)	I would at least assert the buffer's size is within some reasonable bounds - has the trailing bit and a function list. Maybe also the zero sentinels in places.
77 ↗	(On Diff #138879)	Again, some assertions to make sure both threads are reflected in that buffer would be nice. Doesn't have to be too strict.
compiler-rt/lib/xray/tests/unit/segmented_array_test.cc
71 ↗	(On Diff #138879)	Please also test what happens when you do Array <TestData> data; auto it = data.begin(); it--; Because I think you'll find Offset will be SIZE_MAX. Not sure we want that.
compiler-rt/lib/xray/xray_allocator.h
94 ↗	(On Diff #138879)	assume NewChain == nullptr. (or BackingStore for that matter)
compiler-rt/lib/xray/xray_profile_collector.cc
50 ↗	(On Diff #138879)	Why do we need the volatile? It's a global, there's very little optimization the compiler can do anyway... I'd like to see what I missed, thinking it'd be OK without it.
compiler-rt/lib/xray/xray_segmented_array.h
42 ↗	(On Diff #138879)	So, actually, I never liked linked lists where the prev/next pointers are in a separate region of memory, because that tends to worsen the cache miss rate when you walk through the list, and when the points at which these Chunk things are allocated are reasonably randomized along with the actual data allocations to confuse the CPU prefetcher. Which is why I've always been using LIST/TAILQ versions from queue(3) as they embed these to the structures they're listing. I'm not saying you should rewrite all of this now, but have you thought about putting the prev/next into the T somehow? Is that even possible to do with C++ templates?
44 ↗	(On Diff #138879)	Why is this necessary, and we can't just use N?
61 ↗	(On Diff #138879)	InternalAlloc can return nullptr.
153 ↗	(On Diff #138879)	I suppose these were your debugging statements which can go away (and below).
242 ↗	(On Diff #138879)	tautology

fixup: Address comments

compiler-rt/lib/xray/tests/unit/allocator_test.cc
36 ↗	(On Diff #138879)	This is not a valid assertion, since `Counter` is not relevant to the allocator's observable properties.
39 ↗	(On Diff #138879)	Same.
compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
39–40 ↗	(On Diff #138879)	Can you explain better why using 100, 200, 300 as opposed to 1, 2, 3 is better?
91–92 ↗	(On Diff #138879)	Right, changed to "rare". For now, we've not made special support for setjmp/longjmp instrumentation. While there's a possibility we can do that in the future, we're not counting on being able to differentiate that for now. The signal handler case is precisely the one we're looking to support here, but that's just one case.
145–150 ↗	(On Diff #138879)	Good point. I'll leave a TODO for that. In particular we actually need to keep track of the CPU ID instead of just the TSC when we're building the shadow stack. That will let us track the migration of the thread(s). It's still not clear to me why using multiples of 100 is important. These are just arbitrary numbers anyway, it shouldn't matter what order of magnitude they are.
compiler-rt/lib/xray/tests/unit/profile_collector_test.cc
21 ↗	(On Diff #138879)	I don't see the difference. English is hard. "the only one we care about" == "the one we actually only care about" There are other use-cases for this collection API, some of which we don't cover in this unit test (yet). In particular, we could be collecting snapshots of the function call trie for a function every so often, and associating a timestamp to that, so we can show profiles over time instead of a single profile.
47 ↗	(On Diff #138879)	Some of these details aren't really relevant to the unit test. For example: The size of the block is dependent on how we've decided to serialise the data. Yes we can assert that the size is not zero (doing that now). The function list could be in any order. It's not a relevant feature of the API. What we care about is that we're able to get the data. The concern of parsing this data is not really at this level of the unit test (I'd rather we have an actual end-to-end test that would get this information). We're testing that we can get a buffer that's not the empty buffer, which tells us enough information to say that this API in particular is holding its promises based on the preconditions and postconditions.
compiler-rt/lib/xray/tests/unit/segmented_array_test.cc
71 ↗	(On Diff #138879)	That's technically testing for undefined behaviour -- i.e. outside of the contract of the container. :)
compiler-rt/lib/xray/xray_allocator.h
12–13 ↗	(On Diff #138879)	We already do that, because we're relying on the underlying allocator for sanitizer_common. There's already a way to provide alternate implementations of those in that regard. All the backing store we have is gotten from sanitizer_common as opposed to using our own calls to mmap directly.
compiler-rt/lib/xray/xray_function_call_trie.h
116–117 ↗	(On Diff #138879)	There are some potentially subtle issues with using unsigned in these variables. Some of them are: Forcing a value to be unsigned causes the compiler to implement modular arithmetic, even if we don't ever expect that values will wrap-around. Doing zero-sign extension is not cheap. We want to make these values as cheap as possible to update. Also, we cannot make the function ID unsigned, because the value we're getting from XRay is a signed number (int32_t). The conversion will not be faithful, and we've avoided unsigned in those cases for similar reasons. If we decide in the future that we can actually get away with unsigned values for function ids (which I think we can) then we can change all the XRay implementations to take unsigned values for the function id, etc. -- most of which is not really worth the cost.
131 ↗	(On Diff #138879)	This is an optimisation, so that we don't actually need to reach into the pointer just to get the function id of the function at the top of the stack.
152 ↗	(On Diff #138879)	This is necessary because users of this nested type need to access these exported types. In this case, because NodeRefAllocatorType is part of the FunctionCallTrie type, we're re-exporting this type through Allocators which is a public type.
255–256 ↗	(On Diff #138879)	Because the comment is an implementation detail, it's explaining what it's doing rather than what users need to expect (i.e. it's not documentation of the contract, it's documentation of the implementation detail).
315–319 ↗	(On Diff #138879)	Good point. Updated the comment and the implementation to make it clear that we're not destroying the state of the FunctionCallTrie in `O`.
compiler-rt/lib/xray/xray_profile_collector.cc
31 ↗	(On Diff #138879)	Note, the intent here is to use the `GlobalMutex` to lock operations on the `ThreadTries` vector. We shouldn't be holding a lock on the `GlobalMutex` while in the process of copying the FunctionCallTrie.
compiler-rt/lib/xray/xray_profile_collector.h
31–32 ↗	(On Diff #138879)	Good point. It started as a class that had member variables, until it evolved to a global implementation, which is better just as a namespace.
compiler-rt/lib/xray/xray_segmented_array.h
42 ↗	(On Diff #138879)	What you're talking about is intrusive lists. These work only if you're doing a linked list of elements, but in this case it's the chunks we're linking together. Each chunk will have a block, which is what we're managing. All the chunks come from the same region of memory.
44 ↗	(On Diff #138879)	Usability -- because you can do: Chunk C; assert(C.Size > 0);
compiler-rt/test/xray/TestCases/Posix/profiling-single-threaded.cc
16–17	Yeah, unfortunately without these clang-format gets confused. :D

Harbormaster completed remote builds in B17101: Diff 142590.Apr 15 2018, 8:13 PM

Can this possibly be split up? It's way too long to easily review and seems to at least have two sets of functionality.

Split into smaller parts.

Harbormaster completed remote builds in B17169: Diff 142893.Apr 18 2018, 1:18 AM

Retitled, and updated to reflect breakup into smaller parts.

Adding back llvm-commits.

Harbormaster completed remote builds in B17172: Diff 142896.Apr 18 2018, 1:34 AM

dberris added a child revision: D45998: [XRay][profiler] Part 5: Profiler File Writing.Apr 23 2018, 10:22 PM

This one is more straightforward than the previous in the change. The main ideas all fall into place nicely, but I have pointed out some details that could use some attention.

compiler-rt/lib/xray/xray_profiler.cc
41	Can you expand the initialism to thread-local data in a comment? I always think TopLevelDomain when I see this and after jumping to definition that would help refresh my memory.
51	Don't you need pthread_create first for ProfilingKey?
105	Seems to me this should be memory_order_acquire_release if we want mutual exclusion of profileCollectorService::reset() from another thread.
142–143	Maybe check verbosity and Report?
156–168	I think this should be wrapped in an "if (TLD.FCT)" block. It's definitely an edge case, but If ProfileLogStatus is INITIALIZED, and if the atomic load happens and before the next statement a context switch and update to FINALIZING happens, then the TLD can be unitialized, but the status check won't know about the FINALIZING statement and we'll deference a null TLD.FCT. I think there is a similar problem with moving the GetTLD() before the atomic_load for the transition from FINALIZED to INITIALIZING. You could solve it by having GetTLD return the TLD reference and status from the load. It also might be worth documenting that InternalAlloc won't return null (as opposed to FCT allocator, so we're relying on that.
189–190	Not scoped to profiling mode: This stragegy still makes me uncomfortable for cases where not much is instrumented (e.g. event tracing), but I don't have a better solution fleshed out. It kind of feels like a cooperative scheduling problem where threads should check periodically if they're cancelled. Maybe we could have an option to add sleds that just do a finalizing check without instrumenting so that sparse instrumentation is able to respect the grace period.
198	Is this protected from simultaneous calls to postCurrentThreadFCT if other threads take a while to see they're finalized.
210–213	Do you have that graph of valid state transitions? I though it was OK to go from FINALIZED back to INITIALIZED without going back to UNITIALIZED.
229–230	Can these error for bad/missing flags?
240	Ahh. Can you just make a comment near the pthread_set_specific that INITIALIZE is responsible for calling pthread_key_create.
compiler-rt/test/xray/TestCases/Posix/profiling-multi-threaded.cc
4–5	Is it xray-profiler or xray-profiling? Does the flag not match the mode string in code? You assert on xray-profiling and set the mode to that in code below.
8	Do you need a thing to exclude windows since you're calling readtsc()
29	Yes please. ;)
compiler-rt/test/xray/TestCases/Posix/profiling-single-threaded.cc
5	Similar flag confusion.
47	Could be illustrative and increase coverage to have a test case that verifies that profiling mode can turn back on after a "round."

in the chain* I meant.

fixup: Use updated name for flags
fixup: address comments by kpw@, rename to use profiling instead of profiler

dberris planned changes to this revision.Jun 1 2018, 2:19 AM

dberris added inline comments.

compiler-rt/lib/xray/xray_profiler.cc
41	Renamed to ProfilingData instead.
142–143	Yeah, we'll need to refactor this to instead use the same reentrance guard across all the implementations (FDR and Basic). I'll add a dependency to the C++ ABI changes which has those changes.
189–190	Yep, we talked about this offline and we'll need to fix this across the implementations anyway. Let's fix that later.
198	serialize() has internal synchronisation, so we're relying on that synchronisation to do the right thing.
210–213	Yeah, unfortunately it seems that we're going to need to make it so that once an implementation has flushed, it should go back to UNINITIALIZED. Either that, or we're going to have to be a bit more clever about this.
229–230	Yes, but we're really just ignoring them here -- the parser will already report if the verbosity is high enough.
compiler-rt/test/xray/TestCases/Posix/profiling-multi-threaded.cc
4–5	There's two parts here -- there's the mode, which in this case is a name for the implementation. We're using "profiler" to be consistent with "flight data recorder" and "basic". We could just make this "xray-profiling" all throughout, which I think would make it much simpler -- and pretend that "FDR" is "flight data recording" instead. ;)
8	Good point. Yes, need to require Linux for now.
compiler-rt/test/xray/TestCases/Posix/profiling-single-threaded.cc
47	Good call, let me do that in the next round.

Harbormaster completed remote builds in B18809: Diff 149413.Jun 1 2018, 2:19 AM

dberris edited the summary of this revision. (Show Details)Jun 1 2018, 2:22 AM

dberris removed a reviewer: echristo.

dberris added a parent revision: D46998: [XRay][compiler-rt] Remove reliance on C++ ABI features.

Rebase after removing ABI reliance and refactoring of RecursionGuard.

fixup: s/__sanitizer:://g + clang-format
fixup: use the common recursion guard
fixup: remove C++ ABI dependency
fixup: do two rounds of profiling

This is now ready for a look @kpw.

I'm going to have to download the renamed files to diff locally. Is there a way to do this in Differential that I'm missing?

compiler-rt/test/xray/TestCases/Posix/profiling-multi-threaded.cc
8	I think this still must be done before you submit.

In D44620#1127115, @kpw wrote:

I'm going to have to download the renamed files to diff locally. Is there a way to do this in Differential that I'm missing?

I'm not sure, but it looks like Differential already sees the rename/merge -- is there something else you're looking for?

compiler-rt/test/xray/TestCases/Posix/profiling-multi-threaded.cc
8	I looked into this and we already only enable all the tests for Linux from the lit configurations.

kpw accepted this revision.Jun 11 2018, 6:35 PM

This revision is now accepted and ready to land.Jun 11 2018, 6:35 PM

Closed by commit rL334469: [XRay][profiler] Part 4: Profiler Mode Wiring (authored by dberris). · Explain WhyJun 11 2018, 8:34 PM

This revision was automatically updated to reflect the committed changes.

dberris marked an inline comment as done.

Herald added a subscriber: delcypher. · View Herald TranscriptJun 11 2018, 8:34 PM

Revision Contents

Path

Size

compiler-rt/

lib/

xray/

CMakeLists.txt

16 lines

xray_profiler.cc

296 lines

test/

xray/

TestCases/

Posix/

profiling-multi-threaded.cc

53 lines

profiling-single-threaded.cc

48 lines

Diff 142896

compiler-rt/lib/xray/CMakeLists.txt

Show All 12 Lines	set(XRAY_FDR_MODE_SOURCES
xray_buffer_queue.cc		xray_buffer_queue.cc
xray_fdr_logging.cc)		xray_fdr_logging.cc)

set(XRAY_BASIC_MODE_SOURCES		set(XRAY_BASIC_MODE_SOURCES
xray_inmemory_log.cc)		xray_inmemory_log.cc)

set(XRAY_PROFILER_MODE_SOURCES		set(XRAY_PROFILER_MODE_SOURCES
xray_profile_collector.cc		xray_profile_collector.cc
		xray_profiler.cc
xray_profiler_flags.cc)		xray_profiler_flags.cc)

# Implementation files for all XRay architectures.		# Implementation files for all XRay architectures.
set(x86_64_SOURCES		set(x86_64_SOURCES
xray_x86_64.cc		xray_x86_64.cc
xray_trampoline_x86_64.S)		xray_trampoline_x86_64.S)

set(arm_SOURCES		set(arm_SOURCES
▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	foreach(arch ${XRAY_SUPPORTED_ARCH})
# Basic mode runtime archive (addon for clang_rt.xray)		# Basic mode runtime archive (addon for clang_rt.xray)
add_compiler_rt_runtime(clang_rt.xray-basic		add_compiler_rt_runtime(clang_rt.xray-basic
STATIC		STATIC
ARCHS ${arch}		ARCHS ${arch}
CFLAGS ${XRAY_CFLAGS}		CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS}		DEFS ${XRAY_COMMON_DEFINITIONS}
OBJECT_LIBS RTXrayBASIC		OBJECT_LIBS RTXrayBASIC
PARENT_TARGET xray)		PARENT_TARGET xray)
		# Profiler Mode runtime
add_compiler_rt_runtime(clang_rt.xray-profiler		add_compiler_rt_runtime(clang_rt.xray-profiler
STATIC		STATIC
ARCHS ${arch}		ARCHS ${arch}
CFLAGS ${XRAY_CFLAGS}		CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS}		DEFS ${XRAY_COMMON_DEFINITIONS}
OBJECT_LIBS RTXrayPROFILER		OBJECT_LIBS RTXrayPROFILER
PARENT_TARGET xray)		PARENT_TARGET xray)
endforeach()		endforeach()
endif() # not Apple		endif() # not Apple

if(COMPILER_RT_INCLUDE_TESTS)		if(COMPILER_RT_INCLUDE_TESTS)
add_subdirectory(tests)		add_subdirectory(tests)
endif()		endif()

compiler-rt/lib/xray/xray_profiler.cc

This file was added.

				//===-- xray_profiling.cc --------------------------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a part of XRay, a dynamic runtime instrumentation system.
				//
				// This is the implementation of a profiling handler.
				//
				//===----------------------------------------------------------------------===//
				#include <memory>

				#include "sanitizer_common/sanitizer_atomic.h"
				#include "sanitizer_common/sanitizer_flags.h"
				#include "xray/xray_interface.h"
				#include "xray/xray_log_interface.h"

				#include "xray_flags.h"
				#include "xray_profile_collector.h"
				#include "xray_profiler_flags.h"
				#include "xray_tsc.h"
				#include "xray_utils.h"
				#include <pthread.h>

				namespace __xray {

				namespace {

				__sanitizer::atomic_sint32_t ProfilerLogFlushStatus = {
				XRayLogFlushStatus::XRAY_LOG_NOT_FLUSHING};

				__sanitizer::atomic_sint32_t ProfilerLogStatus = {
				XRayLogInitStatus::XRAY_LOG_UNINITIALIZED};

				__sanitizer::SpinMutex ProfilerOptionsMutex;

				struct alignas(64) ProfilingTLD {
				kpwUnsubmitted Done Reply Inline Actions Can you expand the initialism to thread-local data in a comment? I always think TopLevelDomain when I see this and after jumping to definition that would help refresh my memory. kpw: Can you expand the initialism to thread-local data in a comment? I always think TopLevelDomain…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Renamed to ProfilingData instead. dberris: Renamed to ProfilingData instead.
				FunctionCallTrie::Allocators *Allocators = nullptr;
				FunctionCallTrie *FCT = nullptr;
				};

				static pthread_key_t ProfilingKey;

				ProfilingTLD &getThreadLocalData() XRAY_NEVER_INSTRUMENT {
				thread_local ProfilingTLD TLD;
				thread_local bool UNUSED Once = [] {
				pthread_setspecific(ProfilingKey, &TLD);
				kpwUnsubmitted Done Reply Inline Actions Don't you need pthread_create first for ProfilingKey? kpw: Don't you need pthread_create first for ProfilingKey?
				return false;
				}();

				// We need to check whether the global flag to finalizing/finalized has been
				// switched. If it is, then we ought to not actually initialise the data.
				auto Status = __sanitizer::atomic_load(&ProfilerLogStatus,
				__sanitizer::memory_order_acquire);
				if (Status == XRayLogInitStatus::XRAY_LOG_FINALIZING \|\|
				Status == XRayLogInitStatus::XRAY_LOG_FINALIZED)
				return TLD;

				// If we're live, then we re-initialize TLD if the pointers are not null.
				if (UNLIKELY(TLD.Allocators == nullptr && TLD.FCT == nullptr)) {
				TLD.Allocators = reinterpret_cast<FunctionCallTrie::Allocators *>(
				InternalAlloc(sizeof(FunctionCallTrie::Allocators)));
				new (TLD.Allocators) FunctionCallTrie::Allocators();
				*TLD.Allocators = FunctionCallTrie::InitAllocators();
				TLD.FCT = reinterpret_cast<FunctionCallTrie *>(
				InternalAlloc(sizeof(FunctionCallTrie)));
				new (TLD.FCT) FunctionCallTrie(*TLD.Allocators);
				}

				return TLD;
				}

				} // namespace

				const char *profilerCompilerDefinedFlags() XRAY_NEVER_INSTRUMENT {
				#ifdef XRAY_PROFILER_DEFAULT_OPTIONS
				return SANITIZER_STRINGIFY(XRAY_PROFILER_DEFAULT_OPTIONS);
				#else
				return "";
				#endif
				}

				__sanitizer::atomic_sint32_t ProfileFlushStatus = {
				XRayLogFlushStatus::XRAY_LOG_NOT_FLUSHING};

				XRayLogFlushStatus profilingFlush() XRAY_NEVER_INSTRUMENT {
				// When flushing, all we really do is reset the global state, and only when
				// the log has already been finalized.
				if (__sanitizer::atomic_load(&ProfilerLogStatus,
				__sanitizer::memory_order_acquire) !=
				XRayLogInitStatus::XRAY_LOG_FINALIZED) {
				if (__sanitizer::Verbosity())
				Report("Not flushing profiles, profiler not been finalized.\n");
				return XRayLogFlushStatus::XRAY_LOG_NOT_FLUSHING;
				}

				s32 Result = XRayLogFlushStatus::XRAY_LOG_NOT_FLUSHING;
				if (!__sanitizer::atomic_compare_exchange_strong(
				&ProfilerLogFlushStatus, &Result,
				XRayLogFlushStatus::XRAY_LOG_FLUSHING,
				__sanitizer::memory_order_release)) {
				kpwUnsubmitted Done Reply Inline Actions Seems to me this should be memory_order_acquire_release if we want mutual exclusion of profileCollectorService::reset() from another thread. kpw: Seems to me this should be memory_order_acquire_release if we want mutual exclusion of…
				if (__sanitizer::Verbosity())
				Report("Not flushing profiles, implementation still finalizing.\n");
				}

				profileCollectorService::reset();

				__sanitizer::atomic_store(&ProfilerLogStatus,
				XRayLogFlushStatus::XRAY_LOG_FLUSHED,
				__sanitizer::memory_order_release);

				return XRayLogFlushStatus::XRAY_LOG_FLUSHED;
				}

				namespace {

				thread_local volatile bool ReentranceGuard = false;

				void postCurrentThreadFCT(ProfilingTLD &TLD) {
				if (TLD.Allocators == nullptr \|\| TLD.FCT == nullptr)
				return;

				profileCollectorService::post(*TLD.FCT, GetTid());
				TLD.FCT->~FunctionCallTrie();
				TLD.Allocators->~Allocators();
				InternalFree(TLD.FCT);
				InternalFree(TLD.Allocators);
				TLD.FCT = nullptr;
				TLD.Allocators = nullptr;
				}

				} // namespace

				void profilingHandleArg0(int32_t FuncId,
				XRayEntryType Entry) XRAY_NEVER_INSTRUMENT {
				unsigned char CPU;
				auto TSC = readTSC(CPU);
				if (ReentranceGuard)
				return;
				kpwUnsubmitted Not Done Reply Inline Actions Maybe check verbosity and Report? kpw: Maybe check verbosity and Report?
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yeah, we'll need to refactor this to instead use the same reentrance guard across all the implementations (FDR and Basic). I'll add a dependency to the C++ ABI changes which has those changes. dberris: Yeah, we'll need to refactor this to instead use the same reentrance guard across all the…
				ReentranceGuard = true;

				auto Status = __sanitizer::atomic_load(&ProfilerLogStatus,
				__sanitizer::memory_order_acquire);
				auto &TLD = getThreadLocalData();
				if (UNLIKELY(Status == XRayLogInitStatus::XRAY_LOG_FINALIZED \|\|
				Status == XRayLogInitStatus::XRAY_LOG_FINALIZING)) {
				postCurrentThreadFCT(TLD);
				ReentranceGuard = false;
				return;
				}

				switch (Entry) {
				case XRayEntryType::ENTRY:
				case XRayEntryType::LOG_ARGS_ENTRY:
				TLD.FCT->enterFunction(FuncId, TSC);
				break;
				case XRayEntryType::EXIT:
				case XRayEntryType::TAIL:
				TLD.FCT->exitFunction(FuncId, TSC);
				break;
				default:
				// FIXME: Handle bugs.
				break;
				}
				kpwUnsubmitted Not Done Reply Inline Actions I think this should be wrapped in an "if (TLD.FCT)" block. It's definitely an edge case, but If ProfileLogStatus is INITIALIZED, and if the atomic load happens and before the next statement a context switch and update to FINALIZING happens, then the TLD can be unitialized, but the status check won't know about the FINALIZING statement and we'll deference a null TLD.FCT. I think there is a similar problem with moving the GetTLD() before the atomic_load for the transition from FINALIZED to INITIALIZING. You could solve it by having GetTLD return the TLD reference and status from the load. It also might be worth documenting that InternalAlloc won't return null (as opposed to FCT allocator, so we're relying on that. kpw: I think this should be wrapped in an "if (TLD.FCT)" block. It's definitely an edge case, but…

				ReentranceGuard = false;
				}

				void profilingHandleArg1(int32_t FuncId, XRayEntryType Entry,
				uint64_t) XRAY_NEVER_INSTRUMENT {
				return profilingHandleArg0(FuncId, Entry);
				}

				XRayLogInitStatus profilingFinalize() XRAY_NEVER_INSTRUMENT {
				s32 CurrentStatus = XRayLogInitStatus::XRAY_LOG_INITIALIZED;
				if (!__sanitizer::atomic_compare_exchange_strong(
				&ProfilerLogStatus, &CurrentStatus,
				XRayLogInitStatus::XRAY_LOG_FINALIZING,
				__sanitizer::memory_order_release)) {
				if (__sanitizer::Verbosity())
				Report("Cannot finalize profile, the profiler is not initialized.\n");
				return static_cast<XRayLogInitStatus>(CurrentStatus);
				}

				// Wait a grace period to allow threads to see that we're finalizing.
				__sanitizer::SleepForMillis(profilerFlags()->xray_profiling_grace_period_ms);
				kpwUnsubmitted Done Reply Inline Actions Not scoped to profiling mode: This stragegy still makes me uncomfortable for cases where not much is instrumented (e.g. event tracing), but I don't have a better solution fleshed out. It kind of feels like a cooperative scheduling problem where threads should check periodically if they're cancelled. Maybe we could have an option to add sleds that just do a finalizing check without instrumenting so that sparse instrumentation is able to respect the grace period. kpw: Not scoped to profiling mode: This stragegy still makes me uncomfortable for cases where not…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yep, we talked about this offline and we'll need to fix this across the implementations anyway. Let's fix that later. dberris: Yep, we talked about this offline and we'll need to fix this across the implementations anyway.

				// We also want to make sure that the current thread's data is cleaned up, if
				// we have any.
				auto &TLD = getThreadLocalData();
				postCurrentThreadFCT(TLD);

				// Then we force serialize the log data.
				profileCollectorService::serialize();
				kpwUnsubmitted Done Reply Inline Actions Is this protected from simultaneous calls to postCurrentThreadFCT if other threads take a while to see they're finalized. kpw: Is this protected from simultaneous calls to postCurrentThreadFCT if other threads take a while…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions serialize() has internal synchronisation, so we're relying on that synchronisation to do the right thing. dberris: serialize() has internal synchronisation, so we're relying on that synchronisation to do the…

				__sanitizer::atomic_store(&ProfilerLogStatus,
				XRayLogInitStatus::XRAY_LOG_FINALIZED,
				__sanitizer::memory_order_release);
				return XRayLogInitStatus::XRAY_LOG_FINALIZED;
				}

				XRayLogInitStatus
				profilingLoggingInit(size_t BufferSize, size_t BufferMax, void *Options,
				size_t OptionsSize) XRAY_NEVER_INSTRUMENT {
				s32 CurrentStatus = XRayLogInitStatus::XRAY_LOG_UNINITIALIZED;
				if (!__sanitizer::atomic_compare_exchange_strong(
				&ProfilerLogStatus, &CurrentStatus,
				XRayLogInitStatus::XRAY_LOG_INITIALIZING,
				__sanitizer::memory_order_release)) {
				kpwUnsubmitted Not Done Reply Inline Actions Do you have that graph of valid state transitions? I though it was OK to go from FINALIZED back to INITIALIZED without going back to UNITIALIZED. kpw: Do you have that graph of valid state transitions? I though it was OK to go from FINALIZED back…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yeah, unfortunately it seems that we're going to need to make it so that once an implementation has flushed, it should go back to UNINITIALIZED. Either that, or we're going to have to be a bit more clever about this. dberris: Yeah, unfortunately it seems that we're going to need to make it so that once an implementation…
				if (__sanitizer::Verbosity())
				Report(
				"Cannot initialize already initialised profiling implementation.\n");
				return static_cast<XRayLogInitStatus>(CurrentStatus);
				}

				// Here we use the sanitizer flag parsing mechanism. When PR36790 is fixed,
				// migrate to using a different API for configuration.
				{
				SpinMutexLock Lock(&ProfilerOptionsMutex);
				FlagParser ConfigParser;
				auto *F = profilerFlags();
				F->setDefaults();
				registerProfilerFlags(&ConfigParser, F);
				const char *ProfilerCompileFlags = profilerCompilerDefinedFlags();
				ConfigParser.ParseString(ProfilerCompileFlags);
				ConfigParser.ParseString(static_cast<const char *>(Options));
				kpwUnsubmitted Done Reply Inline Actions Can these error for bad/missing flags? kpw: Can these error for bad/missing flags?
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yes, but we're really just ignoring them here -- the parser will already report if the verbosity is high enough. dberris: Yes, but we're really just ignoring them here -- the parser will already report if the…
				if (Verbosity())
				ReportUnrecognizedFlags();
				}

				// We need to reset the profile data collection implementation now.
				profileCollectorService::reset();

				// We need to set up the at-thread-exit handler.
				static bool UNUSED Once = [] {
				pthread_key_create(&ProfilingKey, +[](void *) {
				kpwUnsubmitted Done Reply Inline Actions Ahh. Can you just make a comment near the pthread_set_specific that INITIALIZE is responsible for calling pthread_key_create. kpw: Ahh. Can you just make a comment near the pthread_set_specific that INITIALIZE is responsible…
				// This is the thread-exit handler.
				auto &TLD = getThreadLocalData();
				if (TLD.Allocators == nullptr && TLD.FCT == nullptr)
				return;

				postCurrentThreadFCT(TLD);
				});
				return false;
				}();

				__xray_log_set_buffer_iterator(profileCollectorService::nextBuffer);
				__xray_set_handler(profilingHandleArg0);
				__xray_set_handler_arg1(profilingHandleArg1);

				__sanitizer::atomic_store(&ProfilerLogStatus,
				XRayLogInitStatus::XRAY_LOG_INITIALIZED,
				__sanitizer::memory_order_release);
				if (__sanitizer::Verbosity())
				Report("XRay Profiling init successful.\n");

				return XRayLogInitStatus::XRAY_LOG_INITIALIZED;
				}

				bool profilingDynamicInitializer() XRAY_NEVER_INSTRUMENT {
				// Set up the flag defaults from the static defaults and the compiler-provided
				// defaults.
				{
				SpinMutexLock Lock(&ProfilerOptionsMutex);
				auto *F = profilerFlags();
				F->setDefaults();
				FlagParser ProfilingParser;
				registerProfilerFlags(&ProfilingParser, F);
				const char *ProfilerCompileFlags = profilerCompilerDefinedFlags();
				ProfilingParser.ParseString(ProfilerCompileFlags);
				}

				XRayLogImpl Impl{
				profilingLoggingInit,
				profilingFinalize,
				profilingHandleArg0,
				profilingFlush,
				};
				auto RegistrationResult = __xray_log_register_mode("xray-profiling", Impl);
				if (RegistrationResult != XRayLogRegisterStatus::XRAY_REGISTRATION_OK &&
				__sanitizer::Verbosity())
				Report(
				"Cannot register XRay Profiling mode to 'xray-profiling'; error = %d\n",
				RegistrationResult);
				if (!__sanitizer::internal_strcmp(flags()->xray_mode, "xray-profiling"))
				__xray_set_log_impl(Impl);
				return true;
				}

				} // namespace __xray

				static auto UNUSED Unused = __xray::profilingDynamicInitializer();

compiler-rt/test/xray/TestCases/Posix/profiling-multi-threaded.cc

This file was added.

				// Check that we can get a profile from a single-threaded application, on
				// demand through the XRay logging implementation API.
				//
				// FIXME: Make -fxray-modes=xray-profiling part of the default?
				// RUN: %clangxx_xray -std=c++11 %s -o %t -fxray-modes=xray-profiler
				kpwUnsubmitted Not Done Reply Inline Actions Is it xray-profiler or xray-profiling? Does the flag not match the mode string in code? You assert on xray-profiling and set the mode to that in code below. kpw: Is it xray-profiler or xray-profiling? Does the flag not match the mode string in code? You…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions There's two parts here -- there's the mode, which in this case is a name for the implementation. We're using "profiler" to be consistent with "flight data recorder" and "basic". We could just make this "xray-profiling" all throughout, which I think would make it much simpler -- and pretend that "FDR" is "flight data recording" instead. ;) dberris: There's two parts here -- there's the mode, which in this case is a name for the implementation.
				// RUN: %run %t
				//
				// UNSUPPORTED: target-is-mips64,target-is-mips64el
				kpwUnsubmitted Done Reply Inline Actions Do you need a thing to exclude windows since you're calling readtsc() kpw: Do you need a thing to exclude windows since you're calling readtsc()
				dberrisAuthorUnsubmitted Done Reply Inline Actions Good point. Yes, need to require Linux for now. dberris: Good point. Yes, need to require Linux for now.
				kpwUnsubmitted Done Reply Inline Actions I think this still must be done before you submit. kpw: I think this still must be done before you submit.
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions I looked into this and we already only enable all the tests for Linux from the lit configurations. dberris: I looked into this and we already only enable all the tests for Linux from the lit…

				#include "xray/xray_interface.h"
				#include "xray/xray_log_interface.h"
				#include <cassert>
				#include <cstdio>
				#include <string>
				#include <thread>

				#define XRAY_ALWAYS_INSTRUMENT [[clang::xray_always_instrument]]
				#define XRAY_NEVER_INSTRUMENT [[clang::xray_never_instrument]]

				XRAY_ALWAYS_INSTRUMENT void f2() { return; }
				XRAY_ALWAYS_INSTRUMENT void f1() { f2(); }
				XRAY_ALWAYS_INSTRUMENT void f0() { f1(); }

				using namespace std;

				volatile int buffer_counter = 0;

				XRAY_NEVER_INSTRUMENT void process_buffer(const char *, XRayBuffer) {
				// FIXME: Actually assert the contents of the buffer.
				kpwUnsubmitted Not Done Reply Inline Actions Yes please. ;) kpw: Yes please. ;)
				++buffer_counter;
				}

				XRAY_ALWAYS_INSTRUMENT int main(int, char **) {
				assert(__xray_log_select_mode("xray-profiling") ==
				XRayLogRegisterStatus::XRAY_REGISTRATION_OK);
				assert(__xray_log_get_current_mode() != nullptr);
				std::string current_mode = __xray_log_get_current_mode();
				assert(current_mode == "xray-profiling");
				assert(__xray_patch() == XRayPatchingStatus::SUCCESS);
				assert(__xray_log_init(0, 0, nullptr, 0) ==
				XRayLogInitStatus::XRAY_LOG_INITIALIZED);
				std::thread t0([] { f0(); });
				std::thread t1([] { f0(); });
				f0();
				t0.join();
				t1.join();
				assert(__xray_log_finalize() == XRayLogInitStatus::XRAY_LOG_FINALIZED);
				assert(__xray_log_process_buffers(process_buffer) ==
				XRayLogFlushStatus::XRAY_LOG_FLUSHED);
				// We're running three threds, so we expect three buffers.
				assert(buffer_counter == 3);
				assert(__xray_log_flushLog() == XRayLogFlushStatus::XRAY_LOG_FLUSHED);
				}

compiler-rt/test/xray/TestCases/Posix/profiling-single-threaded.cc

This file was added.

				// Check that we can get a profile from a single-threaded application, on
				// demand through the XRay logging implementation API.
				//
				// FIXME: Make -fxray-modes=xray-profiling part of the default?
				// RUN: %clangxx_xray -std=c++11 %s -o %t -fxray-modes=xray-profiler
				kpwUnsubmitted Done Reply Inline Actions Similar flag confusion. kpw: Similar flag confusion.
				// RUN: %run %t
				//
				// UNSUPPORTED: target-is-mips64,target-is-mips64el

				#include "xray/xray_interface.h"
				#include "xray/xray_log_interface.h"
				#include <cassert>
				#include <cstdio>
				#include <string>

				#define ALWAYS_INSTRUMENT [[clang::xray_always_instrument]]
				#define NEVER_INSTRUMENT [[clang::xray_never_instrument]]
				pelikanUnsubmitted Done Reply Inline Actions These local macros don't make it that much shorter or more readable. Consider either removing "XRAY_" or dropping them. pelikan: These local macros don't make it that much shorter or more readable. Consider either removing…
				dberrisAuthorUnsubmitted Done Reply Inline Actions Yeah, unfortunately without these clang-format gets confused. :D dberris: Yeah, unfortunately without these clang-format gets confused. :D

				ALWAYS_INSTRUMENT void f2() { return; }
				ALWAYS_INSTRUMENT void f1() { f2(); }
				ALWAYS_INSTRUMENT void f0() { f1(); }

				using namespace std;

				volatile int buffer_counter = 0;

				NEVER_INSTRUMENT void process_buffer(const char *, XRayBuffer) {
				// FIXME: Actually assert the contents of the buffer.
				++buffer_counter;
				}

				ALWAYS_INSTRUMENT int main(int, char **) {
				assert(__xray_log_select_mode("xray-profiling") ==
				XRayLogRegisterStatus::XRAY_REGISTRATION_OK);
				assert(__xray_log_get_current_mode() != nullptr);
				std::string current_mode = __xray_log_get_current_mode();
				assert(current_mode == "xray-profiling");
				assert(__xray_patch() == XRayPatchingStatus::SUCCESS);
				assert(__xray_log_init(0, 0, nullptr, 0) ==
				XRayLogInitStatus::XRAY_LOG_INITIALIZED);
				f0();
				assert(__xray_log_finalize() == XRayLogInitStatus::XRAY_LOG_FINALIZED);
				f0();
				assert(__xray_log_process_buffers(process_buffer) ==
				XRayLogFlushStatus::XRAY_LOG_FLUSHED);
				assert(buffer_counter == 1);
				assert(__xray_log_flushLog() == XRayLogFlushStatus::XRAY_LOG_FLUSHED);
				kpwUnsubmitted Done Reply Inline Actions Could be illustrative and increase coverage to have a test case that verifies that profiling mode can turn back on after a "round." kpw: Could be illustrative and increase coverage to have a test case that verifies that profiling…
				dberrisAuthorUnsubmitted Done Reply Inline Actions Good call, let me do that in the next round. dberris: Good call, let me do that in the next round.
				}