This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
compiler-rt/lib/xray/
-
lib/
-
xray/
-
CMakeLists.txt
-
tests/
-
CMakeLists.txt
-
unit/
-
CMakeLists.txt
5/10
function_call_trie_test.cc
10/16
xray_function_call_trie.h
-
xray_profiler_flags.h
1/2
xray_profiler_flags.cc
-
xray_profiler_flags.inc

Differential D45757

[XRay][profiler] Part 2: XRay Function Call Trie
ClosedPublic

Authored by dberris on Apr 18 2018, 1:03 AM.

Download Raw Diff

Details

Reviewers

echristo
pelikan
kpw

Commits

rG980d93d0e094: [XRay][profiler] Part 2: XRay Function Call Trie
rCRT332313: [XRay][profiler] Part 2: XRay Function Call Trie
rL332313: [XRay][profiler] Part 2: XRay Function Call Trie

Summary

This is part of the larger XRay Profiling Mode effort.

This patch implements a central data structure for capturing statistics
about XRay instrumented function call stacks. The FunctionCallTrie
type does the following things:

It keeps track of a shadow function call stack of XRay instrumented functions as they are entered (function enter event) and as they are exited (function exit event).

When a function is entered, the shadow stack contains information about the entry TSC, and updates the trie (or prefix tree) representing the current function call stack. If we haven't encountered this function call before, this creates a unique node for the function in this position on the stack. We update the list of callees of the parent function as well to reflect this newly found path.

When a function is exited, we compute statistics (TSC deltas, function call count frequency) for the associated function(s) up the stack as we unwind to find the matching entry event.

This builds upon the XRay Allocator and Array types in Part 1 of
this series of patches.

Depends on D45756.

Diff Detail

Build Status

Buildable 18081
Build 18081: arc lint + arc unit

Event Timeline

dberris created this revision.Apr 18 2018, 1:03 AM

Herald added a subscriber: mgorny. · View Herald TranscriptApr 18 2018, 1:03 AM

dberris added a child revision: D45758: [XRay][profiler] Part 3: Profile Collector Service.Apr 18 2018, 1:05 AM

Rebase

Harbormaster completed remote builds in B17523: Diff 144479.Apr 29 2018, 6:58 AM

This was easier to review than I expected. I found it easier to follow than the allocator/array CL. Sorry for the long delay!

compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
57	Might as well test MissingFunctionExit as well for symmetry.
84–92	The way this test is written, the two roots look exactly identical. There is a possible error case where the trie returns the same root twice that is undetected. You might have one function spend more time so that this is detected. Then you might want the test to assert ignoring order. Up to you whether it's worth it.
141–143	Imho, this would be easier to interpret if the nodes captured CumulativeTreeTime instead of CumulativeLocalTime. Then it would be F3 -> 100 F2 -> 300 F1 -> 400
191–193	Is sharing an initialized allocator expected to be OK? Why not use the same allocator for FunctionCallTrie Merged below then?
219–221	Only check the root? Theres only two other nodes to check, might as well verify them.
compiler-rt/lib/xray/xray_function_call_trie.h
38	visulaise -> visualize
96	There's no reference members here. Maybe call it NodeIdPair?
218	Don't think so. This looks like the end of "struct Allocators {" to me.
319	To me, it is less intuitive to capture CumulativeLocalTime than CumulativeTreeTime in each Node. With one you can compute the other given the callees, but if I'm analyzing function latencies, I care about time spent in the callees just as much. Why did you make this choice?
337–338	Should this be called with non-empty destinations? Can we just CHECK it is not. Having duplicate roots stinks.
358	Do you have to handle the Allocator failing here with a null check? Are you just relying on those cases artificially pruning the traversal?
381	What's the thread safety of this and deepCopy? It seems like they shouldn't be called when functions are being intercepted. How can we make sure that invariant is preserved?
394–402	Should this be pulled out of the root loop so we can use a single stack and array instead of 1 per root node?
409–410	You have some TODOs elsewhere to update the histograms. Might put one here as well.
compiler-rt/lib/xray/xray_profiler_flags.cc
36	Is the "#Name" intentional? I'm not very fluent in macros, but that popped out.

fixup: Address comments by kpw@

compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
84–92	Good point -- I've instead asserted that R0 and R1 don't have the same function id.
141–143	I actually thought about counting both, but realised that there's a property here which allows us to count just one of them to derive the other. I initially thought about it this way: CTT(N) = CLT(N) + sigma(i = 0->N) CTT(callee(N)[i]) is equivalent to: CLT(N) = CTT(N) - sigma(i = 0->N) CTT(callee(N)[i]) Where `CTT` is "Cumulative Tree Time" and `CLT` is "CumulativeLocalTime". Can we prove that this property holds, and that measuring just `CTT` is sufficient to derive `CLT`? To do this properly I'll introduce a notation: (f -> f') @ t[n+0] (f <- f') @ t[n+1] Where `f` and `f'` are function IDs, and `->` represents a "calls" relationship, '<-' represents an "exits" relationship, and `t` is a timestamp (we denote `n` to be the order of timestamps by appearance). Given the following sequence of events: (f1 -> f2) @ t[0] (f2 -> f3) @ t[1] (f2 <- f3) @ t[2] (f2 -> f3) @ t[3] (f2 <- f3) @ t[4] (f1 <- f2) @ t[5] Here, if we think about the cumulative local times, we might think that: CTT(f1) = CTT(f2) + CLT(f1) CTT(f2) = CTT(f3) + CLT(f2) CTT(f3) = 0 + CLT(f3) As per formula above. If we expand this: CTT(f3) = 0 + (t[4] - t[3]) + (t[2] - t[1]) CTT(f2) = CTT(f3) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0]) CTT(f1) = CTT(f2) + 0 Let's expand it further: CTT(f2) = 0 + (t[4] - t[3]) + (t[2] - t[1]) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0]) CTT(f2) = (t[5] - t[4]) + (t[4] - t[3]) + (t[3] - t[2]) + (t[2] - t[1]) + (t[1] - t[0]) + 0 CTT(f2) = t[5] - t[0] This is what we'd expect for computing `CTT` from `CLT`. Can we do the reverse though? CLT(f2) = CTT(f2) - CLT(f3) CLT(f2) = CTT(f2) - (CTT(f3) + 0) CLT(f2) = (t[5] - t[0]) - (0 + (t[4] - t[3]) + (t[2] - t[1]) + 0) CLT(f2) = (t[5] - t[0]) - ((t[4] - t[3]) + (t[2] - t[1])) CLT(f2) = (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0]) QED There's an argument for doing either, but we make the trade-off to covering Cumulative Local Time instead at runtime for the following reasons: Using CLT reduces the risk of us overflowing counters. We need CLT anyway for generating the histogram of latency for a particular function in the stack context. CLT allows us to better account for when we're un-winding the stack in case we find an exit for a function that was entered "higher up". Does that make sense? Now I kind-of want to write up that formula somewhere more persistent, rather than just in a review thread. :)
191–193	Yes. The reason we're not using the same allocator for the FunctionCallTrie merging test, is because we want to make sure that functionality works -- because we do that when we're transferring the FunctionCallTrie from a thread to the central service (in the next patch). We want to have thread-local allocators, then the central storage service will use a single FunctionCallTrie for the "merged" version of the FunctionCallTrie's for all the threads.
compiler-rt/lib/xray/xray_function_call_trie.h
319	Explained in the comment above... with some math/proof. :)
337–338	Yes, fixed with a DCHECK.
358	Good question. I'm definitely relying on the artificial pruning. Writing a comment as a TODO to figure out what to do in case of failure.
381	Good question. They are thread-compatible (will need external synchronisation).
compiler-rt/lib/xray/xray_profiler_flags.cc
36	Yes, this turns the argument into a string.

Harbormaster completed remote builds in B18048: Diff 146562.May 14 2018, 2:33 AM

kpw accepted this revision.May 14 2018, 8:39 AM

kpw added inline comments.

compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
141–143	Thanks for the detailed response. I agree with all of your points except for the that CLT is the right choice for latency histograms. Responding to your points in turn: Yes, CLT and CTT properties can be derived from the other. Overflow is less likely with CLT. I haven't done the math for a typical CPU to approximate the risk for a 32 bit uint. For either CLT or CTT, the "higher up" exit is handled the same way: as if there were simultaneous exits for all the functions until the matching function id. And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is where we actually have to make a choice between the two or compute both. Imagine a function "processRpc(const MyRequest&, MyResponse*)". When I want to know the 95th percentile of its latency, I am interested in CTT, not CLT. All the time spent in lower levels of the stack affects the observed latency. Callers care about how long the function blocks their execution. The person trying to optimize a function may care about CLT when choosing which code path to target. Let's cross this bridge when we add histograms.
compiler-rt/lib/xray/xray_function_call_trie.h
337–338	You should update the comment to say that the operation should not be called with a non-empty destination. You've tightened the contract.
381	I'm curious how this will work for the data collector service ensuring that function traces won't interfere with flushing.

This revision is now accepted and ready to land.May 14 2018, 8:39 AM

fixup: Address comments by kpw@
fixup: Rename flags to remove xray_profiling_ prefix.

compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc
141–143	And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is where we actually have to make a choice between the two or compute both. Yes! Let's cross this bridge when we add histograms. Agreed. Thanks!
compiler-rt/lib/xray/xray_function_call_trie.h
381	Yeah, Part 3 actually does external synchronisation to ensure that when posting, we're making a copy and merging to a global. There's some interesting work happening there, but not as interesting as the stuff we're doing here. ;)

Closed by commit rL332313: [XRay][profiler] Part 2: XRay Function Call Trie (authored by dberris). · Explain WhyMay 14 2018, 5:46 PM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: delcypher. · View Herald TranscriptMay 14 2018, 5:46 PM

Revision Contents

Path

Size

compiler-rt/

lib/

xray/

CMakeLists.txt

29 lines

tests/

CMakeLists.txt

2 lines

unit/

CMakeLists.txt

2 lines

function_call_trie_test.cc

253 lines

xray_function_call_trie.h

446 lines

xray_profiler_flags.h

39 lines

xray_profiler_flags.cc

40 lines

xray_profiler_flags.inc

26 lines

Diff 146723

compiler-rt/lib/xray/CMakeLists.txt

Show All 12 Lines	set(XRAY_FDR_MODE_SOURCES
xray_fdr_flags.cc		xray_fdr_flags.cc
xray_buffer_queue.cc		xray_buffer_queue.cc
xray_fdr_logging.cc)		xray_fdr_logging.cc)

set(XRAY_BASIC_MODE_SOURCES		set(XRAY_BASIC_MODE_SOURCES
xray_basic_flags.cc		xray_basic_flags.cc
xray_basic_logging.cc)		xray_basic_logging.cc)

		set(XRAY_PROFILER_MODE_SOURCES
		xray_profiler_flags.cc)

# Implementation files for all XRay architectures.		# Implementation files for all XRay architectures.
set(x86_64_SOURCES		set(x86_64_SOURCES
xray_x86_64.cc		xray_x86_64.cc
xray_trampoline_x86_64.S)		xray_trampoline_x86_64.S)

set(arm_SOURCES		set(arm_SOURCES
xray_arm.cc		xray_arm.cc
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	add_compiler_rt_object_libraries(RTXrayFDR
CFLAGS ${XRAY_CFLAGS}		CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS})		DEFS ${XRAY_COMMON_DEFINITIONS})
add_compiler_rt_object_libraries(RTXrayBASIC		add_compiler_rt_object_libraries(RTXrayBASIC
OS ${XRAY_SUPPORTED_OS}		OS ${XRAY_SUPPORTED_OS}
ARCHS ${XRAY_SUPPORTED_ARCH}		ARCHS ${XRAY_SUPPORTED_ARCH}
SOURCES ${XRAY_BASIC_MODE_SOURCES}		SOURCES ${XRAY_BASIC_MODE_SOURCES}
CFLAGS ${XRAY_CFLAGS}		CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS})		DEFS ${XRAY_COMMON_DEFINITIONS})
		add_compiler_rt_object_libraries(RTXrayPROFILER
		OS ${XRAY_SUPPORTED_OS}
		ARCHS ${XRAY_SUPPORTED_ARCH}
		SOURCES ${XRAY_PROFILER_MODE_SOURCES}
		CFLAGS ${XRAY_CFLAGS}
		DEFS ${XRAY_COMMON_DEFINITIONS})

# We only support running on osx for now.		# We only support running on osx for now.
add_compiler_rt_runtime(clang_rt.xray		add_compiler_rt_runtime(clang_rt.xray
STATIC		STATIC
OS ${XRAY_SUPPORTED_OS}		OS ${XRAY_SUPPORTED_OS}
ARCHS ${XRAY_SUPPORTED_ARCH}		ARCHS ${XRAY_SUPPORTED_ARCH}
OBJECT_LIBS RTXray		OBJECT_LIBS RTXray
RTSanitizerCommon		RTSanitizerCommon
Show All 18 Lines	add_compiler_rt_runtime(clang_rt.xray-basic
OS ${XRAY_SUPPORTED_OS}		OS ${XRAY_SUPPORTED_OS}
ARCHS ${XRAY_SUPPORTED_ARCH}		ARCHS ${XRAY_SUPPORTED_ARCH}
OBJECT_LIBS RTXrayBASIC		OBJECT_LIBS RTXrayBASIC
CFLAGS ${XRAY_CFLAGS}		CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS}		DEFS ${XRAY_COMMON_DEFINITIONS}
LINK_FLAGS ${SANITIZER_COMMON_LINK_FLAGS} ${WEAK_SYMBOL_LINK_FLAGS}		LINK_FLAGS ${SANITIZER_COMMON_LINK_FLAGS} ${WEAK_SYMBOL_LINK_FLAGS}
LINK_LIBS ${XRAY_LINK_LIBS}		LINK_LIBS ${XRAY_LINK_LIBS}
PARENT_TARGET xray)		PARENT_TARGET xray)
		add_compiler_rt_runtime(clang_rt.xray-profiler
		STATIC
		OS ${XRAY_SUPPORTED_OS}
		ARCHS ${XRAY_SUPPORTED_ARCH}
		OBJECT_LIBS RTXrayPROFILER
		CFLAGS ${XRAY_CFLAGS}
		DEFS ${XRAY_COMMON_DEFINITIONS}
		LINK_FLAGS ${SANITIZER_COMMON_LINK_FLAGS} ${WEAK_SYMBOL_LINK_FLAGS}
		LINK_LIBS ${XRAY_LINK_LIBS}
		PARENT_TARGET xray)
else() # not Apple		else() # not Apple
foreach(arch ${XRAY_SUPPORTED_ARCH})		foreach(arch ${XRAY_SUPPORTED_ARCH})
if(NOT CAN_TARGET_${arch})		if(NOT CAN_TARGET_${arch})
continue()		continue()
endif()		endif()
add_compiler_rt_object_libraries(RTXray		add_compiler_rt_object_libraries(RTXray
ARCHS ${arch}		ARCHS ${arch}
SOURCES ${XRAY_SOURCES} ${${arch}_SOURCES} CFLAGS ${XRAY_CFLAGS}		SOURCES ${XRAY_SOURCES} ${${arch}_SOURCES} CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS})		DEFS ${XRAY_COMMON_DEFINITIONS})
add_compiler_rt_object_libraries(RTXrayFDR		add_compiler_rt_object_libraries(RTXrayFDR
ARCHS ${arch}		ARCHS ${arch}
SOURCES ${XRAY_FDR_MODE_SOURCES} CFLAGS ${XRAY_CFLAGS}		SOURCES ${XRAY_FDR_MODE_SOURCES} CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS})		DEFS ${XRAY_COMMON_DEFINITIONS})
add_compiler_rt_object_libraries(RTXrayBASIC		add_compiler_rt_object_libraries(RTXrayBASIC
ARCHS ${arch}		ARCHS ${arch}
SOURCES ${XRAY_BASIC_MODE_SOURCES} CFLAGS ${XRAY_CFLAGS}		SOURCES ${XRAY_BASIC_MODE_SOURCES} CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS})		DEFS ${XRAY_COMMON_DEFINITIONS})
		add_compiler_rt_object_libraries(RTXrayPROFILER
		ARCHS ${arch}
		SOURCES ${XRAY_PROFILER_MODE_SOURCES} CFLAGS ${XRAY_CFLAGS}
		DEFS ${XRAY_COMMON_DEFINITIONS})

# Common XRay archive for instrumented binaries.		# Common XRay archive for instrumented binaries.
add_compiler_rt_runtime(clang_rt.xray		add_compiler_rt_runtime(clang_rt.xray
STATIC		STATIC
ARCHS ${arch}		ARCHS ${arch}
CFLAGS ${XRAY_CFLAGS}		CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS}		DEFS ${XRAY_COMMON_DEFINITIONS}
OBJECT_LIBS ${XRAY_COMMON_RUNTIME_OBJECT_LIBS} RTXray		OBJECT_LIBS ${XRAY_COMMON_RUNTIME_OBJECT_LIBS} RTXray
Show All 9 Lines	foreach(arch ${XRAY_SUPPORTED_ARCH})
# Basic mode runtime archive (addon for clang_rt.xray)		# Basic mode runtime archive (addon for clang_rt.xray)
add_compiler_rt_runtime(clang_rt.xray-basic		add_compiler_rt_runtime(clang_rt.xray-basic
STATIC		STATIC
ARCHS ${arch}		ARCHS ${arch}
CFLAGS ${XRAY_CFLAGS}		CFLAGS ${XRAY_CFLAGS}
DEFS ${XRAY_COMMON_DEFINITIONS}		DEFS ${XRAY_COMMON_DEFINITIONS}
OBJECT_LIBS RTXrayBASIC		OBJECT_LIBS RTXrayBASIC
PARENT_TARGET xray)		PARENT_TARGET xray)
		add_compiler_rt_runtime(clang_rt.xray-profiler
		STATIC
		ARCHS ${arch}
		CFLAGS ${XRAY_CFLAGS}
		DEFS ${XRAY_COMMON_DEFINITIONS}
		OBJECT_LIBS RTXrayPROFILER
		PARENT_TARGET xray)
endforeach()		endforeach()
endif() # not Apple		endif() # not Apple

if(COMPILER_RT_INCLUDE_TESTS)		if(COMPILER_RT_INCLUDE_TESTS)
add_subdirectory(tests)		add_subdirectory(tests)
endif()		endif()

compiler-rt/lib/xray/tests/CMakeLists.txt

Show First 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	macro(add_xray_unittest testname)
endif()		endif()
endmacro()		endmacro()

if(COMPILER_RT_CAN_EXECUTE_TESTS)		if(COMPILER_RT_CAN_EXECUTE_TESTS)
if (APPLE)		if (APPLE)
add_xray_lib("RTXRay.test.osx"		add_xray_lib("RTXRay.test.osx"
$<TARGET_OBJECTS:RTXray.osx>		$<TARGET_OBJECTS:RTXray.osx>
$<TARGET_OBJECTS:RTXrayFDR.osx>		$<TARGET_OBJECTS:RTXrayFDR.osx>
		$<TARGET_OBJECTS:RTXrayPROFILER.osx>
$<TARGET_OBJECTS:RTSanitizerCommon.osx>		$<TARGET_OBJECTS:RTSanitizerCommon.osx>
$<TARGET_OBJECTS:RTSanitizerCommonLibc.osx>)		$<TARGET_OBJECTS:RTSanitizerCommonLibc.osx>)
else()		else()
foreach(arch ${XRAY_SUPPORTED_ARCH})		foreach(arch ${XRAY_SUPPORTED_ARCH})
add_xray_lib("RTXRay.test.${arch}"		add_xray_lib("RTXRay.test.${arch}"
$<TARGET_OBJECTS:RTXray.${arch}>		$<TARGET_OBJECTS:RTXray.${arch}>
$<TARGET_OBJECTS:RTXrayFDR.${arch}>		$<TARGET_OBJECTS:RTXrayFDR.${arch}>
		$<TARGET_OBJECTS:RTXrayPROFILER.${arch}>
$<TARGET_OBJECTS:RTSanitizerCommon.${arch}>		$<TARGET_OBJECTS:RTSanitizerCommon.${arch}>
$<TARGET_OBJECTS:RTSanitizerCommonLibc.${arch}>)		$<TARGET_OBJECTS:RTSanitizerCommonLibc.${arch}>)
endforeach()		endforeach()
endif()		endif()
add_subdirectory(unit)		add_subdirectory(unit)
endif()		endif()

compiler-rt/lib/xray/tests/unit/CMakeLists.txt

	add_xray_unittest(XRayBufferQueueTest SOURCES			add_xray_unittest(XRayBufferQueueTest SOURCES
	buffer_queue_test.cc xray_unit_test_main.cc)			buffer_queue_test.cc xray_unit_test_main.cc)
	add_xray_unittest(XRayFDRLoggingTest SOURCES			add_xray_unittest(XRayFDRLoggingTest SOURCES
	fdr_logging_test.cc xray_unit_test_main.cc)			fdr_logging_test.cc xray_unit_test_main.cc)
	add_xray_unittest(XRayAllocatorTest SOURCES			add_xray_unittest(XRayAllocatorTest SOURCES
	allocator_test.cc xray_unit_test_main.cc)			allocator_test.cc xray_unit_test_main.cc)
	add_xray_unittest(XRaySegmentedArrayTest SOURCES			add_xray_unittest(XRaySegmentedArrayTest SOURCES
	segmented_array_test.cc xray_unit_test_main.cc)			segmented_array_test.cc xray_unit_test_main.cc)
				add_xray_unittest(XRayFunctionCallTrieTest SOURCES
				function_call_trie_test.cc xray_unit_test_main.cc)

compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc

This file was added.

				//===-- function_call_trie_test.cc ----------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a part of XRay, a function call tracing system.
				//
				//===----------------------------------------------------------------------===//
				#include "gtest/gtest.h"

				#include "xray_function_call_trie.h"

				namespace __xray {

				namespace {

				TEST(FunctionCallTrieTest, Construction) {
				// We want to make sure that we can create one of these without the set of
				// allocators we need. This will by default use the global allocators.
				FunctionCallTrie Trie;
				}

				TEST(FunctionCallTrieTest, ConstructWithTLSAllocators) {
				// FIXME: Support passing in configuration for allocators in the allocator
				// constructors.
				profilerFlags()->setDefaults();
				FunctionCallTrie::Allocators Allocators = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Trie(Allocators);
				}

				TEST(FunctionCallTrieTest, EnterAndExitFunction) {
				profilerFlags()->setDefaults();
				auto A = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Trie(A);

				Trie.enterFunction(1, 1);
				Trie.exitFunction(1, 2);

				// We need a way to pull the data out. At this point, until we get a data
				// collection service implemented, we're going to export the data as a list of
				// roots, and manually walk through the structure ourselves.

				const auto &R = Trie.getRoots();

				ASSERT_EQ(R.size(), 1u);
				ASSERT_EQ(R.front()->FId, 1);
				ASSERT_EQ(R.front()->CallCount, 1);
				ASSERT_EQ(R.front()->CumulativeLocalTime, 1u);
				}

				TEST(FunctionCallTrieTest, MissingFunctionEntry) {
				auto A = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Trie(A);
				kpwUnsubmitted Done Reply Inline Actions Might as well test MissingFunctionExit as well for symmetry. kpw: Might as well test MissingFunctionExit as well for symmetry.
				Trie.exitFunction(1, 1);
				const auto &R = Trie.getRoots();

				ASSERT_TRUE(R.empty());
				}

				TEST(FunctionCallTrieTest, MissingFunctionExit) {
				auto A = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Trie(A);
				Trie.enterFunction(1, 1);
				const auto &R = Trie.getRoots();

				ASSERT_TRUE(R.empty());
				}

				TEST(FunctionCallTrieTest, MultipleRoots) {
				profilerFlags()->setDefaults();
				auto A = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Trie(A);

				// Enter and exit FId = 1.
				Trie.enterFunction(1, 1);
				Trie.exitFunction(1, 2);

				// Enter and exit FId = 2.
				Trie.enterFunction(2, 3);
				Trie.exitFunction(2, 4);

				const auto &R = Trie.getRoots();
				ASSERT_FALSE(R.empty());
				ASSERT_EQ(R.size(), 2u);

				// Make sure the roots have different IDs.
				const auto R0 = R[0];
				const auto R1 = R[1];
				kpwUnsubmitted Done Reply Inline Actions The way this test is written, the two roots look exactly identical. There is a possible error case where the trie returns the same root twice that is undetected. You might have one function spend more time so that this is detected. Then you might want the test to assert ignoring order. Up to you whether it's worth it. kpw: The way this test is written, the two roots look exactly identical. There is a possible error…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Good point -- I've instead asserted that R0 and R1 don't have the same function id. dberris: Good point -- I've instead asserted that R0 and R1 don't have the same function id.
				ASSERT_NE(R0->FId, R1->FId);

				// Inspect the roots that they have the right data.
				ASSERT_NE(R0, nullptr);
				EXPECT_EQ(R0->CallCount, 1u);
				EXPECT_EQ(R0->CumulativeLocalTime, 1u);

				ASSERT_NE(R1, nullptr);
				EXPECT_EQ(R1->CallCount, 1u);
				EXPECT_EQ(R1->CumulativeLocalTime, 1u);
				}

				// While missing an intermediary entry may be rare in practice, we still enforce
				// that we can handle the case where we've missed the entry event somehow, in
				// between call entry/exits. To illustrate, imagine the following shadow call
				// stack:
				//
				// f0@t0 -> f1@t1 -> f2@t2
				//
				// If for whatever reason we see an exit for `f2` @ t3, followed by an exit for
				// `f0` @ t4 (i.e. no `f1` exit in between) then we need to handle the case of
				// accounting local time to `f2` from d = (t3 - t2), then local time to `f1`
				// as d' = (t3 - t1) - d, and then local time to `f0` as d'' = (t3 - t0) - d'.
				TEST(FunctionCallTrieTest, MissingIntermediaryExit) {
				profilerFlags()->setDefaults();
				auto A = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Trie(A);

				Trie.enterFunction(1, 0);
				Trie.enterFunction(2, 100);
				Trie.enterFunction(3, 200);
				Trie.exitFunction(3, 300);
				Trie.exitFunction(1, 400);

				// What we should see at this point is all the functions in the trie in a
				// specific order (1 -> 2 -> 3) with the appropriate count(s) and local
				// latencies.
				const auto &R = Trie.getRoots();
				ASSERT_FALSE(R.empty());
				ASSERT_EQ(R.size(), 1u);

				const auto &F1 = *R[0];
				ASSERT_EQ(F1.FId, 1);
				ASSERT_FALSE(F1.Callees.empty());

				const auto &F2 = *F1.Callees[0].NodePtr;
				ASSERT_EQ(F2.FId, 2);
				ASSERT_FALSE(F2.Callees.empty());

				const auto &F3 = *F2.Callees[0].NodePtr;
				ASSERT_EQ(F3.FId, 3);
				kpwUnsubmitted Not Done Reply Inline Actions Imho, this would be easier to interpret if the nodes captured CumulativeTreeTime instead of CumulativeLocalTime. Then it would be F3 -> 100 F2 -> 300 F1 -> 400 kpw: Imho, this would be easier to interpret if the nodes captured CumulativeTreeTime instead of…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions I actually thought about counting both, but realised that there's a property here which allows us to count just one of them to derive the other. I initially thought about it this way: CTT(N) = CLT(N) + sigma(i = 0->N) CTT(callee(N)[i]) is equivalent to: CLT(N) = CTT(N) - sigma(i = 0->N) CTT(callee(N)[i]) Where `CTT` is "Cumulative Tree Time" and `CLT` is "CumulativeLocalTime". Can we prove that this property holds, and that measuring just `CTT` is sufficient to derive `CLT`? To do this properly I'll introduce a notation: (f -> f') @ t[n+0] (f <- f') @ t[n+1] Where `f` and `f'` are function IDs, and `->` represents a "calls" relationship, '<-' represents an "exits" relationship, and `t` is a timestamp (we denote `n` to be the order of timestamps by appearance). Given the following sequence of events: (f1 -> f2) @ t[0] (f2 -> f3) @ t[1] (f2 <- f3) @ t[2] (f2 -> f3) @ t[3] (f2 <- f3) @ t[4] (f1 <- f2) @ t[5] Here, if we think about the cumulative local times, we might think that: CTT(f1) = CTT(f2) + CLT(f1) CTT(f2) = CTT(f3) + CLT(f2) CTT(f3) = 0 + CLT(f3) As per formula above. If we expand this: CTT(f3) = 0 + (t[4] - t[3]) + (t[2] - t[1]) CTT(f2) = CTT(f3) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0]) CTT(f1) = CTT(f2) + 0 Let's expand it further: CTT(f2) = 0 + (t[4] - t[3]) + (t[2] - t[1]) + (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0]) CTT(f2) = (t[5] - t[4]) + (t[4] - t[3]) + (t[3] - t[2]) + (t[2] - t[1]) + (t[1] - t[0]) + 0 CTT(f2) = t[5] - t[0] This is what we'd expect for computing `CTT` from `CLT`. Can we do the reverse though? CLT(f2) = CTT(f2) - CLT(f3) CLT(f2) = CTT(f2) - (CTT(f3) + 0) CLT(f2) = (t[5] - t[0]) - (0 + (t[4] - t[3]) + (t[2] - t[1]) + 0) CLT(f2) = (t[5] - t[0]) - ((t[4] - t[3]) + (t[2] - t[1])) CLT(f2) = (t[5] - t[4]) + (t[3] - t[2]) + (t[1] - t[0]) QED There's an argument for doing either, but we make the trade-off to covering Cumulative Local Time instead at runtime for the following reasons: Using CLT reduces the risk of us overflowing counters. We need CLT anyway for generating the histogram of latency for a particular function in the stack context. CLT allows us to better account for when we're un-winding the stack in case we find an exit for a function that was entered "higher up". Does that make sense? Now I kind-of want to write up that formula somewhere more persistent, rather than just in a review thread. :) dberris: I actually thought about counting both, but realised that there's a property here which allows…
				kpwUnsubmitted Done Reply Inline Actions Thanks for the detailed response. I agree with all of your points except for the that CLT is the right choice for latency histograms. Responding to your points in turn: Yes, CLT and CTT properties can be derived from the other. Overflow is less likely with CLT. I haven't done the math for a typical CPU to approximate the risk for a 32 bit uint. For either CLT or CTT, the "higher up" exit is handled the same way: as if there were simultaneous exits for all the functions until the matching function id. And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is where we actually have to make a choice between the two or compute both. Imagine a function "processRpc(const MyRequest&, MyResponse)". When I want to know the 95th percentile of its latency, I am interested in CTT, not CLT. All the time spent in lower levels of the stack affects the observed latency. Callers care about how long the function blocks their execution. The person trying to optimize a function may care about CLT when choosing which code path to target. Let's cross this bridge when we add histograms. kpw:* Thanks for the detailed response. I agree with all of your points except for the that CLT is…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is where we actually have to make a choice between the two or compute both. Yes! Let's cross this bridge when we add histograms. Agreed. Thanks! dberris: > And finally, latency histograms of CTT can't be used to derive CLT and vice-versa. This is…
				ASSERT_TRUE(F3.Callees.empty());

				// Now that we've established the preconditions, we check for specific aspects
				// of the nodes.
				EXPECT_EQ(F3.CallCount, 1);
				EXPECT_EQ(F2.CallCount, 1);
				EXPECT_EQ(F1.CallCount, 1);
				EXPECT_EQ(F3.CumulativeLocalTime, 100);
				EXPECT_EQ(F2.CumulativeLocalTime, 300);
				EXPECT_EQ(F1.CumulativeLocalTime, 100);
				}

				// TODO: Test that we can handle cross-CPU migrations, where TSCs are not
				// guaranteed to be synchronised.
				TEST(FunctionCallTrieTest, DeepCopy) {
				profilerFlags()->setDefaults();
				auto A = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Trie(A);

				Trie.enterFunction(1, 0);
				Trie.enterFunction(2, 1);
				Trie.exitFunction(2, 2);
				Trie.enterFunction(3, 3);
				Trie.exitFunction(3, 4);
				Trie.exitFunction(1, 5);

				// We want to make a deep copy and compare notes.
				auto B = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Copy(B);
				Trie.deepCopyInto(Copy);

				ASSERT_NE(Trie.getRoots().size(), 0u);
				ASSERT_EQ(Trie.getRoots().size(), Copy.getRoots().size());
				const auto &R0Orig = *Trie.getRoots()[0];
				const auto &R0Copy = *Copy.getRoots()[0];
				EXPECT_EQ(R0Orig.FId, 1);
				EXPECT_EQ(R0Orig.FId, R0Copy.FId);

				ASSERT_EQ(R0Orig.Callees.size(), 2u);
				ASSERT_EQ(R0Copy.Callees.size(), 2u);

				const auto &F1Orig =
				*R0Orig.Callees
				.find_element(
				[](const FunctionCallTrie::NodeIdPair &R) { return R.FId == 2; })
				->NodePtr;
				const auto &F1Copy =
				*R0Copy.Callees
				.find_element(
				[](const FunctionCallTrie::NodeIdPair &R) { return R.FId == 2; })
				kpwUnsubmitted Done Reply Inline Actions Is sharing an initialized allocator expected to be OK? Why not use the same allocator for FunctionCallTrie Merged below then? kpw: Is sharing an initialized allocator expected to be OK? Why not use the same allocator for…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yes. The reason we're not using the same allocator for the FunctionCallTrie merging test, is because we want to make sure that functionality works -- because we do that when we're transferring the FunctionCallTrie from a thread to the central service (in the next patch). We want to have thread-local allocators, then the central storage service will use a single FunctionCallTrie for the "merged" version of the FunctionCallTrie's for all the threads. dberris: Yes. The reason we're not using the same allocator for the FunctionCallTrie merging test, is…
				->NodePtr;
				EXPECT_EQ(&R0Orig, F1Orig.Parent);
				EXPECT_EQ(&R0Copy, F1Copy.Parent);
				}

				TEST(FunctionCallTrieTest, MergeInto) {
				profilerFlags()->setDefaults();
				auto A = FunctionCallTrie::InitAllocators();
				FunctionCallTrie T0(A);
				FunctionCallTrie T1(A);

				// 1 -> 2 -> 3
				T0.enterFunction(1, 0);
				T0.enterFunction(2, 1);
				T0.enterFunction(3, 2);
				T0.exitFunction(3, 3);
				T0.exitFunction(2, 4);
				T0.exitFunction(1, 5);

				// 1 -> 2 -> 3
				T1.enterFunction(1, 0);
				T1.enterFunction(2, 1);
				T1.enterFunction(3, 2);
				T1.exitFunction(3, 3);
				T1.exitFunction(2, 4);
				T1.exitFunction(1, 5);

				// We use a different allocator here to make sure that we're able to transfer
				kpwUnsubmitted Done Reply Inline Actions Only check the root? Theres only two other nodes to check, might as well verify them. kpw: Only check the root? Theres only two other nodes to check, might as well verify them.
				// data into a FunctionCallTrie which uses a different allocator. This
				// reflects the inteded usage scenario for when we're collecting profiles that
				// aggregate across threads.
				auto B = FunctionCallTrie::InitAllocators();
				FunctionCallTrie Merged(B);

				T0.mergeInto(Merged);
				T1.mergeInto(Merged);

				ASSERT_EQ(Merged.getRoots().size(), 1u);
				const auto &R0 = *Merged.getRoots()[0];
				EXPECT_EQ(R0.FId, 1);
				EXPECT_EQ(R0.CallCount, 2);
				EXPECT_EQ(R0.CumulativeLocalTime, 10);
				EXPECT_EQ(R0.Callees.size(), 1u);

				const auto &F1 = *R0.Callees[0].NodePtr;
				EXPECT_EQ(F1.FId, 2);
				EXPECT_EQ(F1.CallCount, 2);
				EXPECT_EQ(F1.CumulativeLocalTime, 6);
				EXPECT_EQ(F1.Callees.size(), 1u);

				const auto &F2 = *F1.Callees[0].NodePtr;
				EXPECT_EQ(F2.FId, 3);
				EXPECT_EQ(F2.CallCount, 2);
				EXPECT_EQ(F2.CumulativeLocalTime, 2);
				EXPECT_EQ(F2.Callees.size(), 0u);
				}

				} // namespace

				} // namespace __xray

compiler-rt/lib/xray/xray_function_call_trie.h

This file was added.

				//===-- xray_function_call_trie.h ------------------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a part of XRay, a dynamic runtime instrumentation system.
				//
				// This file defines the interface for a function call trie.
				//
				//===----------------------------------------------------------------------===//
				#ifndef XRAY_FUNCTION_CALL_TRIE_H
				#define XRAY_FUNCTION_CALL_TRIE_H

				#include "xray_profiler_flags.h"
				#include "xray_segmented_array.h"
				#include <utility>

				namespace __xray {

				/// A FunctionCallTrie represents the stack traces of XRay instrumented
				/// functions that we've encountered, where a node corresponds to a function and
				/// the path from the root to the node its stack trace. Each node in the trie
				/// will contain some useful values, including:
				///
				/// * The cumulative amount of time spent in this particular node/stack.
				/// * The number of times this stack has appeared.
				/// * A histogram of latencies for that particular node.
				///
				/// Each node in the trie will also contain a list of callees, represented using
				/// a Array<NodeIdPair> -- each NodeIdPair instance will contain the function
				/// ID of the callee, and a pointer to the node.
				///
				/// If we visualise this data structure, we'll find the following potential
				/// representation:
				kpwUnsubmitted Done Reply Inline Actions visulaise -> visualize kpw: visulaise -> visualize
				///
				/// [function id node] -> [callees] [cumulative time]
				/// [call counter] [latency histogram]
				///
				/// As an example, when we have a function in this pseudocode:
				///
				/// func f(N) {
				/// g()
				/// h()
				/// for i := 1..N { j() }
				/// }
				///
				/// We may end up with a trie of the following form:
				///
				/// f -> [ g, h, j ] [...] [1] [...]
				/// g -> [ ... ] [...] [1] [...]
				/// h -> [ ... ] [...] [1] [...]
				/// j -> [ ... ] [...] [N] [...]
				///
				/// If for instance the function g() called j() like so:
				///
				/// func g() {
				/// for i := 1..10 { j() }
				/// }
				///
				/// We'll find the following updated trie:
				///
				/// f -> [ g, h, j ] [...] [1] [...]
				/// g -> [ j' ] [...] [1] [...]
				/// h -> [ ... ] [...] [1] [...]
				/// j -> [ ... ] [...] [N] [...]
				/// j' -> [ ... ] [...] [10] [...]
				///
				/// Note that we'll have a new node representing the path `f -> g -> j'` with
				/// isolated data. This isolation gives us a means of representing the stack
				/// traces as a path, as opposed to a key in a table. The alternative
				/// implementation here would be to use a separate table for the path, and use
				/// hashes of the path as an identifier to accumulate the information. We've
				/// moved away from this approach as it takes a lot of time to compute the hash
				/// every time we need to update a function's call information as we're handling
				/// the entry and exit events.
				///
				/// This approach allows us to maintain a shadow stack, which represents the
				/// currently executing path, and on function exits quickly compute the amount
				/// of time elapsed from the entry, then update the counters for the node
				/// already represented in the trie. This necessitates an efficient
				/// representation of the various data structures (the list of callees must be
				/// cache-aware and efficient to look up, and the histogram must be compact and
				/// quick to update) to enable us to keep the overheads of this implementation
				/// to the minimum.
				class FunctionCallTrie {
				public:
				struct Node;

				// We use a NodeIdPair type instead of a std::pair<...> to not rely on the
				// standard library types in this header.
				struct NodeIdPair {
				Node *NodePtr;
				kpwUnsubmitted Done Reply Inline Actions There's no reference members here. Maybe call it NodeIdPair? kpw: There's no reference members here. Maybe call it NodeIdPair?
				int32_t FId;

				// Constructor for inplace-construction.
				NodeIdPair(Node *N, int32_t F) : NodePtr(N), FId(F) {}
				};

				using NodeIdPairArray = Array<NodeIdPair>;
				using NodeIdPairAllocatorType = NodeIdPairArray::AllocatorType;

				// A Node in the FunctionCallTrie gives us a list of callees, the cumulative
				// number of times this node actually appeared, the cumulative amount of time
				// for this particular node including its children call times, and just the
				// local time spent on this node. Each Node will have the ID of the XRay
				// instrumented function that it is associated to.
				struct Node {
				Node *Parent;
				NodeIdPairArray Callees;
				int64_t CallCount;
				int64_t CumulativeLocalTime; // Typically in TSC deltas, not wall-time.
				int32_t FId;

				// We add a constructor here to allow us to inplace-construct through
				// Array<...>'s AppendEmplace.
				Node(Node *P, NodeIdPairAllocatorType &A, int64_t CC, int64_t CLT,
				int32_t F)
				: Parent(P), Callees(A), CallCount(CC), CumulativeLocalTime(CLT),
				FId(F) {}

				// TODO: Include the compact histogram.
				};

				private:
				struct ShadowStackEntry {
				int32_t FId; // We're copying the function ID into the stack to avoid having
				// to reach into the node just to get the function ID.
				uint64_t EntryTSC;
				Node *NodePtr;

				// We add a constructor here to allow us to inplace-construct through
				// Array<...>'s AppendEmplace.
				ShadowStackEntry(int32_t F, uint64_t T, Node *N)
				: FId(F), EntryTSC(T), NodePtr(N) {}
				};

				using NodeArray = Array<Node>;
				using RootArray = Array<Node *>;
				using ShadowStackArray = Array<ShadowStackEntry>;

				public:
				// We collate the allocators we need into a single struct, as a convenience to
				// allow us to initialize these as a group.
				struct Allocators {
				using NodeAllocatorType = NodeArray::AllocatorType;
				using RootAllocatorType = RootArray::AllocatorType;
				using ShadowStackAllocatorType = ShadowStackArray::AllocatorType;
				using NodeIdPairAllocatorType = NodeIdPairAllocatorType;

				NodeAllocatorType *NodeAllocator = nullptr;
				RootAllocatorType *RootAllocator = nullptr;
				ShadowStackAllocatorType *ShadowStackAllocator = nullptr;
				NodeIdPairAllocatorType *NodeIdPairAllocator = nullptr;

				Allocators() {}
				Allocators(const Allocators &) = delete;
				Allocators &operator=(const Allocators &) = delete;

				Allocators(Allocators &&O)
				: NodeAllocator(O.NodeAllocator), RootAllocator(O.RootAllocator),
				ShadowStackAllocator(O.ShadowStackAllocator),
				NodeIdPairAllocator(O.NodeIdPairAllocator) {
				O.NodeAllocator = nullptr;
				O.RootAllocator = nullptr;
				O.ShadowStackAllocator = nullptr;
				O.NodeIdPairAllocator = nullptr;
				}

				Allocators &operator=(Allocators &&O) {
				{
				auto Tmp = O.NodeAllocator;
				O.NodeAllocator = this->NodeAllocator;
				this->NodeAllocator = Tmp;
				}
				{
				auto Tmp = O.RootAllocator;
				O.RootAllocator = this->RootAllocator;
				this->RootAllocator = Tmp;
				}
				{
				auto Tmp = O.ShadowStackAllocator;
				O.ShadowStackAllocator = this->ShadowStackAllocator;
				this->ShadowStackAllocator = Tmp;
				}
				{
				auto Tmp = O.NodeIdPairAllocator;
				O.NodeIdPairAllocator = this->NodeIdPairAllocator;
				this->NodeIdPairAllocator = Tmp;
				}
				return *this;
				}

				~Allocators() {
				// Note that we cannot use delete on these pointers, as they need to be
				// returned to the sanitizer_common library's internal memory tracking
				// system.
				if (NodeAllocator != nullptr) {
				NodeAllocator->~NodeAllocatorType();
				InternalFree(NodeAllocator);
				}
				if (RootAllocator != nullptr) {
				RootAllocator->~RootAllocatorType();
				InternalFree(RootAllocator);
				}
				if (ShadowStackAllocator != nullptr) {
				ShadowStackAllocator->~ShadowStackAllocatorType();
				InternalFree(ShadowStackAllocator);
				}
				if (NodeIdPairAllocator != nullptr) {
				NodeIdPairAllocator->~NodeIdPairAllocatorType();
				InternalFree(NodeIdPairAllocator);
				}
				}
				};
				kpwUnsubmitted Done Reply Inline Actions Don't think so. This looks like the end of "struct Allocators {" to me. kpw: Don't think so. This looks like the end of "struct Allocators {" to me.

				// TODO: Support configuration of options through the arguments.
				static Allocators InitAllocators() {
				Allocators A;
				auto NodeAllocator = reinterpret_cast<Allocators::NodeAllocatorType *>(
				InternalAlloc(sizeof(Allocators::NodeAllocatorType)));
				new (NodeAllocator) Allocators::NodeAllocatorType(
				profilerFlags()->per_thread_allocator_max, 0);
				A.NodeAllocator = NodeAllocator;

				auto RootAllocator = reinterpret_cast<Allocators::RootAllocatorType *>(
				InternalAlloc(sizeof(Allocators::RootAllocatorType)));
				new (RootAllocator) Allocators::RootAllocatorType(
				profilerFlags()->per_thread_allocator_max, 0);
				A.RootAllocator = RootAllocator;

				auto ShadowStackAllocator =
				reinterpret_cast<Allocators::ShadowStackAllocatorType *>(
				InternalAlloc(sizeof(Allocators::ShadowStackAllocatorType)));
				new (ShadowStackAllocator) Allocators::ShadowStackAllocatorType(
				profilerFlags()->per_thread_allocator_max, 0);
				A.ShadowStackAllocator = ShadowStackAllocator;

				auto NodeIdPairAllocator =
				reinterpret_cast<Allocators::NodeIdPairAllocatorType *>(
				InternalAlloc(sizeof(Allocators::NodeIdPairAllocatorType)));
				new (NodeIdPairAllocator) Allocators::NodeIdPairAllocatorType(
				profilerFlags()->per_thread_allocator_max, 0);
				A.NodeIdPairAllocator = NodeIdPairAllocator;
				return A;
				}

				private:
				NodeArray Nodes;
				RootArray Roots;
				ShadowStackArray ShadowStack;
				NodeIdPairAllocatorType *NodeIdPairAllocator = nullptr;

				const Allocators &GetGlobalAllocators() {
				static const Allocators A = [] { return InitAllocators(); }();
				return A;
				}

				public:
				explicit FunctionCallTrie(const Allocators &A)
				: Nodes(A.NodeAllocator), Roots(A.RootAllocator),
				ShadowStack(*A.ShadowStackAllocator),
				NodeIdPairAllocator(A.NodeIdPairAllocator) {}

				FunctionCallTrie() : FunctionCallTrie(GetGlobalAllocators()) {}

				void enterFunction(int32_t FId, uint64_t TSC) {
				// This function primarily deals with ensuring that the ShadowStack is
				// consistent and ready for when an exit event is encountered.
				if (UNLIKELY(ShadowStack.empty())) {
				auto NewRoot =
				Nodes.AppendEmplace(nullptr, *NodeIdPairAllocator, 0, 0, FId);
				if (UNLIKELY(NewRoot == nullptr))
				return;
				Roots.Append(NewRoot);
				ShadowStack.AppendEmplace(FId, TSC, NewRoot);
				return;
				}

				auto &Top = ShadowStack.back();
				auto TopNode = Top.NodePtr;

				// If we've seen this callee before, then we just access that node and place
				// that on the top of the stack.
				auto Callee = TopNode->Callees.find_element(
				[FId](const NodeIdPair &NR) { return NR.FId == FId; });
				if (Callee != nullptr) {
				CHECK_NE(Callee->NodePtr, nullptr);
				ShadowStack.AppendEmplace(FId, TSC, Callee->NodePtr);
				return;
				}

				// This means we've never seen this stack before, create a new node here.
				auto NewNode =
				Nodes.AppendEmplace(TopNode, *NodeIdPairAllocator, 0, 0, FId);
				if (UNLIKELY(NewNode == nullptr))
				return;
				TopNode->Callees.AppendEmplace(NewNode, FId);
				ShadowStack.AppendEmplace(FId, TSC, NewNode);
				return;
				}

				void exitFunction(int32_t FId, uint64_t TSC) {
				// When we exit a function, we look up the ShadowStack to see whether we've
				// entered this function before. We do as little processing here as we can,
				// since most of the hard work would have already been done at function
				// entry.
				if (UNLIKELY(ShadowStack.empty()))
				return;

				uint64_t CumulativeTreeTime = 0;
				while (!ShadowStack.empty()) {
				auto &Top = ShadowStack.back();
				auto TopNode = Top.NodePtr;
				auto TopFId = TopNode->FId;
				auto LocalTime = TSC - Top.EntryTSC;
				kpwUnsubmitted Done Reply Inline Actions To me, it is less intuitive to capture CumulativeLocalTime than CumulativeTreeTime in each Node. With one you can compute the other given the callees, but if I'm analyzing function latencies, I care about time spent in the callees just as much. Why did you make this choice? kpw: To me, it is less intuitive to capture CumulativeLocalTime than CumulativeTreeTime in each Node.
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Explained in the comment above... with some math/proof. :) dberris: Explained in the comment above... with some math/proof. :)
				TopNode->CallCount++;
				TopNode->CumulativeLocalTime += LocalTime - CumulativeTreeTime;
				CumulativeTreeTime += LocalTime;
				ShadowStack.trim(1);

				// TODO: Update the histogram for the node.
				if (TopFId == FId)
				break;
				}
				}

				const RootArray &getRoots() const { return Roots; }

				// The deepCopyInto operation will update the provided FunctionCallTrie by
				// re-creating the contents of this particular FunctionCallTrie in the other
				// FunctionCallTrie. It will do this using a Depth First Traversal from the
				// roots, and while doing so recreating the traversal in the provided
				// FunctionCallTrie.
				//
				kpwUnsubmitted Done Reply Inline Actions Should this be called with non-empty destinations? Can we just CHECK it is not. Having duplicate roots stinks. kpw: Should this be called with non-empty destinations? Can we just CHECK it is not. Having…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yes, fixed with a DCHECK. dberris: Yes, fixed with a DCHECK.
				kpwUnsubmitted Done Reply Inline Actions You should update the comment to say that the operation should not be called with a non-empty destination. You've tightened the contract. kpw: You should update the comment to say that the operation should not be called with a non-empty…
				// This operation will not destroy the state in `O`, and thus may cause some
				// duplicate entries in `O` if it is not empty.
				//
				// This function is not thread-safe, and may require external
				// synchronisation of both "this" and \|O\|.
				//
				// This function must not be called with a non-empty FunctionCallTrie \|O\|.
				void deepCopyInto(FunctionCallTrie &O) const {
				DCHECK(O.getRoots().empty());
				for (const auto Root : getRoots()) {
				// Add a node in O for this root.
				auto NewRoot = O.Nodes.AppendEmplace(
				nullptr, *O.NodeIdPairAllocator, Root->CallCount,
				Root->CumulativeLocalTime, Root->FId);
				O.Roots.Append(NewRoot);

				// We then push the root into a stack, to use as the parent marker for new
				// nodes we push in as we're traversing depth-first down the call tree.
				struct NodeAndParent {
				FunctionCallTrie::Node *Node;
				kpwUnsubmitted Done Reply Inline Actions Do you have to handle the Allocator failing here with a null check? Are you just relying on those cases artificially pruning the traversal? kpw: Do you have to handle the Allocator failing here with a null check? Are you just relying on…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Good question. I'm definitely relying on the artificial pruning. Writing a comment as a TODO to figure out what to do in case of failure. dberris: Good question. I'm definitely relying on the artificial pruning. Writing a comment as a TODO to…
				FunctionCallTrie::Node *NewNode;
				};
				using Stack = Array<NodeAndParent>;

				typename Stack::AllocatorType StackAllocator(
				profilerFlags()->stack_allocator_max, 0);
				Stack DFSStack(StackAllocator);

				// TODO: Figure out what to do if we fail to allocate any more stack
				// space. Maybe warn or report once?
				DFSStack.Append(NodeAndParent{Root, NewRoot});
				while (!DFSStack.empty()) {
				NodeAndParent NP = DFSStack.back();
				DCHECK_NE(NP.Node, nullptr);
				DCHECK_NE(NP.NewNode, nullptr);
				DFSStack.trim(1);
				for (const auto Callee : NP.Node->Callees) {
				auto NewNode = O.Nodes.AppendEmplace(
				NP.NewNode, *O.NodeIdPairAllocator, Callee.NodePtr->CallCount,
				Callee.NodePtr->CumulativeLocalTime, Callee.FId);
				DCHECK_NE(NewNode, nullptr);
				NP.NewNode->Callees.AppendEmplace(NewNode, Callee.FId);
				DFSStack.Append(NodeAndParent{Callee.NodePtr, NewNode});
				kpwUnsubmitted Done Reply Inline Actions What's the thread safety of this and deepCopy? It seems like they shouldn't be called when functions are being intercepted. How can we make sure that invariant is preserved? kpw: What's the thread safety of this and deepCopy? It seems like they shouldn't be called when…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Good question. They are thread-compatible (will need external synchronisation). dberris: Good question. They are thread-compatible (will need external synchronisation).
				kpwUnsubmitted Not Done Reply Inline Actions I'm curious how this will work for the data collector service ensuring that function traces won't interfere with flushing. kpw: I'm curious how this will work for the data collector service ensuring that function traces…
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yeah, Part 3 actually does external synchronisation to ensure that when posting, we're making a copy and merging to a global. There's some interesting work happening there, but not as interesting as the stuff we're doing here. ;) dberris: Yeah, Part 3 actually does external synchronisation to ensure that when posting, we're making a…
				}
				}
				}
				}

				// The mergeInto operation will update the provided FunctionCallTrie by
				// traversing the current trie's roots and updating (i.e. merging) the data in
				// the nodes with the data in the target's nodes. If the node doesn't exist in
				// the provided trie, we add a new one in the right position, and inherit the
				// data from the original (current) trie, along with all its callees.
				//
				// This function is not thread-safe, and may require external
				// synchronisation of both "this" and \|O\|.
				void mergeInto(FunctionCallTrie &O) const {
				struct NodeAndTarget {
				FunctionCallTrie::Node *OrigNode;
				FunctionCallTrie::Node *TargetNode;
				};
				using Stack = Array<NodeAndTarget>;
				typename Stack::AllocatorType StackAllocator(
				profilerFlags()->stack_allocator_max, 0);
				kpwUnsubmitted Done Reply Inline Actions Should this be pulled out of the root loop so we can use a single stack and array instead of 1 per root node? kpw: Should this be pulled out of the root loop so we can use a single stack and array instead of 1…
				Stack DFSStack(StackAllocator);

				for (const auto Root : getRoots()) {
				Node *TargetRoot = nullptr;
				auto R = O.Roots.find_element(
				[&](const Node *Node) { return Node->FId == Root->FId; });
				if (R == nullptr) {
				TargetRoot = O.Nodes.AppendEmplace(nullptr, *O.NodeIdPairAllocator, 0,
				kpwUnsubmitted Done Reply Inline Actions You have some TODOs elsewhere to update the histograms. Might put one here as well. kpw: You have some TODOs elsewhere to update the histograms. Might put one here as well.
				0, Root->FId);
				O.Roots.Append(TargetRoot);
				} else {
				TargetRoot = *R;
				}

				DFSStack.Append(NodeAndTarget{Root, TargetRoot});
				while (!DFSStack.empty()) {
				NodeAndTarget NT = DFSStack.back();
				DCHECK_NE(NT.OrigNode, nullptr);
				DCHECK_NE(NT.TargetNode, nullptr);
				DFSStack.trim(1);
				// TODO: Update the histogram as well when we have it ready.
				NT.TargetNode->CallCount += NT.OrigNode->CallCount;
				NT.TargetNode->CumulativeLocalTime += NT.OrigNode->CumulativeLocalTime;
				for (const auto Callee : NT.OrigNode->Callees) {
				auto TargetCallee = NT.TargetNode->Callees.find_element(
				[&](const FunctionCallTrie::NodeIdPair &C) {
				return C.FId == Callee.FId;
				});
				if (TargetCallee == nullptr) {
				auto NewTargetNode = O.Nodes.AppendEmplace(
				NT.TargetNode, *O.NodeIdPairAllocator, 0, 0, Callee.FId);
				TargetCallee =
				NT.TargetNode->Callees.AppendEmplace(NewTargetNode, Callee.FId);
				}
				DFSStack.Append(NodeAndTarget{Callee.NodePtr, TargetCallee->NodePtr});
				}
				}
				}
				}
				};

				} // namespace __xray

				#endif // XRAY_FUNCTION_CALL_TRIE_H

compiler-rt/lib/xray/xray_profiler_flags.h

This file was added.

				//===-- xray_profiler_flags.h ----------------------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a part of XRay, a dynamic runtime instrumentation system.
				//
				// XRay profiler runtime flags.
				//===----------------------------------------------------------------------===//

				#ifndef XRAY_PROFILER_FLAGS_H
				#define XRAY_PROFILER_FLAGS_H

				#include "sanitizer_common/sanitizer_flag_parser.h"
				#include "sanitizer_common/sanitizer_internal_defs.h"

				namespace __xray {

				struct ProfilerFlags {
				#define XRAY_FLAG(Type, Name, DefaultValue, Description) Type Name;
				#include "xray_profiler_flags.inc"
				#undef XRAY_FLAG

				void setDefaults();
				};

				extern ProfilerFlags xray_profiler_flags_dont_use_directly;
				inline ProfilerFlags *profilerFlags() {
				return &xray_profiler_flags_dont_use_directly;
				}
				void registerProfilerFlags(FlagParser P, ProfilerFlags F);

				} // namespace __xray

				#endif // XRAY_PROFILER_FLAGS_H

compiler-rt/lib/xray/xray_profiler_flags.cc

This file was added.

				//===-- xray_flags.h -------------------------------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a part of XRay, a dynamic runtime instrumentation system.
				//
				// XRay runtime flags.
				//===----------------------------------------------------------------------===//

				#include "xray_profiler_flags.h"
				#include "sanitizer_common/sanitizer_common.h"
				#include "sanitizer_common/sanitizer_flag_parser.h"
				#include "sanitizer_common/sanitizer_libc.h"
				#include "xray_defs.h"

				namespace __xray {

				// Storage for the profiler flags.
				ProfilerFlags xray_profiler_flags_dont_use_directly;

				void ProfilerFlags::setDefaults() XRAY_NEVER_INSTRUMENT {
				#define XRAY_FLAG(Type, Name, DefaultValue, Description) Name = DefaultValue;
				#include "xray_profiler_flags.inc"
				#undef XRAY_FLAG
				}

				void registerProfilerFlags(FlagParser *P,
				ProfilerFlags *F) XRAY_NEVER_INSTRUMENT {
				#define XRAY_FLAG(Type, Name, DefaultValue, Description) \
				RegisterFlag(P, #Name, Description, &F->Name);
				#include "xray_profiler_flags.inc"
				kpwUnsubmitted Done Reply Inline Actions Is the "#Name" intentional? I'm not very fluent in macros, but that popped out. kpw: Is the "#Name" intentional? I'm not very fluent in macros, but that popped out.
				dberrisAuthorUnsubmitted Not Done Reply Inline Actions Yes, this turns the argument into a string. dberris: Yes, this turns the argument into a string.
				#undef XRAY_FLAG
				}

				} // namespace __xray

compiler-rt/lib/xray/xray_profiler_flags.inc

This file was added.

				//===-- xray_flags.inc ------------------------------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// XRay profiling runtime flags.
				//
				//===----------------------------------------------------------------------===//
				#ifndef XRAY_FLAG
				#error "Define XRAY_FLAG prior to including this file!"
				#endif

				XRAY_FLAG(uptr, per_thread_allocator_max, 2 << 20,
				"Maximum size of any single per-thread allocator.")
				XRAY_FLAG(uptr, global_allocator_max, 2 << 24,
				"Maximum size of the global allocator for profile storage.")
				XRAY_FLAG(uptr, stack_allocator_max, 2 << 24,
				"Maximum size of the traversal stack allocator.")
				XRAY_FLAG(int, grace_period_ms, 100,
				"Profile collection will wait this much time in milliseconds before "
				"resetting the global state. This gives a chance to threads to "
				"notice that the profiler has been finalized and clean up.")

This is an archive of the discontinued LLVM Phabricator instance.

[XRay][profiler] Part 2: XRay Function Call TrieClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 146723

compiler-rt/lib/xray/CMakeLists.txt

compiler-rt/lib/xray/tests/CMakeLists.txt

compiler-rt/lib/xray/tests/unit/CMakeLists.txt

compiler-rt/lib/xray/tests/unit/function_call_trie_test.cc

compiler-rt/lib/xray/xray_function_call_trie.h

compiler-rt/lib/xray/xray_profiler_flags.h

compiler-rt/lib/xray/xray_profiler_flags.cc

compiler-rt/lib/xray/xray_profiler_flags.inc

[XRay][profiler] Part 2: XRay Function Call Trie
ClosedPublic