This is an archive of the discontinued LLVM Phabricator instance.

I ask because the x86_64 code has already started differentiating tail-exit sleds by having its own trampoline. Maybe this distinction isn't necessary in ARM, but was wondering whether there was something that needed to be done here.

This revision now requires changes to proceed.Nov 8 2016, 7:07 PM

In D26413#590316, @dberris wrote:

Thanks for doing this @rSerge! Do you have a plan for supporting tail-exit sleds either now or at a later time?

I ask because the x86_64 code has already started differentiating tail-exit sleds by having its own trampoline. Maybe this distinction isn't necessary in ARM, but was wondering whether there was something that needed to be done here.

I saw your recent commit with tail-exit sleds https://reviews.llvm.org/D26020 , and I plan to implement this too, but not in the current series of commits. Because ARM32 doesn't have this either, so I would rather commit ARM32 and AArch64 tail-exit sled handling together, than diverging in the current series between ARM32 and AArch64. ARM32 still doesn't support Thumb CPU mode, so some prioritizing is needed which one to do first and definitely not all at once.

I see. I'll defer to @rengolin for the ARM-specific bits.

rengolin added inline comments.Nov 11 2016, 7:13 AM

lib/xray/xray_interface.cc
29	I'd rather have the repetition here. This is looking odd... :)
lib/xray/xray_trampoline_AArch64.S
32	can't you compare the address before saving and restoring all those registers?

rSerge updated this revision to Diff 77674.Nov 11 2016, 2:42 PM

rSerge edited edge metadata.

rSerge marked an inline comment as done.

rSerge added inline comments.

lib/xray/xray_interface.cc
29	Changing.
lib/xray/xray_trampoline_AArch64.S
32	This situation is rare. Usually if the handler is set to `nullptr`, then the user code is unpatched, so the trampolines are not called. So there is no performance issue here. However, there is a race condition issue. It is important to check for `nullptr` handler as late as possible before its call. If we push the registers after checking for `nullptr` handler, we increase the chances that handler is not `nullptr` when it's checked, but `nullptr` when it's called. Though anyway we currently require that the handler function works correctly even if called after its removal via XRay API.

rengolin added inline comments.Nov 14 2016, 3:53 AM

lib/xray/xray_trampoline_AArch64.S
32	This situation is rare. Usually if the handler is set to nullptr, then the user code is unpatched, so the trampolines are not called. So there is no performance issue here. Right, makes sense. However, there is a race condition issue. It is important to check for nullptr handler as late as possible before its call. If we push the registers after checking for nullptr handler, we increase the chances that handler is not nullptr when it's checked, but nullptr when it's called. I'm not sure I follow. You have added the last update to be the first branch atomically, to make sure there are no race conditions. Is this about the users calling the thunk while the update is ongoing, which can see the code but still don't have a function pointer? In any case, if there is a race condition, it needs to be fixed, not work by chance.

dberris added inline comments.Nov 14 2016, 4:00 AM

lib/xray/xray_trampoline_AArch64.S
32	I'm not sure I follow. You have added the last update to be the first branch atomically, to make sure there are no race conditions. Is this about the users calling the thunk while the update is ongoing, which can see the code but still don't have a function pointer? In any case, if there is a race condition, it needs to be fixed, not work by chance. In x86_64 at least, there's no real "race" here because the pointer is implemented as a `std::atomic<function_ptr>`. The sequence here though at a high level should be: The code executing somehow knew to get to this trampoline. While some threads are running through this trampoline, another updates the global atomic pointer to `nullptr`. All threads that encounter the load for the function pointer should see the updated value of the pointer (since it "happens before" the load of the pointer). This is at least the reason why the load of the pointer happens after the registers have been stashed onto the stack. Does that make sense?

rengolin added inline comments.Nov 14 2016, 4:22 AM

lib/xray/xray_trampoline_AArch64.S
32	While some threads are running through this trampoline, another updates the global atomic pointer to nullptr. Hum, I wasn't expecting the pointer to move back to nullptr at any time. In this case, there's still the race condition that the value changes in between the comparison (`CMP X2, #0`) and the jump (`BEQ`) which is, admittedly, very unlikely, but non-zero.

dberris added inline comments.Nov 14 2016, 4:42 AM

lib/xray/xray_trampoline_AArch64.S
32	Hum, I wasn't expecting the pointer to move back to nullptr at any time. Yeah, that's a feature (see `__xray_remove_handler()`) to allow for turning off XRay logging at runtime in a sequence that guarantees things will continue to work. To disable XRay at runtime, users may: Call `__xray_remove_handler()` to ensure that the current installed logging handler is removed atomically (in a cross-platform manner). Call `__xray_unpatch()` to return the state of the sleds to "neutral". Both of these are optional, could be performed in any order (thus operations need to be ensured thread-safe). In this case, there's still the race condition that the value changes in between the comparison (CMP X2, #0) and the jump (BEQ) which is, admittedly, very unlikely, but non-zero. The way we handle this in x86_64 at least is to load the value from the global to a register, then perform the comparison with an immediate value (0). It looks like this code is already doing almost exactly the same thing -- maybe we need to be ensuring that the load is synchronised? My non-familiarity with ARM64 is showing here.

rSerge marked an inline comment as done.Nov 14 2016, 7:52 AM

rSerge added inline comments.

lib/xray/xray_trampoline_AArch64.S
32	There is a possibility that the handler is removed or changed (in another thread) after current thread's `BEQ`, but before the old handler is called, or even when the current thread is in the handler. I think we had this discussion for x86_64 XRay and decided that the code of the handler must be implemented in such a way that it handles decently the situation when the handler function is called after it is removed or changed via XRay API. Complete elimination of the possibility that the old handler is called would require heavy synchronization on the XRay side and will still not eliminate the possibility that the handler is executing in 1 thread and is being removed in another thread. This heavy synchronization is undesirable for the tracing component (XRay), so it seems better to impose some restrictions on the handler code to allow spurious calls (like condition variables allow spurious wakeups, because it's too costly to avoid them). I just minimized the chances of spurious handler call by moving the handler check as close as possible to its call.

rSerge added inline comments.Nov 14 2016, 7:55 AM

lib/xray/xray_trampoline_AArch64.S
32	Here's the discussion we had earlier on this: https://groups.google.com/d/msg/llvm-dev/Ft1XUeiSKgw/iABdpOTSCAAJ

Right, so I think the current scenario is:

Disabled traces won't get there at all, as they'll branch +32 directly.
Enabled traces will make use of the push/pop sequence, so no slow down.
In rare cases, the handler will be removed in between an enabled call, which adds sync problems.

To fix 3, there are three ways:

Make that an atomic pointer, complicating the thunk code making it bigger and slower.
Make sure any update to the function pointer is atomic with regards to it's use, which complicates the tracing logic.
Reduce the probability in the tracing code (some atomics/volatile) and "hope" that the distance between CMP and BEQ is small enough.

We seem to be going for 3.

Given that this only affect XRay, I don't mind this implementation, as long as the users are aware of it. The XRay documentation [1] doesn't seem to cover that.

@dberris, it's up to you, really. :)

cheers,
--renato

[1] http://llvm.org/docs/XRay.html

In D26413#594413, @rengolin wrote:

Right, so I think the current scenario is:

Disabled traces won't get there at all, as they'll branch +32 directly.

Enabled traces will make use of the push/pop sequence, so no slow down.

In rare cases, the handler will be removed in between an enabled call, which adds sync problems.

For 3, there's a few things working together:

When tracing is enabled, the sleds have been overwritten to jump/call into the trampolines. Code here defines the trampolines. Note that the sleds can be made to jump/call into the trampolines even if the global log handler is nullptr, because that's already defined to be an std::atomic<function_ptr>.
The global XRayPatchedFunction is the installed "log handler". This may be nullptr and the trampolines should be loading this in an atomic manner (potentially synchronised on some platforms) before calling it. In x86_64 we're almost assured that this is safe with a normal mov instruction. This is independent of whether XRay instrumentation is on/off (i.e. it's not required that the sleds are patched or unpatched for this pointer to be defined).
The trampolines (defined in assembly for complete control) must check whether the function pointer in XRayPatchedFunction is valid (i.e. not nullptr) before invoking it. The trampolines still have to abide by the calling conventions of the platform, and therefore must not assume that the pointer is not nullptr (and must gracefully handle the case when it is nullptr).

To fix 3, there are three ways:

Make that an atomic pointer, complicating the thunk code making it bigger and slower.

We already do this for x86_64 which amounts to a normal mov instruction. Maybe something else needs to be done for ARM64?

Make sure any update to the function pointer is atomic with regards to it's use, which complicates the tracing logic.

This is already the case in C++ code (xray_interface.cc, see __xray_set_handler(...)). I'm not sure whether this should be different in ARM64 assembler to load the function pointer into a register in a "sequentially consistent" or even with acquire/release semantics.

Reduce the probability in the tracing code (some atomics/volatile) and "hope" that the distance between CMP and BEQ is small enough.

We seem to be going for 3.

At least in x86_64, because we copy the pointer at one point into a register and compare the register, the original data may have already changed after the copy. Like Serge mentioned, we implicitly require that the handler function be able to handle spurious invocations.

Given that this only affect XRay, I don't mind this implementation, as long as the users are aware of it. The XRay documentation [1] doesn't seem to cover that.

@dberris, it's up to you, really. :)

Thanks -- I'll fix the documentation to at least make it clear what the requirements/expectations on the log handler function ought to be.

It's the ARM64 assembly that I have no knowledge about as to whether there's a difference between a synchronised/atomic load of a pointer-sized value into a register and a normal load operation. If the way it's written now is already thread-safe (even if it's relaxed) then this is fine by me. :)

cheers,
--renato

Thanks Renato!

[1] http://llvm.org/docs/XRay.html

Just to affirm that I'm fine with this going in, as long as we're confident that the load on ARM64 through the atomic pointer doesn't need to be synchronised further.

This revision is now accepted and ready to land.Nov 15 2016, 5:20 PM

Normal loads are not atomic on aarch64, so I still need to understand what the x86 code does and what the guarantee is.

In D26413#597054, @rengolin wrote:

Normal loads are not atomic on aarch64, so I still need to understand what the x86 code does and what the guarantee is.

Do you mean that LDR X2, [X1] may load part of the data pointed by X1, then get interrupted so that another thread changes the other part of the data, then LDR X2, [X1] continues reading so that it reads combined/corrupted data?

Or do you mean weak ordering of the instructions being executed on AArch64? The latter should not be a problem because this sequence of instructions carries a dependency one on another:

LDR X1, =_ZN6__xray19XRayPatchedFunctionE
LDR X2, [X1]
CMP X2, #0
BEQ FunctionEntry_restore

In D26413#597248, @rSerge wrote:

Do you mean that LDR X2, [X1] may load part of the data pointed by X1, then get interrupted so that another thread changes the other part of the data, then LDR X2, [X1] continues reading so that it reads combined/corrupted data?

No. I mean in terms of loads and stores. I was trying to understand what the guarantees X86 provides, to make sure they're the same in AArch64.

LDR X1, =_ZN6__xray19XRayPatchedFunctionE
LDR X2, [X1]
CMP X2, #0
BEQ FunctionEntry_restore

I'm not thinking on this sequence, but on the multiple cores executing the same code.

If you guarantee that the symbol _ZN6__xray19XRayPatchedFunctionE will always be there (I think you can, as you're not patching that one) and that calling it is *always* safe, even if the tracing is already disabled (in the case where you disable and the remaining threads that already loaded into X1 will jump to), then it should be fine.

If enabling/disabling the trace must be atomic (ie, things will break horribly if _ZN6__xray19XRayPatchedFunctionE is called after being disabled), then you need a DMB, because the store is in one thread and the load in another.

Makes sense?

We require from the handler function that it handles well spurious calls (after it is removed or changed to another handler). Perhaps this is not documented anywhere except our discussions on the mailing list, but it is so for all the currently supported CPUs.
_ZN6__xray19XRayPatchedFunctionE is a global variable, so it should always be there. Its definition and setting are in xray_interface.cc file (not touched by my current changes).
Definition:

// This is the function to call when we encounter the entry or exit sleds.
std::atomic<void (*)(int32_t, XRayEntryType)> XRayPatchedFunction{nullptr};

Setting:

__xray::XRayPatchedFunction.store(entry, std::memory_order_release);

It should be fine, then. Adding that comment somewhere would be nice, though.

Added a comment about the requirement for XRay user handler function.
Repeated the changes of https://reviews.llvm.org/D26597 : Disable XRay instrumentation of the XRay runtime.

@dberris , @rengolin could you look whether the latest changes (mentioned in a comment above) are ok, and if so, deliver to mainline?

The comment looks good to me, thanks!

Closed by commit rL287517: [XRay] Support AArch64 in compiler-rt (authored by dberris). · Explain WhyNov 20 2016, 7:30 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

cmake/

	config-ix.cmake
	config-ix.cmake (revision 286005)

2 lines

lib/

xray/

	CMakeLists.txt
	CMakeLists.txt (revision 286005)

5 lines

	xray_AArch64.cc
	xray_AArch64.cc (nonexistent)

103 lines

	xray_inmemory_log.cc
	xray_inmemory_log.cc (revision 286005)

6 lines

	xray_interface.cc
	xray_interface.cc (revision 286005)

8 lines

	xray_trampoline_AArch64.S
	xray_trampoline_AArch64.S (nonexistent)

89 lines

Diff 77237

cmake/config-ix.cmake

Show First 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	set(ALL_PROFILE_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM32} ${ARM64} ${PPC64}
${MIPS32} ${MIPS64})		${MIPS32} ${MIPS64})
set(ALL_TSAN_SUPPORTED_ARCH ${X86_64} ${MIPS64} ${ARM64} ${PPC64})		set(ALL_TSAN_SUPPORTED_ARCH ${X86_64} ${MIPS64} ${ARM64} ${PPC64})
set(ALL_UBSAN_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM32} ${ARM64}		set(ALL_UBSAN_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM32} ${ARM64}
${MIPS32} ${MIPS64} ${PPC64} ${S390X})		${MIPS32} ${MIPS64} ${PPC64} ${S390X})
set(ALL_SAFESTACK_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM64} ${MIPS32} ${MIPS64})		set(ALL_SAFESTACK_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM64} ${MIPS32} ${MIPS64})
set(ALL_CFI_SUPPORTED_ARCH ${X86} ${X86_64} ${MIPS64})		set(ALL_CFI_SUPPORTED_ARCH ${X86} ${X86_64} ${MIPS64})
set(ALL_ESAN_SUPPORTED_ARCH ${X86_64} ${MIPS64})		set(ALL_ESAN_SUPPORTED_ARCH ${X86_64} ${MIPS64})
set(ALL_SCUDO_SUPPORTED_ARCH ${X86_64})		set(ALL_SCUDO_SUPPORTED_ARCH ${X86_64})
set(ALL_XRAY_SUPPORTED_ARCH ${X86_64} ${ARM32})		set(ALL_XRAY_SUPPORTED_ARCH ${X86_64} ${ARM32} ${ARM64})

if(APPLE)		if(APPLE)
include(CompilerRTDarwinUtils)		include(CompilerRTDarwinUtils)

find_darwin_sdk_dir(DARWIN_osx_SYSROOT macosx)		find_darwin_sdk_dir(DARWIN_osx_SYSROOT macosx)
find_darwin_sdk_dir(DARWIN_iossim_SYSROOT iphonesimulator)		find_darwin_sdk_dir(DARWIN_iossim_SYSROOT iphonesimulator)
find_darwin_sdk_dir(DARWIN_ios_SYSROOT iphoneos)		find_darwin_sdk_dir(DARWIN_ios_SYSROOT iphoneos)
find_darwin_sdk_dir(DARWIN_watchossim_SYSROOT watchsimulator)		find_darwin_sdk_dir(DARWIN_watchossim_SYSROOT watchsimulator)
▲ Show 20 Lines • Show All 349 Lines • Show Last 20 Lines

lib/xray/CMakeLists.txt

	Show All 13 Lines

	set(arm_SOURCES			set(arm_SOURCES
	xray_arm.cc			xray_arm.cc
	xray_trampoline_arm.S			xray_trampoline_arm.S
	${XRAY_SOURCES})			${XRAY_SOURCES})

	set(armhf_SOURCES ${arm_SOURCES})			set(armhf_SOURCES ${arm_SOURCES})

				set(aarch64_SOURCES
				xray_AArch64.cc
				xray_trampoline_AArch64.S
				${XRAY_SOURCES})

	include_directories(..)			include_directories(..)
	include_directories(../../include)			include_directories(../../include)

	set(XRAY_CFLAGS ${SANITIZER_COMMON_CFLAGS})			set(XRAY_CFLAGS ${SANITIZER_COMMON_CFLAGS})

	set(XRAY_COMMON_DEFINITIONS XRAY_HAS_EXCEPTIONS=1)			set(XRAY_COMMON_DEFINITIONS XRAY_HAS_EXCEPTIONS=1)

	add_compiler_rt_object_libraries(RTXray			add_compiler_rt_object_libraries(RTXray
	Show All 22 Lines

lib/xray/xray_AArch64.cc

				//===-- xray_AArch64.cc ------------------------------------------ C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a part of XRay, a dynamic runtime instrumentation system.
				//
				// Implementation of AArch64-specific routines (64-bit).
				//
				//===----------------------------------------------------------------------===//
				#include "xray_interface_internal.h"
				#include "sanitizer_common/sanitizer_common.h"
				#include <atomic>
				#include <cassert>

				namespace __xray {

				// The machine codes for some instructions used in runtime patching.
				enum class PatchOpcodes : uint32_t {
				PO_StpX0X30SP_m16e = 0xA9BF7BE0, // STP X0, X30, [SP, #-16]!
				PO_LdrW0_12 = 0x18000060, // LDR W0, #12
				PO_LdrX16_12 = 0x58000070, // LDR X16, #12
				PO_BlrX16 = 0xD63F0200, // BLR X16
				PO_LdpX0X30SP_16 = 0xA8C17BE0, // LDP X0, X30, [SP], #16
				PO_B32 = 0x14000008 // B #32
				};

				inline static bool patchSled(const bool Enable, const uint32_t FuncId,
				const XRaySledEntry &Sled, void (*TracingHook)()) {
				// When \|Enable\| == true,
				// We replace the following compile-time stub (sled):
				//
				// xray_sled_n:
				// B #32
				// 7 NOPs (24 bytes)
				//
				// With the following runtime patch:
				//
				// xray_sled_n:
				// STP X0, X30, [SP, #-16]! ; PUSH {r0, lr}
				// LDR W0, #12 ; W0 := function ID
				// LDR X16,#12 ; X16 := address of the trampoline
				// BLR X16
				// ;DATA: 32 bits of function ID
				// ;DATA: lower 32 bits of the address of the trampoline
				// ;DATA: higher 32 bits of the address of the trampoline
				// LDP X0, X30, [SP], #16 ; POP {r0, lr}
				//
				// Replacement of the first 4-byte instruction should be the last and atomic
				// operation, so that the user code which reaches the sled concurrently
				// either jumps over the whole sled, or executes the whole sled when the
				// latter is ready.
				//
				// When \|Enable\|==false, we set back the first instruction in the sled to be
				// B #32

				uint32_t FirstAddress = reinterpret_cast<uint32_t >(Sled.Address);
				if (Enable) {
				uint32_t *CurAddress = FirstAddress + 1;
				*CurAddress = uint32_t(PatchOpcodes::PO_LdrW0_12);
				CurAddress++;
				*CurAddress = uint32_t(PatchOpcodes::PO_LdrX16_12);
				CurAddress++;
				*CurAddress = uint32_t(PatchOpcodes::PO_BlrX16);
				CurAddress++;
				*CurAddress = FuncId;
				CurAddress++;
				reinterpret_cast<void(*)()>(CurAddress) = TracingHook;
				CurAddress+=2;
				*CurAddress = uint32_t(PatchOpcodes::PO_LdpX0X30SP_16);
				std::atomic_store_explicit(
				reinterpret_cast<std::atomic<uint32_t> *>(FirstAddress),
				uint32_t(PatchOpcodes::PO_StpX0X30SP_m16e), std::memory_order_release);
				} else {
				std::atomic_store_explicit(
				reinterpret_cast<std::atomic<uint32_t> *>(FirstAddress),
				uint32_t(PatchOpcodes::PO_B32), std::memory_order_release);
				}
				return true;
				}

				bool patchFunctionEntry(const bool Enable, const uint32_t FuncId,
				const XRaySledEntry &Sled) {
				return patchSled(Enable, FuncId, Sled, __xray_FunctionEntry);
				}

				bool patchFunctionExit(const bool Enable, const uint32_t FuncId,
				const XRaySledEntry &Sled) {
				return patchSled(Enable, FuncId, Sled, __xray_FunctionExit);
				}

				bool patchFunctionTailExit(const bool Enable, const uint32_t FuncId,
				const XRaySledEntry &Sled) {
				// FIXME: In the future we'd need to distinguish between non-tail exits and
				// tail exits for better information preservation.
				return patchSled(Enable, FuncId, Sled, __xray_FunctionExit);
				}

				} // namespace __xray

lib/xray/xray_inmemory_log.cc

Show All 21 Lines
#include <sys/stat.h>		#include <sys/stat.h>
#include <sys/syscall.h>		#include <sys/syscall.h>
#include <sys/types.h>		#include <sys/types.h>
#include <thread>		#include <thread>
#include <unistd.h>		#include <unistd.h>

#if defined(__x86_64__)		#if defined(__x86_64__)
#include <x86intrin.h>		#include <x86intrin.h>
#elif defined(__arm__)		#elif defined(__arm__) \|\| defined(__aarch64__)
static const int64_t NanosecondsPerSecond = 1000LL * 1000 * 1000;		static const int64_t NanosecondsPerSecond = 1000LL * 1000 * 1000;
#else		#else
#error "Unsupported CPU Architecture"		#error "Unsupported CPU Architecture"
#endif /* CPU architecture */		#endif /* CPU architecture */

#include "sanitizer_common/sanitizer_libc.h"		#include "sanitizer_common/sanitizer_libc.h"
#include "xray/xray_records.h"		#include "xray/xray_records.h"
#include "xray_flags.h"		#include "xray_flags.h"
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	if (readValueFromFile("/sys/devices/system/cpu/cpu0/tsc_freq_khz",
CPUFrequency *= 1000;		CPUFrequency *= 1000;
} else if (readValueFromFile(		} else if (readValueFromFile(
"/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq",		"/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq",
&CPUFrequency)) {		&CPUFrequency)) {
CPUFrequency *= 1000;		CPUFrequency *= 1000;
} else {		} else {
Report("Unable to determine CPU frequency for TSC accounting.");		Report("Unable to determine CPU frequency for TSC accounting.");
}		}
#elif defined(__arm__)		#elif defined(__arm__) \|\| defined(__aarch64__)
// There is no instruction like RDTSCP in user mode on ARM. ARM's CP15 does		// There is no instruction like RDTSCP in user mode on ARM. ARM's CP15 does
// not have a constant frequency like TSC on x86(_64), it may go faster		// not have a constant frequency like TSC on x86(_64), it may go faster
// or slower depending on CPU turbo or power saving mode. Furthermore,		// or slower depending on CPU turbo or power saving mode. Furthermore,
// to read from CP15 on ARM a kernel modification or a driver is needed.		// to read from CP15 on ARM a kernel modification or a driver is needed.
// We can not require this from users of compiler-rt.		// We can not require this from users of compiler-rt.
// So on ARM we use clock_gettime() which gives the result in nanoseconds.		// So on ARM we use clock_gettime() which gives the result in nanoseconds.
// To get the measurements per second, we scale this by the number of		// To get the measurements per second, we scale this by the number of
// nanoseconds per second, pretending that the TSC frequency is 1GHz and		// nanoseconds per second, pretending that the TSC frequency is 1GHz and
Show All 31 Lines	#endif /* CPU architecture */
auto &R = reinterpret_cast<__xray::XRayRecord *>(InMemoryBuffer)[Offset];		auto &R = reinterpret_cast<__xray::XRayRecord *>(InMemoryBuffer)[Offset];
R.RecordType = RecordTypes::NORMAL;		R.RecordType = RecordTypes::NORMAL;
#if defined(__x86_64__)		#if defined(__x86_64__)
{		{
unsigned CPU;		unsigned CPU;
R.TSC = __rdtscp(&CPU);		R.TSC = __rdtscp(&CPU);
R.CPU = CPU;		R.CPU = CPU;
}		}
#elif defined(__arm__)		#elif defined(__arm__) \|\| defined(__aarch64__)
{		{
timespec TS;		timespec TS;
int result = clock_gettime(CLOCK_REALTIME, &TS);		int result = clock_gettime(CLOCK_REALTIME, &TS);
if (result != 0) {		if (result != 0) {
Report("clock_gettime() returned %d, errno=%d.", result, int(errno));		Report("clock_gettime() returned %d, errno=%d.", result, int(errno));
TS.tv_sec = 0;		TS.tv_sec = 0;
TS.tv_nsec = 0;		TS.tv_nsec = 0;
}		}
Show All 24 Lines

lib/xray/xray_interface.cc

	Show All 20 Lines
	#include <errno.h>			#include <errno.h>
	#include <limits>			#include <limits>
	#include <sys/mman.h>			#include <sys/mman.h>

	#include "sanitizer_common/sanitizer_common.h"			#include "sanitizer_common/sanitizer_common.h"

	namespace __xray {			namespace __xray {

				static const int16_t cSledLength =
				rengolinUnsubmitted Done Reply Inline Actions I'd rather have the repetition here. This is looking odd... :) rengolin: I'd rather have the repetition here. This is looking odd... :)
				rSergeAuthorUnsubmitted Not Done Reply Inline Actions Changing. rSerge: Changing.
	#if defined(__x86_64__)			#if defined(__x86_64__)
	// FIXME: The actual length is 11 bytes. Why was length 12 passed to mprotect()			// FIXME: The actual length is 11 bytes. Why was length 12 passed to mprotect()
	// ?			// ?
	static const int16_t cSledLength = 12;			12
	#elif defined(__arm__)			#elif defined(__arm__)
	static const int16_t cSledLength = 28;			28
				#elif defined(__aarch64__)
				32
	#else			#else
	#error "Unsupported CPU Architecture"			#error "Unsupported CPU Architecture"
	#endif /* CPU architecture */			#endif /* CPU architecture */
				;

	// This is the function to call when we encounter the entry or exit sleds.			// This is the function to call when we encounter the entry or exit sleds.
	std::atomic<void (*)(int32_t, XRayEntryType)> XRayPatchedFunction{nullptr};			std::atomic<void (*)(int32_t, XRayEntryType)> XRayPatchedFunction{nullptr};

	// MProtectHelper is an RAII wrapper for calls to mprotect(...) that will undo			// MProtectHelper is an RAII wrapper for calls to mprotect(...) that will undo
	// any successful mprotect(...) changes. This is used to make a page writeable			// any successful mprotect(...) changes. This is used to make a page writeable
	// and executable, and upon destruction if it was successful in doing so returns			// and executable, and upon destruction if it was successful in doing so returns
	// the page into a read-only and executable page.			// the page into a read-only and executable page.
	▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

lib/xray/xray_trampoline_AArch64.S

				.text
				/* The variable containing the handler function pointer */
				.global _ZN6__xray19XRayPatchedFunctionE
				/* Word-aligned function entry point */
				.p2align 2
				/* Let C/C++ see the symbol */
				.global __xray_FunctionEntry
				.type __xray_FunctionEntry, %function
				/* In C++ it is void extern "C" __xray_FunctionEntry(uint32_t FuncId) with
				FuncId passed in W0 register. */
				__xray_FunctionEntry:
				/* Move the return address beyond the end of sled data. The 12 bytes of
				data are inserted in the code of the runtime patch, between the call
				instruction and the instruction returned into. The data contains 32
				bits of instrumented function ID and 64 bits of the address of
				the current trampoline. */
				ADD X30, X30, #12
				/* Push the registers which may be modified by the handler function */
				STP X1, X2, [SP, #-16]!
				STP X3, X4, [SP, #-16]!
				STP X5, X6, [SP, #-16]!
				STP X7, X30, [SP, #-16]!
				STP Q0, Q1, [SP, #-32]!
				STP Q2, Q3, [SP, #-32]!
				STP Q4, Q5, [SP, #-32]!
				STP Q6, Q7, [SP, #-32]!
				/* Load the address of _ZN6__xray19XRayPatchedFunctionE into X1 */
				LDR X1, =_ZN6__xray19XRayPatchedFunctionE
				/* Load the handler function pointer into X2 */
				LDR X2, [X1]
				/* Handler address is nullptr if handler is not set */
				CMP X2, #0
				rengolinUnsubmitted Not Done Reply Inline Actions can't you compare the address before saving and restoring all those registers? rengolin: can't you compare the address before saving and restoring all those registers?
				rSergeAuthorUnsubmitted Not Done Reply Inline Actions This situation is rare. Usually if the handler is set to `nullptr`, then the user code is unpatched, so the trampolines are not called. So there is no performance issue here. However, there is a race condition issue. It is important to check for `nullptr` handler as late as possible before its call. If we push the registers after checking for `nullptr` handler, we increase the chances that handler is not `nullptr` when it's checked, but `nullptr` when it's called. Though anyway we currently require that the handler function works correctly even if called after its removal via XRay API. rSerge: This situation is rare. Usually if the handler is set to `nullptr`, then the user code is…
				rengolinUnsubmitted Not Done Reply Inline Actions This situation is rare. Usually if the handler is set to nullptr, then the user code is unpatched, so the trampolines are not called. So there is no performance issue here. Right, makes sense. However, there is a race condition issue. It is important to check for nullptr handler as late as possible before its call. If we push the registers after checking for nullptr handler, we increase the chances that handler is not nullptr when it's checked, but nullptr when it's called. I'm not sure I follow. You have added the last update to be the first branch atomically, to make sure there are no race conditions. Is this about the users calling the thunk while the update is ongoing, which can see the code but still don't have a function pointer? In any case, if there is a race condition, it needs to be fixed, not work by chance. rengolin: > This situation is rare. Usually if the handler is set to nullptr, then the user code is…
				dberrisUnsubmitted Not Done Reply Inline Actions I'm not sure I follow. You have added the last update to be the first branch atomically, to make sure there are no race conditions. Is this about the users calling the thunk while the update is ongoing, which can see the code but still don't have a function pointer? In any case, if there is a race condition, it needs to be fixed, not work by chance. In x86_64 at least, there's no real "race" here because the pointer is implemented as a `std::atomic<function_ptr>`. The sequence here though at a high level should be: The code executing somehow knew to get to this trampoline. While some threads are running through this trampoline, another updates the global atomic pointer to `nullptr`. All threads that encounter the load for the function pointer should see the updated value of the pointer (since it "happens before" the load of the pointer). This is at least the reason why the load of the pointer happens after the registers have been stashed onto the stack. Does that make sense? dberris: > I'm not sure I follow. > > You have added the last update to be the first branch atomically…
				rengolinUnsubmitted Not Done Reply Inline Actions While some threads are running through this trampoline, another updates the global atomic pointer to nullptr. Hum, I wasn't expecting the pointer to move back to nullptr at any time. In this case, there's still the race condition that the value changes in between the comparison (`CMP X2, #0`) and the jump (`BEQ`) which is, admittedly, very unlikely, but non-zero. rengolin: > 2. While some threads are running through this trampoline, another updates the global atomic…
				dberrisUnsubmitted Not Done Reply Inline Actions Hum, I wasn't expecting the pointer to move back to nullptr at any time. Yeah, that's a feature (see `__xray_remove_handler()`) to allow for turning off XRay logging at runtime in a sequence that guarantees things will continue to work. To disable XRay at runtime, users may: Call `__xray_remove_handler()` to ensure that the current installed logging handler is removed atomically (in a cross-platform manner). Call `__xray_unpatch()` to return the state of the sleds to "neutral". Both of these are optional, could be performed in any order (thus operations need to be ensured thread-safe). In this case, there's still the race condition that the value changes in between the comparison (CMP X2, #0) and the jump (BEQ) which is, admittedly, very unlikely, but non-zero. The way we handle this in x86_64 at least is to load the value from the global to a register, then perform the comparison with an immediate value (0). It looks like this code is already doing almost exactly the same thing -- maybe we need to be ensuring that the load is synchronised? My non-familiarity with ARM64 is showing here. dberris: > Hum, I wasn't expecting the pointer to move back to nullptr at any time. Yeah, that's a…
				rSergeAuthorUnsubmitted Not Done Reply Inline Actions There is a possibility that the handler is removed or changed (in another thread) after current thread's `BEQ`, but before the old handler is called, or even when the current thread is in the handler. I think we had this discussion for x86_64 XRay and decided that the code of the handler must be implemented in such a way that it handles decently the situation when the handler function is called after it is removed or changed via XRay API. Complete elimination of the possibility that the old handler is called would require heavy synchronization on the XRay side and will still not eliminate the possibility that the handler is executing in 1 thread and is being removed in another thread. This heavy synchronization is undesirable for the tracing component (XRay), so it seems better to impose some restrictions on the handler code to allow spurious calls (like condition variables allow spurious wakeups, because it's too costly to avoid them). I just minimized the chances of spurious handler call by moving the handler check as close as possible to its call. rSerge: There is a possibility that the handler is removed or changed (in another thread) after current…
				rSergeAuthorUnsubmitted Not Done Reply Inline Actions Here's the discussion we had earlier on this: https://groups.google.com/d/msg/llvm-dev/Ft1XUeiSKgw/iABdpOTSCAAJ rSerge: Here's the discussion we had earlier on this: https://groups.google.com/d/msg/llvm…
				BEQ FunctionEntry_restore
				/* Function ID is already in W0 (the first parameter).
				X1=0 means that we are tracing an entry event */
				MOV X1, #0
				/* Call the handler with 2 parameters in W0 and X1 */
				BLR X2
				FunctionEntry_restore:
				/* Pop the saved registers */
				LDP Q6, Q7, [SP], #32
				LDP Q4, Q5, [SP], #32
				LDP Q2, Q3, [SP], #32
				LDP Q0, Q1, [SP], #32
				LDP X7, X30, [SP], #16
				LDP X5, X6, [SP], #16
				LDP X3, X4, [SP], #16
				LDP X1, X2, [SP], #16
				RET

				/* Word-aligned function entry point */
				.p2align 2
				/* Let C/C++ see the symbol */
				.global __xray_FunctionExit
				.type __xray_FunctionExit, %function
				/* In C++ it is void extern "C" __xray_FunctionExit(uint32_t FuncId) with
				FuncId passed in W0 register. */
				__xray_FunctionExit:
				/* Move the return address beyond the end of sled data. The 12 bytes of
				data are inserted in the code of the runtime patch, between the call
				instruction and the instruction returned into. The data contains 32
				bits of instrumented function ID and 64 bits of the address of
				the current trampoline. */
				ADD X30, X30, #12
				/* Push the registers which may be modified by the handler function */
				STP X1, X2, [SP, #-16]!
				STP X3, X4, [SP, #-16]!
				STP X5, X6, [SP, #-16]!
				STP X7, X30, [SP, #-16]!
				STR Q0, [SP, #-16]!
				/* Load the address of _ZN6__xray19XRayPatchedFunctionE into X1 */
				LDR X1, =_ZN6__xray19XRayPatchedFunctionE
				/* Load the handler function pointer into X2 */
				LDR X2, [X1]
				/* Handler address is nullptr if handler is not set */
				CMP X2, #0
				BEQ FunctionExit_restore
				/* Function ID is already in W0 (the first parameter).
				X1=1 means that we are tracing an exit event */
				MOV X1, #1
				/* Call the handler with 2 parameters in W0 and X1 */
				BLR X2
				FunctionExit_restore:
				LDR Q0, [SP], #16
				LDP X7, X30, [SP], #16
				LDP X5, X6, [SP], #16
				LDP X3, X4, [SP], #16
				LDP X1, X2, [SP], #16
				RET

This is an archive of the discontinued LLVM Phabricator instance.

[XRay] Support AArch64 in compiler-rtClosedPublic

Details

Diff Detail

Event Timeline