This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
compiler-rt/trunk/
-
trunk/
-
cmake/
-
config-ix.cmake
-
lib/
-
sanitizer_common/scripts/
-
scripts/
-
gen_dynamic_list.py
-
xray/
-
CMakeLists.txt
-
xray_arm.cc
-
xray_inmemory_log.cc
-
xray_interface.cc
-
xray_interface_internal.h
-
xray_trampoline_arm.S
-
xray_x86_64.cc

Differential D23933

[XRay] ARM 32-bit no-Thumb support in compiler-rt
ClosedPublic

Authored by rSerge on Aug 26 2016, 10:21 AM.

Download Raw Diff

Details

Reviewers

dberris
rengolin
t.p.northover
zatrazz
asl

Commits

rGd1617cdc492a: [XRay] ARM 32-bit no-Thumb support in compiler-rt
rG5332645c6d17: [XRay] ARM 32-bit no-Thumb support in compiler-rt
rCRT281971: [XRay] ARM 32-bit no-Thumb support in compiler-rt
rCRT280890: [XRay] ARM 32-bit no-Thumb support in compiler-rt
rL281971: [XRay] ARM 32-bit no-Thumb support in compiler-rt
rL280890: [XRay] ARM 32-bit no-Thumb support in compiler-rt

Summary

This is a port of XRay to ARM 32-bit, without Thumb support yet.
This is one of 3 commits to different repositories of XRay ARM port. The other 2 are:

https://reviews.llvm.org/D23931 (LLVM)
https://reviews.llvm.org/D23932 (Clang test)

Diff Detail

Repository: rL LLVM

Event Timeline

rSerge updated this revision to Diff 69394.Aug 26 2016, 10:21 AM

rSerge retitled this revision from to [XRay] ARM 32-bit no-Thumb support in compiler-rt.

rSerge updated this object.

rSerge added reviewers: dberris, rengolin, asl, t.p.northover.

rSerge added a subscriber: llvm-commits.

Herald added subscribers: dberris, samparker, kubamracek and 2 others. · View Herald TranscriptAug 26 2016, 10:21 AM

rSerge added a parent revision: D23931: [XRay] ARM 32-bit no-Thumb support in LLVM.Aug 26 2016, 10:21 AM

rSerge updated this object.Aug 26 2016, 10:24 AM

rengolin added a reviewer: zatrazz.Aug 26 2016, 11:17 AM

iid_iunknown added a subscriber: iid_iunknown.Aug 26 2016, 11:33 AM

Rebased after https://reviews.llvm.org/D21982 and ported logging to ARM: replaced RDTSC instruction with clock_gettime().
Fixed a bug where the length of x86_64 sled (11-12 bytes) was passed to mprotect() on ARM, while the sled size on ARM is 28 bytes. This was sometimes causing segmentation fault when patching at runtime.

rSerge added a parent revision: D21982: [compiler-rt][XRay] Initial per-thread inmemory logging implementation.Aug 26 2016, 3:36 PM

dberris requested changes to this revision.Aug 28 2016, 6:13 PM

dberris edited edge metadata.

dberris added inline comments.

lib/sanitizer_common/scripts/gen_dynamic_list.py
54 ↗	(On Diff #69440)	Is this required to make this change work? Or should this really happen as an isolated change?
lib/xray/xray_arm.cc
28–31 ↗	(On Diff #69440)	The Coding Standards seem to require that variables be camel case starting with a capital letter. http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly
109 ↗	(On Diff #69440)	On ARM, does `std::memory_order_release` turn into writes that have fences after to ensure they're visible? Or am I confusing ARM for an architecture that only has relaxed memory order semantics?
lib/xray/xray_interface.cc
30 ↗	(On Diff #69440)	Good question. I may have miscounted. We can fix that later, once this lands (or if you can change and test to make sure it doesn't break, I'm fine with it).

This revision now requires changes to proceed.Aug 28 2016, 6:13 PM

Please, see my responses inline. I'll upload the updated patch in a few minutes.

lib/sanitizer_common/scripts/gen_dynamic_list.py
54 ↗	(On Diff #69440)	Without this change, XRay for ARM doesn't get cross-compiled from Windows to ARM-Linux .
lib/xray/xray_arm.cc
28–31 ↗	(On Diff #69440)	Changing to an enum. Isn't it better to leave the register parameters of instructions separated by underscore, rather than making a name like `PO_PushR0Lr` ?
109 ↗	(On Diff #69440)	`std::memory_order_release` should do what the standard requires. on any compiler and CPU, unless there is a bug in them. Indeed, x86_64 is strongly ordered, so for the CPU `std::memory_order_relaxed` is always sufficient (but not for the compiler: it may reorder). However, ARM is weakly ordered, so at least `std::memory_order_release` is required here. From http://en.cppreference.com/w/cpp/atomic/memory_order : "All writes in the current thread are visible in other threads that acquire the same atomic variable..." . There is a problem that we cannot force the CPU on the other cores to perform an acquire operation fetching instructions. However, ARM CPU always fetches the instruction at `pc+8`, decodes the instruction at `pc+4` and executes the instruction at `pc` (program counter register). So as far as I know there is no reordering problem here. However, during unpatching we cannot fill the 6 tail instructions with NOPs (and I think, the same applies to x86/x86_64), because concurrent core may have already fetched the first instruction on the patch and therefore relies that the rest of instructions in the patch are correct.
lib/xray/xray_interface.cc
30 ↗	(On Diff #69440)	It may break, again, something to do with alignment, you may even need to increase mprotect length to 18 bytes. There may be a chance that when writing the last byte of 11-byte sled, the CPU may need to access a separate 64-bit word in memory. I don't know whether it will get permission denied in case only the first byte of this word is writeable and the other 7 bytes are write-protected.

Changed the opcodes from constants to an enum.

I'll defer to someone else who understands the ARM assembly parts.

You might also consider extending the file header type to indicate what platform the trace was generated from, so tools can determine what to do with a trace that comes from a specific CPU. I'm happy to do it later, but wanted to know your thoughts on how we might encode that information appropriately.

lib/xray/xray_arm.cc
29–32 ↗	(On Diff #69610)	I think the coding standards say they should look like types, so just CamelCase. I don't make up the rules, but I just try to follow them. :)
38 ↗	(On Diff #69610)	Shouldn't these functions be camelCase? As in `getMoveMask(...)` according to the guide? I understand you're just following the conventions of the files around, and those mistakes are mine -- but do you mind changing them before landing?
lib/xray/xray_inmemory_log.cc
30 ↗	(On Diff #69610)	`static constexpr` instead? Also, please follow the naming conventions for this one too. Another thing -- couldn't you just use std::chrono for the constant here? http://en.cppreference.com/w/cpp/chrono/duration

This revision is now accepted and ready to land.Aug 29 2016, 4:41 PM

Sorry, it was holiday in UK today...

I'll have a look at the patches tomorrow.

Cheers,
Renato

Hi,

I have a number of comments and requests. One general remark, also, is to comment on the top of the assembly functions what's the function signature in C, so that I know how to review the function's code. Otherwise, it's very hard to understand all possibilities.

cheers,
--renato

lib/xray/xray_arm.cc
22 ↗	(On Diff #69610)	Can't you define this on a top-level header and implement on an arch-specific cpp file? I don't think these things should be changing between arches.
50 ↗	(On Diff #69610)	Why haven't use used inline assembly, here? This is really unreadable and error prone.
110 ↗	(On Diff #69610)	All modern ARM CPUs are multi-issue, out-of-order, so you cannot guarantee ordering without a data/memory barrier. ARMv8 has better atomic support.
lib/xray/xray_inmemory_log.cc
30 ↗	(On Diff #69610)	Better still, get the proper value? I know it can be a tad different from hardware to hardware, but not even trying isn't really helpful.
70 ↗	(On Diff #69610)	I'd use #defined x86_64 instead, and replicate to all arches it's supposed to work later. We don't want broken fall-back logic, as it's really hard to find bugs later.
188 ↗	(On Diff #69610)	I don't understand this... Is this just hard-coding to 1GHz? /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq Also works on most ARM and AArch64 boards.
lib/xray/xray_trampoline_arm.S
1 ↗	(On Diff #69610)	You're going to need more than that. Assemblers are very picky on what's valid, and Clang is specially so. You'll need to put the minimum requirements on the header (cpu, fpu, arch, thumb-interop, etc). You'll also need to put ".arch" to extend support on functions that use new instructions where the header flags don't support. Example: .syntax unifixed .arch armv6t2 .fpu vfpv2 ... v7_only_func: .fpu vfpv3 VMOV ... This will mean you can use this on v6T2 onward, and that "v7_only_func" can only be used by arches with vfpv3, guaranteed by the dynamic dispatch. See libunwind and other compiler-RT ARM functions for this behaviour.
6 ↗	(On Diff #69610)	Why not? Vectorizers can and do use Q regs...
7 ↗	(On Diff #69610)	The comment character is "@" not "//". This will only work if compiled by a C++ compiler, not an assembler.

rengolin added inline comments.Aug 30 2016, 6:38 AM

lib/sanitizer_common/scripts/gen_dynamic_list.py
54 ↗	(On Diff #69610)	Have you tested this on Linux and Mac? To make sure it also work there?
lib/xray/xray_arm.cc
38 ↗	(On Diff #69610)	Yes, please. Let's not add different styles if we don't have to. Camel case, caps for variables, no caps for functions. Enum values have format INI_Name, with "INI" the initials of the enum's name, all caps, and the Name a unique identifier within the enum.

This revision now requires changes to proceed.Aug 30 2016, 6:38 AM

I've responded inline and will upload a new diff in a minute.

You might also consider extending the file header type to indicate what platform the trace was generated from, so tools can determine what to do with a trace that comes from a specific CPU. I'm happy to do it later, but wanted to know your thoughts on how we might encode that information appropriately.

I think we should look for some enumeration in compiler-rt or LLVM listing the CPU architectures, so to make XRay CPU codes consistent with that. But so far XRayFileHeader and XRayRecord don't seem to differ between CPUs, or do they?

lib/sanitizer_common/scripts/gen_dynamic_list.py
54 ↗	(On Diff #69610)	I can test on Ubuntu. I don't have access to a Mac.
lib/xray/xray_arm.cc
22 ↗	(On Diff #69610)	Moving to xray_interface_internal.h .
29–32 ↗	(On Diff #69610)	Ok, changing to CamelCase.
38 ↗	(On Diff #69610)	The nearby code of sanitizers (ASAN, sanitizer_common) mostly names the functions starting with a capital letter. Do you still think I should name functions starting with a lowercase letter?
50 ↗	(On Diff #69610)	Because it is not possible. This code patches the user program at runtime with different instructions depending on the data in the user program. There doesn't seem anything we can put as inline assembly in compiler-rt code. It may be possible to use assembly strings, but that would require to link an assembler to the user program.
110 ↗	(On Diff #69610)	I used `memory_order_release` here, on the writer side. According to C++11 standard, this should prevent reordering previous writes past this point (inserting fences if necessary, etc.). There is little we can do on the reader side, as there is no data reader: it is the CPU fetching instructions on the other side. Is there any evidence that ARM may fetch instructions out of order? If so, how to prevent this?
lib/xray/xray_inmemory_log.cc
30 ↗	(On Diff #69610)	What do you mean by getting the proper value? I think that getting here anything other than just simple "1 billion" would be too unexpected, and we would need error checking for that. Furthermore, getting it from other compile units may result in initialization order issues. It is easier and more reliable to just have it as a 1 billion constant.
30 ↗	(On Diff #69610)	I'm renaming this to get rid of `c` prefix. I think that pulling the whole `chrono` just for nanoseconds per second number may be a waste of compile time.
70 ↗	(On Diff #69610)	Changing.
188 ↗	(On Diff #69610)	No, this is not hard-coding to 1GHz. x86_64 uses `RDTSCP` instruction in the numerator, that is why the denominator is CPU frequency. There is nothing similar for ARM in user mode. So we fall back on `clock_gettime()` system call. It provides time in nanoseconds. That is why we use 1 billion as the denominator. Shall I rename the variable from `CPUFrequency` to something like `TicksPerSecond` so that it is more comprehensive on CPUs without instructions like `RDTSC`?
lib/xray/xray_trampoline_arm.S
1 ↗	(On Diff #69610)	I guess that the dynamic dispatch doesn't help us, because we are calling the function from machine code written at run-time into the code of user functions. Adding `.arch armv7` and `.fpu vfpv3` .
6 ↗	(On Diff #69610)	That can happen. But are `Q` registers used for passing parameters and returning values? Perhaps my assembly comment is misleading: here (in `__xray_FunctionEntry`) we need to push&pop every register which may be used for passing parameters. And in `__xray_FunctionExit` we need to push&pop every register which may be used for returning values from C/C++ functions.
7 ↗	(On Diff #69610)	Changing.

rSerge updated this revision to Diff 69744.Aug 30 2016, 1:19 PM

rSerge edited edge metadata.

rSerge marked an inline comment as done.

Marked the done comments according to the diff just uploaded.

Thanks for the changes, some more comments...

lib/xray/xray_arm.cc
39 ↗	(On Diff #69744)	This is a new file, it should use LLVM's policy.
51 ↗	(On Diff #69744)	Of course. Ignore me. Though, this is the same as the one below, and you could merge them both by passing the register name and ORRing [reg << 12] with the instruction, and making sure reg < 15.
111 ↗	(On Diff #69744)	Is there any evidence that ARM may fetch instructions out of order? If so, how to prevent this? I'm not sure what you mean. Many Cortex-AR cores are OOO. That's their design, you can't change that. Or maybe you mean "out of order amongst threads", which is not what I'm talking about. Since this is in C++, so I'm guessing the compiler will "do the right thing" (tm) with regards to memory barriers, and the core being OOO makes no difference here. Probably just a nomenclature clash around "OOO" between ourselves... :)
lib/xray/xray_inmemory_log.cc
188 ↗	(On Diff #69744)	I still find this confusing... Is this 10^9 just a normalising factor, to get compatible numbers? If anything, this line needs a serious comment explaining why this is what it is. Also, clock_gettime() will return a system wide, sequential and consistent number, while RDTSCP will return a counter that is internal to each CPU (and will be different across CPUs), thus prone to problems while context-switching. Regardless, if you want CPU frequency, you can do exactly what you've done to x86.
lib/xray/xray_trampoline_arm.S
7 ↗	(On Diff #69744)	Right, so it's not C/C++, it's AAPCS (the ARM Procedure Call Standard). As long as you're not passing NEON vectors as arguments, Q registers are not used (see arm_neon.h), and d0-d7 should take care of all VFP registers.
34 ↗	(On Diff #69744)	A8.8.132 POP (ARM): "ARM deprecates the use of this instruction with both the LR and the PC in the list."
40 ↗	(On Diff #69744)	Same again, if you're not using NEON vectors, this is fine.

rSerge updated this revision to Diff 69891.Aug 31 2016, 12:10 PM

rSerge edited edge metadata.

rSerge marked 2 inline comments as done.Aug 31 2016, 12:13 PM

rSerge added inline comments.

lib/xray/xray_arm.cc
39 ↗	(On Diff #69744)	Changing.
51 ↗	(On Diff #69744)	Ok, changing.
111 ↗	(On Diff #69744)	You are thinking about data: the CPU executes out of order the instructions which manipulate data. On the data side we only write, and `memory_order_release` should prevent reordering. But we write CPU instructions, which another core may be fetching, decoding and executing concurrently with our writes. So I mean the scenario where the CPU is reading instructions themselves from the code segment ("fetching") in order to then decode the instructions and finally execute. Can it fetch instruction at `pc+4` earlier than the instruction at `pc`? As I understood from ARM specification, it can't: ARM CPU is always fetching the instruction at `pc`, decoding the instruction at `pc-4` and executing the instruction at `pc-8`.
lib/xray/xray_inmemory_log.cc
188 ↗	(On Diff #69744)	10^9 is the number of nanoseconds per second. It can be viewed as a normalizing factor, to get measurements in seconds. I would prefer something of higher resolution than clock_gettime() (they say on the internet that its resolution is only 1ms, while `RDTSCP` resolution is aroun 1ns), but I don't know how to do it on ARM. I searched on the internet and figured out that the cycle counter on ARM is 1) not available in user mode 2) changes frequency when CPU frequency changes. In contrast, RDTSCP on x86 is available in user mode and has a constant frequency, independent on CPU power-saving / turbo frequency adjustments. I'm adding a comment.
lib/xray/xray_trampoline_arm.S
34 ↗	(On Diff #69744)	The list contains only `pc`, not both `lr` and `pc`.

rengolin added inline comments.Sep 1 2016, 7:24 AM

lib/xray/xray_arm.cc
112 ↗	(On Diff #69891)	Right, it's a bit more complicated than that... A good quick source of all factors: https://community.arm.com/groups/processors/blog/2011/03/22/memory-access-ordering--an-introduction But we write CPU instructions, which another core may be fetching, decoding and executing concurrently with our writes. So, you need to tell the other cores to wait until you write, then you need to store-release, then they can fetch. Otherwise, they'll fetch NOPs. Can it fetch instruction at pc+4 earlier than the instruction at pc? In theory, no. In practice, maybe. ARM has separate caches for code and data. If core0 reads 'pc' - 16, and the Icache line is, say, 32, then the NOPs are in core0's cache. Before core0 reaches the 'pc', core1 gets it, sets a load-acquire, and jumps to your thunk. At that time, you really want core0 to stop before reaching that specified 'pc', or it'll execute NOPs. Once core1 has written its shim, it then store-releases and core0 can continue, now executing your inserted code. In summary, you need a barrier. Since this is about code fetching, you need an instruction barrier (ISB) not a data barrier (DMB). ARM CPU is always fetching the instruction at pc, decoding the instruction at pc-4 and executing the instruction at pc-8. On the same core, instructions are (again, in theory) fetched and decoded "in order", but they're stored in a queue, which gets dispatched at any convenient time. So there is no concept of 'pc+8' at all. The cores will also speculatively fetch, decode and even execute (ex. branch prediction, peephole, etc). So, there is absolutely no guarantee that any instruction will be fetched, decoded or executed before another, unless they have a strict dependency relationship, either by data dependency, atomic instructions or barriers.
lib/xray/xray_inmemory_log.cc
188 ↗	(On Diff #69891)	Ah, I see. I didn't know RDTSCP had a fixed frequency. In that case, a comment explaining it would be most welcome.
lib/xray/xray_trampoline_arm.S
35 ↗	(On Diff #69891)	Sorry, ignore me.

Fixed Ubuntu x86_64 build. Implemented the changes requested in code review comments.

rSerge marked an inline comment as done.Sep 1 2016, 11:57 AM

rSerge added inline comments.

lib/sanitizer_common/scripts/gen_dynamic_list.py
54 ↗	(On Diff #69891)	Tested on Ubuntu x86_64.
lib/xray/xray_inmemory_log.cc
188 ↗	(On Diff #69891)	Adding.

This looks good to me, thanks for all the changes!

If @dberris is happy, I'm happy. :)

cheers,
--renato

PS: I may have missed a few things, but we can fix as we go, when support gets better.

This revision is now accepted and ready to land.Sep 1 2016, 1:55 PM

Still, LGTM -- thanks @rSerge!

Rebased to the latest revision. I don't have commit access rights. Could someone commit?

Herald added a subscriber: beanz. · View Herald TranscriptSep 7 2016, 11:26 AM

Landing this now.

Closed by commit rL280890: [XRay] ARM 32-bit no-Thumb support in compiler-rt (authored by dberris). · Explain WhySep 7 2016, 5:37 PM

This revision was automatically updated to reflect the committed changes.

dberris reopened this revision.Sep 8 2016, 9:31 PM

This revision is now accepted and ready to land.Sep 8 2016, 9:31 PM

Reverted in rL280969, need to resolve comments in D23931 before trying to land again.

This revision now requires changes to proceed.Sep 8 2016, 9:31 PM

Removed .arch armv7 directive

Herald added a subscriber: mgorny. · View Herald TranscriptSep 16 2016, 6:25 AM

Fixed patch file format.

So, you're forcing vfpv3, which is armv7-only. AFAICS, you're only using VPUSH and VPOP, which is available since vfpv2 (which is also available in v6), so maybe a better fix would be to use:

.arch armv6t2
.fpu vfpv2

which should work on armv7, too.

Since this is the restriction we have inside the code, it would be more clear this way. Can you do a quick test with those directives?

cheers,
--renato

@rengolin -- do we need to wait for the test, or can we do that post-commit?

lib/xray/xray_arm.cc
1 ↗	(On Diff #71636)	nit: s/xray_arm.cpp/xray_arm.cc/

This revision is now accepted and ready to land.Sep 18 2016, 5:44 PM

The test is to make sure it won't break the bots again. Should be quick, as he had done it before.

In D23933#544817, @rengolin wrote:
So, you're forcing vfpv3, which is armv7-only. AFAICS, you're only using VPUSH and VPOP, which is available since vfpv2 (which is also available in v6), so maybe a better fix would be to use:
.arch armv6t2
.fpu vfpv2
which should work on armv7, too.

Since this is the restriction we have inside the code, it would be more clear this way. Can you do a quick test with those directives?

cheers,
--renato

armv6t2 shouldn't work because MOVW and MOVT instructions are available only since armv7 .
I can test with .fpu vfpv2, though this is not quick (compilation and moving between VMs takes substantial time).

lib/xray/xray_arm.cc
1 ↗	(On Diff #71636)	Sorry, I'm not that good with the lingo. What is the meaning of this comment?

In D23933#546267, @rSerge wrote:

armv6t2 shouldn't work because MOVW and MOVT instructions are available only since armv7.

Movw/Movt are Thumb2 instructions and were introduced in ARMv6T2.

I can test with .fpu vfpv2, though this is not quick (compilation and moving between VMs takes substantial time).

I'm not worried about the assembly code working on v6T2 or VFPv2, I'm worried about the toolchain coping with the options.

You just need to get the complete command line with a recent enough cross-toolchain (4.8+) and try on the resulting file.

cheers,
--renato

PS: You should really get the ARM ARMs: http://llvm.org/docs/CompilerWriterInfo.html

lib/xray/xray_arm.cc
1 ↗	(On Diff #71636)	It means the name on the comment is wrong and you have to replace (s///) with the right one. You're calling it `xray_arm.cc` but has `xray_arm.cpp` in the header.

I've tested

.arch armv6t2
.fpu vfpv2

with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.

lib/xray/xray_arm.cc
1 ↗	(On Diff #71636)	I just did it by example. Ok, I'm fixing the comment

Implemented the changes requested in the code review comments.

In D23933#546615, @rSerge wrote:
I've tested
.arch armv6t2
.fpu vfpv2
with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.

Sorry, I wasn't clear. These two lines should work on any toolchain, my point is if that makes it break with your gnu toolchain because of the same issue (minimal ISA support assumed) in *conjunction* with the rest of the code.

In D23933#546713, @rengolin wrote:
In D23933#546615, @rSerge wrote:
I've tested
.arch armv6t2
.fpu vfpv2
with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.
Sorry, I wasn't clear. These two lines should work on any toolchain, my point is if that makes it break with your gnu toolchain because of the same issue (minimal ISA support assumed) in *conjunction* with the rest of the code.

That error doesn't seem to happen, at least with the toolchains I've tested.

In D23933#547444, @rSerge wrote:

That error doesn't seem to happen, at least with the toolchains I've tested.

Perfect, let's try again. :)

@dberris, you have already committed the other two patches again, right? Would you do the honours?

In D23933#547464, @rengolin wrote:

In D23933#547444, @rSerge wrote:

That error doesn't seem to happen, at least with the toolchains I've tested.

Perfect, let's try again. :)

@dberris, you have already committed the other two patches again, right? Would you do the honours?

Yep -- this one's the last piece. Happy to land now. :)

Closed by commit rL281971: [XRay] ARM 32-bit no-Thumb support in compiler-rt (authored by dberris). · Explain WhySep 20 2016, 7:44 AM

This revision was automatically updated to reflect the committed changes.

rSerge added a child revision: D24799: [XRay] Check in Clang whether XRay supports the target when -fxray-instrument is passed.Sep 21 2016, 7:28 AM

rSerge added a child revision: D25030: [XRay] Support for for tail calls for ARM no-Thumb.Sep 28 2016, 8:49 AM

Revision Contents

Path

Size

compiler-rt/

trunk/

cmake/

config-ix.cmake

2 lines

lib/

sanitizer_common/

scripts/

gen_dynamic_list.py

3 lines

xray/

8 lines

131 lines

52 lines

141 lines

xray_interface_internal.h

22 lines

xray_trampoline_arm.S

65 lines

xray_x86_64.cc

116 lines

Diff 71933

compiler-rt/trunk/cmake/config-ix.cmake

Show First 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	set(ALL_PROFILE_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM32} ${ARM64} ${PPC64}
${MIPS32} ${MIPS64})		${MIPS32} ${MIPS64})
set(ALL_TSAN_SUPPORTED_ARCH ${X86_64} ${MIPS64} ${ARM64} ${PPC64})		set(ALL_TSAN_SUPPORTED_ARCH ${X86_64} ${MIPS64} ${ARM64} ${PPC64})
set(ALL_UBSAN_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM32} ${ARM64}		set(ALL_UBSAN_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM32} ${ARM64}
${MIPS32} ${MIPS64} ${PPC64} ${S390X})		${MIPS32} ${MIPS64} ${PPC64} ${S390X})
set(ALL_SAFESTACK_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM64} ${MIPS32} ${MIPS64})		set(ALL_SAFESTACK_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM64} ${MIPS32} ${MIPS64})
set(ALL_CFI_SUPPORTED_ARCH ${X86} ${X86_64} ${MIPS64})		set(ALL_CFI_SUPPORTED_ARCH ${X86} ${X86_64} ${MIPS64})
set(ALL_ESAN_SUPPORTED_ARCH ${X86_64})		set(ALL_ESAN_SUPPORTED_ARCH ${X86_64})
set(ALL_SCUDO_SUPPORTED_ARCH ${X86_64})		set(ALL_SCUDO_SUPPORTED_ARCH ${X86_64})
set(ALL_XRAY_SUPPORTED_ARCH ${X86_64})		set(ALL_XRAY_SUPPORTED_ARCH ${X86_64} ${ARM32})

if(APPLE)		if(APPLE)
include(CompilerRTDarwinUtils)		include(CompilerRTDarwinUtils)

find_darwin_sdk_dir(DARWIN_osx_SYSROOT macosx)		find_darwin_sdk_dir(DARWIN_osx_SYSROOT macosx)
find_darwin_sdk_dir(DARWIN_iossim_SYSROOT iphonesimulator)		find_darwin_sdk_dir(DARWIN_iossim_SYSROOT iphonesimulator)
find_darwin_sdk_dir(DARWIN_ios_SYSROOT iphoneos)		find_darwin_sdk_dir(DARWIN_ios_SYSROOT iphoneos)
find_darwin_sdk_dir(DARWIN_watchossim_SYSROOT watchsimulator)		find_darwin_sdk_dir(DARWIN_watchossim_SYSROOT watchsimulator)
▲ Show 20 Lines • Show All 347 Lines • Show Last 20 Lines

compiler-rt/trunk/lib/sanitizer_common/scripts/gen_dynamic_list.py

Show All 13 Lines
# gen_dynamic_list.py libclang_rt.san.a [ files ... ]		# gen_dynamic_list.py libclang_rt.san.a [ files ... ]
#		#
#===------------------------------------------------------------------------===#		#===------------------------------------------------------------------------===#
import argparse		import argparse
import os		import os
import re		import re
import subprocess		import subprocess
import sys		import sys
		import platform

new_delete = set([		new_delete = set([
'_Znam', '_ZnamRKSt9nothrow_t', # operator new[](unsigned long)		'_Znam', '_ZnamRKSt9nothrow_t', # operator new[](unsigned long)
'_Znwm', '_ZnwmRKSt9nothrow_t', # operator new(unsigned long)		'_Znwm', '_ZnwmRKSt9nothrow_t', # operator new(unsigned long)
'_Znaj', '_ZnajRKSt9nothrow_t', # operator new[](unsigned int)		'_Znaj', '_ZnajRKSt9nothrow_t', # operator new[](unsigned int)
'_Znwj', '_ZnwjRKSt9nothrow_t', # operator new(unsigned int)		'_Znwj', '_ZnwjRKSt9nothrow_t', # operator new(unsigned int)
'_ZdaPv', '_ZdaPvRKSt9nothrow_t', # operator delete[](void *)		'_ZdaPv', '_ZdaPvRKSt9nothrow_t', # operator delete[](void *)
'_ZdlPv', '_ZdlPvRKSt9nothrow_t', # operator delete(void *)		'_ZdlPv', '_ZdlPvRKSt9nothrow_t', # operator delete(void *)
Show All 15 Lines	def get_global_functions(library):
nm = os.environ.get('NM', 'nm')		nm = os.environ.get('NM', 'nm')
nm_proc = subprocess.Popen([nm, library], stdout=subprocess.PIPE,		nm_proc = subprocess.Popen([nm, library], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)		stderr=subprocess.PIPE)
nm_out = nm_proc.communicate()[0].decode().split('\n')		nm_out = nm_proc.communicate()[0].decode().split('\n')
if nm_proc.returncode != 0:		if nm_proc.returncode != 0:
raise subprocess.CalledProcessError(nm_proc.returncode, nm)		raise subprocess.CalledProcessError(nm_proc.returncode, nm)
func_symbols = ['T', 'W']		func_symbols = ['T', 'W']
# On PowerPC, nm prints function descriptors from .data section.		# On PowerPC, nm prints function descriptors from .data section.
if os.uname()[4] in ["powerpc", "ppc64"]:		if platform.uname()[4] in ["powerpc", "ppc64"]:
func_symbols += ['D']		func_symbols += ['D']
for line in nm_out:		for line in nm_out:
cols = line.split(' ')		cols = line.split(' ')
if len(cols) == 3 and cols[1] in func_symbols :		if len(cols) == 3 and cols[1] in func_symbols :
functions.append(cols[2])		functions.append(cols[2])
return functions		return functions

def main(argv):		def main(argv):
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

compiler-rt/trunk/lib/xray/CMakeLists.txt

	# Build for the XRay runtime support library.			# Build for the XRay runtime support library.

	set(XRAY_SOURCES			set(XRAY_SOURCES
	xray_init.cc			xray_init.cc
	xray_interface.cc			xray_interface.cc
	xray_flags.cc			xray_flags.cc
	xray_inmemory_log.cc			xray_inmemory_log.cc
	)			)

	set(x86_64_SOURCES			set(x86_64_SOURCES
				xray_x86_64.cc
	xray_trampoline_x86_64.S			xray_trampoline_x86_64.S
	${XRAY_SOURCES})			${XRAY_SOURCES})

				set(arm_SOURCES
				xray_arm.cc
				xray_trampoline_arm.S
				${XRAY_SOURCES})

				set(armhf_SOURCES ${arm_SOURCES})

	include_directories(..)			include_directories(..)
	include_directories(../../include)			include_directories(../../include)

	set(XRAY_CFLAGS ${SANITIZER_COMMON_CFLAGS})			set(XRAY_CFLAGS ${SANITIZER_COMMON_CFLAGS})

	set(XRAY_COMMON_DEFINITIONS XRAY_HAS_EXCEPTIONS=1)			set(XRAY_COMMON_DEFINITIONS XRAY_HAS_EXCEPTIONS=1)

	add_compiler_rt_object_libraries(RTXray			add_compiler_rt_object_libraries(RTXray
	Show All 22 Lines

compiler-rt/trunk/lib/xray/xray_arm.cc

				//===-- xray_arm.cc ---------------------------------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file is a part of XRay, a dynamic runtime instrumentation system.
				//
				// Implementation of ARM-specific routines (32-bit).
				//
				//===----------------------------------------------------------------------===//
				#include "xray_interface_internal.h"
				#include "sanitizer_common/sanitizer_common.h"
				#include <atomic>
				#include <cassert>

				namespace __xray {

				// The machine codes for some instructions used in runtime patching.
				enum class PatchOpcodes : uint32_t
				{
				PO_PushR0Lr = 0xE92D4001, // PUSH {r0, lr}
				PO_BlxIp = 0xE12FFF3C, // BLX ip
				PO_PopR0Lr = 0xE8BD4001, // POP {r0, lr}
				PO_B20 = 0xEA000005 // B #20
				};

				// 0xUUUUWXYZ -> 0x000W0XYZ
				inline static uint32_t getMovwMask(const uint32_t Value) {
				return (Value & 0xfff) \| ((Value & 0xf000) << 4);
				}

				// 0xWXYZUUUU -> 0x000W0XYZ
				inline static uint32_t getMovtMask(const uint32_t Value) {
				return getMovwMask(Value >> 16);
				}

				// Writes the following instructions:
				// MOVW R<regNo>, #<lower 16 bits of the \|Value\|>
				// MOVT R<regNo>, #<higher 16 bits of the \|Value\|>
				inline static uint32_t* write32bitLoadReg(uint8_t regNo, uint32_t* Address,
				const uint32_t Value) {
				//This is a fatal error: we cannot just report it and continue execution.
				assert(regNo <= 15 && "Register number must be 0 to 15.");
				// MOVW R, #0xWXYZ in machine code is 0xE30WRXYZ
				*Address = (0xE3000000 \| (uint32_t(regNo)<<12) \| getMovwMask(Value));
				Address++;
				// MOVT R, #0xWXYZ in machine code is 0xE34WRXYZ
				*Address = (0xE3400000 \| (uint32_t(regNo)<<12) \| getMovtMask(Value));
				return Address + 1;
				}

				// Writes the following instructions:
				// MOVW r0, #<lower 16 bits of the \|Value\|>
				// MOVT r0, #<higher 16 bits of the \|Value\|>
				inline static uint32_t Write32bitLoadR0(uint32_t Address,
				const uint32_t Value) {
				return write32bitLoadReg(0, Address, Value);
				}

				// Writes the following instructions:
				// MOVW ip, #<lower 16 bits of the \|Value\|>
				// MOVT ip, #<higher 16 bits of the \|Value\|>
				inline static uint32_t Write32bitLoadIP(uint32_t Address,
				const uint32_t Value) {
				return write32bitLoadReg(12, Address, Value);
				}

				inline static bool patchSled(const bool Enable, const uint32_t FuncId,
				const XRaySledEntry &Sled, void (*TracingHook)()) {
				// When \|Enable\| == true,
				// We replace the following compile-time stub (sled):
				//
				// xray_sled_n:
				// B #20
				// 6 NOPs (24 bytes)
				//
				// With the following runtime patch:
				//
				// xray_sled_n:
				// PUSH {r0, lr}
				// MOVW r0, #<lower 16 bits of function ID>
				// MOVT r0, #<higher 16 bits of function ID>
				// MOVW ip, #<lower 16 bits of address of TracingHook>
				// MOVT ip, #<higher 16 bits of address of TracingHook>
				// BLX ip
				// POP {r0, lr}
				//
				// Replacement of the first 4-byte instruction should be the last and atomic
				// operation, so that the user code which reaches the sled concurrently
				// either jumps over the whole sled, or executes the whole sled when the
				// latter is ready.
				//
				// When \|Enable\|==false, we set back the first instruction in the sled to be
				// B #20

				uint32_t FirstAddress = reinterpret_cast<uint32_t >(Sled.Address);
				if (Enable) {
				uint32_t *CurAddress = FirstAddress + 1;
				CurAddress =
				Write32bitLoadR0(CurAddress, reinterpret_cast<uint32_t>(FuncId));
				CurAddress =
				Write32bitLoadIP(CurAddress, reinterpret_cast<uint32_t>(TracingHook));
				*CurAddress = uint32_t(PatchOpcodes::PO_BlxIp);
				CurAddress++;
				*CurAddress = uint32_t(PatchOpcodes::PO_PopR0Lr);
				std::atomic_store_explicit(
				reinterpret_cast<std::atomic<uint32_t> *>(FirstAddress),
				uint32_t(PatchOpcodes::PO_PushR0Lr), std::memory_order_release);
				} else {
				std::atomic_store_explicit(
				reinterpret_cast<std::atomic<uint32_t> *>(FirstAddress),
				uint32_t(PatchOpcodes::PO_B20), std::memory_order_release);
				}
				return true;
				}

				bool patchFunctionEntry(const bool Enable, const uint32_t FuncId,
				const XRaySledEntry &Sled) {
				return patchSled(Enable, FuncId, Sled, __xray_FunctionEntry);
				}

				bool patchFunctionExit(const bool Enable, const uint32_t FuncId,
				const XRaySledEntry &Sled) {
				return patchSled(Enable, FuncId, Sled, __xray_FunctionExit);
				}

				} // namespace __xray

compiler-rt/trunk/lib/xray/xray_inmemory_log.cc

Show All 18 Lines
#include <cstdio>		#include <cstdio>
#include <fcntl.h>		#include <fcntl.h>
#include <mutex>		#include <mutex>
#include <sys/stat.h>		#include <sys/stat.h>
#include <sys/syscall.h>		#include <sys/syscall.h>
#include <sys/types.h>		#include <sys/types.h>
#include <thread>		#include <thread>
#include <unistd.h>		#include <unistd.h>

		#if defined(__x86_64__)
#include <x86intrin.h>		#include <x86intrin.h>
		#elif defined(__arm__)
		static const int64_t NanosecondsPerSecond = 1000LL10001000;
		#else
		#error "Unsupported CPU Architecture"
		#endif /* CPU architecture */

#include "sanitizer_common/sanitizer_libc.h"		#include "sanitizer_common/sanitizer_libc.h"
#include "xray/xray_records.h"		#include "xray/xray_records.h"
#include "xray_flags.h"		#include "xray_flags.h"
#include "xray_interface_internal.h"		#include "xray_interface_internal.h"

// __xray_InMemoryRawLog will use a thread-local aligned buffer capped to a		// __xray_InMemoryRawLog will use a thread-local aligned buffer capped to a
// certain size (32kb by default) and use it as if it were a circular buffer for		// certain size (32kb by default) and use it as if it were a circular buffer for
Show All 20 Lines	while (auto Written = write(Fd, Begin, TotalBytes)) {
}		}
TotalBytes -= Written;		TotalBytes -= Written;
if (TotalBytes == 0)		if (TotalBytes == 0)
break;		break;
Begin += Written;		Begin += Written;
}		}
}		}

		#if defined(__x86_64__)
static std::pair<ssize_t, bool> retryingReadSome(int Fd, char *Begin,		static std::pair<ssize_t, bool> retryingReadSome(int Fd, char *Begin,
char *End) {		char *End) {
auto BytesToRead = std::distance(Begin, End);		auto BytesToRead = std::distance(Begin, End);
ssize_t BytesRead;		ssize_t BytesRead;
ssize_t TotalBytesRead = 0;		ssize_t TotalBytesRead = 0;
while (BytesToRead && (BytesRead = read(Fd, Begin, BytesToRead))) {		while (BytesToRead && (BytesRead = read(Fd, Begin, BytesToRead))) {
if (BytesRead == -1) {		if (BytesRead == -1) {
if (errno == EINTR)		if (errno == EINTR)
Show All 26 Lines	static bool readValueFromFile(const char Filename, long long Value) {
bool Result = false;		bool Result = false;
if (Line[0] != '\0' && (End == '\n' \|\| End == '\0')) {		if (Line[0] != '\0' && (End == '\n' \|\| End == '\0')) {
*Value = Tmp;		*Value = Tmp;
Result = true;		Result = true;
}		}
return Result;		return Result;
}		}

		#endif /* CPU architecture */

class ThreadExitFlusher {		class ThreadExitFlusher {
int Fd;		int Fd;
XRayRecord *Start;		XRayRecord *Start;
size_t &Offset;		size_t &Offset;

public:		public:
explicit ThreadExitFlusher(int Fd, XRayRecord *Start, size_t &Offset)		explicit ThreadExitFlusher(int Fd, XRayRecord *Start, size_t &Offset)
: Fd(Fd), Start(Start), Offset(Offset) {}		: Fd(Fd), Start(Start), Offset(Offset) {}
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	if (Fd == -1) {
TmpFilename);		TmpFilename);
return -1;		return -1;
}		}
if (Verbosity())		if (Verbosity())
fprintf(stderr, "XRay: Log file in '%s'\n", TmpFilename);		fprintf(stderr, "XRay: Log file in '%s'\n", TmpFilename);

// Get the cycle frequency from SysFS on Linux.		// Get the cycle frequency from SysFS on Linux.
long long CPUFrequency = -1;		long long CPUFrequency = -1;
		#if defined(__x86_64__)
if (readValueFromFile("/sys/devices/system/cpu/cpu0/tsc_freq_khz",		if (readValueFromFile("/sys/devices/system/cpu/cpu0/tsc_freq_khz",
&CPUFrequency)) {		&CPUFrequency)) {
CPUFrequency *= 1000;		CPUFrequency *= 1000;
} else if (readValueFromFile(		} else if (readValueFromFile(
"/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq",		"/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq",
&CPUFrequency)) {		&CPUFrequency)) {
CPUFrequency *= 1000;		CPUFrequency *= 1000;
} else {		} else {
Report("Unable to determine CPU frequency for TSC accounting.");		Report("Unable to determine CPU frequency for TSC accounting.");
}		}
		#elif defined(__arm__)
		// There is no instruction like RDTSCP in user mode on ARM. ARM's CP15 does
		// not have a constant frequency like TSC on x86(_64), it may go faster
		// or slower depending on CPU turbo or power saving mode. Furthermore,
		// to read from CP15 on ARM a kernel modification or a driver is needed.
		// We can not require this from users of compiler-rt.
		// So on ARM we use clock_gettime() which gives the result in nanoseconds.
		// To get the measurements per second, we scale this by the number of
		// nanoseconds per second, pretending that the TSC frequency is 1GHz and
		// one TSC tick is 1 nanosecond.
		CPUFrequency = NanosecondsPerSecond;
		#else
		#error "Unsupported CPU Architecture"
		#endif /* CPU architecture */

// Since we're here, we get to write the header. We set it up so that the		// Since we're here, we get to write the header. We set it up so that the
// header will only be written once, at the start, and let the threads		// header will only be written once, at the start, and let the threads
// logging do writes which just append.		// logging do writes which just append.
XRayFileHeader Header;		XRayFileHeader Header;
Header.Version = 1;		Header.Version = 1;
Header.Type = FileTypes::NAIVE_LOG;		Header.Type = FileTypes::NAIVE_LOG;
Header.CycleFrequency =		Header.CycleFrequency =
Show All 11 Lines	if (Fd == -1)
return;		return;
thread_local __xray::ThreadExitFlusher Flusher(		thread_local __xray::ThreadExitFlusher Flusher(
Fd, reinterpret_cast<__xray::XRayRecord *>(InMemoryBuffer), Offset);		Fd, reinterpret_cast<__xray::XRayRecord *>(InMemoryBuffer), Offset);
thread_local pid_t TId = syscall(SYS_gettid);		thread_local pid_t TId = syscall(SYS_gettid);

// First we get the useful data, and stuff it into the already aligned buffer		// First we get the useful data, and stuff it into the already aligned buffer
// through a pointer offset.		// through a pointer offset.
auto &R = reinterpret_cast<__xray::XRayRecord *>(InMemoryBuffer)[Offset];		auto &R = reinterpret_cast<__xray::XRayRecord *>(InMemoryBuffer)[Offset];
unsigned CPU;
R.RecordType = RecordTypes::NORMAL;		R.RecordType = RecordTypes::NORMAL;
		#if defined(__x86_64__)
		{
		unsigned CPU;
R.TSC = __rdtscp(&CPU);		R.TSC = __rdtscp(&CPU);
R.CPU = CPU;		R.CPU = CPU;
		}
		#elif defined(__arm__)
		{
		timespec TS;
		int result = clock_gettime(CLOCK_REALTIME, &TS);
		if(result != 0)
		{
		Report("clock_gettime() returned %d, errno=%d.", result, int(errno));
		TS.tv_sec = 0;
		TS.tv_nsec = 0;
		}
		R.TSC = TS.tv_sec * NanosecondsPerSecond + TS.tv_nsec;
		R.CPU = 0;
		}
		#else
		#error "Unsupported CPU Architecture"
		#endif /* CPU architecture */
R.TId = TId;		R.TId = TId;
R.Type = Type;		R.Type = Type;
R.FuncId = FuncId;		R.FuncId = FuncId;
++Offset;		++Offset;
if (Offset == BuffLen) {		if (Offset == BuffLen) {
std::lock_guard<std::mutex> L(LogMutex);		std::lock_guard<std::mutex> L(LogMutex);
auto RecordBuffer = reinterpret_cast<__xray::XRayRecord *>(InMemoryBuffer);		auto RecordBuffer = reinterpret_cast<__xray::XRayRecord *>(InMemoryBuffer);
retryingWriteAll(Fd, reinterpret_cast<char *>(RecordBuffer),		retryingWriteAll(Fd, reinterpret_cast<char *>(RecordBuffer),
Show All 10 Lines

compiler-rt/trunk/lib/xray/xray_interface.cc

Show All 20 Lines
#include <errno.h>		#include <errno.h>
#include <limits>		#include <limits>
#include <sys/mman.h>		#include <sys/mman.h>

#include "sanitizer_common/sanitizer_common.h"		#include "sanitizer_common/sanitizer_common.h"

namespace __xray {		namespace __xray {

		#if defined(__x86_64__)
		// FIXME: The actual length is 11 bytes. Why was length 12 passed to mprotect() ?
		static const int16_t cSledLength = 12;
		#elif defined(__arm__)
		static const int16_t cSledLength = 28;
		#else
		#error "Unsupported CPU Architecture"
		#endif /* CPU architecture */

// This is the function to call when we encounter the entry or exit sleds.		// This is the function to call when we encounter the entry or exit sleds.
std::atomic<void (*)(int32_t, XRayEntryType)> XRayPatchedFunction{nullptr};		std::atomic<void (*)(int32_t, XRayEntryType)> XRayPatchedFunction{nullptr};

// MProtectHelper is an RAII wrapper for calls to mprotect(...) that will undo		// MProtectHelper is an RAII wrapper for calls to mprotect(...) that will undo
// any successful mprotect(...) changes. This is used to make a page writeable		// any successful mprotect(...) changes. This is used to make a page writeable
// and executable, and upon destruction if it was successful in doing so returns		// and executable, and upon destruction if it was successful in doing so returns
// the page into a read-only and executable page.		// the page into a read-only and executable page.
//		//
Show All 22 Lines	~MProtectHelper() {
if (MustCleanup) {		if (MustCleanup) {
mprotect(PageAlignedAddr, MProtectLen, PROT_READ \| PROT_EXEC);		mprotect(PageAlignedAddr, MProtectLen, PROT_READ \| PROT_EXEC);
}		}
}		}
};		};

} // namespace __xray		} // namespace __xray

extern "C" {
// The following functions have to be defined in assembler, on a per-platform
// basis. See xray_trampoline_*.s files for implementations.
extern void __xray_FunctionEntry();
extern void __xray_FunctionExit();
}

extern std::atomic<bool> XRayInitialized;		extern std::atomic<bool> XRayInitialized;
extern std::atomic<__xray::XRaySledMap> XRayInstrMap;		extern std::atomic<__xray::XRaySledMap> XRayInstrMap;

int __xray_set_handler(void (*entry)(int32_t, XRayEntryType)) {		int __xray_set_handler(void (*entry)(int32_t, XRayEntryType)) {
if (XRayInitialized.load(std::memory_order_acquire)) {		if (XRayInitialized.load(std::memory_order_acquire)) {
__xray::XRayPatchedFunction.store(entry, std::memory_order_release);		__xray::XRayPatchedFunction.store(entry, std::memory_order_release);
return 1;		return 1;
}		}
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	XRayPatchingStatus ControlPatching(bool Enable) {
});		});

// Step 1: Compute the function id, as a unique identifier per function in the		// Step 1: Compute the function id, as a unique identifier per function in the
// instrumentation map.		// instrumentation map.
XRaySledMap InstrMap = XRayInstrMap.load(std::memory_order_acquire);		XRaySledMap InstrMap = XRayInstrMap.load(std::memory_order_acquire);
if (InstrMap.Entries == 0)		if (InstrMap.Entries == 0)
return XRayPatchingStatus::NOT_INITIALIZED;		return XRayPatchingStatus::NOT_INITIALIZED;

int32_t FuncId = 1;		const uint64_t PageSize = GetPageSizeCached();
static constexpr uint8_t CallOpCode = 0xe8;		if((PageSize == 0) \|\| ( (PageSize & (PageSize-1)) != 0) ) {
static constexpr uint16_t MovR10Seq = 0xba41;		Report("System page size is not a power of two: %lld", PageSize);
static constexpr uint16_t Jmp9Seq = 0x09eb;		return XRayPatchingStatus::FAILED;
static constexpr uint8_t JmpOpCode = 0xe9;		}
static constexpr uint8_t RetOpCode = 0xc3;
		uint32_t FuncId = 1;
uint64_t CurFun = 0;		uint64_t CurFun = 0;
for (std::size_t I = 0; I < InstrMap.Entries; I++) {		for (std::size_t I = 0; I < InstrMap.Entries; I++) {
auto Sled = InstrMap.Sleds[I];		auto Sled = InstrMap.Sleds[I];
auto F = Sled.Function;		auto F = Sled.Function;
if (CurFun == 0)		if (CurFun == 0)
CurFun = F;		CurFun = F;
if (F != CurFun) {		if (F != CurFun) {
++FuncId;		++FuncId;
CurFun = F;		CurFun = F;
}		}

// While we're here, we should patch the nop sled. To do that we mprotect		// While we're here, we should patch the nop sled. To do that we mprotect
// the page containing the function to be writeable.		// the page containing the function to be writeable.
void *PageAlignedAddr =		void *PageAlignedAddr =
reinterpret_cast<void *>(Sled.Address & ~((2 << 16) - 1));		reinterpret_cast<void *>(Sled.Address & ~(PageSize-1));
std::size_t MProtectLen =		std::size_t MProtectLen =
(Sled.Address + 12) - reinterpret_cast<uint64_t>(PageAlignedAddr);		(Sled.Address + cSledLength) - reinterpret_cast<uint64_t>(PageAlignedAddr);
MProtectHelper Protector(PageAlignedAddr, MProtectLen);		MProtectHelper Protector(PageAlignedAddr, MProtectLen);
if (Protector.MakeWriteable() == -1) {		if (Protector.MakeWriteable() == -1) {
printf("Failed mprotect: %d\n", errno);		printf("Failed mprotect: %d\n", errno);
return XRayPatchingStatus::FAILED;		return XRayPatchingStatus::FAILED;
}		}

static constexpr int64_t MinOffset{std::numeric_limits<int32_t>::min()};		bool Success = false;
static constexpr int64_t MaxOffset{std::numeric_limits<int32_t>::max()};		switch(Sled.Kind) {
if (Sled.Kind == XRayEntryType::ENTRY) {		case XRayEntryType::ENTRY:
// FIXME: Implement this in a more extensible manner, per-platform.		Success = patchFunctionEntry(Enable, FuncId, Sled);
// Here we do the dance of replacing the following sled:		break;
//		case XRayEntryType::EXIT:
// xray_sled_n:		Success = patchFunctionExit(Enable, FuncId, Sled);
// jmp +9		break;
// <9 byte nop>		default:
//		Report("Unsupported sled kind: %d", int(Sled.Kind));
// With the following:
//
// mov r10d, <function id>
// call <relative 32bit offset to entry trampoline>
//
// We need to do this in the following order:
//
// 1. Put the function id first, 2 bytes from the start of the sled (just
// after the 2-byte jmp instruction).
// 2. Put the call opcode 6 bytes from the start of the sled.
// 3. Put the relative offset 7 bytes from the start of the sled.
// 4. Do an atomic write over the jmp instruction for the "mov r10d"
// opcode and first operand.
//
// Prerequisite is to compute the relative offset to the
// __xray_FunctionEntry function's address.
int64_t TrampolineOffset =
reinterpret_cast<int64_t>(__xray_FunctionEntry) -
(static_cast<int64_t>(Sled.Address) + 11);
if (TrampolineOffset < MinOffset \|\| TrampolineOffset > MaxOffset) {
Report("XRay Entry trampoline (%p) too far from sled (%p); distance = "
"%ld\n",
__xray_FunctionEntry, reinterpret_cast<void *>(Sled.Address),
TrampolineOffset);
continue;
}
if (Enable) {
reinterpret_cast<uint32_t >(Sled.Address + 2) = FuncId;
reinterpret_cast<uint8_t >(Sled.Address + 6) = CallOpCode;
reinterpret_cast<uint32_t >(Sled.Address + 7) = TrampolineOffset;
std::atomic_store_explicit(
reinterpret_cast<std::atomic<uint16_t> *>(Sled.Address), MovR10Seq,
std::memory_order_release);
} else {
std::atomic_store_explicit(
reinterpret_cast<std::atomic<uint16_t> *>(Sled.Address), Jmp9Seq,
std::memory_order_release);
// FIXME: Write out the nops still?
}
}

if (Sled.Kind == XRayEntryType::EXIT) {
// FIXME: Implement this in a more extensible manner, per-platform.
// Here we do the dance of replacing the following sled:
//
// xray_sled_n:
// ret
// <10 byte nop>
//
// With the following:
//
// mov r10d, <function id>
// jmp <relative 32bit offset to exit trampoline>
//
// 1. Put the function id first, 2 bytes from the start of the sled (just
// after the 1-byte ret instruction).
// 2. Put the jmp opcode 6 bytes from the start of the sled.
// 3. Put the relative offset 7 bytes from the start of the sled.
// 4. Do an atomic write over the jmp instruction for the "mov r10d"
// opcode and first operand.
//
// Prerequisite is to compute the relative offset fo the
// __xray_FunctionExit function's address.
int64_t TrampolineOffset =
reinterpret_cast<int64_t>(__xray_FunctionExit) -
(static_cast<int64_t>(Sled.Address) + 11);
if (TrampolineOffset < MinOffset \|\| TrampolineOffset > MaxOffset) {
Report("XRay Exit trampoline (%p) too far from sled (%p); distance = "
"%ld\n",
__xray_FunctionExit, reinterpret_cast<void *>(Sled.Address),
TrampolineOffset);
continue;		continue;
}		}
if (Enable) {		(void)Success;
reinterpret_cast<uint32_t >(Sled.Address + 2) = FuncId;
reinterpret_cast<uint8_t >(Sled.Address + 6) = JmpOpCode;
reinterpret_cast<uint32_t >(Sled.Address + 7) = TrampolineOffset;
std::atomic_store_explicit(
reinterpret_cast<std::atomic<uint16_t> *>(Sled.Address), MovR10Seq,
std::memory_order_release);
} else {
std::atomic_store_explicit(
reinterpret_cast<std::atomic<uint8_t> *>(Sled.Address), RetOpCode,
std::memory_order_release);
// FIXME: Write out the nops still?
}
}
}		}
XRayPatching.store(false, std::memory_order_release);		XRayPatching.store(false, std::memory_order_release);
PatchingSuccess = true;		PatchingSuccess = true;
return XRayPatchingStatus::SUCCESS;		return XRayPatchingStatus::SUCCESS;
}		}

XRayPatchingStatus __xray_patch() { return ControlPatching(true); }		XRayPatchingStatus __xray_patch() { return ControlPatching(true); }

XRayPatchingStatus __xray_unpatch() { return ControlPatching(false); }		XRayPatchingStatus __xray_unpatch() { return ControlPatching(false); }

compiler-rt/trunk/lib/xray/xray_interface_internal.h

	Show All 10 Lines
	//			//
	// Implementation of the API functions. See also include/xray/xray_interface.h.			// Implementation of the API functions. See also include/xray/xray_interface.h.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef XRAY_INTERFACE_INTERNAL_H			#ifndef XRAY_INTERFACE_INTERNAL_H
	#define XRAY_INTERFACE_INTERNAL_H			#define XRAY_INTERFACE_INTERNAL_H

	#include "xray/xray_interface.h"			#include "xray/xray_interface.h"
				#include "sanitizer_common/sanitizer_platform.h"
	#include <cstddef>			#include <cstddef>
	#include <cstdint>			#include <cstdint>

	extern "C" {			extern "C" {

	struct XRaySledEntry {			struct XRaySledEntry {
				#if SANITIZER_WORDSIZE == 64
	uint64_t Address;			uint64_t Address;
	uint64_t Function;			uint64_t Function;
	unsigned char Kind;			unsigned char Kind;
	unsigned char AlwaysInstrument;			unsigned char AlwaysInstrument;
	unsigned char Padding[14]; // Need 32 bytes			unsigned char Padding[14]; // Need 32 bytes
				#elif SANITIZER_WORDSIZE == 32
				uint32_t Address;
				uint32_t Function;
				unsigned char Kind;
				unsigned char AlwaysInstrument;
				unsigned char Padding[6]; // Need 16 bytes
				#else
				#error "Unsupported word size."
				#endif
	};			};

	}			}

	namespace __xray {			namespace __xray {

	struct XRaySledMap {			struct XRaySledMap {
	const XRaySledEntry *Sleds;			const XRaySledEntry *Sleds;
	size_t Entries;			size_t Entries;
	};			};

				bool patchFunctionEntry(const bool Enable, const uint32_t FuncId, const XRaySledEntry& Sled);
				bool patchFunctionExit(const bool Enable, const uint32_t FuncId, const XRaySledEntry& Sled);

	} // namespace __xray			} // namespace __xray

				extern "C" {
				// The following functions have to be defined in assembler, on a per-platform
				// basis. See xray_trampoline_*.S files for implementations.
				extern void __xray_FunctionEntry();
				extern void __xray_FunctionExit();
				}

	#endif			#endif

compiler-rt/trunk/lib/xray/xray_trampoline_arm.S

				.syntax unified
				.arch armv6t2
				.fpu vfpv2
				.code 32
				.global _ZN6__xray19XRayPatchedFunctionE
				@ Word-aligned function entry point
				.p2align 2
				@ Let C/C++ see the symbol
				.global __xray_FunctionEntry
				@ It preserves all registers except r0, r12(ip), r14(lr) and r15(pc)
				@ Assume that "q" part of the floating-point registers is not used
				@ for passing parameters to C/C++ functions.
				.type __xray_FunctionEntry, %function
				@ In C++ it is void extern "C" __xray_FunctionEntry(uint32_t FuncId) with
				@ FuncId passed in r0 register.
				__xray_FunctionEntry:
				PUSH {r1-r3,lr}
				@ Save floating-point parameters of the instrumented function
				VPUSH {d0-d7}
				MOVW r1,#:lower16:_ZN6__xray19XRayPatchedFunctionE
				MOVT r1,#:upper16:_ZN6__xray19XRayPatchedFunctionE
				LDR r2, [r1]
				@ Handler address is nullptr if handler is not set
				CMP r2, #0
				BEQ FunctionEntry_restore
				@ Function ID is already in r0 (the first parameter).
				@ r1=0 means that we are tracing an entry event
				MOV r1, #0
				@ Call the handler with 2 parameters in r0 and r1
				BLX r2
				FunctionEntry_restore:
				@ Restore floating-point parameters of the instrumented function
				VPOP {d0-d7}
				POP {r1-r3,pc}

				@ Word-aligned function entry point
				.p2align 2
				@ Let C/C++ see the symbol
				.global __xray_FunctionExit
				@ Assume that d1-d7 are not used for the return value.
				@ Assume that "q" part of the floating-point registers is not used for the
				@ return value in C/C++.
				.type __xray_FunctionExit, %function
				@ In C++ it is extern "C" void __xray_FunctionExit(uint32_t FuncId) with
				@ FuncId passed in r0 register.
				__xray_FunctionExit:
				PUSH {r1-r3,lr}
				@ Save the floating-point return value of the instrumented function
				VPUSH {d0}
				@ Load the handler address
				MOVW r1,#:lower16:_ZN6__xray19XRayPatchedFunctionE
				MOVT r1,#:upper16:_ZN6__xray19XRayPatchedFunctionE
				LDR r2, [r1]
				@ Handler address is nullptr if handler is not set
				CMP r2, #0
				BEQ FunctionExit_restore
				@ Function ID is already in r0 (the first parameter).
				@ 1 means that we are tracing an exit event
				MOV r1, #1
				@ Call the handler with 2 parameters in r0 and r1
				BLX r2
				FunctionExit_restore:
				@ Restore the floating-point return value of the instrumented function
				VPOP {d0}
				POP {r1-r3,pc}

compiler-rt/trunk/lib/xray/xray_x86_64.cc

				#include "xray_interface_internal.h"
				#include "sanitizer_common/sanitizer_common.h"
				#include <atomic>
				#include <cstdint>
				#include <limits>

				namespace __xray {

				static constexpr uint8_t CallOpCode = 0xe8;
				static constexpr uint16_t MovR10Seq = 0xba41;
				static constexpr uint16_t Jmp9Seq = 0x09eb;
				static constexpr uint8_t JmpOpCode = 0xe9;
				static constexpr uint8_t RetOpCode = 0xc3;

				static constexpr int64_t MinOffset{std::numeric_limits<int32_t>::min()};
				static constexpr int64_t MaxOffset{std::numeric_limits<int32_t>::max()};

				bool patchFunctionEntry(const bool Enable, const uint32_t FuncId, const XRaySledEntry& Sled)
				{
				// Here we do the dance of replacing the following sled:
				//
				// xray_sled_n:
				// jmp +9
				// <9 byte nop>
				//
				// With the following:
				//
				// mov r10d, <function id>
				// call <relative 32bit offset to entry trampoline>
				//
				// We need to do this in the following order:
				//
				// 1. Put the function id first, 2 bytes from the start of the sled (just
				// after the 2-byte jmp instruction).
				// 2. Put the call opcode 6 bytes from the start of the sled.
				// 3. Put the relative offset 7 bytes from the start of the sled.
				// 4. Do an atomic write over the jmp instruction for the "mov r10d"
				// opcode and first operand.
				//
				// Prerequisite is to compute the relative offset to the
				// __xray_FunctionEntry function's address.
				int64_t TrampolineOffset =
				reinterpret_cast<int64_t>(__xray_FunctionEntry) -
				(static_cast<int64_t>(Sled.Address) + 11);
				if (TrampolineOffset < MinOffset \|\| TrampolineOffset > MaxOffset) {
				Report("XRay Entry trampoline (%p) too far from sled (%p); distance = "
				"%ld\n",
				__xray_FunctionEntry, reinterpret_cast<void *>(Sled.Address),
				TrampolineOffset);
				return false;
				}
				if (Enable) {
				reinterpret_cast<uint32_t >(Sled.Address + 2) = FuncId;
				reinterpret_cast<uint8_t >(Sled.Address + 6) = CallOpCode;
				reinterpret_cast<uint32_t >(Sled.Address + 7) = TrampolineOffset;
				std::atomic_store_explicit(
				reinterpret_cast<std::atomic<uint16_t> *>(Sled.Address), MovR10Seq,
				std::memory_order_release);
				} else {
				std::atomic_store_explicit(
				reinterpret_cast<std::atomic<uint16_t> *>(Sled.Address), Jmp9Seq,
				std::memory_order_release);
				// FIXME: Write out the nops still?
				}
				return true;
				}

				bool patchFunctionExit(const bool Enable, const uint32_t FuncId, const XRaySledEntry& Sled)
				{
				// Here we do the dance of replacing the following sled:
				//
				// xray_sled_n:
				// ret
				// <10 byte nop>
				//
				// With the following:
				//
				// mov r10d, <function id>
				// jmp <relative 32bit offset to exit trampoline>
				//
				// 1. Put the function id first, 2 bytes from the start of the sled (just
				// after the 1-byte ret instruction).
				// 2. Put the jmp opcode 6 bytes from the start of the sled.
				// 3. Put the relative offset 7 bytes from the start of the sled.
				// 4. Do an atomic write over the jmp instruction for the "mov r10d"
				// opcode and first operand.
				//
				// Prerequisite is to compute the relative offset fo the
				// __xray_FunctionExit function's address.
				int64_t TrampolineOffset =
				reinterpret_cast<int64_t>(__xray_FunctionExit) -
				(static_cast<int64_t>(Sled.Address) + 11);
				if (TrampolineOffset < MinOffset \|\| TrampolineOffset > MaxOffset) {
				Report("XRay Exit trampoline (%p) too far from sled (%p); distance = "
				"%ld\n",
				__xray_FunctionExit, reinterpret_cast<void *>(Sled.Address),
				TrampolineOffset);
				return false;
				}
				if (Enable) {
				reinterpret_cast<uint32_t >(Sled.Address + 2) = FuncId;
				reinterpret_cast<uint8_t >(Sled.Address + 6) = JmpOpCode;
				reinterpret_cast<uint32_t >(Sled.Address + 7) = TrampolineOffset;
				std::atomic_store_explicit(
				reinterpret_cast<std::atomic<uint16_t> *>(Sled.Address), MovR10Seq,
				std::memory_order_release);
				} else {
				std::atomic_store_explicit(
				reinterpret_cast<std::atomic<uint8_t> *>(Sled.Address), RetOpCode,
				std::memory_order_release);
				// FIXME: Write out the nops still?
				}
				return true;
				}

				} // namespace __xray

This is an archive of the discontinued LLVM Phabricator instance.

[XRay] ARM 32-bit no-Thumb support in compiler-rtClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 71933

compiler-rt/trunk/cmake/config-ix.cmake

compiler-rt/trunk/lib/sanitizer_common/scripts/gen_dynamic_list.py

compiler-rt/trunk/lib/xray/CMakeLists.txt

compiler-rt/trunk/lib/xray/xray_arm.cc

compiler-rt/trunk/lib/xray/xray_inmemory_log.cc

compiler-rt/trunk/lib/xray/xray_interface.cc

compiler-rt/trunk/lib/xray/xray_interface_internal.h

compiler-rt/trunk/lib/xray/xray_trampoline_arm.S

compiler-rt/trunk/lib/xray/xray_x86_64.cc

[XRay] ARM 32-bit no-Thumb support in compiler-rt
ClosedPublic