This is an archive of the discontinued LLVM Phabricator instance.

[XRay] ARM 32-bit no-Thumb support in compiler-rt
ClosedPublic

Authored by rSerge on Aug 26 2016, 10:21 AM.

Diff Detail

Event Timeline

rSerge updated this revision to Diff 69394.Aug 26 2016, 10:21 AM
rSerge retitled this revision from to [XRay] ARM 32-bit no-Thumb support in compiler-rt.
rSerge updated this object.
rSerge added reviewers: dberris, rengolin, asl, t.p.northover.
rSerge added a subscriber: llvm-commits.
rSerge updated this object.Aug 26 2016, 10:24 AM
rSerge updated this revision to Diff 69440.Aug 26 2016, 3:31 PM

Rebased after https://reviews.llvm.org/D21982 and ported logging to ARM: replaced RDTSC instruction with clock_gettime().
Fixed a bug where the length of x86_64 sled (11-12 bytes) was passed to mprotect() on ARM, while the sled size on ARM is 28 bytes. This was sometimes causing segmentation fault when patching at runtime.

dberris requested changes to this revision.Aug 28 2016, 6:13 PM
dberris edited edge metadata.
dberris added inline comments.
lib/sanitizer_common/scripts/gen_dynamic_list.py
54

Is this required to make this change work? Or should this really happen as an isolated change?

lib/xray/xray_arm.cc
29–32

The Coding Standards seem to require that variables be camel case starting with a capital letter.

http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

110

On ARM, does std::memory_order_release turn into writes that have fences after to ensure they're visible? Or am I confusing ARM for an architecture that only has relaxed memory order semantics?

lib/xray/xray_interface.cc
30

Good question. I may have miscounted. We can fix that later, once this lands (or if you can change and test to make sure it doesn't break, I'm fine with it).

This revision now requires changes to proceed.Aug 28 2016, 6:13 PM

Please, see my responses inline. I'll upload the updated patch in a few minutes.

lib/sanitizer_common/scripts/gen_dynamic_list.py
54

Without this change, XRay for ARM doesn't get cross-compiled from Windows to ARM-Linux .

lib/xray/xray_arm.cc
29–32

Changing to an enum. Isn't it better to leave the register parameters of instructions separated by underscore, rather than making a name like PO_PushR0Lr ?

110

std::memory_order_release should do what the standard requires. on any compiler and CPU, unless there is a bug in them. Indeed, x86_64 is strongly ordered, so for the CPU std::memory_order_relaxed is always sufficient (but not for the compiler: it may reorder). However, ARM is weakly ordered, so at least std::memory_order_release is required here. From http://en.cppreference.com/w/cpp/atomic/memory_order : "All writes in the current thread are visible in other threads that acquire the same atomic variable..." . There is a problem that we cannot force the CPU on the other cores to perform an acquire operation fetching instructions. However, ARM CPU always fetches the instruction at pc+8, decodes the instruction at pc+4 and executes the instruction at pc (program counter register). So as far as I know there is no reordering problem here. However, during unpatching we cannot fill the 6 tail instructions with NOPs (and I think, the same applies to x86/x86_64), because concurrent core may have already fetched the first instruction on the patch and therefore relies that the rest of instructions in the patch are correct.

lib/xray/xray_interface.cc
30

It may break, again, something to do with alignment, you may even need to increase mprotect length to 18 bytes. There may be a chance that when writing the last byte of 11-byte sled, the CPU may need to access a separate 64-bit word in memory. I don't know whether it will get permission denied in case only the first byte of this word is writeable and the other 7 bytes are write-protected.

rSerge updated this revision to Diff 69610.Aug 29 2016, 1:46 PM
rSerge edited edge metadata.

Changed the opcodes from constants to an enum.

dberris accepted this revision.Aug 29 2016, 4:41 PM
dberris edited edge metadata.

I'll defer to someone else who understands the ARM assembly parts.

You might also consider extending the file header type to indicate what platform the trace was generated from, so tools can determine what to do with a trace that comes from a specific CPU. I'm happy to do it later, but wanted to know your thoughts on how we might encode that information appropriately.

lib/xray/xray_arm.cc
30–33

I think the coding standards say they should look like types, so just CamelCase. I don't make up the rules, but I just try to follow them. :)

39

Shouldn't these functions be camelCase? As in getMoveMask(...) according to the guide? I understand you're just following the conventions of the files around, and those mistakes are mine -- but do you mind changing them before landing?

lib/xray/xray_inmemory_log.cc
30

static constexpr instead?

Also, please follow the naming conventions for this one too.

Another thing -- couldn't you just use std::chrono for the constant here?

http://en.cppreference.com/w/cpp/chrono/duration

This revision is now accepted and ready to land.Aug 29 2016, 4:41 PM
rengolin edited edge metadata.Aug 29 2016, 5:22 PM

Sorry, it was holiday in UK today...

I'll have a look at the patches tomorrow.

Cheers,
Renato

rengolin requested changes to this revision.Aug 30 2016, 6:38 AM
rengolin edited edge metadata.

Hi,

I have a number of comments and requests. One general remark, also, is to comment on the top of the assembly functions what's the function signature in C, so that I know how to review the function's code. Otherwise, it's very hard to understand all possibilities.

cheers,
--renato

lib/xray/xray_arm.cc
23

Can't you define this on a top-level header and implement on an arch-specific cpp file?

I don't think these things should be changing between arches.

51

Why haven't use used inline assembly, here? This is really unreadable and error prone.

111

All modern ARM CPUs are multi-issue, out-of-order, so you cannot guarantee ordering without a data/memory barrier. ARMv8 has better atomic support.

lib/xray/xray_inmemory_log.cc
30

Better still, get the proper value?

I know it can be a tad different from hardware to hardware, but not even trying isn't really helpful.

70

I'd use #defined x86_64 instead, and replicate to all arches it's supposed to work later.

We don't want broken fall-back logic, as it's really hard to find bugs later.

188

I don't understand this... Is this just hard-coding to 1GHz?

/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq

Also works on most ARM and AArch64 boards.

lib/xray/xray_trampoline_arm.S
2

You're going to need more than that.

Assemblers are very picky on what's valid, and Clang is specially so.

You'll need to put the minimum requirements on the header (cpu, fpu, arch, thumb-interop, etc).

You'll also need to put ".arch" to extend support on functions that use new instructions where the header flags don't support.

Example:

.syntax unifixed
.arch   armv6t2
.fpu    vfpv2

...

v7_only_func:
.fpu vfpv3
VMOV ...

This will mean you can use this on v6T2 onward, and that "v7_only_func" can only be used by arches with vfpv3, guaranteed by the dynamic dispatch. See libunwind and other compiler-RT ARM functions for this behaviour.

7

Why not? Vectorizers can and do use Q regs...

8

The comment character is "@" not "//". This will only work if compiled by a C++ compiler, not an assembler.

rengolin added inline comments.Aug 30 2016, 6:38 AM
lib/sanitizer_common/scripts/gen_dynamic_list.py
54

Have you tested this on Linux and Mac? To make sure it also work there?

lib/xray/xray_arm.cc
39

Yes, please. Let's not add different styles if we don't have to. Camel case, caps for variables, no caps for functions.

Enum values have format INI_Name, with "INI" the initials of the enum's name, all caps, and the Name a unique identifier within the enum.

This revision now requires changes to proceed.Aug 30 2016, 6:38 AM
rSerge marked an inline comment as done.Aug 30 2016, 1:18 PM

I've responded inline and will upload a new diff in a minute.

You might also consider extending the file header type to indicate what platform the trace was generated from, so tools can determine what to do with a trace that comes from a specific CPU. I'm happy to do it later, but wanted to know your thoughts on how we might encode that information appropriately.

I think we should look for some enumeration in compiler-rt or LLVM listing the CPU architectures, so to make XRay CPU codes consistent with that. But so far XRayFileHeader and XRayRecord don't seem to differ between CPUs, or do they?

lib/sanitizer_common/scripts/gen_dynamic_list.py
54

I can test on Ubuntu. I don't have access to a Mac.

lib/xray/xray_arm.cc
23

Moving to xray_interface_internal.h .

30–33

Ok, changing to CamelCase.

39

The nearby code of sanitizers (ASAN, sanitizer_common) mostly names the functions starting with a capital letter. Do you still think I should name functions starting with a lowercase letter?

51

Because it is not possible. This code patches the user program at runtime with different instructions depending on the data in the user program. There doesn't seem anything we can put as inline assembly in compiler-rt code. It may be possible to use assembly strings, but that would require to link an assembler to the user program.

111

I used memory_order_release here, on the writer side. According to C++11 standard, this should prevent reordering previous writes past this point (inserting fences if necessary, etc.).
There is little we can do on the reader side, as there is no data reader: it is the CPU fetching instructions on the other side. Is there any evidence that ARM may fetch instructions out of order? If so, how to prevent this?

lib/xray/xray_inmemory_log.cc
30

What do you mean by getting the proper value?
I think that getting here anything other than just simple "1 billion" would be too unexpected, and we would need error checking for that. Furthermore, getting it from other compile units may result in initialization order issues. It is easier and more reliable to just have it as a 1 billion constant.

30

I'm renaming this to get rid of c prefix.
I think that pulling the whole chrono just for nanoseconds per second number may be a waste of compile time.

70

Changing.

188

No, this is not hard-coding to 1GHz. x86_64 uses RDTSCP instruction in the numerator, that is why the denominator is CPU frequency. There is nothing similar for ARM in user mode. So we fall back on clock_gettime() system call. It provides time in nanoseconds. That is why we use 1 billion as the denominator. Shall I rename the variable from CPUFrequency to something like TicksPerSecond so that it is more comprehensive on CPUs without instructions like RDTSC?

lib/xray/xray_trampoline_arm.S
2

I guess that the dynamic dispatch doesn't help us, because we are calling the function from machine code written at run-time into the code of user functions.
Adding .arch armv7 and .fpu vfpv3 .

7

That can happen. But are Q registers used for passing parameters and returning values?
Perhaps my assembly comment is misleading: here (in __xray_FunctionEntry) we need to push&pop every register which may be used for passing parameters. And in __xray_FunctionExit we need to push&pop every register which may be used for returning values from C/C++ functions.

8

Changing.

rSerge updated this revision to Diff 69744.Aug 30 2016, 1:19 PM
rSerge edited edge metadata.
rSerge marked an inline comment as done.
rSerge marked 5 inline comments as done.Aug 30 2016, 1:23 PM

Marked the done comments according to the diff just uploaded.

Thanks for the changes, some more comments...

lib/xray/xray_arm.cc
40

This is a new file, it should use LLVM's policy.

52

Of course. Ignore me.

Though, this is the same as the one below, and you could merge them both by passing the register name and ORRing [reg << 12] with the instruction, and making sure reg < 15.

112

Is there any evidence that ARM may fetch instructions out of order? If so, how to prevent this?

I'm not sure what you mean. Many Cortex-AR cores are OOO. That's their design, you can't change that. Or maybe you mean "out of order amongst threads", which is not what I'm talking about.

Since this is in C++, so I'm guessing the compiler will "do the right thing" (tm) with regards to memory barriers, and the core being OOO makes no difference here.

Probably just a nomenclature clash around "OOO" between ourselves... :)

lib/xray/xray_inmemory_log.cc
188

I still find this confusing... Is this 10^9 just a normalising factor, to get compatible numbers? If anything, this line needs a serious comment explaining why this is what it is.

Also, clock_gettime() will return a system wide, sequential and consistent number, while RDTSCP will return a counter that is internal to each CPU (and will be different across CPUs), thus prone to problems while context-switching.

Regardless, if you want CPU frequency, you can do exactly what you've done to x86.

lib/xray/xray_trampoline_arm.S
8

Right, so it's not C/C++, it's AAPCS (the ARM Procedure Call Standard).

As long as you're not passing NEON vectors as arguments, Q registers are not used (see arm_neon.h), and d0-d7 should take care of all VFP registers.

35

A8.8.132 POP (ARM):

"ARM deprecates the use of this instruction with both the LR and the PC in the list."
41

Same again, if you're not using NEON vectors, this is fine.

rSerge updated this revision to Diff 69891.Aug 31 2016, 12:10 PM
rSerge edited edge metadata.
rSerge marked 2 inline comments as done.Aug 31 2016, 12:13 PM
rSerge added inline comments.
lib/xray/xray_arm.cc
40

Changing.

52

Ok, changing.

112

You are thinking about data: the CPU executes out of order the instructions which manipulate data. On the data side we only write, and memory_order_release should prevent reordering. But we write CPU instructions, which another core may be fetching, decoding and executing concurrently with our writes.
So I mean the scenario where the CPU is reading instructions themselves from the code segment ("fetching") in order to then decode the instructions and finally execute. Can it fetch instruction at pc+4 earlier than the instruction at pc?
As I understood from ARM specification, it can't: ARM CPU is always fetching the instruction at pc, decoding the instruction at pc-4 and executing the instruction at pc-8.

lib/xray/xray_inmemory_log.cc
188

10^9 is the number of nanoseconds per second. It can be viewed as a normalizing factor, to get measurements in seconds.
I would prefer something of higher resolution than clock_gettime() (they say on the internet that its resolution is only 1ms, while RDTSCP resolution is aroun 1ns), but I don't know how to do it on ARM. I searched on the internet and figured out that the cycle counter on ARM is 1) not available in user mode 2) changes frequency when CPU frequency changes. In contrast, RDTSCP on x86 is available in user mode and has a constant frequency, independent on CPU power-saving / turbo frequency adjustments.
I'm adding a comment.

lib/xray/xray_trampoline_arm.S
35

The list contains only pc, not both lr and pc.

rengolin added inline comments.Sep 1 2016, 7:24 AM
lib/xray/xray_arm.cc
112

Right, it's a bit more complicated than that...

A good quick source of all factors: https://community.arm.com/groups/processors/blog/2011/03/22/memory-access-ordering--an-introduction

But we write CPU instructions, which another core may be fetching, decoding and executing concurrently with our writes.

So, you need to tell the other cores to wait until you write, then you need to store-release, then they can fetch. Otherwise, they'll fetch NOPs.

Can it fetch instruction at pc+4 earlier than the instruction at pc?

In theory, no. In practice, maybe.

ARM has separate caches for code and data. If core0 reads 'pc' - 16, and the Icache line is, say, 32, then the NOPs are in core0's cache. Before core0 reaches the 'pc', core1 gets it, sets a load-acquire, and jumps to your thunk. At that time, you really want core0 to *stop* before reaching that specified 'pc', or it'll execute NOPs. Once core1 has written its shim, it then store-releases and core0 can continue, now executing your inserted code.

In summary, you *need* a barrier. Since this is about code fetching, you need an instruction barrier (ISB) not a data barrier (DMB).

ARM CPU is always fetching the instruction at pc, decoding the instruction at pc-4 and executing the instruction at pc-8.

On the same core, instructions are (again, in theory) fetched and decoded "in order", but they're stored in a queue, which gets dispatched at any convenient time. So there is no concept of 'pc+8' at all. The cores will also speculatively fetch, decode and even execute (ex. branch prediction, peephole, etc).

So, there is absolutely *no* guarantee that any instruction will be fetched, decoded or executed before another, unless they have a strict dependency relationship, either by data dependency, atomic instructions or barriers.

lib/xray/xray_inmemory_log.cc
188

Ah, I see. I didn't know RDTSCP had a fixed frequency. In that case, a comment explaining it would be most welcome.

lib/xray/xray_trampoline_arm.S
35

Sorry, ignore me.

rSerge updated this revision to Diff 70039.Sep 1 2016, 11:56 AM
rSerge marked 3 inline comments as done.

Fixed Ubuntu x86_64 build. Implemented the changes requested in code review comments.

rSerge marked an inline comment as done.Sep 1 2016, 11:57 AM
rSerge added inline comments.
lib/sanitizer_common/scripts/gen_dynamic_list.py
54

Tested on Ubuntu x86_64.

lib/xray/xray_inmemory_log.cc
188

Adding.

rengolin accepted this revision.Sep 1 2016, 1:55 PM
rengolin edited edge metadata.

This looks good to me, thanks for all the changes!

If @dberris is happy, I'm happy. :)

cheers,
--renato

PS: I may have missed a few things, but we can fix as we go, when support gets better.

This revision is now accepted and ready to land.Sep 1 2016, 1:55 PM

Still, LGTM -- thanks @rSerge!

rSerge updated this revision to Diff 70574.Sep 7 2016, 11:26 AM
rSerge edited edge metadata.
rSerge marked an inline comment as done.

Rebased to the latest revision. I don't have commit access rights. Could someone commit?

Landing this now.

This revision was automatically updated to reflect the committed changes.
dberris reopened this revision.Sep 8 2016, 9:31 PM
This revision is now accepted and ready to land.Sep 8 2016, 9:31 PM
dberris requested changes to this revision.Sep 8 2016, 9:31 PM
dberris edited edge metadata.

Reverted in rL280969, need to resolve comments in D23931 before trying to land again.

This revision now requires changes to proceed.Sep 8 2016, 9:31 PM
rSerge updated this revision to Diff 71630.Sep 16 2016, 6:25 AM
rSerge edited edge metadata.
rSerge removed rL LLVM as the repository for this revision.

Removed .arch armv7 directive

rSerge updated this revision to Diff 71636.Sep 16 2016, 6:48 AM
rSerge edited edge metadata.

Fixed patch file format.

So, you're forcing vfpv3, which is armv7-only. AFAICS, you're only using VPUSH and VPOP, which is available since vfpv2 (which is also available in v6), so maybe a better fix would be to use:

.arch armv6t2
.fpu vfpv2

which should work on armv7, too.

Since this is the restriction we have inside the code, it would be more clear this way. Can you do a quick test with those directives?

cheers,
--renato

dberris accepted this revision.Sep 18 2016, 5:44 PM
dberris edited edge metadata.

@rengolin -- do we need to wait for the test, or can we do that post-commit?

lib/xray/xray_arm.cc
2

nit: s/xray_arm.cpp/xray_arm.cc/

This revision is now accepted and ready to land.Sep 18 2016, 5:44 PM

The test is to make sure it won't break the bots again. Should be quick, as he had done it before.

rSerge added a comment.EditedSep 19 2016, 6:49 AM

So, you're forcing vfpv3, which is armv7-only. AFAICS, you're only using VPUSH and VPOP, which is available since vfpv2 (which is also available in v6), so maybe a better fix would be to use:

.arch armv6t2
.fpu vfpv2

which should work on armv7, too.

Since this is the restriction we have inside the code, it would be more clear this way. Can you do a quick test with those directives?

cheers,
--renato

armv6t2 shouldn't work because MOVW and MOVT instructions are available only since armv7 .
I can test with .fpu vfpv2, though this is not quick (compilation and moving between VMs takes substantial time).

lib/xray/xray_arm.cc
2

Sorry, I'm not that good with the lingo. What is the meaning of this comment?

armv6t2 shouldn't work because MOVW and MOVT instructions are available only since armv7.

Movw/Movt are Thumb2 instructions and were introduced in ARMv6T2.

I can test with .fpu vfpv2, though this is not quick (compilation and moving between VMs takes substantial time).

I'm not worried about the assembly code working on v6T2 or VFPv2, I'm worried about the toolchain coping with the options.

You just need to get the complete command line with a recent enough cross-toolchain (4.8+) and try on the resulting file.

cheers,
--renato

PS: You should really get the ARM ARMs: http://llvm.org/docs/CompilerWriterInfo.html

lib/xray/xray_arm.cc
2

It means the name on the comment is wrong and you have to *replace* (s///) with the right one.

You're calling it xray_arm.cc but has xray_arm.cpp in the header.

I've tested

.arch armv6t2
.fpu vfpv2

with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.

lib/xray/xray_arm.cc
2

I just did it by example. Ok, I'm fixing the comment

rSerge updated this revision to Diff 71865.Sep 19 2016, 12:08 PM
rSerge edited edge metadata.

Implemented the changes requested in the code review comments.

I've tested

.arch armv6t2
.fpu vfpv2

with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.

Sorry, I wasn't clear. These two lines should work on any toolchain, my point is if that makes it break with your gnu toolchain because of the same issue (minimal ISA support assumed) in *conjunction* with the rest of the code.

I've tested

.arch armv6t2
.fpu vfpv2

with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.

Sorry, I wasn't clear. These two lines should work on any toolchain, my point is if that makes it break with your gnu toolchain because of the same issue (minimal ISA support assumed) in *conjunction* with the rest of the code.

That error doesn't seem to happen, at least with the toolchains I've tested.

That error doesn't seem to happen, at least with the toolchains I've tested.

Perfect, let's try again. :)

@dberris, you have already committed the other two patches again, right? Would you do the honours?

That error doesn't seem to happen, at least with the toolchains I've tested.

Perfect, let's try again. :)

@dberris, you have already committed the other two patches again, right? Would you do the honours?

Yep -- this one's the last piece. Happy to land now. :)

This revision was automatically updated to reflect the committed changes.