Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[XRay] ARM 32-bit no-Thumb support in compiler-rt
ClosedPublic

Authored by rSerge on Aug 26 2016, 10:21 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
rengolin added inline comments.Aug 30 2016, 6:38 AM
lib/sanitizer_common/scripts/gen_dynamic_list.py
54 ↗(On Diff #69610)

Have you tested this on Linux and Mac? To make sure it also work there?

lib/xray/xray_arm.cc
38 ↗(On Diff #69610)

Yes, please. Let's not add different styles if we don't have to. Camel case, caps for variables, no caps for functions.

Enum values have format INI_Name, with "INI" the initials of the enum's name, all caps, and the Name a unique identifier within the enum.

This revision now requires changes to proceed.Aug 30 2016, 6:38 AM
rSerge marked an inline comment as done.Aug 30 2016, 1:18 PM

I've responded inline and will upload a new diff in a minute.

You might also consider extending the file header type to indicate what platform the trace was generated from, so tools can determine what to do with a trace that comes from a specific CPU. I'm happy to do it later, but wanted to know your thoughts on how we might encode that information appropriately.

I think we should look for some enumeration in compiler-rt or LLVM listing the CPU architectures, so to make XRay CPU codes consistent with that. But so far XRayFileHeader and XRayRecord don't seem to differ between CPUs, or do they?

lib/sanitizer_common/scripts/gen_dynamic_list.py
54 ↗(On Diff #69610)

I can test on Ubuntu. I don't have access to a Mac.

lib/xray/xray_arm.cc
22 ↗(On Diff #69610)

Moving to xray_interface_internal.h .

29–32 ↗(On Diff #69610)

Ok, changing to CamelCase.

38 ↗(On Diff #69610)

The nearby code of sanitizers (ASAN, sanitizer_common) mostly names the functions starting with a capital letter. Do you still think I should name functions starting with a lowercase letter?

50 ↗(On Diff #69610)

Because it is not possible. This code patches the user program at runtime with different instructions depending on the data in the user program. There doesn't seem anything we can put as inline assembly in compiler-rt code. It may be possible to use assembly strings, but that would require to link an assembler to the user program.

110 ↗(On Diff #69610)

I used memory_order_release here, on the writer side. According to C++11 standard, this should prevent reordering previous writes past this point (inserting fences if necessary, etc.).
There is little we can do on the reader side, as there is no data reader: it is the CPU fetching instructions on the other side. Is there any evidence that ARM may fetch instructions out of order? If so, how to prevent this?

lib/xray/xray_inmemory_log.cc
30 ↗(On Diff #69610)

What do you mean by getting the proper value?
I think that getting here anything other than just simple "1 billion" would be too unexpected, and we would need error checking for that. Furthermore, getting it from other compile units may result in initialization order issues. It is easier and more reliable to just have it as a 1 billion constant.

30 ↗(On Diff #69610)

I'm renaming this to get rid of c prefix.
I think that pulling the whole chrono just for nanoseconds per second number may be a waste of compile time.

70 ↗(On Diff #69610)

Changing.

188 ↗(On Diff #69610)

No, this is not hard-coding to 1GHz. x86_64 uses RDTSCP instruction in the numerator, that is why the denominator is CPU frequency. There is nothing similar for ARM in user mode. So we fall back on clock_gettime() system call. It provides time in nanoseconds. That is why we use 1 billion as the denominator. Shall I rename the variable from CPUFrequency to something like TicksPerSecond so that it is more comprehensive on CPUs without instructions like RDTSC?

lib/xray/xray_trampoline_arm.S
1 ↗(On Diff #69610)

I guess that the dynamic dispatch doesn't help us, because we are calling the function from machine code written at run-time into the code of user functions.
Adding .arch armv7 and .fpu vfpv3 .

6 ↗(On Diff #69610)

That can happen. But are Q registers used for passing parameters and returning values?
Perhaps my assembly comment is misleading: here (in __xray_FunctionEntry) we need to push&pop every register which may be used for passing parameters. And in __xray_FunctionExit we need to push&pop every register which may be used for returning values from C/C++ functions.

7 ↗(On Diff #69610)

Changing.

rSerge updated this revision to Diff 69744.Aug 30 2016, 1:19 PM
rSerge edited edge metadata.
rSerge marked an inline comment as done.
rSerge marked 5 inline comments as done.Aug 30 2016, 1:23 PM

Marked the done comments according to the diff just uploaded.

Thanks for the changes, some more comments...

lib/xray/xray_arm.cc
39 ↗(On Diff #69744)

This is a new file, it should use LLVM's policy.

51 ↗(On Diff #69744)

Of course. Ignore me.

Though, this is the same as the one below, and you could merge them both by passing the register name and ORRing [reg << 12] with the instruction, and making sure reg < 15.

111 ↗(On Diff #69744)

Is there any evidence that ARM may fetch instructions out of order? If so, how to prevent this?

I'm not sure what you mean. Many Cortex-AR cores are OOO. That's their design, you can't change that. Or maybe you mean "out of order amongst threads", which is not what I'm talking about.

Since this is in C++, so I'm guessing the compiler will "do the right thing" (tm) with regards to memory barriers, and the core being OOO makes no difference here.

Probably just a nomenclature clash around "OOO" between ourselves... :)

lib/xray/xray_inmemory_log.cc
188 ↗(On Diff #69744)

I still find this confusing... Is this 10^9 just a normalising factor, to get compatible numbers? If anything, this line needs a serious comment explaining why this is what it is.

Also, clock_gettime() will return a system wide, sequential and consistent number, while RDTSCP will return a counter that is internal to each CPU (and will be different across CPUs), thus prone to problems while context-switching.

Regardless, if you want CPU frequency, you can do exactly what you've done to x86.

lib/xray/xray_trampoline_arm.S
7 ↗(On Diff #69744)

Right, so it's not C/C++, it's AAPCS (the ARM Procedure Call Standard).

As long as you're not passing NEON vectors as arguments, Q registers are not used (see arm_neon.h), and d0-d7 should take care of all VFP registers.

34 ↗(On Diff #69744)

A8.8.132 POP (ARM):

"ARM deprecates the use of this instruction with both the LR and the PC in the list."
40 ↗(On Diff #69744)

Same again, if you're not using NEON vectors, this is fine.

rSerge updated this revision to Diff 69891.Aug 31 2016, 12:10 PM
rSerge edited edge metadata.
rSerge marked 2 inline comments as done.Aug 31 2016, 12:13 PM
rSerge added inline comments.
lib/xray/xray_arm.cc
39 ↗(On Diff #69744)

Changing.

51 ↗(On Diff #69744)

Ok, changing.

111 ↗(On Diff #69744)

You are thinking about data: the CPU executes out of order the instructions which manipulate data. On the data side we only write, and memory_order_release should prevent reordering. But we write CPU instructions, which another core may be fetching, decoding and executing concurrently with our writes.
So I mean the scenario where the CPU is reading instructions themselves from the code segment ("fetching") in order to then decode the instructions and finally execute. Can it fetch instruction at pc+4 earlier than the instruction at pc?
As I understood from ARM specification, it can't: ARM CPU is always fetching the instruction at pc, decoding the instruction at pc-4 and executing the instruction at pc-8.

lib/xray/xray_inmemory_log.cc
188 ↗(On Diff #69744)

10^9 is the number of nanoseconds per second. It can be viewed as a normalizing factor, to get measurements in seconds.
I would prefer something of higher resolution than clock_gettime() (they say on the internet that its resolution is only 1ms, while RDTSCP resolution is aroun 1ns), but I don't know how to do it on ARM. I searched on the internet and figured out that the cycle counter on ARM is 1) not available in user mode 2) changes frequency when CPU frequency changes. In contrast, RDTSCP on x86 is available in user mode and has a constant frequency, independent on CPU power-saving / turbo frequency adjustments.
I'm adding a comment.

lib/xray/xray_trampoline_arm.S
34 ↗(On Diff #69744)

The list contains only pc, not both lr and pc.

rengolin added inline comments.Sep 1 2016, 7:24 AM
lib/xray/xray_arm.cc
112 ↗(On Diff #69891)

Right, it's a bit more complicated than that...

A good quick source of all factors: https://community.arm.com/groups/processors/blog/2011/03/22/memory-access-ordering--an-introduction

But we write CPU instructions, which another core may be fetching, decoding and executing concurrently with our writes.

So, you need to tell the other cores to wait until you write, then you need to store-release, then they can fetch. Otherwise, they'll fetch NOPs.

Can it fetch instruction at pc+4 earlier than the instruction at pc?

In theory, no. In practice, maybe.

ARM has separate caches for code and data. If core0 reads 'pc' - 16, and the Icache line is, say, 32, then the NOPs are in core0's cache. Before core0 reaches the 'pc', core1 gets it, sets a load-acquire, and jumps to your thunk. At that time, you really want core0 to *stop* before reaching that specified 'pc', or it'll execute NOPs. Once core1 has written its shim, it then store-releases and core0 can continue, now executing your inserted code.

In summary, you *need* a barrier. Since this is about code fetching, you need an instruction barrier (ISB) not a data barrier (DMB).

ARM CPU is always fetching the instruction at pc, decoding the instruction at pc-4 and executing the instruction at pc-8.

On the same core, instructions are (again, in theory) fetched and decoded "in order", but they're stored in a queue, which gets dispatched at any convenient time. So there is no concept of 'pc+8' at all. The cores will also speculatively fetch, decode and even execute (ex. branch prediction, peephole, etc).

So, there is absolutely *no* guarantee that any instruction will be fetched, decoded or executed before another, unless they have a strict dependency relationship, either by data dependency, atomic instructions or barriers.

lib/xray/xray_inmemory_log.cc
188 ↗(On Diff #69891)

Ah, I see. I didn't know RDTSCP had a fixed frequency. In that case, a comment explaining it would be most welcome.

lib/xray/xray_trampoline_arm.S
35 ↗(On Diff #69891)

Sorry, ignore me.

rSerge updated this revision to Diff 70039.Sep 1 2016, 11:56 AM
rSerge marked 3 inline comments as done.

Fixed Ubuntu x86_64 build. Implemented the changes requested in code review comments.

rSerge marked an inline comment as done.Sep 1 2016, 11:57 AM
rSerge added inline comments.
lib/sanitizer_common/scripts/gen_dynamic_list.py
54 ↗(On Diff #69891)

Tested on Ubuntu x86_64.

lib/xray/xray_inmemory_log.cc
188 ↗(On Diff #69891)

Adding.

rengolin accepted this revision.Sep 1 2016, 1:55 PM
rengolin edited edge metadata.

This looks good to me, thanks for all the changes!

If @dberris is happy, I'm happy. :)

cheers,
--renato

PS: I may have missed a few things, but we can fix as we go, when support gets better.

This revision is now accepted and ready to land.Sep 1 2016, 1:55 PM

Still, LGTM -- thanks @rSerge!

rSerge updated this revision to Diff 70574.Sep 7 2016, 11:26 AM
rSerge edited edge metadata.
rSerge marked an inline comment as done.

Rebased to the latest revision. I don't have commit access rights. Could someone commit?

Landing this now.

This revision was automatically updated to reflect the committed changes.
dberris reopened this revision.Sep 8 2016, 9:31 PM
This revision is now accepted and ready to land.Sep 8 2016, 9:31 PM
dberris requested changes to this revision.Sep 8 2016, 9:31 PM
dberris edited edge metadata.

Reverted in rL280969, need to resolve comments in D23931 before trying to land again.

This revision now requires changes to proceed.Sep 8 2016, 9:31 PM
rSerge updated this revision to Diff 71630.Sep 16 2016, 6:25 AM
rSerge edited edge metadata.
rSerge removed rL LLVM as the repository for this revision.

Removed .arch armv7 directive

rSerge updated this revision to Diff 71636.Sep 16 2016, 6:48 AM
rSerge edited edge metadata.

Fixed patch file format.

So, you're forcing vfpv3, which is armv7-only. AFAICS, you're only using VPUSH and VPOP, which is available since vfpv2 (which is also available in v6), so maybe a better fix would be to use:

.arch armv6t2
.fpu vfpv2

which should work on armv7, too.

Since this is the restriction we have inside the code, it would be more clear this way. Can you do a quick test with those directives?

cheers,
--renato

dberris accepted this revision.Sep 18 2016, 5:44 PM
dberris edited edge metadata.

@rengolin -- do we need to wait for the test, or can we do that post-commit?

lib/xray/xray_arm.cc
1 ↗(On Diff #71636)

nit: s/xray_arm.cpp/xray_arm.cc/

This revision is now accepted and ready to land.Sep 18 2016, 5:44 PM

The test is to make sure it won't break the bots again. Should be quick, as he had done it before.

rSerge added a comment.EditedSep 19 2016, 6:49 AM

So, you're forcing vfpv3, which is armv7-only. AFAICS, you're only using VPUSH and VPOP, which is available since vfpv2 (which is also available in v6), so maybe a better fix would be to use:

.arch armv6t2
.fpu vfpv2

which should work on armv7, too.

Since this is the restriction we have inside the code, it would be more clear this way. Can you do a quick test with those directives?

cheers,
--renato

armv6t2 shouldn't work because MOVW and MOVT instructions are available only since armv7 .
I can test with .fpu vfpv2, though this is not quick (compilation and moving between VMs takes substantial time).

lib/xray/xray_arm.cc
1 ↗(On Diff #71636)

Sorry, I'm not that good with the lingo. What is the meaning of this comment?

armv6t2 shouldn't work because MOVW and MOVT instructions are available only since armv7.

Movw/Movt are Thumb2 instructions and were introduced in ARMv6T2.

I can test with .fpu vfpv2, though this is not quick (compilation and moving between VMs takes substantial time).

I'm not worried about the assembly code working on v6T2 or VFPv2, I'm worried about the toolchain coping with the options.

You just need to get the complete command line with a recent enough cross-toolchain (4.8+) and try on the resulting file.

cheers,
--renato

PS: You should really get the ARM ARMs: http://llvm.org/docs/CompilerWriterInfo.html

lib/xray/xray_arm.cc
1 ↗(On Diff #71636)

It means the name on the comment is wrong and you have to *replace* (s///) with the right one.

You're calling it xray_arm.cc but has xray_arm.cpp in the header.

I've tested

.arch armv6t2
.fpu vfpv2

with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.

lib/xray/xray_arm.cc
1 ↗(On Diff #71636)

I just did it by example. Ok, I'm fixing the comment

rSerge updated this revision to Diff 71865.Sep 19 2016, 12:08 PM
rSerge edited edge metadata.

Implemented the changes requested in the code review comments.

I've tested

.arch armv6t2
.fpu vfpv2

with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.

Sorry, I wasn't clear. These two lines should work on any toolchain, my point is if that makes it break with your gnu toolchain because of the same issue (minimal ISA support assumed) in *conjunction* with the rest of the code.

I've tested

.arch armv6t2
.fpu vfpv2

with Clang cross-compiling from x86_64-Windows to ARM-Linux and Thumb-Linux, and GCC cross-compiling from x86_64-Ubuntu to ARM-Linux and Thumb-Linux. No compile errors so far.

Sorry, I wasn't clear. These two lines should work on any toolchain, my point is if that makes it break with your gnu toolchain because of the same issue (minimal ISA support assumed) in *conjunction* with the rest of the code.

That error doesn't seem to happen, at least with the toolchains I've tested.

That error doesn't seem to happen, at least with the toolchains I've tested.

Perfect, let's try again. :)

@dberris, you have already committed the other two patches again, right? Would you do the honours?

That error doesn't seem to happen, at least with the toolchains I've tested.

Perfect, let's try again. :)

@dberris, you have already committed the other two patches again, right? Would you do the honours?

Yep -- this one's the last piece. Happy to land now. :)

This revision was automatically updated to reflect the committed changes.