This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Support -mexecute-only with -mlong-calls.
ClosedPublic

Authored by ZhiyaoMa98 on Oct 18 2022, 2:51 PM.

Details

Summary

Instead of using constant pools, use movw movt pair.

Diff Detail

Event Timeline

ZhiyaoMa98 created this revision.Oct 18 2022, 2:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 18 2022, 2:51 PM
ZhiyaoMa98 requested review of this revision.Oct 18 2022, 2:51 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 18 2022, 2:51 PM
efriedma added inline comments.
llvm/test/CodeGen/Thumb2/thumb2-execute-only-long-calls.ll
20

Is there some reason we can't just generate movw r0, :lower16:bar; movt r0, :upper16:bar?

@efriedma Thank you for your suggestion. I will remove the extra indirection.

I was wondering if you could also provide some insights about the RWPI case. I believe the same optimization also applies to RWPI. However, I actually want to store the function address as a global variable when using RWPI, because I want the address to live in RAM instead of Flash, so that I can redirect the function call at runtime, for dynamic linking purpose.

Should I create a new target feature to indicate that I want to store function address in RAM?

that I can redirect the function call at runtime, for dynamic linking purpose.

Can you describe a little more what you're trying to do here?

If you want to replace the implementation of an existing function at runtime, you'd be better off implementing the indirection as a frontend feature; by the time you get to the backend, optimizations have destroyed the semantics you want.

ZhiyaoMa98 added a comment.EditedOct 18 2022, 4:50 PM

Can you describe a little more what you're trying to do here?

Sure. My eventual goal is to enable fine-granular live-update on ARM based microcontrollers, which requires the system to do some relocation at runtime. Below I will describe the challenge with a simple C example.

Consider the following C snippet:

extern void global_func(void); // A global function whose symbol is exported by the system at runtime.
static void local_func(void) { ... }

static void main_entry(void) {
    local_func();
    global_func();
}

I want to load and run the compiled object file at runtime, which requires two steps.

  1. Burn the object file into Flash storage.
  2. Perform a runtime symbol resolution and relocation so that global_func is set to the runtime address.

The reason why I must store code in Flash storage is that the microcontroller I am using, as well as many other ARM based microcontrollers, has Flash storage 5x greater than RAM, and code typically directly runs from Flash.

local_func requires the compiler to use position independent code, which has already been handled by -fropi. global_func however, is the case I am trying to solve here.

Existing compiler options always store the address of global_func in Flash.

The default case:

main_entry:
    bl  local_func
    b.w global_func // Relative address is hardcoded in the instruction, in Flash.

If compile with -mlong-calls:

main_entry:
    bl  local_func
    ldr r0, [pc, #4] // Load address from constant pool, still in Flash.
    bx  r0
.Lconst_pool:
    .word global_func

In the hypothetical case if the compiler chose to use movw movt pair:

main_entry:
    bl   local_func
    movw r0, :lower16:global_func // Absolute address is hardcoded in the instruction, still in Flash.
    movt r0, :upper16:global_func
    bx.  r0

I was expecting to use the "side effect" of -mexecute-only that promotes constant pools to global variables to achieve my goal of having the function address to live in RAM.

main_entry:
    bl   local_func
    movw r0, :lower16:.const_pool(sbrel)
    movt r0, :upper16:.const_pool(sbrel) // Also using RWPI so that the jump table can be placed anywhere in RAM pointed by r9.
    ldr  r0, [r9, r0] // Absolute address is held in RAM now.
    bx   r0

As you have already pointed out, in the normal case when we do not need to put the address in RAM, the extra indirection is unnecessary and slows down the code.

But if I have a use case like above where I need to store the address in RAM, could you enlighten me about the best approach to achieve my goal?

The construct you want is pretty similar to a GOT. if you compile with -fPIE -fsemantic-interposition, you get basically the code you want, except that the compiler uses a plt by default instead of a got. If we supported -fno-plt for ARM, it would be almost exactly what you want. That said, that won't work with -frwpi... maybe we need some new kind of relocation to represent that.

Unfortunately, -fPIE seems not to be generating the PLT on LLVM for embedded ARM.

C source file (test.c):

extern void bar(void);
void foo(void) {
    bar();
}

LLVM with clang -O2 -fPIE -fsemantic-interposition -mlong-calls --target=armv7em-none-eabi -c test.c:

00000000 <foo>:
   0:   4800            ldr     r0, [pc, #0]    ; (4 <foo+0x4>)
   2:   4700            bx      r0
   4:   00000000        .word   0x00000000

ARM GNU with arm-none-eabi-gcc -O2 -fPIE -mlong-calls -msingle-pic-base -mcpu=cortex-m4 -c test.c:

00000000 <foo>:
   0:   4b01            ldr     r3, [pc, #4]    ; (8 <foo+0x8>)
   2:   f859 3003       ldr.w   r3, [r9, r3]
   6:   4718            bx      r3
   8:   00000000        .word   0x00000000

One, -mlong-calls isn't currently compatible with PIE. Two, on ARM, there are no special plt relocations; the linker just takes care of it. (You can see the differences if you try to take the address of a function without calling it.)

ZhiyaoMa98 edited the summary of this revision. (Show Details)
ZhiyaoMa98 marked an inline comment as done.Oct 19 2022, 8:33 AM

I have updated the diff to avoid the extra indirection. I am thinking about adding a new option, say -mgot-calls to allow code generation with the extra indirection. Is it sensible and shall I create another diff to discuss that?

I am thinking about adding a new option, say -mgot-calls to allow code generation with the extra indirection. Is it sensible and shall I create another diff to discuss that?

That probably makes sense, yes.

llvm/lib/Target/ARM/ARMISelLowering.cpp
2655

Can we directly check that movw/movt is available? I think that's what we do in other places? (Then just assert we aren't execute-only in the non-movw path.)

Then just assert we aren't execute-only in the non-movw path.

When we are not execute-only, existing code handles it by using constant pools and we are all good.

In the case where we are execute-only and long-calls at the same time, we assert that we have movt like in other places in the same source file.

efriedma accepted this revision.Oct 19 2022, 10:44 AM

LGTM with one small change.

clang/lib/Driver/ToolChains/Arch/ARM.cpp
779

Fix this comment?

This revision is now accepted and ready to land.Oct 19 2022, 10:44 AM
ZhiyaoMa98 marked an inline comment as done.

Updated the comment to reflect that now we allow using -mlong-calls with -mexecute-only.

efriedma accepted this revision.Oct 19 2022, 11:09 AM
ZhiyaoMa98 marked an inline comment as done.

Remove the unused GA variable.

Just in case you assume that I have push permission, unfortunately I do not. Could you help me merge the patch in? Thanks.

This revision was automatically updated to reflect the committed changes.