This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Optimise load(adr address) to ldr address
ClosedPublic

Authored by dmgreen on Aug 21 2018, 2:11 AM.

Details

Summary

Providing that the load is known to be 4 byte aligned, we can optimise a ldr(adr address) to just ldr address.

Diff Detail

Repository
rL LLVM

Event Timeline

dmgreen created this revision.Aug 21 2018, 2:11 AM

Unfortunately I'm hitting a number of linker problems in testing this.

  • Armlink seems to do fine, but I am testing embedded code there.
  • ld.bfd links the test-suite fine with fpic (except for the expected errors where relocations are out of range). But without pic gives an error about relocating against stderr that I'm not sure of yet, but might point to this alignment check being wrong in some way. Unfortunately the errors are not clear what is out of range, it may be a legitimate error (although ld.gold and lld both are "successful" at linking the same program).
  • ld.gold doesn't like load literal relocations, producing invalid instructions.
  • lld, with some extra patches, does fine for anything that isn't a tls relocation I think.

Unfortunately I'm hitting a number of linker problems in testing this.

  • Armlink seems to do fine, but I am testing embedded code there.
  • ld.bfd links the test-suite fine with fpic (except for the expected errors where relocations are out of range). But without pic gives an error about relocating against stderr that I'm not sure of yet, but might point to this alignment check being wrong in some way. Unfortunately the errors are not clear what is out of range, it may be a legitimate error (although ld.gold and lld both are "successful" at linking the same program).

Some thoughts:

  • When compiling PIC I'm guessing that the reference to stderr will go via the GOT, this is likely to be closer to the code than the data.
  • I can't see anything obviously wrong with the BFD code for handling LO19
  • LLD has a different section layout to BFD, in particular it puts the ro-data before code so it may be just luck that puts stderr in range for LLD. You may be able to reproduce by feeding LLD the ld.bfd linker map file.
  • One way to find out whether the out of range error is correct is to use the small code model and see how big the distance between stderr and the equivalent place in the relocation is. Accounting for the tiny code model, does this look credible?
  • ld.gold doesn't like load literal relocations, producing invalid instructions.

I think it will be worth filing a binutils bug, a quick check of the entry for aarch64-reloc.def in gold shows us:

ARD(LD_PREL_LO19		 , STATIC ,  AARCH64    ,   Y,  -1,   20,20		  ,    2,20 , Symbol::RELATIVE_REF , 			     LDST  )

with the LDST being the key part. I think that should be LD as there isn't a store literal and the LDST category of relocations have a different set of encoding rules.

  • lld, with some extra patches, does fine for anything that isn't a tls relocation I think.

If there is anything I can do to help here let me know?

  • I can't see anything obviously wrong with the BFD code for handling LO19

Trying a newer ld.bfd (2.31 from today) this is working fine in both modes, so perhaps the issue is old. It was 2.27/2.28 that I was trying earlier, but the program being compiled (timeit-target) is small enough that I didn't expect it to go out of range.

I think it will be worth filing a binutils bug, a quick check of the entry for aarch64-reloc.def in gold shows us:

https://sourceware.org/bugzilla/show_bug.cgi?id=23557

If there is anything I can do to help here let me know?

I think the tls problems was me getting the code to add R_AARCH64_GOT_LD_PREL19 and R_AARCH64_TLSIE_LD_GOTTPREL_PREL19 relocations incorrect. GOT seems easy enough now I've figured out what was wrong, TLSIE I'm not sure about yet but I'm guessing it should be possible.

  • I can't see anything obviously wrong with the BFD code for handling LO19

Trying a newer ld.bfd (2.31 from today) this is working fine in both modes, so perhaps the issue is old. It was 2.27/2.28 that I was trying earlier, but the program being compiled (timeit-target) is small enough that I didn't expect it to go out of range.

I can reproduce with the installed binutils on Ubuntu 16.04 (2.26.1). Taking a look at the object that causes the problem the specific error message I see is

get.dir/timeit.c.o  -o timeit-target 
/usr/bin/ld: CMakeFiles/timeit-target.dir/timeit.c.o(.text+0x5dc): unresolvable R_AARCH64_LD_PREL_LO19 relocation against symbol `stderr@@GLIBC_2.17'

I think that this is likely to be the only time in the test suite that there is a R_AARCH64_LD_PREL_LO19 to a data symbol in a shared library. In theory this should generate a R_AARCH64_COPY dynamic relocation. I think that this was fixed as part of the changes made in https://sourceware.org/bugzilla/show_bug.cgi?id=21532 "[AArch64] Allow COPY relocation elimination" as prior to that R_AARCH64_LD_PREL_LO19 wasn't in the switch statement to generate copy relocations. The change made it into binutils 2.29.

If there is anything I can do to help here let me know?

I think the tls problems was me getting the code to add R_AARCH64_GOT_LD_PREL19 and R_AARCH64_TLSIE_LD_GOTTPREL_PREL19 relocations incorrect. GOT seems easy enough now I've figured out what was wrong, TLSIE I'm not sure about yet but I'm guessing it should be possible.

I'm guessing that this is the relaxation from TLSIE to TLSLE? I think that will be possible as it should be possible to tell from the relocation at the place which code model is being used.

john.brawn added inline comments.Aug 22 2018, 8:50 AM
lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1098 ↗(On Diff #161666)

I'm pretty sure we can do the selection entirely in tablegen. I experimented with adjusting LoadLiteral AArch64InstrFormats.td to be kinda similar to Load32RO (take ValueType and SDPatternOperator as arguments, use these in a pattern but match AArch64adr for the address) and that seemed to work. The only problem there is making sure the alignment is correct, which I think can be done using a PatFrag or PatLeaf (though I tried something quick and ran into tablegen errors, so maybe it's more complicated than I expect).

I think that this is likely to be the only time in the test suite that there is a R_AARCH64_LD_PREL_LO19 to a data symbol in a shared library. In theory this should generate a R_AARCH64_COPY dynamic relocation. I think that this was fixed as part of the changes made in https://sourceware.org/bugzilla/show_bug.cgi?id=21532 "[AArch64] Allow COPY relocation elimination" as prior to that R_AARCH64_LD_PREL_LO19 wasn't in the switch statement to generate copy relocations. The change made it into binutils 2.29.

Thanks for looking. What is our usual response with issues like this? Is it OK for this to go ahead, as it can be used fine in some linkers. I could put it behind a flag I think, so that users have the option to turn it off (although that might not be very obvious..)

I'm guessing that this is the relaxation from TLSIE to TLSLE? I think that will be possible as it should be possible to tell from the relocation at the place which code model is being used.

Yep :)

lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1098 ↗(On Diff #161666)

Yep, my original version of this was just in tablegen. I wasn't sure about how to do the alignment though, and it seemed to need an addedcomplexity of at least 20, at least the way I did it.

I'll try and take another go at doing it there, so long as it can handle the alignment and all the LDRSWl/LDRXl/LDRWl's. I'll let you know how it looks.

Thanks for looking. What is our usual response with issues like this? Is it OK for this to go ahead, as it can be used fine in some linkers. I could put it behind a flag I think, so that users have the option to turn it off (although that might not be very obvious..)

I think it depends on the context. For example:

  • How many existing builds is it going to break?
  • How easy would it be to backport the fix in binutils if it were important enough to do so?
  • Are there alternatives to binutils available?
  • Does the same fault exist when compiling the file with GCC?

In this particular case I think:

  • There are no existing clients of the tiny code model, so no existing builds to break.
  • The fix is fairly recent, AArch64 specific so it could be backported if necessary.
  • In theory LLD is an option if we get support in for the same clang release.
  • If it doesn't work on GCC it probably means very few people are using the feature on linux.

In summary I think the impact would be low enough that it isn't worth worrying about it.

dmgreen updated this revision to Diff 162715.Aug 27 2018, 11:58 AM

Many tablegen crashes later.. :)

I wasn't sure if there was some better way to say "any address" than this.

efriedma added inline comments.Aug 27 2018, 2:55 PM
lib/Target/AArch64/AArch64InstrInfo.td
1896 ↗(On Diff #162715)

Checking the alignment of the load isn't going to do the right thing in general; you need to check that the global is aligned. (Normally, a misaligned load is UB, but this would turn it into a link error.)

dmgreen added inline comments.Aug 28 2018, 12:11 PM
lib/Target/AArch64/AArch64InstrInfo.td
1896 ↗(On Diff #162715)

OK thanks. Can you give more details on the difference? Is the load's alignment wrong or inefficient? Or do you mean that in the event that load's alignment is wrong in a way that is UB, it needs to not be link error?

I would expect if we had an aligned global, <something> (instsimplify maybe?) would set the alignment on the load to that of the global. And we are using the alignment for a ldr here, so it coming from a load at least makes sense in that way.

There are other examples of aligned load patterns like this, such as alignedload16 from arm or alignedload from x86. They may not be expecting to be acting on the same global data though.

dmgreen updated this revision to Diff 162926.Aug 28 2018, 12:30 PM

"alignedAArch64adr"

It may be better to make this match the alignment on the global directly, not looking through the ADR. I'm not sure how that's done yet.

efriedma added inline comments.Aug 28 2018, 12:34 PM
lib/Target/AArch64/AArch64InstrInfo.td
1896 ↗(On Diff #162715)

Or do you mean that in the event that load's alignment is wrong in a way that is UB, it needs to not be link error?

I mean this: we can't reliably prevent optimizations that clone code, like inlining, from introducing UB, so UB can't be a link error.

The requirement here is different from alignedload because normally aligned loads only fault (or load incorrect data) if you violate the alignment requirement; the code still links. For example, on x86, "movapd 1, %xmm0" is a valid instruction, even though it will always fault.

efriedma added inline comments.Aug 28 2018, 12:58 PM
lib/Target/AArch64/AArch64InstrInfo.td
1900 ↗(On Diff #162926)

I'd suggest just using unsigned Align = G->getGlobal()->getPointerAlignment() here, which should reliably do the right thing for all GlobalObjects.

Should be possible to match a global address node directly using something like the following:

def alignedglobaladdr : PatLeaf<(tglobaladdr), [{
  auto *G = cast<GlobalAddressSDNode>(N);
  // etc.
}]>;
dmgreen added inline comments.Aug 28 2018, 2:56 PM
lib/Target/AArch64/AArch64InstrInfo.td
1896 ↗(On Diff #162715)

I see.

We currently use other ldr literal relocations, but I guess those are mostly to gotslots so cant be unaligned in the same way. I'm not sure about the tls ones, I think they all go through a got too.

1900 ↗(On Diff #162926)

Thanks for the suggestions. Will do.

I was looking for a way to avoid these multiple Pat's, having a PatLeaf that matched either tglobaladdr or tconstpool. You can let me know if it's worth it over multiple patterns.

dmgreen updated this revision to Diff 162964.Aug 28 2018, 2:57 PM
This revision is now accepted and ready to land.Aug 28 2018, 3:32 PM
This revision was automatically updated to reflect the committed changes.