Page MenuHomePhabricator

[PPC64] toc-indirect to toc-relative relaxation
ClosedPublic

Authored by MaskRay on Apr 22 2019, 12:06 AM.

Details

Summary

This is based on D54720 by Sean Fertile.

When accessing a global symbol which is not defined in the translation unit,
compilers will generate instructions that load the address from the toc entry.

If the symbol is defined, non-preemptable, and addressable with a 32-bit
signed offset from the toc pointer, the address can be computed
directly. e.g.

addis 3, 2, .LC0@toc@ha  # R_PPC64_TOC16_HA
ld    3, .LC0@toc@l(3)   # R_PPC64_TOC16_LO_DS, load the address from a .toc entry
ld/lwa 3, 0(3)           # load the value from the address

.section .toc,"aw",@progbits
.LC0: .tc var[TC],var

can be relaxed to

addis 3,2,var@toc@ha     # this may be relaxed to a nop,
addi  3,3,var@toc@l      # then this becomes addi 3,2,var@toc
ld/lwa 3, 0(3)           # load the value from the address

We can delete the test ppc64-got-indirect.s as its purpose is covered by
newly added ppc64-toc-relax.s and ppc64-toc-relax-constants.s

Event Timeline

MaskRay created this revision.Apr 22 2019, 12:06 AM
Herald added a project: Restricted Project. · View Herald Transcript
MaskRay updated this revision to Diff 196043.Apr 22 2019, 3:08 AM
MaskRay edited the summary of this revision. (Show Details)

Revamp tests

MaskRay updated this revision to Diff 196044.Apr 22 2019, 3:09 AM

Delete ppc64-got-indirect.s

MaskRay updated this revision to Diff 196045.Apr 22 2019, 3:14 AM

Delete unused Inputs/ppc64-toc-relax-constants.s

ruiu added inline comments.Apr 22 2019, 3:15 AM
ELF/Arch/PPC64.cpp
114–115

How fast is this statement? Don't you have to do binary search?

122–125

I'm not familiar with the PPC64 instruction set, and I'm wondering why the relaxed instructions are faster than the original form. Can you add that to the comment?

ELF/Relocations.cpp
1251

nit: please add }.

MaskRay marked an inline comment as done.Apr 22 2019, 6:58 AM
MaskRay added inline comments.
ELF/Arch/PPC64.cpp
114–115

When resolving R_PPC16_TOC_* in a .rela.text section, TocRelIdx is monotonically increasing.
The loop is executed at most num(R_PPC16_TOC_*)+size(.rela.text) times.

The pessimistic case happens when you have many .rela.text* sections (-ffunction-sections or comdat .text*) that contain toc relocations.

When linking clang, the 4 object files with the most numbers of .rela.toc entries:
CheckerRegistry.o (.rela.toc has 951 entries) (8 .rela.text have toc relocations) (1172 toc relocations)
X86ISelLowering.o (.rela.toc has 723 entries) (208 .rela.text have toc relocations) (1108 toc relocations)
ScalarEvolution.o (.rela.toc has 492 entries) (153 .rela.text have toc relocations) (754 toc relocations)
X86FastIsel.o (.rela.toc has 462 entries) (242 .rela.text have toc relocations) (3901 toc relocations)
The 50th has 123 entries in its .rela.toc.

For CheckerRegistry.o, the TocRelIdx optimization decreases the execution times of the loop from at most (951+1172)*951 to at most (951+8)*951. Using linear probing here should be ok.

MaskRay updated this revision to Diff 196067.Apr 22 2019, 7:00 AM

Add {} around the EM_RISCV branch

MaskRay updated this revision to Diff 196174.Apr 22 2019, 7:45 PM

Rebase after the llvm::stable_sort change

Hi MaskRay.

Thanks for doing this, when you originally described this too me I didn't realize you intended to partition and sort the relocations in the sections other then .rela.toc. This clears up my question regarding the implementation. I'm still a little hesitant on this approach though. Did you profile link times between the 2 approaches? I understand the speed up in number of access (you have outlined that very well in another comment), however that fails to consider both how small the number of relocation in rela.toc is compared to all the relocations in the text section, and how infrequently we have an object file that actually enters the loop.

The main reason I wasn't so concerned over the n^2 look up in the original patch is because we would hit that loop so infrequently, and even though it is technically n^2, in practice the typical object files gcc produces when it does put a constant in the TOC would typically lead to 1 or 2 extra array access rather then n extra array accesses. I've compiled a couple of projects to show what I mean.

Protobuf: 628 objects, 27 where there is a constant in the TOC. (4.3%)
 Missing reloc count               Frequency
       1                             19
       2                              8

Postgres: 704 objects 12 with a constant in the TOC (1.7%)
Missing reloc count                Frequency
        1                            10
        2                             1
        9                             1

FMPEG: 681 objects,  28 with constants in the TOC (4.1%)
Missing reloc count                Frequency
      1                               22
      2                                4
      3                                1
      96                               1

LLVM: 3294 Objects, 355 with constant in the TOC.  (10.8%)
Missing reloc count                Frequency
     1                               180
     2                                61
     3                                48
     4                                22
     5                                13
     6                                 4
     7                                 5
     8                                 3
     9                                 2
    10                                 2
    11                                 3
    12                                 3
    13                                 2
    19                                 1
    22                                 1
    28                                 2
    32                                 1
    40                                 1
    80                                 1

Clearly there are a few objects where the number of missing relocation's does start to get worryingly large (30/40/80/96), but 90%-95% of the files never hit the loop, and 70% of those that do need the loop will have at most 1 or 2 extra array accesses per lookup. Note that this is all compiled with gcc, when compiling with clang as the build compiler we end up with *no* objects falling though to the loop.

FWIW, I think this implementation is clean and understandable enough that we can switch to it, but I would like to know how this affects the link time of say llvm when clang is the build compiler and when gcc is the build compiler before deciding this is the best approach.

MaskRay updated this revision to Diff 196380.Apr 23 2019, 8:27 PM

Switch to Sean Fertile's original approach and update tests.

Hi MaskRay.

Thanks for doing this, when you originally described this too me I didn't realize you intended to partition and sort the relocations in the sections other then .rela.toc. This clears up my question regarding the implementation. I'm still a little hesitant on this approach though. Did you profile link times between the 2 approaches? I understand the speed up in number of access (you have outlined that very well in another comment), however that fails to consider both how small the number of relocation in rela.toc is compared to all the relocations in the text section, and how infrequently we have an object file that actually enters the loop.

The main reason I wasn't so concerned over the n^2 look up in the original patch is because we would hit that loop so infrequently, and even though it is technically n^2, in practice the typical object files gcc produces when it does put a constant in the TOC would typically lead to 1 or 2 extra array access rather then n extra array accesses. I've compiled a couple of projects to show what I mean.

Protobuf: 628 objects, 27 where there is a constant in the TOC. (4.3%)
 Missing reloc count               Frequency
       1                             19
       2                              8

Postgres: 704 objects 12 with a constant in the TOC (1.7%)
Missing reloc count                Frequency
        1                            10
        2                             1
        9                             1

FMPEG: 681 objects,  28 with constants in the TOC (4.1%)
Missing reloc count                Frequency
      1                               22
      2                                4
      3                                1
      96                               1

LLVM: 3294 Objects, 355 with constant in the TOC.  (10.8%)
Missing reloc count                Frequency
     1                               180
     2                                61
     3                                48
     4                                22
     5                                13
     6                                 4
     7                                 5
     8                                 3
     9                                 2
    10                                 2
    11                                 3
    12                                 3
    13                                 2
    19                                 1
    22                                 1
    28                                 2
    32                                 1
    40                                 1
    80                                 1

Clearly there are a few objects where the number of missing relocation's does start to get worryingly large (30/40/80/96), but 90%-95% of the files never hit the loop, and 70% of those that do need the loop will have at most 1 or 2 extra array accesses per lookup. Note that this is all compiled with gcc, when compiling with clang as the build compiler we end up with *no* objects falling though to the loop.

FWIW, I think this implementation is clean and understandable enough that we can switch to it, but I would like to know how this affects the link time of say llvm when clang is the build compiler and when gcc is the build compiler before deciding this is the best approach.

Hi Sean,

Thank you for the statistics and explanation!

As I updated some PPC64 tests yesterday, now I think I have a better understanding of what these relaxations are... I didn't really understand the R_PPC64_ADDR64 relocations in .toc well until now. I think your approach in D54720 is superior. (Though I tested an earlier version of this revision on numerous internal targets and didn't see issues caused by it (there are many unrelated long-standing issues))

I now understand that .rela.toc consists of exclusively R_PPC64_ADDR64 relocations. What you mentioned before is that clang/llvm never emits constants into .toc so the linear loop can't be a problem. For gcc, as your statistics say, it isn't a problem either given the very small number of such cases.

I switched to your approach but tried inlining some functions (fixed an 32-bit compile error and an assertion error when Relas is empty, etc) and adjusted some code to make it (hopefully) easier to understand.

MaskRay updated this revision to Diff 196381.Apr 23 2019, 8:46 PM

Switch to sfertile's original approach.

MaskRay updated this revision to Diff 196383.Apr 23 2019, 9:52 PM
MaskRay edited the summary of this revision. (Show Details)

Enhance ppc64-toc-relax.s : add another symbol hidden2

MaskRay updated this revision to Diff 196384.Apr 23 2019, 11:05 PM
MaskRay marked 2 inline comments as done.

Elaborate what this relaxation does

MaskRay marked an inline comment as done.Apr 23 2019, 11:05 PM
MaskRay added inline comments.
ELF/Arch/PPC64.cpp
122–125

Elaborated the comments.

MaskRay updated this revision to Diff 196386.Apr 23 2019, 11:20 PM

Improve comments

MaskRay updated this revision to Diff 196387.Apr 23 2019, 11:33 PM

Better comments

Patch looks really good. I think you managed to explain the toc-indirection without getting to verbose which is something I've struggled with. I have a couple minor suggestions and I need to go over the test a bit more indepth but overall this looks good to me.

ELF/Arch/PPC64.cpp
122

I think if Relas.size() is not equal to zero, then we want to set Index to index the last element of the array, and perform the linear probing. For example trying to relax an access to last element of a .toc section that contains a constant as the first entry and a symbol as the second: Index will be 1, but Relas[0] is the relocation we want to find to perform the relaxation.

139

I think we need to make a small change to this comment. The compiler generates the indirection by adding a private label at the address of the toc entry for the symbol, and the @toc symbol modifier indicates to the assembler to create a toc-relative relocation for that private label. If we relax then we rewrite the instructions to access the symbol toc-relative:

// .toc
// ...
// .LC0:
//  .tc foo               # will allocate 8 bytes storage and a R_PPC64_ADDR64 relocation for 'foo'


//  addis 3, 2, LC0@toc@ha    # R_PPC64_TOC16_HA
//  ld 3, LC0@toc@l(3)        # R_PPC64_TOC16_LO_DS, load the address of 'foo' from the .toc entry
//  ld/lwa  3, 0(3)           # load the value of foo


// If foo is defined, non-preemptable and addressable with a 32-bit signed
// offset from the toc base, the address can be computed by adding an offset to
// the toc base, saving a load.

//  addis 3, 2, foo@toc@ha ...
`
359

Need to make similar adjustments to this comment and the next.

ELF/InputSection.cpp
73

Is this related or is it from a different patch?

MaskRay marked 7 inline comments as done.Apr 25 2019, 7:55 PM
MaskRay added inline comments.
ELF/Arch/PPC64.cpp
122

My bad. The test ppc64-toc-relax-constants.s had demonstrated the failure. Fixed.

139

Reworded.

ELF/InputSection.cpp
73

It is related to an assertion failure when relas() is called with zero relocations.

template <class ELFT> ArrayRef<typename ELFT::Rela> relas() const {
  assert(AreRelocsRela); // If relas is empty(), this is false
  return llvm::makeArrayRef(
      static_cast<const typename ELFT::Rela *>(FirstRelocation),
      NumRelocations);
}

I have switched to early return if TocReloc->NumRelocations == 0. The assertion will not fire.

MaskRay updated this revision to Diff 196787.Apr 25 2019, 7:55 PM
MaskRay marked 3 inline comments as done.
MaskRay edited the summary of this revision. (Show Details)

Update description based on sfertile's review

sfertile accepted this revision.Apr 30 2019, 8:04 AM

LGTM.

test/ELF/ppc64-toc-relax-constants.s
47

remove the : 8*2 - 0x8000 = -32752 part of the comment.

test/ELF/ppc64-toc-relax.s
44

minor nit: why not check for the relocation for shared 0x0 as well.

69

Maybe, "The definition of 'shared' used is determined at loadtime (by the dynamic linker) so the extra indirection cannot be relaxed."

This revision is now accepted and ready to land.Apr 30 2019, 8:04 AM
MaskRay updated this revision to Diff 197340.Apr 30 2019, 8:29 AM
MaskRay marked 2 inline comments as done.
MaskRay edited the summary of this revision. (Show Details)

Address sfertile's comments

MaskRay marked an inline comment as done.Apr 30 2019, 8:29 AM
MaskRay marked an inline comment as done.
MaskRay updated this revision to Diff 197342.Apr 30 2019, 8:32 AM

Fix a trailing comma in a comment

Harbormaster completed remote builds in B31160: Diff 197342.

@ruiu Do you have more suggestions? :)

ruiu accepted this revision.May 6 2019, 9:07 PM

LGTM

ELF/Arch/PPC64.cpp
113

Nice comment. Thank you for writing this.

This revision was automatically updated to reflect the committed changes.