This is an archive of the discontinued LLVM Phabricator instance.

With PIE on x86_64, keep hot local arrays on the stack
Needs ReviewPublic

Authored by tmsriram on Mar 8 2017, 2:39 PM.

Details

Reviewers
davidxl
Summary

Accessing local arrays is cheaper with PIE on x86_64 than global arrays.

This patch seeks to improve performance of PIE code on x86_64 when accessing local const arrays.

A few of our key benchmarks slow-down by as much as 3% because of this and we would like to see a way to do this.

Example here:

int foo(int i) {

int arr[] = { 0, 2, 7, 11 };
return arr[i];

}

$ clang -O2 foo.cc -S -fPIE
foo.s:

_Z3fooi: # @_Z3fooi

...
movslq  %edi, %rax                 # Access i
leaq    _ZZ3fooiE3arr(%rip), %rcx  # Access arr promoted to global
movl    (%rcx,%rax,4), %eax        # Access arr[i]
retq

...
.section .rodata,"a",@progbits
...
.L_ZZ3fooiE3arr:
.long 0 # 0x0
.long 2 # 0x2
.long 7 # 0x7
.long 11 # 0xb

The array is made a global and the access needs one extra instruction to get the address of array _ZZ3fooiE3arr, the lea instruction. This causes upto 3% perf. regression, larger on newer chips.

If this array were kept on the stack the access needs only two instructions. It does need to push the elements on the stack which needs to done once at entry. This is how the code would look like:

_Z3fooi: # @_Z3fooi

...

movaps .L_ZZ3fooiE3arr(%rip), %xmm0 # Vectorized, array push onto the stack
movaps %xmm0, -24(%rsp) # Push the array on the stack
movslq %edi, %rax # Access i
movl -24(%rsp,%rax,4), %eax # Access arr[i] in one instruction
...
.section .rodata,"a",@progbits
...
.L_ZZ3fooiE3arr:
.long 0 # 0x0
.long 2 # 0x2
.long 7 # 0x7
.long 11 # 0xb

The actual array access is one instruction in this case without accounting for the extra instructions needed to push it onto the stack:
movl -24(%rsp,%rax,4), %eax # Access arr[i] in one instruction

If the use of the array element is hot and is done more frequently compared to pushing the elements on the stack, it is better to keep this array on the stack for performance. This patch tries to do that based on profiles. It compares the count of the basic block where the element is accessed to the count of the basic block where the stack allocation is made. It also takes it into account the size of the array as more instructions are needed to push a larger array.

A couple of notes about this:

  • Clang turns on -fmerge-all-constants by default and moves all const local arrays to .rodata even before any optimizations are applied. So, in order for this patch to be effective, -fno-merge-all-constants must be used.
  • It is the instruction combine pass that analyzes alloca and tries to delete them when possible, if the allocas are for const arrays. This patch checks for profile count at this point.
  • This is only applicable x86_64 and that too for PIE builds.
  • GCC always keeps const stack arrays on stack unless -fmerge-all-constants is specified. What GCC does here is overkill in order to prevent this problem:

https://gcc.gnu.org/ml/gcc/2016-04/msg00178.html
The standard says that if the pointer to the stack array is accessed by calling this function recursively, it must be unique. GCC tries to honor this but does not differentiate between cases where the pointer is not accessed or the function is not called recursively.

  • Clang does not honor that part of the standard, I am not sure of the full story here. Since -fmerge-all-constants is the default, clang will always move the stack array to rodata and the pointer uniqueness if this function is entered recursively is violated. This is a separate discussion but I am pitching my solution to this here since it is within the

scope of this work:

  1. I will add a clang option -fmerge-constants and not make -fmerge-all-constants the default, just like GCC does.
  2. Then instruction combine can keep function local arrays on stack if it is called recursively.

Please let me know what you think.

Diff Detail

Event Timeline

tmsriram created this revision.Mar 8 2017, 2:39 PM

In theory, the C/C++ standards require behavior equivalent to -fno-merge-all-constants. In practice, code doesn't actually depend on that, so we made the decision many years ago to turn on -fmerge-all-constants by default.

Anyway, you don't really want the behavior of the optimizer to depend on whether the constant array is marked "static"; it probably doesn't reflect the user's intent in any useful way.


I'm sort of surprised you didn't mention register pressure at all in your explanation.

In theory, the C/C++ standards require behavior equivalent to -fno-merge-all-constants. In practice, code doesn't actually depend on that, so we made the decision many years ago to turn on -fmerge-all-constants by default.

Understood. Does it seem reasonable/useful to fix this along the
lines of GCC, -fmerge-constants and -fmerge-all-constants where the
latter applies to const arrays and a warning that this is happening
when the latter option is used?

-fmerge-all-constants has exactly the same meaning in clang and gcc. And it's generally beneficial for both codesize and performance, so turning it off to pursue performance is a bad idea.

I would suggest finding some other approach to solve your issue later in the optimization pipeline, preferably in a manner which is sensitive to register pressure. Maybe put the code in ConstantHoisting? You don't lose any useful information by promoting the alloca to a global constant; you can easily recreate it later.

In theory, the C/C++ standards require behavior equivalent to -fno-merge-all-constants. In practice, code doesn't actually depend on that, so we made the decision many years ago to turn on -fmerge-all-constants by default.

Understood. Does it seem reasonable/useful to fix this along the
lines of GCC, -fmerge-constants and -fmerge-all-constants where the
latter applies to const arrays and a warning that this is happening
when the latter option is used?

-fmerge-all-constants has exactly the same meaning in clang and gcc. And it's generally beneficial for both codesize and performance, so turning it off to pursue performance is a bad idea.

I would suggest finding some other approach to solve your issue later in the optimization pipeline, preferably in a manner which is sensitive to register pressure. Maybe put the code in ConstantHoisting? You don't lose any useful information by promoting the alloca to a global constant; you can easily recreate it

Digging a little more, here is what I found.

  • Like Eli pointed out, this is an optimization that is needed only when there is register pressure, otherwise the address computation can be hoisted out.
  • To recap, a global array access in PIE mode needs two instructions (for X86_64), an address computation using lea and the actual element access. If the array access is inside a hot loop, we have noticed the performance drop by a few percent due to the increased dynamic instruction count from the address computation when compared to non-PIE code.
  • Machine LICM does hoist the address computation of the array outside the loop but register allocation will sink the address computation back near the use via rematerialization if the register pressure is high.

I can think of two different ways in which I can solve this problem

a) Implement this in a late optimization pass, in LLVM IR and use a heuristic to compute register pressure.

  • When compiling for PIE, the optimization would move a global array to the stack if the register pressure estimate of the function where it is used is high.
  • Use a heuristic to compute the register usage. This is already done for instance in loop vectorizaton, function "LoopVectorizationCostModel::calculateRegisterUsage".
  • The heuristic would use the number of overlapping live ranges as the estimate for register usage.
  • If implemented in a late pass, just before code generation, the estimates would tend to be closer to actual.

b) Teach rematerialization, during greedy register allocation, to move global arrays to stack instead of recomputing the address everytime it is used.

  • This would be done in machine IR, during register allocation, when it is known that this is about to be rematerialized.
  • However, this optimization seems heavy-weight to do in machine level IR, not sure about the complexity/feasibility of doing this here.
  • The only reason to justify doing this here is the absence of a register usage estimation function in LLVM IR.

It looks like a) would be the way to go here. The need for a generic register usage estimation function has already been discussed by Wei to drive other optimizations . What do you think?

There's also another possibility: you could make the register allocator prefer to spill some other register which isn't on the critical path. Not sure if that's practical for the loops you care about.

That said, your possibility (a) seems reasonable.