Accessing local arrays is cheaper with PIE on x86_64 than global arrays.
This patch seeks to improve performance of PIE code on x86_64 when accessing local const arrays.
A few of our key benchmarks slow-down by as much as 3% because of this and we would like to see a way to do this.
Example here:
int foo(int i) {
int arr[] = { 0, 2, 7, 11 }; return arr[i];
}
$ clang -O2 foo.cc -S -fPIE
foo.s:
_Z3fooi: # @_Z3fooi
... movslq %edi, %rax # Access i leaq _ZZ3fooiE3arr(%rip), %rcx # Access arr promoted to global movl (%rcx,%rax,4), %eax # Access arr[i] retq
...
.section .rodata,"a",@progbits
...
.L_ZZ3fooiE3arr:
.long 0 # 0x0
.long 2 # 0x2
.long 7 # 0x7
.long 11 # 0xb
The array is made a global and the access needs one extra instruction to get the address of array _ZZ3fooiE3arr, the lea instruction. This causes upto 3% perf. regression, larger on newer chips.
If this array were kept on the stack the access needs only two instructions. It does need to push the elements on the stack which needs to done once at entry. This is how the code would look like:
_Z3fooi: # @_Z3fooi
...
movaps .L_ZZ3fooiE3arr(%rip), %xmm0 # Vectorized, array push onto the stack
movaps %xmm0, -24(%rsp) # Push the array on the stack
movslq %edi, %rax # Access i
movl -24(%rsp,%rax,4), %eax # Access arr[i] in one instruction
...
.section .rodata,"a",@progbits
...
.L_ZZ3fooiE3arr:
.long 0 # 0x0
.long 2 # 0x2
.long 7 # 0x7
.long 11 # 0xb
The actual array access is one instruction in this case without accounting for the extra instructions needed to push it onto the stack:
movl -24(%rsp,%rax,4), %eax # Access arr[i] in one instruction
If the use of the array element is hot and is done more frequently compared to pushing the elements on the stack, it is better to keep this array on the stack for performance. This patch tries to do that based on profiles. It compares the count of the basic block where the element is accessed to the count of the basic block where the stack allocation is made. It also takes it into account the size of the array as more instructions are needed to push a larger array.
A couple of notes about this:
- Clang turns on -fmerge-all-constants by default and moves all const local arrays to .rodata even before any optimizations are applied. So, in order for this patch to be effective, -fno-merge-all-constants must be used.
- It is the instruction combine pass that analyzes alloca and tries to delete them when possible, if the allocas are for const arrays. This patch checks for profile count at this point.
- This is only applicable x86_64 and that too for PIE builds.
- GCC always keeps const stack arrays on stack unless -fmerge-all-constants is specified. What GCC does here is overkill in order to prevent this problem:
https://gcc.gnu.org/ml/gcc/2016-04/msg00178.html
The standard says that if the pointer to the stack array is accessed by calling this function recursively, it must be unique. GCC tries to honor this but does not differentiate between cases where the pointer is not accessed or the function is not called recursively.
- Clang does not honor that part of the standard, I am not sure of the full story here. Since -fmerge-all-constants is the default, clang will always move the stack array to rodata and the pointer uniqueness if this function is entered recursively is violated. This is a separate discussion but I am pitching my solution to this here since it is within the
scope of this work:
- I will add a clang option -fmerge-constants and not make -fmerge-all-constants the default, just like GCC does.
- Then instruction combine can keep function local arrays on stack if it is called recursively.
Please let me know what you think.