Enable "Remove Redundant LEAs" part of the LEA optimization pass for -O2.
This gives 6.4% performance improve on Broadwell on nnet benchmark from coremark-pro. There is no significant effect on other benchmarks.
Differential D19659
[X86] Enable RRL part of the LEA optimization pass for -O2 aturetsk on Apr 28 2016, 7:11 AM. Authored by
Details Enable "Remove Redundant LEAs" part of the LEA optimization pass for -O2. This gives 6.4% performance improve on Broadwell on nnet benchmark from coremark-pro. There is no significant effect on other benchmarks.
Diff Detail Event TimelineComment Actions Hi Andrey, IIUC, this patch also makes -Os run removeRedundantLEAs, which it previously didn't. Did you check what's the effect for compile time in -Os after this change? Comment Actions The RRL part of the LEA pass takes a sane amount of compile time. -Os, the LEA pass is completely disabled: real 0m57.797s user 0m57.448s sys 0m0.337s -Os, only the RRL part of the LEA pass is enabled: real 1m3.238s user 1m2.868s sys 0m0.352s -Os, the LEA pass is fully enabled: real 1m12.568s user 1m12.193s sys 0m0.354s The test was generated by the script: $ python gen.py 5000 > test.c $ cat gen.py import sys def foo(n): print 'struct { int a, b, c; } arr[1000000];' print '' print 'int foo(int x) {' print ' int r = 0;' for i in range(n): print ' r += arr[x + %d].a + arr[x + %d].b + arr[x + %d].c;' % (i, i, i); print ' switch (r) {' print ' case 1:' for i in range(n): print ' arr[x + %d].b = 111;' % (i); print ' arr[x + %d].c = 111;' % (i); print ' break;' print ' case 2:' for i in range(n): print ' arr[x + %d].b = 222;' % (i); print ' arr[x + %d].c = 222;' % (i); print ' break;' print ' default:' for i in range(n): # Make the LEAs irreplaceable, so that no LEAs would be removed by the LEA # pass and thus there would be no compile-time improvement because of the # reduced number of instructions which need to be processed by the # compiler in other passes print ' arr[x + %d].b = (int) &arr[x + %d].b;' % (i, i); print ' arr[x + %d].c = (int) &arr[x + %d].c;' % (i, i); print ' break;' print ' }' print ' return r;' print '}' if __name__ == '__main__': foo(int(sys.argv[1])) The run command: time ./bin/clang -Os -S test.c Comment Actions Note that the generated test is really LEA-specific, the majority of machine instructions gets modified by the pass. That's why the LEA pass takes ~25% of total compile time in this test. Comment Actions Hi Andrey, What change for the algorithm such that now we think it is also beneficial for performances whereas it was not previously? Is it “just” because we did not benchmark it so far? Cheers, Comment Actions Hi Quentin, Yes. When I was implementing the pass my primary target was code size, so I didn't check performance impact extensively (I checked only that there was no significant degradation on a couple of benchmarks). The obvious concern was that the RRL part of the pass may increase register pressure and that would hurt performance. The code size impact was really small so it was easier and safer to just enable it only for -Oz.
|
Could you check both with and without the optimization running?