This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable RRL part of the LEA optimization pass for -O2
ClosedPublic

Authored by aturetsk on Apr 28 2016, 7:11 AM.

Details

Summary

Enable "Remove Redundant LEAs" part of the LEA optimization pass for -O2.

This gives 6.4% performance improve on Broadwell on nnet benchmark from coremark-pro. There is no significant effect on other benchmarks.

Diff Detail

Repository
rL LLVM

Event Timeline

aturetsk updated this revision to Diff 55411.Apr 28 2016, 7:11 AM
aturetsk retitled this revision from to [X86] Enable RRL part of the LEA optimization pass for -O2.
aturetsk updated this object.
aturetsk added reviewers: nadav, qcolombet, rnk.
aturetsk added subscribers: reames, zinovy.nis, llvm-commits.
aturetsk updated this object.Apr 28 2016, 7:17 AM
bruno added a subscriber: bruno.Apr 28 2016, 9:25 AM

Hi Andrey,

What are the other benchmarks you tested this on?

Hi Bruno,
I tried Geekbench, Coremark-Pro, Spec2000 and Spec2006.

bruno added a reviewer: bruno.May 2 2016, 9:24 AM

Hi Andrey,

IIUC, this patch also makes -Os run removeRedundantLEAs, which it previously didn't. Did you check what's the effect for compile time in -Os after this change?

The RRL part of the LEA pass takes a sane amount of compile time.
Here are the measurements.

-Os, the LEA pass is completely disabled:

real    0m57.797s
user    0m57.448s
sys     0m0.337s

-Os, only the RRL part of the LEA pass is enabled:

real    1m3.238s
user    1m2.868s
sys     0m0.352s

-Os, the LEA pass is fully enabled:

real    1m12.568s
user    1m12.193s
sys     0m0.354s

The test was generated by the script:

$ python gen.py 5000 > test.c
$ cat gen.py

import sys

def foo(n):
  print 'struct { int a, b, c; } arr[1000000];'
  print ''
  print 'int foo(int x) {'
  print '  int r = 0;'
  for i in range(n):
    print '  r += arr[x + %d].a + arr[x + %d].b + arr[x + %d].c;' % (i, i, i);
  print '  switch (r) {'
  print '  case 1:'
  for i in range(n):
    print '    arr[x + %d].b = 111;' % (i);
    print '    arr[x + %d].c = 111;' % (i);
  print '    break;'
  print '  case 2:'
  for i in range(n):
    print '    arr[x + %d].b = 222;' % (i);
    print '    arr[x + %d].c = 222;' % (i);
  print '    break;'
  print '  default:'
  for i in range(n):
    # Make the LEAs irreplaceable, so that no LEAs would be removed by the LEA
    # pass and thus there would be no compile-time improvement because of the
    # reduced number of instructions which need to be processed by the
    # compiler in other passes
    print '    arr[x + %d].b = (int) &arr[x + %d].b;' % (i, i);
    print '    arr[x + %d].c = (int) &arr[x + %d].c;' % (i, i);
  print '    break;'
  print '  }'
  print '  return r;'
  print '}'

if __name__ == '__main__':
  foo(int(sys.argv[1]))

The run command:

time ./bin/clang -Os -S test.c

Note that the generated test is really LEA-specific, the majority of machine instructions gets modified by the pass. That's why the LEA pass takes ~25% of total compile time in this test.

qcolombet edited edge metadata.May 13 2016, 9:33 AM

Hi Andrey,

What change for the algorithm such that now we think it is also beneficial for performances whereas it was not previously?

Is it “just” because we did not benchmark it so far?

Cheers,
-Quentin

Hi Quentin,

Yes. When I was implementing the pass my primary target was code size, so I didn't check performance impact extensively (I checked only that there was no significant degradation on a couple of benchmarks). The obvious concern was that the RRL part of the pass may increase register pressure and that would hurt performance. The code size impact was really small so it was easier and safer to just enable it only for -Oz.

qcolombet added inline comments.May 16 2016, 6:39 PM
test/CodeGen/X86/lea-opt.ll
1 ↗(On Diff #55411)

Could you check both with and without the optimization running?

112 ↗(On Diff #55411)

I don’t get what is the problem with that specific part of the test.

If we want to test something else, writing a new test should be fine instead of fixing that one, shouldn’t it?

aturetsk updated this revision to Diff 57444.May 17 2016, 1:25 AM
aturetsk edited edge metadata.

Add a test for the case when the LEA pass is disabled.

aturetsk updated this revision to Diff 57445.May 17 2016, 1:53 AM

Add a test for the case when the LEA pass is disabled. (Take #2).

aturetsk added inline comments.May 17 2016, 2:05 AM
test/CodeGen/X86/lea-opt.ll
1–2 ↗(On Diff #57445)

Done.

137 ↗(On Diff #57445)

There are two LEA instructions in this test.

Before this patch, RRL part of the pass was disabled for 'optsize', so both LEAs were preserved and RemoveRedundantAddressCalculation part of the pass had to choose a LEA for substitution in load instructions. And it should choose not the closest one, but the one which would make a resulting displacement fit 1 byte (that's exactly what the test checks).

After this patch, RRL part of the pass is enabled for 'optsize'. And without the changes in the test I made it would remove one of the LEAs and thus RRAC part of the pass wouldn't need to choose a LEA for substitution (because only one remained). So the test would become useless.

qcolombet added inline comments.May 17 2016, 9:36 AM
test/CodeGen/X86/lea-opt.ll
1–2 ↗(On Diff #57445)

Could you factor out the common pattern under a CHECK file and move the rest in two different patterns ENABLED/DISABLED.
That way it would be easier to see the effects of the pass.
(I also suspect that DISABLED will be a superset of CHECK and ENABLED will be empty.)

(Note: you can use several check prefix on the command line with additional —check-prefix options.)

137 ↗(On Diff #57445)

Thanks for the clarification.

aturetsk updated this revision to Diff 57592.May 18 2016, 5:13 AM

Improve the test.

test/CodeGen/X86/lea-opt.ll
1–3 ↗(On Diff #57592)

Done.

qcolombet accepted this revision.May 18 2016, 11:07 AM
qcolombet edited edge metadata.
This revision is now accepted and ready to land.May 18 2016, 11:07 AM
This revision was automatically updated to reflect the committed changes.