This is an archive of the discontinued LLVM Phabricator instance.

[IndVarSimplify] Widen signed loop compare instructions to enable additional optimizations.
ClosedPublic

Authored by mcrosier on Sep 12 2014, 10:14 AM.

Details

Summary

This change removes a trunc from in-between an induction variable and the associated compare. This allows other optimizations (i.e., instcombine, LSR) to take effect. A sext may be added to the compare's other operand, but this can often be hoisted outside of the loop.

For example,
int *ptr;
int e, idx;
int foo() {

int i;
idx = -1;
for (i = 0; i <= e; i++)
  if (!ptr[i]) {
    idx = i;
    break;
  };
return idx;

}

Before this change, on AArch64 we generate the following loop:

.LBB0_2: // %for.body

ldr     w11, [x10, x0, lsl #2]
cbz     w11, .LBB0_5
add     x0, x0, #1           // ++i / pre-increment
sub     w11, w0, #1        // rematerialize i
cmp      w11, w9           // compare i
b.lt    .LBB0_2

After:

.LBB0_2: // %for.body

ldr      w11, [x10]             // remove shift as we can now increment base by 4
add     x0, x0, #1             // i++ / post-increment i
cbz     w11, .LBB0_5
add     x10, x10, #4         // base += 4 
cmp      x0, x9
b.lt    .LBB0_2

With a little more work we should be able to generate a post-increment load (my next patch) to generate the following loop:

.LBB0_2: // %for.body

ldr      w11, [x10], 4
add     x0, x0, #1
cbz     w11, .LBB0_5
cmp      x0, x9
b.lt    .LBB0_2

This change in isolation has a minimal effect on performance (i.e., nothing outside of noise). However, it enables better use of post-increment loads/stores, which is why I deemed it important. Please have a look.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 13639.Sep 12 2014, 10:14 AM
mcrosier retitled this revision from to [IndVarSimplify] Widen signed loop compare instructions to enable additional optimizations..
mcrosier updated this object.
mcrosier edited the test plan for this revision. (Show Details)
mcrosier added subscribers: Unknown Object (MLST), Jiangning, jmolloy.

Just a few comments to save on review time.

Chad

lib/Transforms/Scalar/IndVarSimplify.cpp
935

No that I think about it, I'm not sure the single use is necessary.

954

I reuse getExtend because it tries hard to hoist the extend as far out of the loop as possible.

test/Transforms/IndVarSimplify/elim-extend.ll
10 ↗(On Diff #13639)

This is the sext on limit, which is outside of the loop. This isn't an extend we need to CHECK-NOT.

test/Transforms/IndVarSimplify/verify-scev.ll
383

Benjamin,
Can you verify this is a safe change on this test case? My patch changes how the SCEV output, so I modified the test case to the new result.

test/Transforms/IndVarSimplify/widen-loop-comp.ll
20

This is the example from the commit-list message.

test/Transforms/LoopSimplify/merge-exits.ll
1 ↗(On Diff #13639)

FileCheckize.

14 ↗(On Diff #13639)

No sext in loop.

reames added a subscriber: reames.Sep 12 2014, 10:40 AM

Looks fine to me, though I'm not familiar with this code. You should get a LGTM from someone who knows the area. I'm mostly just reading through to build up my own background in the area.

lib/Transforms/Scalar/IndVarSimplify.cpp
935

Why does it matter that the compare has one use? We're not changing the result of the ICmp are we?

test/Transforms/LoopSimplify/merge-exits.ll
1 ↗(On Diff #13639)

Separating this cleanup into it's own change before making the change required for your patch would make the diff a lot more understandable.

atrick edited edge metadata.Sep 12 2014, 10:46 AM

I just glanced at this and still want to take a little closer look. In particular I want to understand why LSR behavior changes. Ping me if I don't get back to it.

I generally would not trade a trunc for a sext. But in this case, it probably works out better most of the time. So your change seems reasonable if it doesn't regress things. We definitely want canonical looking loop compares and that must have something to do with LSR behaving better.

Did you also see changes in InstCombine?

I wonder if you'll hit an assertion if both compare operands happen to use the same IV:

j = i*2 - k
cmp i, j

It would be good to just handle the unsigned case instead of leaving a fixme so behavior is consistent. It doesn't have to be in this commit though.

Thanks for the feedback, Phillip.

Andy,
I'll have to revisit the LSR code to better understand the the change. At one point in time I understood what was going on, but it's since been paged out. Assuming this patch gets accepted, I'll investigate the unsigned implementation.

Chad

lib/Transforms/Scalar/IndVarSimplify.cpp
935

Exactly. I'll remove the check and rerun the analysis.

test/Transforms/LoopSimplify/merge-exits.ll
1 ↗(On Diff #13639)

r217698.

mcrosier updated this revision to Diff 13651.Sep 12 2014, 11:48 AM
mcrosier edited edge metadata.

Updated based on reviews. Removed the singleUse check on icmp. Also removed the FileCheckize change (committed in r217698).

dpeixott added inline comments.
lib/Transforms/Scalar/IndVarSimplify.cpp
931

Does your comment about single use still apply? I don't see a check for it.

atrick accepted this revision.Sep 15 2014, 7:59 PM
atrick edited edge metadata.

I think the reason removing the trunc works well in practice is that LSR can easily tell that an immediate offset can be folded into the compare. I'm sure it would be possible to make LSR handle the trunc, but fixing Indvars to generate a cleaner loop test is a fine approach.

I'm still curious if this triggers any InstCombines that we were missing before.

This revision is now accepted and ready to land.Sep 15 2014, 7:59 PM
In D5333#16, @atrick wrote:

I think the reason removing the trunc works well in practice is that LSR can easily tell that an immediate offset can be folded into the compare. I'm sure it would be possible to make LSR handle the trunc, but fixing Indvars to generate a cleaner loop test is a fine approach.

Yes, that's what I'm seeing as well.

LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i64

reg({0,+,1}<nuw><nsw><%for.body>)
reg({-1,+,1}<nw><%for.body>) + imm(1)

The latter case is only considered once the trunc is removed.

The specific instcombine I thought might be effected was committed by David here:
http://llvm.org/viewvc/llvm-project?view=revision&revision=179316

From my original code I was hoping to fold the extra subtract into the compare:

ldr w11, [x10, x0, lsl #2]
cbz w11, .LBB0_5
add x0, x0, #1
sub w11, w0, #1 <--- fold into compare
cmp w11, w9
b.lt .LBB0_2

However, r179316 couldn't handle this case because of the trunc. After a few pointers from David M. I realized that modifying that code to handle the trunc wasn't the right approach and that's how we arrived at this patch. Based on that limited experience, I could imagine other compare instcombines not working due to intervening truncs. However, I haven't actually seen this in practice.

Thanks for the review, Andy. I'm going to do a little more analysis before committing. My performance runs were on devices that were rather unstable at the time, so I'd like to rerun everything so I can sleep easier at night. :)

Chad

Thanks LGTM. Please follow up by removing the FIXME for unsigned compare.

mcrosier updated this revision to Diff 13782.Sep 17 2014, 7:08 AM
mcrosier edited edge metadata.

Revised patch which now handles unsigned compares.

mcrosier closed this revision.Sep 17 2014, 7:22 AM

(Andy, I'm sure you can appreciate this..) I had a bit of variance in my performance runs (both A53 and A57), but overall I saw a few small improvements.

Committed in r217953. Post-commit reviews welcome, but hopefully this is in good shape.

njw45 added a subscriber: njw45.Sep 25 2014, 1:47 AM
This comment was removed by njw45.

Nick,
The promotion of unsigned comparisons was reverted for this very reason.
I believe the step has to be a unit step of one to be correct. There may
be other cases, but this is the only one I can think of at the moment.

Regardless, that part of the code has been reverted.

Chad

I'm following up on my mail to llvmdev - I don't think this change is quite correct. The test case
below is a slight modification of the one in the final version of this
change; it's basically (in C):

uint32_t test5(uint32_t* a, uint8_t b) {
  uint32_t sum = 0;
  for(uint8_t i = 0; i <= b; i += 10) {
    sum += a[i];
  }
  return sum;
}

LLVM IR:

; CHECK-LABEL: @test5
; CHECK: zext i8 %b
; CHECK: for.cond:
; CHECK: phi i64
; CHECK: icmp ule i64

define i32 @test5(i32* %a, i8 %b) {
entry:
  br label %for.cond

for.cond:
  %sum.0 = phi i32 [ 0, %entry ], [ %add, %for.body ]
  %i.0 = phi i8 [ 0, %entry ], [ %inc, %for.body ]
  %cmp = icmp ule i8 %i.0, %b
  br i1 %cmp, label %for.body, label %for.end

for.body:
  %idxprom = sext i8 %i.0 to i64
  %arrayidx = getelementptr inbounds i32* %a, i64 %idxprom
  %0 = load i32* %arrayidx, align 4
  %add = add nsw i32 %sum.0, %0
  %inc = add nsw i8 %i.0, 10
  br label %for.cond

for.end:
  ret i32 %sum.0
}

If b was (e.g.) 251 the loop would be infinite (as it's `for(uint8_t i =
0; i < 251; i += 10), but when i` gets to 250 + 10 then i wraps and `i
< 251` is true). If you promote i to a uint64_t then the loop
terminates at this point as 260 > 251. Is this analysis correct?

Thanks -

Nick

http://reviews.llvm.org/D5333

spatel added a subscriber: spatel.Sep 25 2014, 2:57 PM

Hi Chad -
I think this patch caused or exposed some other bug here:
http://llvm.org/bugs/show_bug.cgi?id=21030

Thanks, Sanjay. I'll investigate now.

Hi Chad -
I think this patch caused or exposed some other bug here:
http://llvm.org/bugs/show_bug.cgi?id=21030

http://reviews.llvm.org/D5333

jhowarth added a subscriber: jhowarth.EditedFeb 19 2015, 6:42 AM

This change caused a 24% performance regression in the SciMark2's Sparse matmult benchmark on Bloomfield processors and 12% on Harpertown processors with "-O3 -march=native" generating identical assembly in both cases. This issue is filed at http://llvm.org/bugs/show_bug.cgi?id=22589 and appears to be due to the increased register pressure introduced by the newly exposed optimizations causing spills. This issue is processor-specific and the regression is minimal on other processors like Haswell due to their additional registers (at least for this specific test case).