Page MenuHomePhabricator

[ScalarEvolution] createSCEV(): recognize `udiv`/`urem` disguised as an `sdiv`/`srem`
ClosedPublic

Authored by lebedev.ri on Jun 28 2020, 10:43 AM.

Details

Summary

While InstCombine trivially converts that srem into a urem,
it might happen later than wanted, in particular i'd like
for that to happen on https://godbolt.org/z/bwuEmJ test case
early in pipeline, before first instcombine run, just before -mem2reg.

SCEV should recognize this case natively.

Diff Detail

Event Timeline

lebedev.ri created this revision.Jun 28 2020, 10:43 AM
lebedev.ri edited the summary of this revision. (Show Details)Jun 28 2020, 10:44 AM
lebedev.ri marked an inline comment as done.
lebedev.ri added inline comments.
llvm/lib/Analysis/ScalarEvolution.cpp
4343

This is a pretty verbatim copypasta from InstCombiner::visitSRem()

Not sure I understand what benefit you're expecting on the linked testcase.

llvm/lib/Analysis/ScalarEvolution.cpp
4344

Can we stick this code into ScalarEvolution::createSCEV directly? I don't think any of the other users of MatchBinaryOp care whether they get a urem or an srem.

Can we use the same code for sdiv?

4350

Can we not use isKnownNonNegative?

Not sure I understand what benefit you're expecting on the linked testcase.

Since in SCEV it now becomes some expression (as opposed to just SCEVUnknown),
it can reason about it more. In particular, SCEVExpander tries to hoist stuff
as much as possible, and now that it is no longer some opaque SCEVUnknown,
but an SCEVURem with non-zero divisor, it gets hoisted.

Ex 1.: i & 1:

*** IR Dump Before rewrite allocas if that makes them promotable ***
; Function Attrs: uwtable
define dso_local void @_Z4loopi(i32 %width) local_unnamed_addr #0 {
entry:
  %storage = alloca [2 x i32], align 4
  %0 = bitcast [2 x i32]* %storage to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %0) #3
  br label %for.cond

for.cond:                                         ; preds = %for.body, %entry
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %cmp = icmp ne i32 %i.0, %width
  br i1 %cmp, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %0) #3
  ret void

for.body:                                         ; preds = %for.cond
  %and = and i32 %i.0, 1
  %1 = zext i32 %and to i64
  %arrayidx = getelementptr inbounds [2 x i32], [2 x i32]* %storage, i64 0, i64 %1
  %2 = load i32, i32* %arrayidx, align 4, !tbaa !2
  %call = call i32 @_Z3adji(i32 %2)
  %3 = load i32, i32* %arrayidx, align 4, !tbaa !2
  %add = add nsw i32 %3, %call
  store i32 %add, i32* %arrayidx, align 4, !tbaa !2
  %inc = add nsw i32 %i.0, 1
  br label %for.cond
}
*** IR Dump After rewrite allocas if that makes them promotable ***
; Function Attrs: uwtable
define dso_local void @_Z4loopi(i32 %width) local_unnamed_addr #0 {
entry:
  %storage.apc.retyped = alloca <8 x i8>, align 4
  br label %for.cond

for.cond:                                         ; preds = %for.body, %entry
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %0 = mul i32 %i.0, -1
  %1 = trunc i32 %0 to i1
  %2 = zext i1 %1 to i64
  %cmp = icmp ne i32 %i.0, %width
  br i1 %cmp, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond
  ret void

for.body:                                         ; preds = %for.cond
  %3 = load <8 x i8>, <8 x i8>* %storage.apc.retyped, align 4
  %4 = bitcast <8 x i8> %3 to <2 x i32>
  %5 = extractelement <2 x i32> %4, i64 %2
  %call = call i32 @_Z3adji(i32 %5)
  %6 = load <8 x i8>, <8 x i8>* %storage.apc.retyped, align 4
  %7 = bitcast <8 x i8> %6 to <2 x i32>
  %8 = extractelement <2 x i32> %7, i64 %2
  %add = add nsw i32 %8, %call
  %9 = load <8 x i8>, <8 x i8>* %storage.apc.retyped, align 4
  %10 = bitcast <8 x i8> %9 to <2 x i32>
  %11 = insertelement <2 x i32> %10, i32 %add, i64 %2
  %12 = bitcast <2 x i32> %11 to <8 x i8>
  store <8 x i8> %12, <8 x i8>* %storage.apc.retyped, align 4
  %inc = add nsw i32 %i.0, 1
  br label %for.cond
}

So it got hoisted. But now,

Ex. 2: i % 2:

*** IR Dump Before rewrite allocas if that makes them promotable ***
; Function Attrs: uwtable
define dso_local void @_Z4loopi(i32 %width) local_unnamed_addr #0 {
entry:
  %storage = alloca [2 x i32], align 4
  %0 = bitcast [2 x i32]* %storage to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %0) #3
  br label %for.cond

for.cond:                                         ; preds = %for.body, %entry
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %cmp = icmp ne i32 %i.0, %width
  br i1 %cmp, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %0) #3
  ret void

for.body:                                         ; preds = %for.cond
  %rem = srem i32 %i.0, 2
  %idxprom = sext i32 %rem to i64
  %arrayidx = getelementptr inbounds [2 x i32], [2 x i32]* %storage, i64 0, i64 %idxprom
  %1 = load i32, i32* %arrayidx, align 4, !tbaa !2
  %call = call i32 @_Z3adji(i32 %1)
  %2 = load i32, i32* %arrayidx, align 4, !tbaa !2
  %add = add nsw i32 %2, %call
  store i32 %add, i32* %arrayidx, align 4, !tbaa !2
  %inc = add nsw i32 %i.0, 1
  br label %for.cond
}
*** IR Dump After rewrite allocas if that makes them promotable ***
; Function Attrs: uwtable
define dso_local void @_Z4loopi(i32 %width) local_unnamed_addr #0 {
entry:
  %storage.apc.retyped = alloca <8 x i8>, align 4
  br label %for.cond

for.cond:                                         ; preds = %for.body, %entry
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %cmp = icmp ne i32 %i.0, %width
  br i1 %cmp, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond
  ret void

for.body:                                         ; preds = %for.cond
  %rem = srem i32 %i.0, 2
  %idxprom = sext i32 %rem to i64
  %0 = load <8 x i8>, <8 x i8>* %storage.apc.retyped, align 4
  %1 = bitcast <8 x i8> %0 to <2 x i32>
  %2 = extractelement <2 x i32> %1, i64 %idxprom
  %call = call i32 @_Z3adji(i32 %2)
  %3 = load <8 x i8>, <8 x i8>* %storage.apc.retyped, align 4
  %4 = bitcast <8 x i8> %3 to <2 x i32>
  %5 = extractelement <2 x i32> %4, i64 %idxprom
  %add = add nsw i32 %5, %call
  %6 = load <8 x i8>, <8 x i8>* %storage.apc.retyped, align 4
  %7 = bitcast <8 x i8> %6 to <2 x i32>
  %8 = insertelement <2 x i32> %7, i32 %add, i64 %idxprom
  %9 = bitcast <2 x i32> %8 to <8 x i8>
  store <8 x i8> %9, <8 x i8>* %storage.apc.retyped, align 4
  %inc = add nsw i32 %i.0, 1
  br label %for.cond
}

Nope, not hoisted without this patch.

That's an inconsistency. It leads to other differences, e.g.:

*** IR Dump Before Rotate Loops ***
; Preheader:
entry:
  br label %for.cond

; Loop:
for.cond:                                         ; preds = %for.body, %entry
  <badref> = phi <2 x i32> [ undef, %entry ], [ %4, %for.body ]
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  <badref> = and i32 %i.0, 1
  <badref> = zext i32 %1 to i64
  %cmp = icmp eq i32 %i.0, %width
  br i1 %cmp, label %for.cond.cleanup, label %for.body

for.body:                                         ; preds = %for.cond
  <badref> = extractelement <2 x i32> %0, i64 %2
  %call = tail call i32 @_Z3adji(i32 %3)
  %add = add nsw i32 %call, %3
  <badref> = insertelement <2 x i32> %0, i32 %add, i64 %2
  %inc = add nuw nsw i32 %i.0, 1
  br label %for.cond

; Exit blocks
for.cond.cleanup:                                 ; preds = %for.cond
  ret void
*** IR Dump After Rotate Loops ***
; Preheader:
for.body.lr.ph:                                   ; preds = %entry
  br label %for.body

; Loop:
for.body:                                         ; preds = %for.body.lr.ph, %for.body
  <badref> = phi i64 [ 0, %for.body.lr.ph ], [ %5, %for.body ]
  %i.07 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
  <badref> = phi <2 x i32> [ undef, %for.body.lr.ph ], [ %3, %for.body ]
  <badref> = extractelement <2 x i32> %1, i64 %0
  %call = tail call i32 @_Z3adji(i32 %2)
  %add = add nsw i32 %call, %2
  <badref> = insertelement <2 x i32> %1, i32 %add, i64 %0
  %inc = add nuw nsw i32 %i.07, 1
  <badref> = and i32 %inc, 1
  <badref> = zext i32 %4 to i64
  %cmp = icmp eq i32 %inc, %width
  br i1 %cmp, label %for.cond.for.cond.cleanup_crit_edge, label %for.body

; Exit blocks
for.cond.for.cond.cleanup_crit_edge:              ; preds = %for.body
  br label %for.cond.cleanup

vs.

*** IR Dump Before Rotate Loops ***
; Preheader:
entry:
  br label %for.cond

; Loop:
for.cond:                                         ; preds = %for.body, %entry
  <badref> = phi <2 x i32> [ undef, %entry ], [ %2, %for.body ]
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %cmp = icmp eq i32 %i.0, %width
  br i1 %cmp, label %for.cond.cleanup, label %for.body

for.body:                                         ; preds = %for.cond
  %rem = and i32 %i.0, 1
  %idxprom = zext i32 %rem to i64
  <badref> = extractelement <2 x i32> %0, i64 %idxprom
  %call = tail call i32 @_Z3adji(i32 %1)
  %add = add nsw i32 %call, %1
  <badref> = insertelement <2 x i32> %0, i32 %add, i64 %idxprom
  %inc = add nuw nsw i32 %i.0, 1
  br label %for.cond

; Exit blocks
for.cond.cleanup:                                 ; preds = %for.cond
  ret void
*** IR Dump After Rotate Loops ***
; Preheader:
for.body.lr.ph:                                   ; preds = %entry
  br label %for.body

; Loop:
for.body:                                         ; preds = %for.body.lr.ph, %for.body
  %i.07 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
  <badref> = phi <2 x i32> [ undef, %for.body.lr.ph ], [ %2, %for.body ]
  %rem = and i32 %i.07, 1
  %idxprom = zext i32 %rem to i64
  <badref> = extractelement <2 x i32> %0, i64 %idxprom
  %call = tail call i32 @_Z3adji(i32 %1)
  %add = add nsw i32 %call, %1
  <badref> = insertelement <2 x i32> %0, i32 %add, i64 %idxprom
  %inc = add nuw nsw i32 %i.07, 1
  %cmp = icmp eq i32 %inc, %width
  br i1 %cmp, label %for.cond.for.cond.cleanup_crit_edge, label %for.body

; Exit blocks
for.cond.for.cond.cleanup_crit_edge:              ; preds = %for.body
  br label %for.cond.cleanup

I'm not going to argue which result is better, but *normally* there wouldn't be such an srem,
so the current outcome is an outlier, and i would think we would prefer to not have such odd outliers.

lebedev.ri marked 2 inline comments as done.

Addressing review notes:

  • Sink into createSCEV()
  • Use SCEV's isKnownNonNegative()
  • Also handle sdiv
lebedev.ri retitled this revision from [ScalarEvolution] createSCEV(): MatchBinaryOp(): recognize `urem` disguised as an `srem` to [ScalarEvolution] createSCEV(): recognize `udiv`/`urem` disguised as an `sdiv`/`srem`.Jul 1 2020, 4:56 AM
nikic added a comment.Jul 1 2020, 12:04 PM

Your transform log refers to the "rewrite allocas if that makes them promotable" pass, but I wasn't able to find it in the code base. Could you point me to where it is implemented?

Your transform log refers to the "rewrite allocas if that makes them promotable" pass,

I'm not sure it's material to the patch in question..

but I wasn't able to find it in the code base.

Correct. There is no such thing in llvm.

Could you point me to where it is implemented?

it's wip https://github.com/LebedevRI/llvm-project/tree/alloca-promotion

nikic added a comment.Jul 1 2020, 1:47 PM

Thanks for the context! That makes it clearer what the motivation here is. You are adding a new SCEV-based pass, that runs very early in the pipeline, before even the first InstCombine run, so it needs SCEV to deal with non-canonical IR. I see you mentioned this in the description, but I was missing the "why" of it.

I'm not seeing any negative compile-time impact from this change, so this change seems ok to me.

Thanks for the context! That makes it clearer what the motivation here is. You are adding a new SCEV-based pass, that runs very early in the pipeline, before even the first InstCombine run, so it needs SCEV to deal with non-canonical IR. I see you mentioned this in the description, but I was missing the "why" of it.

That's the gist of it, yes.

I'm not seeing any negative compile-time impact from this change, so this change seems ok to me.

Cool, thanks for checking.

efriedma added inline comments.Jul 1 2020, 3:20 PM
llvm/test/Analysis/ScalarEvolution/sdiv.ll
17

What are you diffing against here?

lebedev.ri marked 2 inline comments as done.Jul 1 2020, 3:45 PM
lebedev.ri added inline comments.
llvm/test/Analysis/ScalarEvolution/sdiv.ll
17

Didn't push the test commit originally.
Now done: rG51ff7642a33f73518d60909e3fe4e6348dcc7b27.

lebedev.ri marked an inline comment as done.Jul 1 2020, 3:46 PM
lebedev.ri planned changes to this revision.Jul 1 2020, 4:04 PM
clementval added inline comments.
llvm/test/Analysis/ScalarEvolution/sdiv.ll
17

This commit is making a bunch of builedbot failing. Can you revert it?

lebedev.ri updated this revision to Diff 274955.Jul 1 2020, 4:35 PM

Ok, now for real.

lebedev.ri marked an inline comment as done.Jul 1 2020, 4:36 PM
efriedma accepted this revision.Jul 1 2020, 5:37 PM

LGTM

llvm/test/Analysis/ScalarEvolution/sdiv.ll
17

Oh, that makes more sense.

This revision is now accepted and ready to land.Jul 1 2020, 5:37 PM
lebedev.ri marked 2 inline comments as done.Jul 2 2020, 2:22 AM

LGTM

Thank you for the review!

llvm/test/Analysis/ScalarEvolution/sdiv.ll
17

Yeah, somehow i ended up with wrong initial check lines, and didn't notice it, oops.
Sorry.

lebedev.ri marked an inline comment as done.Jul 2 2020, 2:41 AM
This revision was automatically updated to reflect the committed changes.