This is an archive of the discontinued LLVM Phabricator instance.

[LSR] Generate and use zero extends
ClosedPublic

Authored by sanjoy on Apr 21 2015, 6:44 PM.

Details

Summary

If a scale or a base register can be rewritten as "Zext({A,+,1})" then
LSR will now consider a formula of that form in its normal cost
computation.

Depends on D9180

Diff Detail

Repository
rL LLVM

Event Timeline

sanjoy updated this revision to Diff 24191.Apr 21 2015, 6:44 PM
sanjoy retitled this revision from to [LSR] Generate and use zero extends.
sanjoy updated this object.
sanjoy edited the test plan for this revision. (Show Details)
sanjoy added a reviewer: atrick.
sanjoy added a subscriber: Unknown Object (MLST).
atrick requested changes to this revision.Apr 27 2015, 11:14 AM
atrick edited edge metadata.

This looks reasonable. Thanks!

Does your test fail on trunk? This is what the test outputs for me:

ok_161: ; preds = %ok_158

%lsr.iv.next = add nuw nsw i64 %lsr.iv, 1
%4 = add i64 %0, %lsr.iv.next
%tmp1 = trunc i64 %4 to i32
%tmp188 = icmp slt i32 %tmp1, %tmp160
br i1 %tmp188, label %ok_146, label %block_81

Would you be able to test both post-inc and pre-inc variants? We've a lot of bugs because of TransformForPostIncUse.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
3685 ↗(On Diff #24191)

Please comment this at least as well as your commit comment if not better.

This revision now requires changes to proceed.Apr 27 2015, 11:14 AM

Does your test fail on trunk? This is what the test outputs for me:

ok_161: ; preds = %ok_158

%lsr.iv.next = add nuw nsw i64 %lsr.iv, 1
%4 = add i64 %0, %lsr.iv.next
%tmp1 = trunc i64 %4 to i32
%tmp188 = icmp slt i32 %tmp1, %tmp160
br i1 %tmp188, label %ok_146, label %block_81

That's behavior I'm trying to avoid -- I don't want two additions on
every iteration of the loop.

To elaborate a little bit, here is the -debug-only=loop-reduce on the
test case before the change:

'''
After filtering out undesirable candidates:
LSR is examining the following uses:

LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
  reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
  reg((zext i32 %tmp156 to i64)) + 1*reg({2,+,1}<nw><%ok_146>) + imm(-2)
  reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>)
  reg((zext i32 %tmp156 to i64)) + 1*reg({-1,+,1}<nw><%ok_146>) + imm(1)
  reg((zext i32 %tmp156 to i64)) + 1*reg({3,+,1}<nw><%ok_146>) + imm(-3)
LSR Use: Kind=Basic, Offsets={0}, all-fixups-outside-loop, widest

fixup type: i64

  reg({(-1 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
  reg({(zext i32 %tmp156 to i64),+,1}<nuw><nsw><%ok_146>) + imm(-1)
  reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>) + imm(-1)
  reg((zext i32 %tmp156 to i64)) + 1*reg({-1,+,1}<nw><%ok_146>)
  reg((-1 + (zext i32 %tmp156 to i64))) + 1*reg({0,+,1}<nuw><%ok_146>)
  reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>) + imm(-4)
  reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>) + imm(-3)
LSR Use: Kind=Address of double, Offsets={16}, widest fixup type: double*
  -16 + reg({(16 + (8 * (zext i32 %tmp156 to i64)) + %d),+,8}<nw><%ok_146>)
  -24 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 4*reg({6,+,2}<%ok_146>)
  -16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 2*reg({8,+,4}<%ok_146>)
  -24 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 2*reg({12,+,4}<%ok_146>)
  -16 + reg(%d) + 8*reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
  -24 + reg(%d) + 8*reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
  -8 + reg((16 + %d)<nsw>) + 8*reg({(-1 + (zext i32 %tmp156 to

i64)),+,1}<nw><%ok_146>)

-16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 4*reg({4,+,2}<%ok_146>)
-16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) +

8*reg({2,+,1}<nw><%ok_146>)

-16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +

2*reg({0,+,4}<%ok_146>)

-16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +

4*reg({0,+,2}<%ok_146>)

-16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +

8*reg({0,+,1}<nuw><%ok_146>)

-16 + reg((16 + %d)<nsw>) + 2*reg({(4 * (zext i32 %tmp156 to

i64)),+,4}<nuw><nsw><%ok_146>)

-16 + reg((16 + %d)<nsw>) + 4*reg({(2 * (zext i32 %tmp156 to

i64)),+,2}<%ok_146>)

-16 + reg((16 + %d)<nsw>) + 8*reg({(zext i32 %tmp156 to

i64),+,1}<nuw><nsw><%ok_146>)

LSR Use: Kind=Address of i32, Offsets={12}, widest fixup type: i32*
  -12 + reg({(12 + (4 * (zext i32 %tmp156 to i64)) + %b),+,4}<nw><%ok_146>)
  -20 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

2*reg({4,+,2}<%ok_146>)

-12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

1*reg({0,+,4}<%ok_146>)

-12 + reg((12 + %b)<nsw>) + 1*reg({(4 * (zext i32 %tmp156 to

i64)),+,4}<nuw><nsw><%ok_146>)

-12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) + 2*reg({6,+,2}<%ok_146>)
-20 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

1*reg({8,+,4}<%ok_146>)

-12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) + 1*reg({12,+,4}<%ok_146>)
-8 + reg(%b) + 4*reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
-12 + reg(%b) + 4*reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
-8 + reg((12 + %b)<nsw>) + 4*reg({(-1 + (zext i32 %tmp156 to

i64)),+,1}<nw><%ok_146>)

-12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) +

4*reg({3,+,1}<nw><%ok_146>)

-12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

2*reg({0,+,2}<%ok_146>)

-12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

4*reg({0,+,1}<nuw><%ok_146>)

-12 + reg((12 + %b)<nsw>) + 2*reg({(2 * (zext i32 %tmp156 to

i64)),+,2}<%ok_146>)

-12 + reg((12 + %b)<nsw>) + 4*reg({(zext i32 %tmp156 to

i64),+,1}<nuw><nsw><%ok_146>)
New best at 4 regs, with addrec cost 2, plus 2 scale cost, plus 9 imm
cost, plus 3 setup cost.
Regs: {%tmp156,+,1}<nuw><nsw><%ok_146> {(-1 + (zext i32 %tmp156 to
i64)),+,1}<nw><%ok_146> (16 + %d)<nsw> (12 + %b)<nsw>
New best at 4 regs, with addrec cost 1, plus 3 base adds, plus 2 scale
cost, plus 3 setup cost.
Regs: {0,+,1}<nuw><%ok_146> (zext i32 %tmp156 to i64) (16 + (8 *
(zext i32 %tmp156 to i64)) + %d) (12 + (4 * (zext i32 %tmp156 to i64))
+ %b)

The chosen solution requires 4 regs, with addrec cost 1, plus 3 base
adds, plus 2 scale cost, plus 3 setup cost:

LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
  reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>)
LSR Use: Kind=Basic, Offsets={0}, all-fixups-outside-loop, widest

fixup type: i64

  reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>) + imm(-1)
LSR Use: Kind=Address of double, Offsets={16}, widest fixup type: double*
  -16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +

8*reg({0,+,1}<nuw><%ok_146>)

LSR Use: Kind=Address of i32, Offsets={12}, widest fixup type: i32*
  -12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

4*reg({0,+,1}<nuw><%ok_146>)
'''

And here is the debug output after the change. LSR now "sees" a few
more candidate formulae (marked with '<- new solution +++++'):

'''
After filtering out undesirable candidates:
LSR is examining the following uses:

LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
  reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
  reg(%tmp156) + 1*reg({0,+,1}<nuw><%ok_146>)
  reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>)
  reg((zext i32 %tmp156 to i64)) + 1*reg({-1,+,1}<nw><%ok_146>) + imm(1)
  reg((zext i32 %tmp156 to i64)) + 1*reg({3,+,1}<nw><%ok_146>) + imm(-3)
  reg((zext i32 %tmp156 to i64)) + 1*reg({2,+,1}<nw><%ok_146>) + imm(-2)
LSR Use: Kind=Basic, Offsets={0}, all-fixups-outside-loop, widest

fixup type: i64

  reg({(-1 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
  reg({(zext i32 %tmp156 to i64),+,1}<nuw><nsw><%ok_146>) + imm(-1)
  reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>) + imm(-1)
  reg((zext i32 %tmp156 to i64)) + 1*reg({-1,+,1}<nw><%ok_146>)
  reg((-1 + (zext i32 %tmp156 to i64))) + 1*reg({0,+,1}<nuw><%ok_146>)
  reg({%tmp156,+,1}<nuw><nsw><%ok_146>) + imm(-1)   <- new solution +++++
  reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>) + imm(-4)
  reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>) + imm(-3)
LSR Use: Kind=Address of double, Offsets={16}, widest fixup type: double*
  -16 + reg({(16 + (8 * (zext i32 %tmp156 to i64)) + %d),+,8}<nw><%ok_146>)
  -24 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 4*reg({6,+,2}<%ok_146>)
  -16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 2*reg({8,+,4}<%ok_146>)
  -24 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 2*reg({12,+,4}<%ok_146>)
  -16 + reg(%d) + 8*reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
  -24 + reg(%d) + 8*reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
  -8 + reg((16 + %d)<nsw>) + 8*reg({(-1 + (zext i32 %tmp156 to

i64)),+,1}<nw><%ok_146>)

-16 + reg((16 + %d)<nsw>) +

8*reg({%tmp156,+,1}<nuw><nsw><%ok_146>) <- new solution +++++

-16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 4*reg({4,+,2}<%ok_146>)
-16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) +

8*reg({2,+,1}<nw><%ok_146>)

-16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +

2*reg({0,+,4}<%ok_146>)

-16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +

4*reg({0,+,2}<%ok_146>)

-16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +

8*reg({0,+,1}<nuw><%ok_146>)

-16 + reg((16 + %d)<nsw>) + 2*reg({(4 * (zext i32 %tmp156 to

i64)),+,4}<nuw><nsw><%ok_146>)

-16 + reg((16 + %d)<nsw>) + 4*reg({(2 * (zext i32 %tmp156 to

i64)),+,2}<%ok_146>)

-16 + reg((16 + %d)<nsw>) + 8*reg({(zext i32 %tmp156 to

i64),+,1}<nuw><nsw><%ok_146>)

LSR Use: Kind=Address of i32, Offsets={12}, widest fixup type: i32*
  -12 + reg({(12 + (4 * (zext i32 %tmp156 to i64)) + %b),+,4}<nw><%ok_146>)
  -20 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

2*reg({4,+,2}<%ok_146>)

-12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

1*reg({0,+,4}<%ok_146>)

-12 + reg((12 + %b)<nsw>) + 1*reg({(4 * (zext i32 %tmp156 to

i64)),+,4}<nuw><nsw><%ok_146>)

-12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) + 2*reg({6,+,2}<%ok_146>)
-20 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

1*reg({8,+,4}<%ok_146>)

-12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) + 1*reg({12,+,4}<%ok_146>)
-8 + reg(%b) + 4*reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
-12 + reg(%b) + 4*reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
-8 + reg((12 + %b)<nsw>) + 4*reg({(-1 + (zext i32 %tmp156 to

i64)),+,1}<nw><%ok_146>)

-12 + reg((12 + %b)<nsw>) +

4*reg({%tmp156,+,1}<nuw><nsw><%ok_146>) <- new solution +++++

-12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) +

4*reg({3,+,1}<nw><%ok_146>)

-12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

2*reg({0,+,2}<%ok_146>)

-12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +

4*reg({0,+,1}<nuw><%ok_146>)

-12 + reg((12 + %b)<nsw>) + 2*reg({(2 * (zext i32 %tmp156 to

i64)),+,2}<%ok_146>)

-12 + reg((12 + %b)<nsw>) + 4*reg({(zext i32 %tmp156 to

i64),+,1}<nuw><nsw><%ok_146>)
New best at 3 regs, with addrec cost 1, plus 1 base add, plus 2 scale
cost, plus 2 setup cost.
Regs: {%tmp156,+,1}<nuw><nsw><%ok_146> (16 + %d)<nsw> (12 + %b)<nsw>

The chosen solution requires 3 regs, with addrec cost 1, plus 1 base
add, plus 2 scale cost, plus 2 setup cost:

LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
  reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
LSR Use: Kind=Basic, Offsets={0}, all-fixups-outside-loop, widest

fixup type: i64

  reg({%tmp156,+,1}<nuw><nsw><%ok_146>) + imm(-1)
LSR Use: Kind=Address of double, Offsets={16}, widest fixup type: double*
  -16 + reg((16 + %d)<nsw>) + 8*reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
LSR Use: Kind=Address of i32, Offsets={12}, widest fixup type: i32*
  -12 + reg((12 + %b)<nsw>) + 4*reg({%tmp156,+,1}<nuw><nsw><%ok_146>)

'''

The new formulae lets LSR choose a better solution.

Looking at the output tells me that I should definitely add something
to the textual representation of a formulae that has an implicit zero
extend.

Would you be able to test both post-inc and pre-inc variants? We've a
lot of bugs because of TransformForPostIncUse.

I think that is a good idea. I'm not familiar with LSR so I suspect
it will take me some time to get there, but just that is actually a
good reason for writing these tests. :)

It also looks like (I'm not a 100% sure) that there's an inconsistency
in how LSR treats overflow in post-inc expressions vs. how SCEV treats
them. For instance, in ScalarEvolution.cpp line ~1500 we say that
{S,+,X} is <nuw> if "S + MaxBECount * X" does not unsigned
overflow. But that does not say anything about the overflow of the
post-inc SCEV expression (e.g. if BECount == 0 then all add recs are
<nuw>).

The problem is that TransformForPostIncUse(zext({S,+,X})) =
zext({S+X,+,X}) -- ScalarEvolutionNormalization.cpp, line ~100.

Now if {S,+,1} is nuw then {zext S,+,1} = zext({S,+,1}), but
{(zext S) + 1,+,1} is != zext({S+1,+,1}) by SCEV's definition of nuw.
But TransformForPostIncUse({zext S,+,1}) =
TransformForPostIncUse(zext({S,+,1})) = zext({S+1,+,1}) when it should
really be {(zext S) + 1,+,1}. I don't have a concrete example where
this is a problem, but this definitely looks fishy to me.

  • Sanjoy
sanjoy updated this revision to Diff 27362.Jun 8 2015, 11:30 PM
sanjoy edited edge metadata.

address review:

  • change Formula::print to indicate that a specific formula is a zext formula
  • add comment on LSRInstance::GenerateZExts
  • the existing test case tests both pre-inc and post-inc uses (commented). Please let me know if this is not what you meant.
qcolombet edited edge metadata.Jul 1 2015, 11:13 AM

FWIW, LGTM.

I let Andy comment on the latest changes.

Thanks,
Q.

atrick accepted this revision.Jul 26 2015, 5:01 PM
atrick edited edge metadata.

LGTM. Sorry to neglect this one that was clearly blocked on me.

This revision is now accepted and ready to land.Jul 26 2015, 5:01 PM
This revision was automatically updated to reflect the committed changes.