This is an archive of the discontinued LLVM Phabricator instance.

Reassociate GEP operands for loop invariant code motion
AbandonedPublic

Authored by jingyue on Apr 20 2015, 6:26 PM.

Details

Reviewers
meheff
hfinkel
Summary

Rename SeparateConstOffsetFromGEP pass to ReassociateGEPs and
expand to include reassociation of loop invariant terms in the
indices of GEPs. This reassociation enables splitting GEPs into
two separate GEPS: one loop invariant and the other loop variant
GEPs. The existing reassociation pass is unable to perform this
transformation because that pass does not reassociate values
across GEP operands.

The renamed pass now performs three related GEP transformations:

  1. Reassociate and gather constant offsets in GEP sequential index operands.
  2. Reassociate loop invariant terms in GEP sequential index operands.
  3. (Optional) Lower GEPs into simpler GEPs or their arithmetic equivalents.

This patch adds (2) and includes significant refactoring to share
code between the constant offset reassociation and the loop
invariant reassociation.

Diff Detail

Repository
rL LLVM

Event Timeline

meheff updated this revision to Diff 24071.Apr 20 2015, 6:26 PM
meheff retitled this revision from to Reassociate GEP operands for loop invariant code motion.
meheff updated this object.
meheff edited the test plan for this revision. (Show Details)
meheff added reviewers: jingyue, hfinkel.
meheff set the repository for this revision to rL LLVM.
meheff added a subscriber: Unknown Object (MLST).
sanjoy added a subscriber: sanjoy.Apr 20 2015, 7:08 PM

This looks like it is sharing intent with the loop strength reduction pass. I don't know if that is intentional, just pointing it out.

E.g. -loop-reduce does this to gep_with_const:

define void @gep_with_const(i32* %input, i32* %output, i64 %a) {
entry:
  %0 = add i64 %a, 1234
  %scevgep = getelementptr i32, i32* %input, i64 %0
  br label %loop

loop:                                             ; preds = %loop, %entry
  %lsr.iv1 = phi i32* [ %scevgep2, %loop ], [ %scevgep, %entry ]
  %lsr.iv = phi i64 [ %lsr.iv.next, %loop ], [ 1000000, %entry ]
  %1 = load i32, i32* %lsr.iv1
  store i32 %1, i32* %output
  %lsr.iv.next = add nsw i64 %lsr.iv, -1
  %scevgep2 = getelementptr i32, i32* %lsr.iv1, i64 1
  %exitcond = icmp eq i64 %lsr.iv.next, 0
  br i1 %exitcond, label %loop.exit, label %loop

loop.exit:                                        ; preds = %loop
  ret void
}
jingyue edited edge metadata.Apr 20 2015, 9:11 PM

Hi Sanjoy,

One reason we couldn't simply leverage LSR is that LSR forgets nsw and would miss cases such as gep input, sext(a +nsw i) in simple_licm. Cases where indices are 32-bit and pointers are 64-bit are pretty common for GPU programs, because

  1. most NVIDIA and AMD GPUs natively support only i32, and
  2. the host side that drives GPU programs usually runs on 64-bit CPUs, and communicates (e.g. via unified memory) with GPU via 64-bit pointers.

Indvar widening alleviates this nsw issue for lots of architectures. However, it's not a good option for GPU programs again because most GPUs support only i32 natively. If LSR fails to simplify the loop, then indvar widening can negatively affect performance (https://llvm.org/bugs/show_bug.cgi?id=21148) because 64-bit arithmetic is much more expensive than 32-bit. P.S. maybe we can narrow an induction variable back to its original size on LSR failure?

Mark and Wei seem to observe significant speedup by applying this pass to X86 where I believe indvar widening is on. What's the story there? Why didn't indvar widening + LSR hoist loop invariants?

One reason we couldn't simply leverage LSR is that LSR forgets nsw and would miss cases such as gep input, sext(a +nsw i) in simple_licm.

I'm not sure this is correct. LSR uses represents "registers" using
normal SCEV expressions. If SCEV can see that an add recurrence does
not overflow, LSR should see that too. I've only recently started
looking at LSR so maybe I'm missing some insight here.

Indvar widening alleviates this nsw issue for lots of architectures. However, it's not a good option for GPU programs again because most GPUs support only i32 natively. If LSR fails to simplify the loop, then indvar widening can negatively affect performance (https://llvm.org/bugs/show_bug.cgi?id=21148) because 64-bit arithmetic is much more expensive than 32-bit. P.S. maybe we can narrow an induction variable back to its original size on LSR failure?

I ran into a similar (if not exactly the same) issue recently:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084465.html
(this is why I've been looking at LSR).

I concluded that the fundamental reason LSR is regressing performance
is that (as mentioned in the email I've linked to) it does not
consider formulae of the form (sext X) or (zext X). Another way of
saying "narrow an induction variable back to its original size on LSR
failure" is "consider a direct zext (or sext) of the narrowed
induction variable as one possible way to satisfy a use and then do
your normal profit / cost calculation as usual".

  • Sanjoy

Indvar widening alleviates this nsw issue for lots of architectures. However, it's not a good option for GPU programs again because most GPUs support only i32 natively. If LSR fails to simplify the loop, then indvar widening can negatively affect performance (https://llvm.org/bugs/show_bug.cgi?id=21148) because 64-bit arithmetic is much more expensive than 32-bit. P.S. maybe we can narrow an induction variable back to its original size on LSR failure?

Actually, widening the induction variable looks like an orthogonal
issue to re-associating GEPs. As mentioned in the bug, the right fix
to the widening issue is to not widen if wide integer ops are more
expensive for the target -- if LSR is widening indvars when it
shouldn't perhaps that part of LSR should be predicated on a target
specific hook also?

  • Sanjoy
wmi added a subscriber: wmi.Apr 20 2015, 10:11 PM

I ran opt -scalar-evolution -analyze on simple_licm and got

Printing analysis 'Scalar Evolution Analysis' for function 'simple_licm':
Classifying expressions for: @simple_licm
  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
  -->  {0,+,1}<nuw><nsw><%loop> U: [0,1000000) S: [0,1000000)           Exits: 999999
  %idx = add nsw i32 %a, %i
  -->  {%a,+,1}<nw><%loop> U: full-set S: full-set              Exits: (999999 + %a)
  %idx.sext = sext i32 %idx to i64
  -->  (sext i32 {%a,+,1}<nw><%loop> to i64) U: [-2147483648,2147483648) S: [-2147483648,2147483648)            Exits: (sext i32 (999999 + 
%a) to i64)
  %arrayidx = getelementptr i32, i32* %input, i64 %idx.sext
  -->  ((4 * (sext i32 {%a,+,1}<nw><%loop> to i64)) + %input) U: full-set S: full-set           Exits: ((4 * (sext i32 (999999 + %a) to i64
)) + %input)
  %0 = load i32, i32* %arrayidx
  -->  %0 U: full-set S: full-set               Exits: <<Unknown>>
  %i.next = add nuw nsw i32 %i, 1
  -->  {1,+,1}<nuw><nsw><%loop> U: [1,1000001) S: [1,1000001)           Exits: 1000000
Determining loop execution counts for: @simple_licm
Loop %loop: backedge-taken count is 999999
Loop %loop: max backedge-taken count is 999999

As far as I can see,

%idx = add nsw i32 %a, %i
-->  {%a,+,1}<nw><%loop> U: full-set S: full-set              Exits: (999999 + %a)

%idx has only <nw> (self-wrap) flag but not <nsw>.

I noticed you recently worked on strengthening ScalarEvolution's handling of nsw, which is pretty awesome! Does it address this issue?

LSR doesn't widen indvars; IndVarSimplify which lives in a separate pass
does that. And then yes, I disabled induction variable widening (but not
LSR) for the NVPTX backend by looking at arithmetic costs (
http://reviews.llvm.org/D6196).

Speaking of indvar widening, all I meant was, if the wider bit-width were
not an issue and we could enable indvar widening in NVPTX, LSR would be
more effective on our cases because ScalarEvolution needn't worry about
sext/zext any more.

First round.

That's a *lot* of refactoring work. Appreciate it!

lib/Target/NVPTX/NVPTXTargetMachine.cpp
172

Should we run LICM after ReassociateGEPs?

lib/Target/PowerPC/PPCTargetMachine.cpp
266

extract loop invariants

lib/Transforms/Scalar/ReassociateGEPs.cpp
126

minor: (1) and (2) are used before; try other numbering.

332

I think LLVM generally prefers

bool visit(Value *V, bool &ContinueSearch) override;
616

Is Found necessary? Do we need to look at the second operand if already found what we want in the first operand?

629

What's the invariant between UserChain and Found? My understanding was, when Found is true, UserChain is non-empty containing the user chain from V to the found value, but the code here clears UserChain when Found.

634

Looks like we cannot do much with ConstantExprs either. Maybe just dyn_cast<Instruction> to pre-filter out more cases.

688

s/FindConstOffset/findConstOffset

760

Can you comment on what the rank computes? Is it the deepest loop-level V can recursively reach?

770

An instruction can use a value in a deeper loop if the loop is in a non-LCSSA form. Does MaxRank = L->getLoopDepth() still make sense?

784

s/FindLoopInvariantValue/findLoopInvariantValue/

or may be just findLoopInvariant?

786

I've seen UserChain passed to lots of methods of HoistableValueFinder. Can we make it global to HoistableValueFinder?

830

Is recursing into subtract a TODO or problematic?

1135

I was initially confused when I thought I saw

add %a, %b = gep %base, 42, %idx, %foo, %bar

:)

How about giving the gep a name instead of assigning it to blank? e.g.

%idx = add %a, %b
%p = gep %base, 42, %idx, %foo, %bar
1248

Can we split splitGEPForConst and splitGEPForLICM? The general logic seems:

if we can strip const offset from gep
  gep = GEP with const offset stripped
splitGEPForLICM()
1359

I'll come back to this function later. Looks like after we refactor splitGEP, splitGEP and splitAndLowerGEP can be merged?

test/Transforms/ReassociateGEPs/NVPTX/split-licm-gep.ll
122

Nice! :)

jingyue commandeered this revision.Mar 15 2016, 10:30 AM
jingyue edited reviewers, added: meheff; removed: jingyue.
jingyue added a subscriber: jingyue.
jingyue abandoned this revision.Mar 15 2016, 10:54 AM

Sorry, I pressed the wrong button. I meant to say "needs revision". Feel free to reclaim this patch.

test/Transforms/ReassociateGEPs/NVPTX/split-licm-gep.ll