This is an archive of the discontinued LLVM Phabricator instance.

[Loop Vectorize] Vectorize Loops with Backward Dependence
Needs ReviewPublic

Authored by DIVYA on Jul 31 2017, 2:26 PM.

Details

Summary

This is the LLVM pass which is intended to be run before loop vectorizer. This pass checks if there is a backward dependence between instructions ,and if so, then it tries to reorders instructions, so as to convert non-vectorizable loop into vectorizable form. The pass uses the LoopAccessAnalysis and MemorySSA analysis to check for dependences, inorder to reorder the instructions.

Worked in collaboration with Aditya Kumar

Diff Detail

Event Timeline

DIVYA created this revision.Jul 31 2017, 2:26 PM
fhahn added a subscriber: fhahn.Aug 1 2017, 1:31 AM
fhahn added a comment.Aug 1 2017, 11:57 AM

I just had a quick look and left a few comments. I'm just curious, what kind of testing did you do & did you run any benchmarks?

include/llvm/Transforms/Vectorize.h
128

Indent is off.

include/llvm/Transforms/Vectorize/LoopVectorizePred.h
3

This should go on line 1 I think, see first line of Vectorize.h for example.

lib/Transforms/IPO/PassManagerBuilder.cpp
831

I think people might be more comfortable with adding the new pass if there is an option to completely disable it, as in not adding it to the pass manager (like with LoopInterchange for example). This way, the pass can go in being disabled by default and makes it easier for people to test it/run benchmarks.

lib/Transforms/Vectorize/LoopVectorizePred.cpp
112

I think this function could do with slightly more comments, e.g. what the loops are doing and so on.

116

Is there a more meaningful name instead of temprit?

153

I think you could use Dep.isBackward(), which is shorter.

161

Could be a for loop?

DIVYA updated this revision to Diff 109176.EditedAug 1 2017, 12:26 PM
DIVYA updated this revision to Diff 109180.Aug 1 2017, 12:28 PM
DIVYA updated this revision to Diff 109231.Aug 1 2017, 2:59 PM
  • Added comments
  • Made option to disable pass
DIVYA marked 6 inline comments as done.Aug 1 2017, 3:02 PM
DIVYA added inline comments.
lib/Transforms/Vectorize/LoopVectorizePred.cpp
153

Dep.isBackward() will take BackwardVectorizable, Backward and BackwardVectorizableButPreventsForwarding cases into consideration.But Here its done only for Backward dependences

hiraditya added inline comments.Aug 2 2017, 8:03 AM
include/llvm/Transforms/Vectorize/LoopVectorizePred.h
13

s/reorders/reorder

lib/Transforms/Vectorize/LoopVectorizePred.cpp
189

|= false is a noop.

hiraditya added inline comments.Aug 3 2017, 7:56 AM
lib/Transforms/Vectorize/LoopVectorizePred.cpp
115

We can make moved a bool

169

I think we do not need a dyn_cast here, we can just compare the instructions.

DIVYA updated this revision to Diff 109799.Aug 4 2017, 12:12 PM
DIVYA marked 4 inline comments as done.
jbhateja removed a subscriber: jbhateja.Aug 4 2017, 8:45 PM

@hfinkel , I'd appreciate your feedback, and suggestions for improvement.
Thanks,

hfinkel added inline comments.Aug 9 2017, 10:07 PM
include/llvm/Transforms/IPO/PassManagerBuilder.h
151 ↗(On Diff #109799)

I don't think we need a special pass-manager option for this. We can put it behind a cl::opt flag, but once it is judged to be ready for the default pipeline, I imagine we'll always enable it whenever vectorization is enabled (at least at -O3).

include/llvm/Transforms/Vectorize/LoopVectorizePred.h
11

This first sentence seems inaccurate. It does not need to be run, nor does it need to run before the vectorizer. You can say that it is intended to be run before the vectorizer.

12

" instructions ,if so, " -> " instructions, and if so, "

14

"and generates target-independent LLVM-IR" - all passes do this unless other noted. No need to say that here.

14

"The vectorizer uses" - Is this a statement about the vectorizer or this pass?

15

the LoopAccessAnalysis -> remove "the" and the extra spaces on this line.

lib/Transforms/Vectorize/LoopVectorizePred.cpp
13

See comments from the header file. They apply here too.

18

Please put here some pseudo-cost/C examples, before/after, showing what the transformation does.

58

space before and

59

remove comma before if.

112

Please add a comment explaining what this function does.

114

Local variable names need to start with a capital letter.

117

Is this algorithm quadratic?

119

Please be careful with the formatting of comments. Only once space in between words. A single space goes after a comma, none before.

131

Can you do this with std::find and op_begin/op_end?

176

It also makes the load loop invariant. Why does LICM not hoist it?

211

Please describe what this function does.

227

Please describe what this function does.

dberlin added inline comments.
lib/Transforms/Vectorize/LoopVectorizePred.cpp
112

This funciton looks like it tries to figure out if the operands can be moved upward. You don't need to iterate through the instructions at all to do this.
You can just use the def-use chains and stop when your chain is above the other instruction.
If you need local-to-block ordering, use OrderedInstructions->dominates to tell when you are above the

180

You are guaranteed this is a MemoryUseOrDef, since oyu asked about a load.
There is no need for a dyn_cast_or_null, a simple cast will suffice.

185

This check does not make much sense.
The right check to see if it's not dependent in the loop would be to see if the defining access is above the loop. As hal says, that makes it loop invariant.
The right check to see if it not dependent on the store would be to see if the defining access dominates (in the global sense, so if it's part of the same block, it must come before) the store, and is not part of an SCC of the memoryssa use-def graph that contains the store's memory access.

(If it is not part of an scc, it cannot be changing in the loop. If the store's memory access does not appear in the SCC, it cannot be part of whatever cycle affects this load)

IE even in the case that the store is below the load:

a = memoryphi(foo, bar)
baz = memorydef(a)
memoryuse(baz)
bar = memorydef(baz)

This will give the right answer.

DIVYA updated this revision to Diff 110638.Aug 10 2017, 3:11 PM
DIVYA marked 16 inline comments as done.
DIVYA edited the summary of this revision. (Show Details)
  • Modified checkDepAndReorder() function to use use-def chain and OrderedInstructions
  • added extra checks using MemorySSA results and OrderedInstructions( in reorderInstructions() functions), to check when checkDepAndReorder() be invoked
  • Modified some comments and formatting
DIVYA marked 2 inline comments as done.Aug 10 2017, 3:17 PM
DIVYA added inline comments.
include/llvm/Transforms/IPO/PassManagerBuilder.h
151 ↗(On Diff #109799)

I had already added cl::opt flag "EnableLoopVectorizePred" in PassManagerBuilder.cpp to have an option to disable the pass.Do I need add to anything else?

DIVYA updated this revision to Diff 110726.Aug 11 2017, 7:48 AM
DIVYA marked an inline comment as done.
hfinkel added inline comments.Aug 14 2017, 12:11 PM
include/llvm/Transforms/IPO/PassManagerBuilder.h
151 ↗(On Diff #109799)

Remove this variable (make the cl::opt default to off for now). We don't want a proliferation of these for each pass.

include/llvm/Transforms/Vectorize/LoopVectorizePred.h
17

inorder -> in order

lib/Transforms/Vectorize/LoopVectorizePred.cpp
130

Do you mean "aliasing store instruction" or do you mean "backward-dependent store instruction"? I think the latter.

132

Is what you're doing here sufficient? What happens if you have memory-access instructions , function calls, etc. that need to be moved because of (potential) memory dependencies?

188

I'm somewhat concerned that a number of these cases are actually handling loop-invariant values, meaning that we're doing a suboptimal handling of something that LICM should handle more fully. The problem is that dealing with these after vectorization could be more difficult than before, and if we have a phase-ordering problem such that we're missing these cases, we might just end up with suboptimal code. Are you seeing these cases in practice?

201

Are you assuming here that the Phi must be the loop backedge? What if there's control flow in the loop (note that the vectorizer can do if conversion).

DIVYA updated this revision to Diff 111083.Aug 14 2017, 3:17 PM
DIVYA marked an inline comment as done.
  • Modified Comments
lib/Transforms/Vectorize/LoopVectorizePred.cpp
132

If there are any memory-access instructions, or function calls , on which the load instruction depends, then the MemorySSA analysis will find that out, and the control will never reach this function.

188

I think the loop invariant codes will be already hoisted outside the loop before this pass.The Defining access will be liveOnEntry for the loads and stores from pointer arguments .
For example in the below code, the defining access for the load Instruction for a[i+1], is Live on Entry .However the load instruction is not loop invariant.

For the case
int foo1(int n,int * restrict a, int * restrict b, int *restrict m){

int i;
for (i = 0; i < n; i++){
  a[i] = b[i];
  m[i] = a[i+1];
}

}

201

Even if the Phi is not from loop backedge , rather from control flow, still if the store Instruction dominates the load, the load can be moved above the store Instruction.Since the store Instruction remains below the defining access(which is memory phi)

DIVYA marked an inline comment as done.Aug 14 2017, 3:19 PM
DIVYA updated this revision to Diff 111084.Aug 14 2017, 3:26 PM
dberlin added inline comments.Aug 14 2017, 3:31 PM
lib/Transforms/Vectorize/LoopVectorizePred.cpp
188

I'm surprised it can prove that it does not conflict with the store.
The load instruction actually is loop invariant in the sense of whether a given loaded address can change during the loop, it just takes on different loop invariant values each iteration.

It is possible to hoist it out of this loop, it just requires making a different loop :)

In particular, this generates the same result:

int foo1(int n,int * restrict a, int * restrict b, int *restrict m){

int i;
for (i = 0; i < n; i++){
  m[i] = a[i+1];
}
for (i = 0; i < n; i++){
  a[i] = b[i];
}
}
mssimpso added inline comments.
lib/Transforms/Vectorize/LoopVectorizePred.cpp
188

I wonder if our loop distribution pass (disabled by default) can handle this case?

dberlin added a comment.EditedAug 15 2017, 8:05 AM
This comment has been deleted.
lib/Transforms/Vectorize/LoopVectorizePred.cpp
188

I also wonder if one of our passes could ever transform it to memcpy, since it's just a memcpy(m, a+1, n*sizeof(a))

hiraditya added inline comments.Aug 15 2017, 10:16 AM
lib/Transforms/Vectorize/LoopVectorizePred.cpp
188

@hfinkel
Since LICM does not hoist the invariant load, we do not see any performance degradation with this patch. Also, this patch handles other cases where the memory access is not live on entry. If LICM improves to handle invariant loads as you stated, or if we run loop distribution before vectorization, then we can disable that part. Other cases where the load is not loop invariant but can be reordered to enable vectorization can still be useful.

@mssimpso
Maybe just improving LICM could achieve the distribution of this loop, and then running loop-idiom would convert to memcpy.
I think since LICM does not use MemorySSA, it is unable to establish the invariance of load.

hfinkel edited edge metadata.Aug 22 2017, 7:11 PM

Two high-level thoughts:

  1. Is this transformation always profitable even if we don't later vectorize? I worry that you're taking a stream of accesses that generally appear in order and making them appear out of order. That could have odd effects. Maybe this should all be a utility that the vectorizer uses more directly?
  2. I don't think a goal of this should be to provide backup handling of things that LICM misses but MemorySSA can prove are loop invariant. We should improve LICM by making it use MemorySSA (see work on this in D35741 - pushing this along might help?). We could run LICM right before we vectorize too.

Two high-level thoughts:

  1. Is this transformation always profitable even if we don't later vectorize? I worry that you're taking a stream of accesses that generally appear in order and making them appear out of order. That could have odd effects. Maybe this should all be a utility that the vectorizer uses more directly?
  2. I don't think a goal of this should be to provide backup handling of things that LICM misses but MemorySSA can prove are loop invariant. We should improve LICM by making it use MemorySSA (see work on this in D35741 - pushing this along might help?). We could run LICM right before we vectorize too.

Adding comments from @dberlin which got missed for some reason:

I'm definitely with Hal on the second point. We alreayd know there are deficiencies there, and the right way is to improve LICM.
If the goal is "move loop invariants", we want one pass that does that, not 6.

It's fine to have multiple passes when the job is so large that it can't sanely be done optimally by one thing, but i don't think this is ojne of those cases ....

Two high-level thoughts:

  1. Is this transformation always profitable even if we don't later vectorize? I worry that you're taking a stream of accesses that generally appear in order and making them appear out of order. That could have odd effects. Maybe this should all be a utility that the vectorizer uses more directly?

Agreed, we will evaluate if we can this pass as a utility for vectorizer. In the past we tried that but got stuck because we were unable to recompute loop-access-info once the instructions have been reordered. Maybe you can help us here.

  1. I don't think a goal of this should be to provide backup handling of things that LICM misses but MemorySSA can prove are loop invariant. We should improve LICM by making it use MemorySSA (see work on this in D35741 - pushing this along might help?). We could run LICM right before we vectorize too.

The goal is to vectorize instructions by reordering, it just happens to catch cases missed by LICM. Once D35741 is merged, it will still catch cases currently missed by vectorizer e.g., when loads are not loop-invariant.

@hfinkel
we tried adding this patch to loop vectorizer. but after reordering the instructions, canVectorizeMemory() needs to be called again to check if it can now be vectorized.However,we are unable to recompute loop-access-info once the instructions have been reordered.So it stills holds the old loop-access-info(before reordering) and does not vectorize. Is is there any way to recompute loop-access-info after reordering the instructions?

@hfinkel
we tried adding this patch to loop vectorizer. but after reordering the instructions, canVectorizeMemory() needs to be called again to check if it can now be vectorized.However,we are unable to recompute loop-access-info once the instructions have been reordered.So it stills holds the old loop-access-info(before reordering) and does not vectorize. Is is there any way to recompute loop-access-info after reordering the instructions?

Do you want to recompute the LAA, or just update it to remove the dependencies that you'll later remove when you generate the vectorized code? Ideally, the reordering would just become part of the vectorizartion plan (the vplan infrastructure is now committed). Maybe you can update the instruction ordering in the MemDepChecker's ordering map, and then have it recompute dependencies with the new ordering. We could have a corresponding ordering map in the vectorization plan as well.