This is an archive of the discontinued LLVM Phabricator instance.

[LV] Introduce TTI::getMinimumVF
ClosedPublic

Authored by kparzysz on Apr 4 2018, 10:53 AM.

Details

Summary

The function getMinimumVF(ElemWidth) will return the minimum VF for a vector with elements of size ElemWidth bits. This value will only apply to targets for which TTI::shouldMaximizeVectorBandwidth returns true. The value of 0 indicates that there is no minimum VF.

This is a follow-up to D44735.

Diff Detail

Repository
rL LLVM

Event Timeline

kparzysz created this revision.Apr 4 2018, 10:53 AM

I think TTI part also looks reasonable, but I'll wait for Craig or others to comment on that.

lib/Transforms/Vectorize/LoopVectorize.cpp
6163 ↗(On Diff #140993)

This change in the vectorizer looks reasonable to me.

test/Transforms/LoopVectorize/Hexagon/minimum-vf.ll
1 ↗(On Diff #140993)

Sorry for being picky.

I happen to think this test is somehow unnecessarily large. Can't we have a much smaller/simpler test case? I think all we need is widest type guiding to less than 64 and/or constant trip guiding to less than 64, right? Or am I missing a point?

5 ↗(On Diff #140993)

Please add a comment saying that hard coded 9 comes from the constant trip count, in case someone has to maintain the test case and/or validation later.

kparzysz added inline comments.Apr 4 2018, 1:54 PM
test/Transforms/LoopVectorize/Hexagon/minimum-vf.ll
5 ↗(On Diff #140993)

I have reduced this testcase. The MaxVF of 9 actually came from register usage calculation. It was a coincidence that the iteration count was 9 as well.

kparzysz updated this revision to Diff 141052.Apr 4 2018, 1:55 PM

Reduced the testcase.

hsaito added inline comments.Apr 4 2018, 2:55 PM
test/Transforms/LoopVectorize/Hexagon/minimum-vf.ll
5 ↗(On Diff #140993)

Can't go something as simple as the equivalent of this? If that's the case why?
for(i=0;i<9;i++){

a[i]+=1; // int add
b[i]+=1; // char add

}

kparzysz added inline comments.Apr 5 2018, 8:09 AM
test/Transforms/LoopVectorize/Hexagon/minimum-vf.ll
5 ↗(On Diff #140993)

I've tried this testcase, let's call it sl.c:

int a[9];
char b[9];

void foo() {
  for (unsigned i = 0; i < 9; ++i) {
    a[i]++;
    b[i]++;
  }
}

Output from clang -target hexagon -O2 -mhvx -mllvm -hexagon-autohvx -S sl.c -fno-unroll-loops -mllvm -debug-only=loop-vectorize:

LV: Checking a loop in "foo" from sl.c
LV: Interleaving disabled by the pass manager
LV: Loop hints: force=? width=0 unroll=1
LV: Found a loop: for.body
LV: Found an induction variable.
LV: We can vectorize this loop!
LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar iteration overheads are incurred.
LV: Found trip count: 9
LV: The Smallest and Widest types: 8 / 32 bits.
LV: The Widest register safe to use is: 512 bits.
LV(REG): Calculating max register usage:
LV: Found uniform instruction:   %exitcond = icmp eq i32 %inc3, 9
LV: Found uniform instruction:   %arrayidx = getelementptr inbounds [9 x i32], [9 x i32]* @a, i32 0, i32 %i.08
LV: Found uniform instruction:   %arrayidx1 = getelementptr inbounds [9 x i8], [9 x i8]* @b, i32 0, i32 %i.08
LV: Found uniform instruction:   %i.08 = phi i32 [ 0, %entry ], [ %inc3, %for.body ]
LV: Found uniform instruction:   %inc3 = add nuw nsw i32 %i.08, 1
LV: Found scalar instruction:   %i.08 = phi i32 [ 0, %entry ], [ %inc3, %for.body ]
LV: Found scalar instruction:   %inc3 = add nuw nsw i32 %i.08, 1
LV: Found uniform instruction:   %exitcond = icmp eq i32 %inc3, 9
LV: Found uniform instruction:   %arrayidx = getelementptr inbounds [9 x i32], [9 x i32]* @a, i32 0, i32 %i.08
LV: Found uniform instruction:   %arrayidx1 = getelementptr inbounds [9 x i8], [9 x i8]* @b, i32 0, i32 %i.08
LV: Found uniform instruction:   %i.08 = phi i32 [ 0, %entry ], [ %inc3, %for.body ]
LV: Found uniform instruction:   %inc3 = add nuw nsw i32 %i.08, 1
LV: Found scalar instruction:   %i.08 = phi i32 [ 0, %entry ], [ %inc3, %for.body ]
LV: Found scalar instruction:   %inc3 = add nuw nsw i32 %i.08, 1
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #5 Interval # 1
LV(REG): At #6 Interval # 2
LV(REG): At #7 Interval # 3
LV(REG): At #9 Interval # 1
LV(REG): At #10 Interval # 1
LV(REG): VF = 32
LV(REG): Found max usage: 2
LV(REG): Found invariant usage: 0
LV(REG): VF = 64
LV(REG): Found max usage: 4
LV(REG): Found invariant usage: 0
LV: Aborting. A tail loop is required with -Os/-Oz.
LV: Vectorization is possible but not beneficial.
LV: Interleaving is not beneficial.

The VF calculated above is 64, which it what we would have wanted. On the other hand, with the testcase from this patch, the relevant output looks like this:

...
LV(REG): At #106 Interval # 2
LV(REG): At #108 Interval # 1
LV(REG): At #109 Interval # 1
LV(REG): VF = 18
LV(REG): Found max usage: 36
LV(REG): Found invariant usage: 16
LV: Overriding calculated MaxVF(9) with target's minimum: 64

Ping.

I have no more comments. You probably want to add a few more reviewers who are familiar with TTI.

Since you're ok with the vectorizer change, I'll commit this. If someone has concerns regarding the TTI interface, I'll address it post-commit.

This revision was not accepted when it landed; it landed in state Needs Review.Apr 13 2018, 1:19 PM
This revision was automatically updated to reflect the committed changes.