This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Adjust heuristics for a truncated load
AbandonedPublic

Authored by jonpa on Sep 29 2018, 5:58 AM.

Details

Summary

The max VF is based on the smallest and widest types in the loop, which is returned by getSmallestAndWidestTypes(). As an example, on SystemZ a vector register is 128 bits. So if a i64 or double is defined, the feasible Max VF is 2.

My observation is that if a load is in effect truncated, it would make sense to see if the VF derived from the truncated type is also evaluated with cost analysis. For instace, if a load of double is truncated to float, VF 4 could be more interesting than VF 2, so MaxVF should be 4. Note that this only means VF 2 is compared to VF 4.

In practice, this seems to affect mostly load -> store type of loops on SystemZ (SPEC) (but if there would be a loop with some more operations on the narrow type, the win would be bigger). Therefore I am not so sure of the worth of this, but it seems reasonable to me. ~80 loops have slightly less instructions per iteration of original loop on z13, it seems.

Also not sure if it should 'continue' instead of taking the truncated type (given that the truncate should be present in the loop and therefore also checked).

I saw one loop that got slightly worse. It was load -> store loop, where the i16 loads were truncated to i8. Since two loads were stored with interleaving on the store, the vector store was now in 2 vector registers. This gave worse code on SystemZ for some reason, even though it should have just scaled, I think. Generally, either the target should return a higher cost, or fix the backend to reflect the cost, so that the lower VF is selected in this case.

I could make a test if the reviewers think the concept is a good. There are no test failures caused by this.

Diff Detail

Event Timeline

jonpa created this revision.Sep 29 2018, 5:58 AM
jonpa updated this revision to Diff 168809.Oct 9 2018, 8:25 AM

Ping!

Added test case which serves as a good example where the loop body has an i64 load which is truncated to i32 and stored. With VF=2, which is the maxVF on trunk since 128/64 is 2, the vectorized loop becomes:

vl      %v0, 0(%r14)
vpkg    %v0, %v0, %v0
vsteg   %v0, 0(%r13), 0
la      %r3, 2(%r3)
la      %r13, 8(%r13)
la      %r14, 16(%r14)
cgrjlh  %r1, %r3, .LBB0_5

With this patch, VF 4 is also considered, which is in this case better and selected by the cost heuristics:

vl      %v0, 16(%r14)
vl      %v1, 0(%r14)
vpkg    %v0, %v1, %v0
vst     %v0, 0(%r13)
la      %r3, 4(%r3)
la      %r13, 16(%r13)
la      %r14, 32(%r14)
cgrjlh  %r1, %r3, .LBB0_5

I agree it does make sense to me to consider VF 4 in this test case, and the resulting code is better on Z.

Ayal added a subscriber: Ayal.Nov 2 2018, 2:37 AM
Ayal added a subscriber: dorit.Nov 2 2018, 3:23 AM

We were wondering why not simply consider the scalar type of all not-to-be-ignored instructions in the loop instead. We're traversing them all anyway here.

(PR39497 provides a vectorizable loop having no loads nor stores, but it's quite odd.)

lib/Transforms/Vectorize/LoopVectorize.cpp
4698

Call *I.user_begin(), or rather user_back(), once instead of thrice?
Checking isa<LoadInst> is somewhat redundant.
Taking the smaller T helps reduce MinWidth, but may also reduce MaxWidth.

test/Transforms/LoopVectorize/SystemZ/maxVF_truncload.ll
5

Smallest type should indeed be 32 bits, but Widest should (remain) 64 bits; unless the type of the original load should be ignored?

jonpa added inline comments.Nov 2 2018, 4:53 AM
test/Transforms/LoopVectorize/SystemZ/maxVF_truncload.ll
5

I am not sure I follow here - 'Smallest' was 32 even without this patch, but 'Widest' was 64, right?

hsaito added a subscriber: hsaito.Nov 2 2018, 4:39 PM

I prefer not to add this kind of "heuristics" on the functions that should just return "the fact". Even if the loaded value is truncated immediately, the load itself needs to be vectorized.

These kinds of issues should be handled by enabling TTI.shouldMaximizeVectorBandwidth() and then apply appropriate cost modeling fix on it. If we keep adding those little things
to places that doesn't really belong, we'll never fix the cost model properly to enable TTI.shouldMaximizeVectorBandwidth(). Please use this as an incentive to spend time on the cost
model. Please don't get confused by the name of the function. It just means vectorizer has more VF choices to lower the cost of executing the loop.

Thanks,
Hideki

Ayal added inline comments.Nov 3 2018, 9:39 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
4698

One way of affecting only MinWidth:

{
  Type *TruncT = (*I.user_begin())->getType();
  MinWidth = std::min(MinWidth,
                      (unsigned)DL.getTypeSizeInBits(TruncT->getScalarType()));
}
test/Transforms/LoopVectorize/SystemZ/maxVF_truncload.ll
5

Ahh, right. There are two distinct issues here:

  1. how are Smallest and Widest types computed?
  2. which of Smallest or Widest type is used for setting MaxVF?

For the second issue, as @hsaito pointed out, see https://reviews.llvm.org/D44523.

The first issue, e.g., expanding "64 / 64 bits" into "32 / 64 bits" in the following revised example where both load and store are 64 bits but compute is 32 bits, would benefit from your patch:

define void @fun(i64 %n, i32 %v, i64* %ptr, i64* %dst) {
entry:
  br label %for.body

for.body:
  %i = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
  %addr = getelementptr inbounds i64, i64* %ptr, i64 %i
  %l64 = load i64, i64* %addr
  %conv = trunc i64 %l64 to i32
  %add = add i32 %conv, %v
  %sext = sext i32 %add to i64
  %addr1 = getelementptr inbounds i64, i64* %dst, i64 %i
  store i64 %sext, i64* %addr1
  %iv.next = add nuw nsw i64 %i, 1
  %cmp = icmp slt i64 %iv.next, %n
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret void
}
jonpa abandoned this revision.Nov 6 2018, 4:38 AM

Thanks for feedback and explanations! I will abandon this, and instead aim at enabling TT.shouldMaximizeVectorBandwidth().

Thanks for feedback and explanations! I will abandon this, and instead aim at enabling TT.shouldMaximizeVectorBandwidth().

Thanks. That's great. In a long run, it would be much better.