This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
2
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/SystemZ/
-
Transforms/
-
LoopVectorize/
-
SystemZ/
3
maxVF_truncload.ll

Differential D52685

[LoopVectorizer] Adjust heuristics for a truncated load
AbandonedPublic

Authored by jonpa on Sep 29 2018, 5:58 AM.

Download Raw Diff

Details

Reviewers

hfinkel
uweigand
MatzeB
fhahn

Summary

The max VF is based on the smallest and widest types in the loop, which is returned by getSmallestAndWidestTypes(). As an example, on SystemZ a vector register is 128 bits. So if a i64 or double is defined, the feasible Max VF is 2.

My observation is that if a load is in effect truncated, it would make sense to see if the VF derived from the truncated type is also evaluated with cost analysis. For instace, if a load of double is truncated to float, VF 4 could be more interesting than VF 2, so MaxVF should be 4. Note that this only means VF 2 is compared to VF 4.

In practice, this seems to affect mostly load -> store type of loops on SystemZ (SPEC) (but if there would be a loop with some more operations on the narrow type, the win would be bigger). Therefore I am not so sure of the worth of this, but it seems reasonable to me. ~80 loops have slightly less instructions per iteration of original loop on z13, it seems.

Also not sure if it should 'continue' instead of taking the truncated type (given that the truncate should be present in the loop and therefore also checked).

I saw one loop that got slightly worse. It was load -> store loop, where the i16 loads were truncated to i8. Since two loads were stored with interleaving on the store, the vector store was now in 2 vector registers. This gave worse code on SystemZ for some reason, even though it should have just scaled, I think. Generally, either the target should return a higher cost, or fix the backend to reflect the cost, so that the lower VF is selected in this case.

I could make a test if the reviewers think the concept is a good. There are no test failures caused by this.

Diff Detail

Event Timeline

jonpa created this revision.Sep 29 2018, 5:58 AM

Ping!

Added test case which serves as a good example where the loop body has an i64 load which is truncated to i32 and stored. With VF=2, which is the maxVF on trunk since 128/64 is 2, the vectorized loop becomes:

vl      %v0, 0(%r14)
vpkg    %v0, %v0, %v0
vsteg   %v0, 0(%r13), 0
la      %r3, 2(%r3)
la      %r13, 8(%r13)
la      %r14, 16(%r14)
cgrjlh  %r1, %r3, .LBB0_5

With this patch, VF 4 is also considered, which is in this case better and selected by the cost heuristics:

vl      %v0, 16(%r14)
vl      %v1, 0(%r14)
vpkg    %v0, %v1, %v0
vst     %v0, 0(%r13)
la      %r3, 4(%r3)
la      %r13, 16(%r13)
la      %r14, 32(%r14)
cgrjlh  %r1, %r3, .LBB0_5

I agree it does make sense to me to consider VF 4 in this test case, and the resulting code is better on Z.

Ping!

Ayal added a subscriber: Ayal.Nov 2 2018, 2:37 AM

We were wondering why not simply consider the scalar type of all not-to-be-ignored instructions in the loop instead. We're traversing them all anyway here.

(PR39497 provides a vectorizable loop having no loads nor stores, but it's quite odd.)

lib/Transforms/Vectorize/LoopVectorize.cpp
4698	Call `*I.user_begin()`, or rather `user_back()`, once instead of thrice? Checking `isa<LoadInst>` is somewhat redundant. Taking the smaller `T` helps reduce `MinWidth`, but may also reduce `MaxWidth`.
test/Transforms/LoopVectorize/SystemZ/maxVF_truncload.ll
5	`Smallest` type should indeed be 32 bits, but `Widest` should (remain) 64 bits; unless the type of the original load should be ignored?

jonpa added inline comments.Nov 2 2018, 4:53 AM

test/Transforms/LoopVectorize/SystemZ/maxVF_truncload.ll
5	I am not sure I follow here - 'Smallest' was 32 even without this patch, but 'Widest' was 64, right?

I prefer not to add this kind of "heuristics" on the functions that should just return "the fact". Even if the loaded value is truncated immediately, the load itself needs to be vectorized.

These kinds of issues should be handled by enabling TTI.shouldMaximizeVectorBandwidth() and then apply appropriate cost modeling fix on it. If we keep adding those little things
to places that doesn't really belong, we'll never fix the cost model properly to enable TTI.shouldMaximizeVectorBandwidth(). Please use this as an incentive to spend time on the cost
model. Please don't get confused by the name of the function. It just means vectorizer has more VF choices to lower the cost of executing the loop.

Thanks,
Hideki

Ayal added inline comments.Nov 3 2018, 9:39 AM

lib/Transforms/Vectorize/LoopVectorize.cpp

4698

One way of affecting only MinWidth:

{
  Type *TruncT = (*I.user_begin())->getType();
  MinWidth = std::min(MinWidth,
                      (unsigned)DL.getTypeSizeInBits(TruncT->getScalarType()));
}

test/Transforms/LoopVectorize/SystemZ/maxVF_truncload.ll

Ahh, right. There are two distinct issues here:

how are Smallest and Widest types computed?
which of Smallest or Widest type is used for setting MaxVF?

For the second issue, as @hsaito pointed out, see https://reviews.llvm.org/D44523.

The first issue, e.g., expanding "64 / 64 bits" into "32 / 64 bits" in the following revised example where both load and store are 64 bits but compute is 32 bits, would benefit from your patch:

define void @fun(i64 %n, i32 %v, i64* %ptr, i64* %dst) {
entry:
  br label %for.body

for.body:
  %i = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
  %addr = getelementptr inbounds i64, i64* %ptr, i64 %i
  %l64 = load i64, i64* %addr
  %conv = trunc i64 %l64 to i32
  %add = add i32 %conv, %v
  %sext = sext i32 %add to i64
  %addr1 = getelementptr inbounds i64, i64* %dst, i64 %i
  store i64 %sext, i64* %addr1
  %iv.next = add nuw nsw i64 %i, 1
  %cmp = icmp slt i64 %iv.next, %n
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret void
}

Thanks for feedback and explanations! I will abandon this, and instead aim at enabling TT.shouldMaximizeVectorBandwidth().

In D52685#1288573, @jonpa wrote:

Thanks for feedback and explanations! I will abandon this, and instead aim at enabling TT.shouldMaximizeVectorBandwidth().

Thanks. That's great. In a long run, it would be much better.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

5 lines

test/

Transforms/

LoopVectorize/

SystemZ/

maxVF_truncload.ll

26 lines

Diff 168809

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,686 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
// been selected. Here, we assume that if an access can be		// been selected. Here, we assume that if an access can be
// vectorized, it will be. We should also look at extending this		// vectorized, it will be. We should also look at extending this
// optimization to non-pointer types.		// optimization to non-pointer types.
//		//
if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I) &&		if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I) &&
!isAccessInterleaved(&I) && !isLegalGatherOrScatter(&I))		!isAccessInterleaved(&I) && !isLegalGatherOrScatter(&I))
continue;		continue;

		// If the loaded value is truncated, consider the truncated type.
		if (isa<LoadInst>(&I) && I.hasOneUse() &&
		(isa<TruncInst>(I.user_begin()) \|\| isa<FPTruncInst>(I.user_begin())))
		T = (*I.user_begin())->getType();
		AyalUnsubmitted Not Done Reply Inline Actions Call `I.user_begin()`, or rather `user_back()`, once instead of thrice? Checking `isa<LoadInst>` is somewhat redundant. Taking the smaller `T` helps reduce `MinWidth`, but may also reduce `MaxWidth`. Ayal:* Call `*I.user_begin()`, or rather `user_back()`, once instead of thrice? Checking…
		AyalUnsubmitted Not Done Reply Inline Actions One way of affecting only MinWidth: { Type TruncT = (I.user_begin())->getType(); MinWidth = std::min(MinWidth, (unsigned)DL.getTypeSizeInBits(TruncT->getScalarType())); } Ayal: One way of affecting only MinWidth: ``` { Type TruncT = (I.user_begin())…

MinWidth = std::min(MinWidth,		MinWidth = std::min(MinWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
MaxWidth = std::max(MaxWidth,		MaxWidth = std::max(MaxWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
}		}
}		}

return {MinWidth, MaxWidth};		return {MinWidth, MaxWidth};
▲ Show 20 Lines • Show All 2,582 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/SystemZ/maxVF_truncload.ll

This file was added.

				; RUN: opt < %s -mtriple=s390x-unknown-linux -mcpu=z13 -o - -loop-vectorize \
				; RUN: -debug-only=loop-vectorize 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; CHECK: LV: The Smallest and Widest types: 32 / 32 bits.
				AyalUnsubmitted Not Done Reply Inline Actions `Smallest` type should indeed be 32 bits, but `Widest` should (remain) 64 bits; unless the type of the original load should be ignored? Ayal: `Smallest` type should indeed be 32 bits, but `Widest` should (remain) 64 bits; unless the type…
				jonpaAuthorUnsubmitted Not Done Reply Inline Actions I am not sure I follow here - 'Smallest' was 32 even without this patch, but 'Widest' was 64, right? jonpa: I am not sure I follow here - 'Smallest' was 32 even without this patch, but 'Widest' was 64…
				AyalUnsubmitted Not Done Reply Inline Actions Ahh, right. There are two distinct issues here: how are Smallest and Widest types computed? which of Smallest or Widest type is used for setting MaxVF? For the second issue, as @hsaito pointed out, see https://reviews.llvm.org/D44523. The first issue, e.g., expanding "64 / 64 bits" into "32 / 64 bits" in the following revised example where both load and store are 64 bits but compute is 32 bits, would benefit from your patch: define void @fun(i64 %n, i32 %v, i64* %ptr, i64* %dst) { entry: br label %for.body for.body: %i = phi i64 [ 0, %entry ], [ %iv.next, %for.body ] %addr = getelementptr inbounds i64, i64* %ptr, i64 %i %l64 = load i64, i64* %addr %conv = trunc i64 %l64 to i32 %add = add i32 %conv, %v %sext = sext i32 %add to i64 %addr1 = getelementptr inbounds i64, i64* %dst, i64 %i store i64 %sext, i64* %addr1 %iv.next = add nuw nsw i64 %i, 1 %cmp = icmp slt i64 %iv.next, %n br i1 %cmp, label %for.body, label %for.end for.end: ret void } Ayal: Ahh, right. There are two distinct issues here: # how are Smallest and Widest types computed?
				; CHECK: LV: Selecting VF: 4.

				define void @fun(i64 %n, i32 %v, i64* %ptr, i32* %dst) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%addr = getelementptr inbounds i64, i64* %ptr, i64 %i
				%l64 = load i64, i64* %addr
				%conv = trunc i64 %l64 to i32
				%addr1 = getelementptr inbounds i32, i32* %dst, i64 %i
				store i32 %conv, i32* %addr1
				%iv.next = add nuw nsw i64 %i, 1
				%cmp = icmp slt i64 %iv.next, %n
				br i1 %cmp, label %for.body, label %for.end

				for.end:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Adjust heuristics for a truncated loadAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 168809

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/SystemZ/maxVF_truncload.ll

[LoopVectorizer] Adjust heuristics for a truncated load
AbandonedPublic