This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp

Differential D44523

Change calculation of MaxVectorSize
AbandonedPublic

Authored by kparzysz on Mar 15 2018, 9:17 AM.

Download Raw Diff

Details

Reviewers

craig.topper
dcaballe
hfinkel
hsaito
mkuper
rengolin
fhahn

Summary

Currently it's MaxVectorSize = WidestRegister / WidestType, but it seems like MaxVectorSize = WidestRegister / SmallestType would make a lot more sense. Is that a typo?

A bunch of lit tests for vectorizer fail with this change, because they look for very specific output. All CodeGen tests pass.

Diff Detail

Repository: rL LLVM

Event Timeline

kparzysz created this revision.Mar 15 2018, 9:17 AM

craig.topper added reviewers: dcaballe, hfinkel, hsaito.Mar 15 2018, 9:52 AM

Hi Krzysztof,

I'm afraid this is not a simple typo :). The current code is conservatively correct and just replacing WidestType with SmallestType would be problematic.
MaxVectorSize is computing the maximum number of elements of any type that you can put in a physical vector (unsigned WidestRegister = TTI.getRegisterBitWidth(true);). For example, imaging that WidestRegister is 128-bit and we have double (64-bit) and char (8-bit) data types in the loop:

With the current code, MaxVectorSize = 128 / 64 = 2 elements/physical vector. This number of elements is OK for our loop because 2 doubles and 2 chars fit into a 128-bit vector.
With your proposed change, MaxVectorSize = 128 / 8 = 16 elements/physical vector. This number may "problematic" for our loop because 16 chars fit into a 128-bit vector but 16 doubles doesn't. We would need to use 8 physical vector registers to pack 16 doubles!

We use the term double/triple/... pumping vectorization when we have to use multiple 2/3/... physical register to pack some data types. Unfortunately, this approach is not always beneficial since it may increase too much the register pressure and lead to register spilling. If we wanted to enable something like this, we would need to add the proper cost model support to evaluate that double/triple/... pumping vectorization scenarios have better cost than the standard approach.

I hope this is helpful.
Please, let me know if you have any question.

Thanks,
Diego

This is actually problematic for HVX because it creates short vectors. In a loop with vectorizable operations on i16 and i32, this calculates the VF of 16, and (using cost model where everything is cheap) we have loads of <16 x i32> and loads of <16 x i16>. The former is good, because it matches the HVX register size. The second one is really bad, because the load is scalarized, which is highly expensive. Expensive to the point that it negates any benefit from vectorizing anything. In fact, with the cost reflected properly, the VF is calculated to be 2, which makes no use of HVX at all.

For us, the register pressure concern is pretty much a non-issue compared to the problems caused by unaligned short vectors.

Would it be possible to add a TTI callback that forces the minimum VF?

What would work well for us too would to enable pumping for targets that want it, but leave it off by default. This wouldn't change any existing behavior, but would help HVX a great deal. In any case I'd be interested to hear your feedback on this, if you have any other suggestions on how to deal with the short vector issue.

In D44523#1039290, @kparzysz wrote:

Would it be possible to add a TTI callback that forces the minimum VF?

I think so. Something along the lines of TTI.getMinimumVF(ElementTy), I guess. For most arch/type, that can return 2 (or 1).

In D44523#1039300, @kparzysz wrote:

What would work well for us too would to enable pumping for targets that want it, but leave it off by default. This wouldn't change any existing behavior, but would help HVX a great deal. In any case I'd be interested to hear your feedback on this, if you have any other suggestions on how to deal with the short vector issue.

I think Michael Kuperstein was trying to enable double pumping (could be one year or more ago). I think it's best to dig it up. There may be something usable in there.

Would it be possible to add a TTI callback that forces the minimum VF?

This sounds reasonable to me as long as the cost for the minimum VF is still better than the scalar version.

I think Michael Kuperstein was trying to enable double pumping (could be one year or more ago). I think it's best to dig it up. There may be something usable in there.

Given Krzysztof's experiment, pumping doesn't seem to introduce stability issues in LV. calculateRegisterUsage seems to be handling some cost modeling for pumping (not sure if it's enough):

// A lambda that gets the register usage for the given type and VF.
auto GetRegUsage = [&DL, WidestRegister](Type *Ty, unsigned VF) {
  if (Ty->isTokenTy())
    return 0U;
  unsigned TypeSize = DL.getTypeSizeInBits(Ty->getScalarType());
  return std::max<unsigned>(1, VF * TypeSize / WidestRegister);
};

Maybe it's just a question of changing MaxVectorSize as suggested, and refine the cost modeling, if necessary.
Let's wait for Michael.

dcaballe added a reviewer: mkuper.Mar 15 2018, 2:50 PM

a.elovikov added a subscriber: a.elovikov.Mar 16 2018, 4:36 AM

In D44523#1039543, @dcaballe wrote:

Maybe it's just a question of changing MaxVectorSize as suggested, and refine the cost modeling, if necessary.

The key here is moving to that one target at a time, i.e., when that target is ready --- and TTI is a good way of doing so.
So, if someone wants to change this "MaxVectorSize" determination TTI based, I'd support the idea.

Adding a few more reviewers for discussion.

hsaito added reviewers: rengolin, fhahn.Mar 16 2018, 10:44 AM

In the meantime I'm working on a patch that introduces MinVF. It seems to be a bit complex, since the allowable VF range used to simply start at 1 and go up to MaxVF, but now it's {1} u [MinVF, MaxVF].

kparzysz mentioned this in D44574: [LV] Introduce TTI::getMinimumVF.Mar 16 2018, 12:02 PM

See MaximizeBandwidth, as in
llvm-dev's Enable vectorizer-maximize-bandwidth by default?
patch: Enable vectorizer-maximize-bandwidth by default.
which is still reverted afaik: r306936 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."

In D44523#1041288, @Ayal wrote:

See MaximizeBandwidth, as in
llvm-dev's Enable vectorizer-maximize-bandwidth by default?
patch: Enable vectorizer-maximize-bandwidth by default.
which is still reverted afaik: r306936 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."

Thanks, Ayal. I somehow thought that attempt was TTI based, not just the flip of the default for all targets at once. I think we should enable it per target basis.

kparzysz mentioned this in D44735: [LV] Add TTI::shouldMaximizeVectorBandwidth to allow enabling it per target.Mar 21 2018, 7:09 AM

kparzysz abandoned this revision.Apr 4 2018, 11:21 AM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

2 lines

Diff 138575

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,090 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize,
// Get the maximum safe dependence distance in bits computed by LAA.		// Get the maximum safe dependence distance in bits computed by LAA.
// It is computed by MaxVF * sizeOf(type) * 8, where type is taken from		// It is computed by MaxVF * sizeOf(type) * 8, where type is taken from
// the memory accesses that is most restrictive (involved in the smallest		// the memory accesses that is most restrictive (involved in the smallest
// dependence distance).		// dependence distance).
unsigned MaxSafeRegisterWidth = Legal->getMaxSafeRegisterWidth();		unsigned MaxSafeRegisterWidth = Legal->getMaxSafeRegisterWidth();

WidestRegister = std::min(WidestRegister, MaxSafeRegisterWidth);		WidestRegister = std::min(WidestRegister, MaxSafeRegisterWidth);

unsigned MaxVectorSize = WidestRegister / WidestType;		unsigned MaxVectorSize = WidestRegister / SmallestType;

DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType << " / "		DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType << " / "
<< WidestType << " bits.\n");		<< WidestType << " bits.\n");
DEBUG(dbgs() << "LV: The Widest register safe to use is: " << WidestRegister		DEBUG(dbgs() << "LV: The Widest register safe to use is: " << WidestRegister
<< " bits.\n");		<< " bits.\n");

assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements"		assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements"
" into one vector!");		" into one vector!");
▲ Show 20 Lines • Show All 2,556 Lines • Show Last 20 Lines