This is an archive of the discontinued LLVM Phabricator instance.

Change calculation of MaxVectorSize
AbandonedPublic

Authored by kparzysz on Mar 15 2018, 9:17 AM.

Details

Summary

Currently it's MaxVectorSize = WidestRegister / WidestType, but it seems like MaxVectorSize = WidestRegister / SmallestType would make a lot more sense. Is that a typo?

A bunch of lit tests for vectorizer fail with this change, because they look for very specific output. All CodeGen tests pass.

Diff Detail

Repository
rL LLVM

Event Timeline

kparzysz created this revision.Mar 15 2018, 9:17 AM

Hi Krzysztof,

I'm afraid this is not a simple typo :). The current code is conservatively correct and just replacing WidestType with SmallestType would be problematic.
MaxVectorSize is computing the maximum number of elements of any type that you can put in a physical vector (unsigned WidestRegister = TTI.getRegisterBitWidth(true);). For example, imaging that WidestRegister is 128-bit and we have double (64-bit) and char (8-bit) data types in the loop:

  • With the current code, MaxVectorSize = 128 / 64 = 2 elements/physical vector. This number of elements is OK for our loop because 2 doubles and 2 chars fit into a 128-bit vector.
  • With your proposed change, MaxVectorSize = 128 / 8 = 16 elements/physical vector. This number may "problematic" for our loop because 16 chars fit into a 128-bit vector but 16 doubles doesn't. We would need to use 8 physical vector registers to pack 16 doubles!

We use the term double/triple/... pumping vectorization when we have to use multiple 2/3/... physical register to pack some data types. Unfortunately, this approach is not always beneficial since it may increase too much the register pressure and lead to register spilling. If we wanted to enable something like this, we would need to add the proper cost model support to evaluate that double/triple/... pumping vectorization scenarios have better cost than the standard approach.

I hope this is helpful.
Please, let me know if you have any question.

Thanks,
Diego

This is actually problematic for HVX because it creates short vectors. In a loop with vectorizable operations on i16 and i32, this calculates the VF of 16, and (using cost model where everything is cheap) we have loads of <16 x i32> and loads of <16 x i16>. The former is good, because it matches the HVX register size. The second one is really bad, because the load is scalarized, which is highly expensive. Expensive to the point that it negates any benefit from vectorizing anything. In fact, with the cost reflected properly, the VF is calculated to be 2, which makes no use of HVX at all.

For us, the register pressure concern is pretty much a non-issue compared to the problems caused by unaligned short vectors.

Would it be possible to add a TTI callback that forces the minimum VF?

What would work well for us too would to enable pumping for targets that want it, but leave it off by default. This wouldn't change any existing behavior, but would help HVX a great deal. In any case I'd be interested to hear your feedback on this, if you have any other suggestions on how to deal with the short vector issue.

Would it be possible to add a TTI callback that forces the minimum VF?

I think so. Something along the lines of TTI.getMinimumVF(ElementTy), I guess. For most arch/type, that can return 2 (or 1).

What would work well for us too would to enable pumping for targets that want it, but leave it off by default. This wouldn't change any existing behavior, but would help HVX a great deal. In any case I'd be interested to hear your feedback on this, if you have any other suggestions on how to deal with the short vector issue.

I think Michael Kuperstein was trying to enable double pumping (could be one year or more ago). I think it's best to dig it up. There may be something usable in there.

Would it be possible to add a TTI callback that forces the minimum VF?

This sounds reasonable to me as long as the cost for the minimum VF is still better than the scalar version.

I think Michael Kuperstein was trying to enable double pumping (could be one year or more ago). I think it's best to dig it up. There may be something usable in there.

Given Krzysztof's experiment, pumping doesn't seem to introduce stability issues in LV. calculateRegisterUsage seems to be handling some cost modeling for pumping (not sure if it's enough):

// A lambda that gets the register usage for the given type and VF.
auto GetRegUsage = [&DL, WidestRegister](Type *Ty, unsigned VF) {
  if (Ty->isTokenTy())
    return 0U;
  unsigned TypeSize = DL.getTypeSizeInBits(Ty->getScalarType());
  return std::max<unsigned>(1, VF * TypeSize / WidestRegister);
};

Maybe it's just a question of changing MaxVectorSize as suggested, and refine the cost modeling, if necessary.
Let's wait for Michael.

hsaito added a comment.EditedMar 16 2018, 10:43 AM

Maybe it's just a question of changing MaxVectorSize as suggested, and refine the cost modeling, if necessary.

The key here is moving to that one target at a time, i.e., when that target is ready --- and TTI is a good way of doing so.
So, if someone wants to change this "MaxVectorSize" determination TTI based, I'd support the idea.

Adding a few more reviewers for discussion.

In the meantime I'm working on a patch that introduces MinVF. It seems to be a bit complex, since the allowable VF range used to simply start at 1 and go up to MaxVF, but now it's {1} u [MinVF, MaxVF].

Thanks, Ayal. I somehow thought that attempt was TTI based, not just the flip of the default for all targets at once. I think we should enable it per target basis.

kparzysz abandoned this revision.Apr 4 2018, 11:21 AM