Page MenuHomePhabricator

SLP: honor requested max vector size merging PHIs
ClosedPublic

Authored by rampitec on Jun 19 2020, 12:19 PM.

Details

Summary

At the moment this place does not check maximum size set
by TTI and just creates a maximum possible vectors.

Diff Detail

Event Timeline

rampitec created this revision.Jun 19 2020, 12:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 19 2020, 12:19 PM
rampitec updated this revision to Diff 272159.Jun 19 2020, 12:30 PM

Cleanup the test.

rampitec updated this revision to Diff 272165.Jun 19 2020, 12:50 PM

Fixed test checks.

Why it is required?

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

Isn't it inconsistent to use one width for arithmetic and another for phis? In particular I am getting more instructions after SLP for the actual app behind this test than without it because it forms <32 x float> phis and then lots of extractelement and insertelement to get to actual elements and perform arithmetic on it.

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

Isn't it inconsistent to use one width for arithmetic and another for phis? In particular I am getting more instructions after SLP for the actual app behind this test than without it because it forms <32 x float> phis and then lots of extractelement and insertelement to get to actual elements and perform arithmetic on it.

I don't think it is inconsistency. Actually, the larger vectors we are able to build, the better. It reduces compile time significantly, at least. And, most probably, leads to better vectorization.
Actually, it would be good if you could commit the test begore thr patch to see the difference in the transformation. But use the script to generate the checks, do not do it manyally.

rampitec updated this revision to Diff 272435.Jun 22 2020, 7:51 AM

Pre-commited test and rebased.

rampitec updated this revision to Diff 272444.Jun 22 2020, 8:18 AM

Test update.

ABataev added inline comments.Jun 22 2020, 8:18 AM
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

It really makes the vectorization worse, in general. Most of these inserts/extracts will be transformed into the simple shuffles by the instcombiner. And if there is really a problem with target-specific limitations, it is better to adapt the cost model rather than introduce changes that may affect all targets. Maybe need to fix TTI::getTypeLegalizationCost?

Actually, it would be good if you could commit the test begore thr patch to see the difference in the transformation. But use the script to generate the checks, do not do it manyally.

Done. This test goes down from 683 to 607 lines (using wc -l) which shall compile faster as far as I understand.

rampitec marked an inline comment as done.Jun 22 2020, 1:20 PM
rampitec added inline comments.
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

We certainly working on improving our cost model, although it will not help here. I have experimented and we really need to lie a lot to avoid vectorization for a case like this. Note though that 128 bit case was suppressed by the cost model of a generic processor which believes it is not profitable, so I have updated test to 256 bit.

The main issue is a PHI of a wide vector type, we do not need anything else to run into the problem, and BoUpSLP::getEntryCost() does not even check a cost of a PHI:

switch (ShuffleOrOp) {
  case Instruction::PHI:
    return 0;

That said, I also do not believe it can be correctly solved by a cost model. This is not a cost problem, this is a question of ability to correctly generate code. Cost model covers instruction size, throughput, and latency, but it does not cover register pressure.

If we believe there are targets out there which may benefit from an unlimitly wide vectorization I can expose yet another TTI callback. We have TTI::getRegisterBitWidth() and option -slp-max-reg-size, I can add TTI::getRealRegisterBitWidth() and -slp-real-max-reg-size. Alternatively targets believing in an unconditionally good wide vectors may return 1024 from getRegisterBitWidth(), right?

ABataev added inline comments.Jun 22 2020, 1:29 PM
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

Yes, I thought about it, that may be a good alternative solution. Maybe, just return 0 and in this case do not check for size at all?

I think some of the problem is the cost model inherits a lot of the bad SelectionDAG mentality of what constitutes "legal". It would be good if we can move the IR heuristics away from the concept of legal types

rampitec marked an inline comment as done.Jun 22 2020, 2:03 PM
rampitec added inline comments.
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

The easiest way is to return UINT_MAX, right? I think 0 logically means "we do not have it at all, no vectorization please".

ABataev added inline comments.Jun 22 2020, 2:33 PM
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

I'm fine with it.

rampitec marked an inline comment as done.Jun 22 2020, 3:08 PM
rampitec added inline comments.
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

I believe that's how it works now, the same effect as -slp-max-reg-size=-1

I haven't seen this cause damage in x86 examples, but it seems similar to D68667 (although the recent comments there suggest we may want to refine the predicate).

I haven't seen this cause damage in x86 examples, but it seems similar to D68667 (although the recent comments there suggest we may want to refine the predicate).

It is indeed similar, although adds a check into a yet another place. These checks are just not consistently used.

spatel accepted this revision.Jul 7 2020, 10:33 AM

We can't expect the backend to lower arbitrary vector IR and/or unlimited register pressure efficiently, so there's always going to be a need to limit IR in ways like this, so LGTM, but wait a bit to commit in case there are more comments.

This revision is now accepted and ready to land.Jul 7 2020, 10:33 AM
This revision was automatically updated to reflect the committed changes.