This is an archive of the discontinued LLVM Phabricator instance.

SLP: honor requested max vector size merging PHIs
ClosedPublic

Authored by rampitec on Jun 19 2020, 12:19 PM.

Details

Summary

At the moment this place does not check maximum size set
by TTI and just creates a maximum possible vectors.

Diff Detail

Event Timeline

rampitec created this revision.Jun 19 2020, 12:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 19 2020, 12:19 PM
rampitec updated this revision to Diff 272159.Jun 19 2020, 12:30 PM

Cleanup the test.

rampitec updated this revision to Diff 272165.Jun 19 2020, 12:50 PM

Fixed test checks.

Why it is required?

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

Isn't it inconsistent to use one width for arithmetic and another for phis? In particular I am getting more instructions after SLP for the actual app behind this test than without it because it forms <32 x float> phis and then lots of extractelement and insertelement to get to actual elements and perform arithmetic on it.

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

Isn't it inconsistent to use one width for arithmetic and another for phis? In particular I am getting more instructions after SLP for the actual app behind this test than without it because it forms <32 x float> phis and then lots of extractelement and insertelement to get to actual elements and perform arithmetic on it.

I don't think it is inconsistency. Actually, the larger vectors we are able to build, the better. It reduces compile time significantly, at least. And, most probably, leads to better vectorization.
Actually, it would be good if you could commit the test begore thr patch to see the difference in the transformation. But use the script to generate the checks, do not do it manyally.

rampitec updated this revision to Diff 272435.Jun 22 2020, 7:51 AM

Pre-commited test and rebased.

rampitec updated this revision to Diff 272444.Jun 22 2020, 8:18 AM

Test update.

ABataev added inline comments.Jun 22 2020, 8:18 AM
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

It really makes the vectorization worse, in general. Most of these inserts/extracts will be transformed into the simple shuffles by the instcombiner. And if there is really a problem with target-specific limitations, it is better to adapt the cost model rather than introduce changes that may affect all targets. Maybe need to fix TTI::getTypeLegalizationCost?

Actually, it would be good if you could commit the test begore thr patch to see the difference in the transformation. But use the script to generate the checks, do not do it manyally.

Done. This test goes down from 683 to 607 lines (using wc -l) which shall compile faster as far as I understand.

rampitec marked an inline comment as done.Jun 22 2020, 1:20 PM
rampitec added inline comments.
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

We certainly working on improving our cost model, although it will not help here. I have experimented and we really need to lie a lot to avoid vectorization for a case like this. Note though that 128 bit case was suppressed by the cost model of a generic processor which believes it is not profitable, so I have updated test to 256 bit.

The main issue is a PHI of a wide vector type, we do not need anything else to run into the problem, and BoUpSLP::getEntryCost() does not even check a cost of a PHI:

switch (ShuffleOrOp) {
  case Instruction::PHI:
    return 0;

That said, I also do not believe it can be correctly solved by a cost model. This is not a cost problem, this is a question of ability to correctly generate code. Cost model covers instruction size, throughput, and latency, but it does not cover register pressure.

If we believe there are targets out there which may benefit from an unlimitly wide vectorization I can expose yet another TTI callback. We have TTI::getRegisterBitWidth() and option -slp-max-reg-size, I can add TTI::getRealRegisterBitWidth() and -slp-real-max-reg-size. Alternatively targets believing in an unconditionally good wide vectors may return 1024 from getRegisterBitWidth(), right?

ABataev added inline comments.Jun 22 2020, 1:29 PM
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

Yes, I thought about it, that may be a good alternative solution. Maybe, just return 0 and in this case do not check for size at all?

I think some of the problem is the cost model inherits a lot of the bad SelectionDAG mentality of what constitutes "legal". It would be good if we can move the IR heuristics away from the concept of legal types

rampitec marked an inline comment as done.Jun 22 2020, 2:03 PM
rampitec added inline comments.
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

The easiest way is to return UINT_MAX, right? I think 0 logically means "we do not have it at all, no vectorization please".

ABataev added inline comments.Jun 22 2020, 2:33 PM
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

I'm fine with it.

rampitec marked an inline comment as done.Jun 22 2020, 3:08 PM
rampitec added inline comments.
llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163

I believe that's how it works now, the same effect as -slp-max-reg-size=-1

I haven't seen this cause damage in x86 examples, but it seems similar to D68667 (although the recent comments there suggest we may want to refine the predicate).

I haven't seen this cause damage in x86 examples, but it seems similar to D68667 (although the recent comments there suggest we may want to refine the predicate).

It is indeed similar, although adds a check into a yet another place. These checks are just not consistently used.

spatel accepted this revision.Jul 7 2020, 10:33 AM

We can't expect the backend to lower arbitrary vector IR and/or unlimited register pressure efficiently, so there's always going to be a need to limit IR in ways like this, so LGTM, but wait a bit to commit in case there are more comments.

This revision is now accepted and ready to land.Jul 7 2020, 10:33 AM
This revision was automatically updated to reflect the committed changes.
jonpa added a subscriber: jonpa.Nov 11 2020, 10:05 AM

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?

I did try to override this with -slp-max-reg-size, and that works... However getRegisterBitWidth() is also used by other passes, like the LoopVectorizer, so it seems wrong to change that value as it is defined just for the purpose of tuning a particular optimization...

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?

I did try to override this with -slp-max-reg-size, and that works... However getRegisterBitWidth() is also used by other passes, like the LoopVectorizer, so it seems wrong to change that value as it is defined just for the purpose of tuning a particular optimization...

The other pass which calls getRegisterBitWidth(true) is LoopVectorize. Do you mean you want to have different heuristics for loop and straight-line vectorization?

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?

I did try to override this with -slp-max-reg-size, and that works... However getRegisterBitWidth() is also used by other passes, like the LoopVectorizer, so it seems wrong to change that value as it is defined just for the purpose of tuning a particular optimization...

The other pass which calls getRegisterBitWidth(true) is LoopVectorize. Do you mean you want to have different heuristics for loop and straight-line vectorization?

Well, the definition of that hook per the comment is "The width of the largest scalar or vector register type", so I don't see how it could be a variable to play with. It should simply reflect the size of the vector register - 128 bits for SystemZ.

In the original discussion there was a suggestion to look into the TTI costs on your target for those very wide vector types, a <32 x ...> PHI instruction...? Why isn't it enough to use TTI?

Why would it make sense to only vectorize to <2 x double> and not <4 x double>? The latter is just 2 vector regs, and that is completely fine... In my case it is obvious that the final result of the vectorizer is greatly improved by allowing an over-wide vector type, even though in the most simple case 2 x <2 x double> should give the same output as a split <4 x double>.... I am not sure yet exactly why this makes for many more vector fp-add/fp-mul in the output... Note that with your patch those instructions are not vectorized at all anymore, but are left scalar! So there is some vectorization that is lost by always doing max <2 x double> and never wider...

I wonder why is it better to do 2 x <2 x double> rather than <4 x double>, they will both use two vector registers... (not just for PHIs, but generally)?

The other pass which calls getRegisterBitWidth(true) is LoopVectorize. Do you mean you want to have different heuristics for loop and straight-line vectorization?

Well, the definition of that hook per the comment is "The width of the largest scalar or vector register type", so I don't see how it could be a variable to play with. It should simply reflect the size of the vector register - 128 bits for SystemZ.

Well, probably the name of the callback does not really reflect its use. The actual use if the width of the vectorization required. If used with Vector = true it affects exactly two places: it sets the width of the vectorization for loop and slp.
Earlier in the comments there seems to be a consensus that a target which want wider vectorization shall really return a bigger number form getRegisterBitWidth().

In the original discussion there was a suggestion to look into the TTI costs on your target for those very wide vector types, a <32 x ...> PHI instruction...? Why isn't it enough to use TTI?

The problem with using costs returned from TTI is exactly this: it was ignored here and vectorization of PHI was trying to grab as much as it could.

Why would it make sense to only vectorize to <2 x double> and not <4 x double>? The latter is just 2 vector regs, and that is completely fine... In my case it is obvious that the final result of the vectorizer is greatly improved by allowing an over-wide vector type, even though in the most simple case 2 x <2 x double> should give the same output as a split <4 x double>.... I am not sure yet exactly why this makes for many more vector fp-add/fp-mul in the output... Note that with your patch those instructions are not vectorized at all anymore, but are left scalar! So there is some vectorization that is lost by always doing max <2 x double> and never wider...

I wonder why is it better to do 2 x <2 x double> rather than <4 x double>, they will both use two vector registers... (not just for PHIs, but generally)?

Making it wider than we can actually lower is bad in two ways:

  1. It eliminated a possibility to deadcode dead lanes.
  2. What was much more important it requires an allocation of a wider register. In our case it was literally asking for registers 1024 bits wide (yes, we can have such tuples), and that leads to spilling and even inability to allocate registers in some cases.

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

We also observed regression on x86 for imagick after this patch. I'm not sure whether we observed the same case but it was definitely related to this patch.
Here is what happened:
SLP vectorizer started from 5 PHIs of i64 type. And it turned out unfortunate to have only the last four of them being perfectly vectorizable with VF=4.
Max register size blindly cut that list taking the first four of them that were unprofitable to vectorize and thus left the good one even outside of possible try out window. SLP finally end up vectorizing just two of them with VF=2.
Before the patch the vectorizer after failed attempt to vectorize first four scalars took the next four (last in the list) and succeeded.
Assuming max vector size is power of two it probably makes sense to cut list at 2*MaxRegSize-1 rather than at MaxRegSize.

jonpa added a comment.Nov 20 2020, 8:35 AM

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

We also observed regression on x86 for imagick after this patch. I'm not sure whether we observed the same case but it was definitely related to this patch.
Here is what happened:
SLP vectorizer started from 5 PHIs of i64 type. And it turned out unfortunate to have only the last four of them being perfectly vectorizable with VF=4.
Max register size blindly cut that list taking the first four of them that were unprofitable to vectorize and thus left the good one even outside of possible try out window. SLP finally end up vectorizing just two of them with VF=2.
Before the patch the vectorizer after failed attempt to vectorize first four scalars took the next four (last in the list) and succeeded.
Assuming max vector size is power of two it probably makes sense to cut list at 2*MaxRegSize-1 rather than at MaxRegSize.

I think I am seeing something very similar on SystemZ: The important BB has 5 PHIs of type double, where the first one has different operand opcodes (constant) than the others. Before this patch, that group of 5 was passed to tryToVectorizeList() which first tried 0-3 which failed due to the different operands of PHI#0, but then succeeded with 1-4. This patch changed that so that instead groups of just 2 are passed to tryVectorizeList(), where 0-1 fail, and then it seems some VF=2 vectorization takes place instead.

By using -slp-max-reg-size=320 I get exactly 5 elements and the old behaviour for this example is restored, with the result of <4 x double> instructions. If I instead changed the loop in vectorizeChainsInBlock() to retry on the next instruction in the group ('IncIt = SameTypeIt' => 'IncIt++'), vectorization is now tried with groups [1,2] and [3,4], which I thought might be as good as the [1,2,3,4] group which will be split later anyway. This however did not work: only [3,4] was actually vectorized because [1,2] was considered too expensive. I tried adjusting with a negative value for-slp-threshold, but even though that enabled more vectorization, the performance was not improved at all.

The performance improvement is only present with VF=4, which also adds 3 shufflevectors so it seems that SLP can shuffle a few vectors and produce better code than at VF=2.

The problem with using costs returned from TTI is exactly this: it was ignored here and vectorization of PHI was trying to grab as much as it could.

It seems that might be simply missing then? tryToVectorizeList() calls getTreeCost() which queries TTI for these costs, so it seems this is the place to increase the cost to avoid the vectorization, or? Could you provide a test case and explain why this approach is not working?

Just because some other place is using this heuristic doesn't mean that it is necessarily optimal - it could very well be the other way around so that it really shouldn't be limited in those places either...

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

We also observed regression on x86 for imagick after this patch. I'm not sure whether we observed the same case but it was definitely related to this patch.
Here is what happened:
SLP vectorizer started from 5 PHIs of i64 type. And it turned out unfortunate to have only the last four of them being perfectly vectorizable with VF=4.
Max register size blindly cut that list taking the first four of them that were unprofitable to vectorize and thus left the good one even outside of possible try out window. SLP finally end up vectorizing just two of them with VF=2.
Before the patch the vectorizer after failed attempt to vectorize first four scalars took the next four (last in the list) and succeeded.
Assuming max vector size is power of two it probably makes sense to cut list at 2*MaxRegSize-1 rather than at MaxRegSize.

I think I am seeing something very similar on SystemZ: The important BB has 5 PHIs of type double, where the first one has different operand opcodes (constant) than the others. Before this patch, that group of 5 was passed to tryToVectorizeList() which first tried 0-3 which failed due to the different operands of PHI#0, but then succeeded with 1-4. This patch changed that so that instead groups of just 2 are passed to tryVectorizeList(), where 0-1 fail, and then it seems some VF=2 vectorization takes place instead.

By using -slp-max-reg-size=320 I get exactly 5 elements and the old behaviour for this example is restored, with the result of <4 x double> instructions. If I instead changed the loop in vectorizeChainsInBlock() to retry on the next instruction in the group ('IncIt = SameTypeIt' => 'IncIt++'), vectorization is now tried with groups [1,2] and [3,4], which I thought might be as good as the [1,2,3,4] group which will be split later anyway. This however did not work: only [3,4] was actually vectorized because [1,2] was considered too expensive. I tried adjusting with a negative value for-slp-threshold, but even though that enabled more vectorization, the performance was not improved at all.

The performance improvement is only present with VF=4, which also adds 3 shufflevectors so it seems that SLP can shuffle a few vectors and produce better code than at VF=2.

The problem with using costs returned from TTI is exactly this: it was ignored here and vectorization of PHI was trying to grab as much as it could.

It seems that might be simply missing then? tryToVectorizeList() calls getTreeCost() which queries TTI for these costs, so it seems this is the place to increase the cost to avoid the vectorization, or? Could you provide a test case and explain why this approach is not working?

It did not work for the reduced test I have submitted with the change for a simple reason: it was not checked.

Just because some other place is using this heuristic doesn't mean that it is necessarily optimal - it could very well be the other way around so that it really shouldn't be limited in those places either...

I can understand the argument that controlling it with register size might not be a best approach. In this case we can just expose another target callback, specifically for the vectorization purposes.

It did not work for the reduced test I have submitted with the change for a simple reason: it was not checked.

Yeah - but if you have it and post it here we can work together on it... Maybe an .ll/.bc file with a runline which gives in the output what you need to avoid. Maybe even an llc runline on that to show the spilling...

I can understand the argument that controlling it with register size might not be a best approach. In this case we can just expose another target callback, specifically for the vectorization purposes.

Would it make sense to have SLP first try the full group, and then as long as it's not profitable reiterate with half of the previous group size? In other words first try 32 in your case, and then start over with a max of 16, then 8, all the way down to 2 unless TTI costs returned a profitable total cost? That is one idea.. An alternative might be to have SLP look at the tree it wants to convert and do a register pressure estimate and add that to the total cost... This is assuming that greater VFs are beneficial, which I at least think they are at the moment...

It did not work for the reduced test I have submitted with the change for a simple reason: it was not checked.

Yeah - but if you have it and post it here we can work together on it... Maybe an .ll/.bc file with a runline which gives in the output what you need to avoid. Maybe even an llc runline on that to show the spilling...

It has been really long time ago. I was trying to find the original bc from the failing app, but cannot anymore :(

I can understand the argument that controlling it with register size might not be a best approach. In this case we can just expose another target callback, specifically for the vectorization purposes.

Would it make sense to have SLP first try the full group, and then as long as it's not profitable reiterate with half of the previous group size? In other words first try 32 in your case, and then start over with a max of 16, then 8, all the way down to 2 unless TTI costs returned a profitable total cost? That is one idea.. An alternative might be to have SLP look at the tree it wants to convert and do a register pressure estimate and add that to the total cost... This is assuming that greater VFs are beneficial, which I at least think they are at the moment...

That is if you believe that a wider vectorization is a bonus. It might be on some targets, it is definitely not for AMDGPU. We have very wide register tuples, but really only 2 element vector ALU instructions (and 4 element vector loads and stores). Nonetheless since we have these wide registers RA would use them thus increasing register pressure. The generated code will use subregs of these wide tuples. In fact just by returning twice wider result for the vector register size I see increase in the number of consumed registers which directly lowers the performance.

Then for a target which does not have such registers there is no such option and vectors will be split at lowering. And that's the real difference here.

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

jonpa added a comment.Nov 21 2020, 1:43 AM

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

It seems like we would want to retry vectorization from the next instruction in the group if the current group failed. It also seems like we want to try greater VFs first. This seems to be what tryToVectorizeList() is doing when it gets all of the group as input (like before this patch). This patch breaks both of these points: it retrys by skipping the *whole* failed group so for instance with a group of 5 PHIs where the first one is different, it will not restart on PHI#1, but on PHI#2, which is bad. It also forces a small VF which seems to miss vectorization opportunities.

Instead of just updating this patch with a new hook would it perhaps be even better to put that hook inside tryToVectorizeList()? I see MinVF and MaxVF being set there and maybe that's a better place to cap VF? That way you would get the restart on the next instruction... Otherwise I agree it would be a good start to just use a new hook in the same place to get the old behavior back on SystemZ and other targets.

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

It seems like we would want to retry vectorization from the next instruction in the group if the current group failed. It also seems like we want to try greater VFs first. This seems to be what tryToVectorizeList() is doing when it gets all of the group as input (like before this patch). This patch breaks both of these points: it retrys by skipping the *whole* failed group so for instance with a group of 5 PHIs where the first one is different, it will not restart on PHI#1, but on PHI#2, which is bad. It also forces a small VF which seems to miss vectorization opportunities.

Instead of just updating this patch with a new hook would it perhaps be even better to put that hook inside tryToVectorizeList()? I see MinVF and MaxVF being set there and maybe that's a better place to cap VF? That way you would get the restart on the next instruction... Otherwise I agree it would be a good start to just use a new hook in the same place to get the old behavior back on SystemZ and other targets.

That makes sense to me. I.e. try to restrart vectorization in tryToVectorizeList() but still make sure we do not produce vectors wider than requested by TTI.

jonpa added a comment.Nov 24 2020, 3:10 AM

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

It seems like we would want to retry vectorization from the next instruction in the group if the current group failed. It also seems like we want to try greater VFs first. This seems to be what tryToVectorizeList() is doing when it gets all of the group as input (like before this patch). This patch breaks both of these points: it retrys by skipping the *whole* failed group so for instance with a group of 5 PHIs where the first one is different, it will not restart on PHI#1, but on PHI#2, which is bad. It also forces a small VF which seems to miss vectorization opportunities.

Instead of just updating this patch with a new hook would it perhaps be even better to put that hook inside tryToVectorizeList()? I see MinVF and MaxVF being set there and maybe that's a better place to cap VF? That way you would get the restart on the next instruction... Otherwise I agree it would be a good start to just use a new hook in the same place to get the old behavior back on SystemZ and other targets.

That makes sense to me. I.e. try to restrart vectorization in tryToVectorizeList() but still make sure we do not produce vectors wider than requested by TTI.

Would you like to give that a try? I think this patch could basically be reverted, and then some new hook would be needed to control MinVF/MaxVF in tryToVectorizeList(). Personally, I think it could be nice with something like TTI->getMaximumVF() or even getMaximumSLPVF()... Probably also good if with a test case for your target if possible that is supposed to not produce any spilling...

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

It seems like we would want to retry vectorization from the next instruction in the group if the current group failed. It also seems like we want to try greater VFs first. This seems to be what tryToVectorizeList() is doing when it gets all of the group as input (like before this patch). This patch breaks both of these points: it retrys by skipping the *whole* failed group so for instance with a group of 5 PHIs where the first one is different, it will not restart on PHI#1, but on PHI#2, which is bad. It also forces a small VF which seems to miss vectorization opportunities.

Instead of just updating this patch with a new hook would it perhaps be even better to put that hook inside tryToVectorizeList()? I see MinVF and MaxVF being set there and maybe that's a better place to cap VF? That way you would get the restart on the next instruction... Otherwise I agree it would be a good start to just use a new hook in the same place to get the old behavior back on SystemZ and other targets.

That makes sense to me. I.e. try to restrart vectorization in tryToVectorizeList() but still make sure we do not produce vectors wider than requested by TTI.

Would you like to give that a try? I think this patch could basically be reverted, and then some new hook would be needed to control MinVF/MaxVF in tryToVectorizeList(). Personally, I think it could be nice with something like TTI->getMaximumVF() or even getMaximumSLPVF()... Probably also good if with a test case for your target if possible that is supposed to not produce any spilling...

WRT the testcase, it is really the testcase in the patch with a single modification right before ret void:

store float %phi31, float* undef
ret void

That makes the value not dead and then if I feed opt's output into llc I either get spilling with wide vectors (first case) or not (second case):

opt -slp-vectorizer -slp-max-reg-size=1024 -S < slp-max-phi-size.ll | ~/work/llvm/rel/bin/llc -march=amdgcn -mcpu=gfx900
opt -slp-vectorizer -slp-max-reg-size=256 -S < slp-max-phi-size.ll | ~/work/llvm/rel/bin/llc -march=amdgcn -mcpu=gfx900

WRT the testcase, it is really the testcase in the patch with a single modification right before ret void:

store float %phi31, float* undef
ret void

That makes the value not dead and then if I feed opt's output into llc I either get spilling with wide vectors (first case) or not (second case):

opt -slp-vectorizer -slp-max-reg-size=1024 -S < slp-max-phi-size.ll | ~/work/llvm/rel/bin/llc -march=amdgcn -mcpu=gfx900
opt -slp-vectorizer -slp-max-reg-size=256 -S < slp-max-phi-size.ll | ~/work/llvm/rel/bin/llc -march=amdgcn -mcpu=gfx900

You can check it here: D92047
Not sure how appropriate is it to commit a target test using llc into the Transforms directory.

Would you like to give that a try? I think this patch could basically be reverted, and then some new hook would be needed to control MinVF/MaxVF in tryToVectorizeList(). Personally, I think it could be nice with something like TTI->getMaximumVF() or even getMaximumSLPVF()... Probably also good if with a test case for your target if possible that is supposed to not produce any spilling...

I am experimenting with it now. I have reverted the patch and instead added this into tryToVectorizeList():

   unsigned MaxVF = std::max<unsigned>(PowerOf2Floor(VL.size()), MinVF);
+  MaxVF = std::min(R.getMaxVecRegSize() / Sz, MaxVF);

I.e. clamp the list to the MaxVecRegSize. It should still allow reordering.

I'd say that solves the initial problem and probably a better solution overall. For instance I see less vectorization with AMDGPU where it should have not been vectorized in the first place.
However, the impact of that change will be less overall vectorization across all targets since tryToVectorizeList() is used not just for PHI.
That can be fixed by that new target callback if default value is MAX_UINT or just some large number. This default is somewhat subpar as I would expect a better default to be MaxVecRegSize, but I guess it can be decided per target.

Would you like to give that a try? I think this patch could basically be reverted, and then some new hook would be needed to control MinVF/MaxVF in tryToVectorizeList(). Personally, I think it could be nice with something like TTI->getMaximumVF() or even getMaximumSLPVF()... Probably also good if with a test case for your target if possible that is supposed to not produce any spilling...

Please check D92059, I believe that is what we have discussed here.