This is an archive of the discontinued LLVM Phabricator instance.

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

In D82227#2105554, @rampitec wrote:

In D82227#2105469, @ABataev wrote:

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

In D82227#2105650, @ABataev wrote:

In D82227#2105554, @rampitec wrote:

In D82227#2105469, @ABataev wrote:

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

Isn't it inconsistent to use one width for arithmetic and another for phis? In particular I am getting more instructions after SLP for the actual app behind this test than without it because it forms <32 x float> phis and then lots of extractelement and insertelement to get to actual elements and perform arithmetic on it.

In D82227#2105769, @rampitec wrote:

In D82227#2105650, @ABataev wrote:

In D82227#2105554, @rampitec wrote:

In D82227#2105469, @ABataev wrote:

Why it is required?

The whole pass only creates vectors up to the size of a vector register provided by a target, this is simply an omission in this place not to check it.

On practice the testcase is reduced from a real app where it has created two <32 x float> vectors and RA run out of registers. Both have to be alive at the same loop and we did not have sufficient number of registers.

Actually, that was the goal in some cases, to create the vectors of the maximum possible length so, later they could be split into several smaller vectors. I addume, this chanhe may increase compile time on the regular targets, where we can create large vectors without problems.
I think, you need to fix the cost of the vectorization of the large vectors for your target, so the SLP could consider it unprofitable to vectorize large vectors.

Isn't it inconsistent to use one width for arithmetic and another for phis? In particular I am getting more instructions after SLP for the actual app behind this test than without it because it forms <32 x float> phis and then lots of extractelement and insertelement to get to actual elements and perform arithmetic on it.

I don't think it is inconsistency. Actually, the larger vectors we are able to build, the better. It reduces compile time significantly, at least. And, most probably, leads to better vectorization.
Actually, it would be good if you could commit the test begore thr patch to see the difference in the transformation. But use the script to generate the checks, do not do it manyally.

Pre-commited test and rebased.

Test update.

ABataev added inline comments.Jun 22 2020, 8:18 AM

llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163	It really makes the vectorization worse, in general. Most of these inserts/extracts will be transformed into the simple shuffles by the instcombiner. And if there is really a problem with target-specific limitations, it is better to adapt the cost model rather than introduce changes that may affect all targets. Maybe need to fix `TTI::getTypeLegalizationCost`?

In D82227#2105782, @ABataev wrote:

Actually, it would be good if you could commit the test begore thr patch to see the difference in the transformation. But use the script to generate the checks, do not do it manyally.

Done. This test goes down from 683 to 607 lines (using wc -l) which shall compile faster as far as I understand.

rampitec marked an inline comment as done.Jun 22 2020, 1:20 PM

rampitec added inline comments.

llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163	We certainly working on improving our cost model, although it will not help here. I have experimented and we really need to lie a lot to avoid vectorization for a case like this. Note though that 128 bit case was suppressed by the cost model of a generic processor which believes it is not profitable, so I have updated test to 256 bit. The main issue is a PHI of a wide vector type, we do not need anything else to run into the problem, and BoUpSLP::getEntryCost() does not even check a cost of a PHI: switch (ShuffleOrOp) { case Instruction::PHI: return 0; That said, I also do not believe it can be correctly solved by a cost model. This is not a cost problem, this is a question of ability to correctly generate code. Cost model covers instruction size, throughput, and latency, but it does not cover register pressure. If we believe there are targets out there which may benefit from an unlimitly wide vectorization I can expose yet another TTI callback. We have TTI::getRegisterBitWidth() and option -slp-max-reg-size, I can add TTI::getRealRegisterBitWidth() and -slp-real-max-reg-size. Alternatively targets believing in an unconditionally good wide vectors may return 1024 from getRegisterBitWidth(), right?

rampitec added a reviewer: spatel.Jun 22 2020, 1:26 PM

ABataev added inline comments.Jun 22 2020, 1:29 PM

llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163	Yes, I thought about it, that may be a good alternative solution. Maybe, just return `0` and in this case do not check for size at all?

I think some of the problem is the cost model inherits a lot of the bad SelectionDAG mentality of what constitutes "legal". It would be good if we can move the IR heuristics away from the concept of legal types

rampitec marked an inline comment as done.Jun 22 2020, 2:03 PM

rampitec added inline comments.

llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163	The easiest way is to return UINT_MAX, right? I think 0 logically means "we do not have it at all, no vectorization please".

ABataev added inline comments.Jun 22 2020, 2:33 PM

llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163	I'm fine with it.

rampitec marked an inline comment as done.Jun 22 2020, 3:08 PM

rampitec added inline comments.

llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
162–163	I believe that's how it works now, the same effect as -slp-max-reg-size=-1

I haven't seen this cause damage in x86 examples, but it seems similar to D68667 (although the recent comments there suggest we may want to refine the predicate).

In D82227#2109213, @spatel wrote:

I haven't seen this cause damage in x86 examples, but it seems similar to D68667 (although the recent comments there suggest we may want to refine the predicate).

It is indeed similar, although adds a check into a yet another place. These checks are just not consistently used.

ping.

rampitec added a reviewer: fhahn.Jul 2 2020, 12:39 PM

We can't expect the backend to lower arbitrary vector IR and/or unlimited register pressure efficiently, so there's always going to be a need to limit IR in ways like this, so LGTM, but wait a bit to commit in case there are more comments.

This revision is now accepted and ready to land.Jul 7 2020, 10:33 AM

Closed by commit rG64030099c378: SLP: honor requested max vector size merging PHIs (authored by rampitec). · Explain WhyJul 8 2020, 8:19 AM

This revision was automatically updated to reflect the committed changes.

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

jonpa added a subscriber: uweigand.Nov 11 2020, 10:05 AM

In D82227#2389212, @jonpa wrote:

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?

In D82227#2389246, @rampitec wrote:

In D82227#2389212, @jonpa wrote:

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?

I did try to override this with -slp-max-reg-size, and that works... However getRegisterBitWidth() is also used by other passes, like the LoopVectorizer, so it seems wrong to change that value as it is defined just for the purpose of tuning a particular optimization...

In D82227#2389258, @jonpa wrote:

In D82227#2389246, @rampitec wrote:

In D82227#2389212, @jonpa wrote:

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?

I did try to override this with -slp-max-reg-size, and that works... However getRegisterBitWidth() is also used by other passes, like the LoopVectorizer, so it seems wrong to change that value as it is defined just for the purpose of tuning a particular optimization...

The other pass which calls getRegisterBitWidth(true) is LoopVectorize. Do you mean you want to have different heuristics for loop and straight-line vectorization?

In D82227#2389284, @rampitec wrote:

In D82227#2389258, @jonpa wrote:

In D82227#2389246, @rampitec wrote:

In D82227#2389212, @jonpa wrote:

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?

I did try to override this with -slp-max-reg-size, and that works... However getRegisterBitWidth() is also used by other passes, like the LoopVectorizer, so it seems wrong to change that value as it is defined just for the purpose of tuning a particular optimization...

The other pass which calls getRegisterBitWidth(true) is LoopVectorize. Do you mean you want to have different heuristics for loop and straight-line vectorization?

Well, the definition of that hook per the comment is "The width of the largest scalar or vector register type", so I don't see how it could be a variable to play with. It should simply reflect the size of the vector register - 128 bits for SystemZ.

In the original discussion there was a suggestion to look into the TTI costs on your target for those very wide vector types, a <32 x ...> PHI instruction...? Why isn't it enough to use TTI?

Why would it make sense to only vectorize to <2 x double> and not <4 x double>? The latter is just 2 vector regs, and that is completely fine... In my case it is obvious that the final result of the vectorizer is greatly improved by allowing an over-wide vector type, even though in the most simple case 2 x <2 x double> should give the same output as a split <4 x double>.... I am not sure yet exactly why this makes for many more vector fp-add/fp-mul in the output... Note that with your patch those instructions are not vectorized at all anymore, but are left scalar! So there is some vectorization that is lost by always doing max <2 x double> and never wider...

I wonder why is it better to do 2 x <2 x double> rather than <4 x double>, they will both use two vector registers... (not just for PHIs, but generally)?

In D82227#2389389, @jonpa wrote:

In D82227#2389284, @rampitec wrote:

The other pass which calls getRegisterBitWidth(true) is LoopVectorize. Do you mean you want to have different heuristics for loop and straight-line vectorization?

Well, the definition of that hook per the comment is "The width of the largest scalar or vector register type", so I don't see how it could be a variable to play with. It should simply reflect the size of the vector register - 128 bits for SystemZ.

Well, probably the name of the callback does not really reflect its use. The actual use if the width of the vectorization required. If used with Vector = true it affects exactly two places: it sets the width of the vectorization for loop and slp.
Earlier in the comments there seems to be a consensus that a target which want wider vectorization shall really return a bigger number form getRegisterBitWidth().

In the original discussion there was a suggestion to look into the TTI costs on your target for those very wide vector types, a <32 x ...> PHI instruction...? Why isn't it enough to use TTI?

The problem with using costs returned from TTI is exactly this: it was ignored here and vectorization of PHI was trying to grab as much as it could.

Why would it make sense to only vectorize to <2 x double> and not <4 x double>? The latter is just 2 vector regs, and that is completely fine... In my case it is obvious that the final result of the vectorizer is greatly improved by allowing an over-wide vector type, even though in the most simple case 2 x <2 x double> should give the same output as a split <4 x double>.... I am not sure yet exactly why this makes for many more vector fp-add/fp-mul in the output... Note that with your patch those instructions are not vectorized at all anymore, but are left scalar! So there is some vectorization that is lost by always doing max <2 x double> and never wider...

I wonder why is it better to do 2 x <2 x double> rather than <4 x double>, they will both use two vector registers... (not just for PHIs, but generally)?

Making it wider than we can actually lower is bad in two ways:

It eliminated a possibility to deadcode dead lanes.
What was much more important it requires an allocation of a wider register. In our case it was literally asking for registers 1024 bits wide (yes, we can have such tuples), and that leads to spilling and even inability to allocate registers in some cases.

In D82227#2389212, @jonpa wrote:

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

We also observed regression on x86 for imagick after this patch. I'm not sure whether we observed the same case but it was definitely related to this patch.
Here is what happened:
SLP vectorizer started from 5 PHIs of i64 type. And it turned out unfortunate to have only the last four of them being perfectly vectorizable with VF=4.
Max register size blindly cut that list taking the first four of them that were unprofitable to vectorize and thus left the good one even outside of possible try out window. SLP finally end up vectorizing just two of them with VF=2.
Before the patch the vectorizer after failed attempt to vectorize first four scalars took the next four (last in the list) and succeeded.
Assuming max vector size is power of two it probably makes sense to cut list at 2*MaxRegSize-1 rather than at MaxRegSize.

In D82227#2395271, @vdmitrie wrote:

In D82227#2389212, @jonpa wrote:

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

We also observed regression on x86 for imagick after this patch. I'm not sure whether we observed the same case but it was definitely related to this patch.
Here is what happened:
SLP vectorizer started from 5 PHIs of i64 type. And it turned out unfortunate to have only the last four of them being perfectly vectorizable with VF=4.
Max register size blindly cut that list taking the first four of them that were unprofitable to vectorize and thus left the good one even outside of possible try out window. SLP finally end up vectorizing just two of them with VF=2.
Before the patch the vectorizer after failed attempt to vectorize first four scalars took the next four (last in the list) and succeeded.
Assuming max vector size is power of two it probably makes sense to cut list at 2*MaxRegSize-1 rather than at MaxRegSize.

I think I am seeing something very similar on SystemZ: The important BB has 5 PHIs of type double, where the first one has different operand opcodes (constant) than the others. Before this patch, that group of 5 was passed to tryToVectorizeList() which first tried 0-3 which failed due to the different operands of PHI#0, but then succeeded with 1-4. This patch changed that so that instead groups of just 2 are passed to tryVectorizeList(), where 0-1 fail, and then it seems some VF=2 vectorization takes place instead.

By using -slp-max-reg-size=320 I get exactly 5 elements and the old behaviour for this example is restored, with the result of <4 x double> instructions. If I instead changed the loop in vectorizeChainsInBlock() to retry on the next instruction in the group ('IncIt = SameTypeIt' => 'IncIt++'), vectorization is now tried with groups [1,2] and [3,4], which I thought might be as good as the [1,2,3,4] group which will be split later anyway. This however did not work: only [3,4] was actually vectorized because [1,2] was considered too expensive. I tried adjusting with a negative value for-slp-threshold, but even though that enabled more vectorization, the performance was not improved at all.

The performance improvement is only present with VF=4, which also adds 3 shufflevectors so it seems that SLP can shuffle a few vectors and produce better code than at VF=2.

The problem with using costs returned from TTI is exactly this: it was ignored here and vectorization of PHI was trying to grab as much as it could.

It seems that might be simply missing then? tryToVectorizeList() calls getTreeCost() which queries TTI for these costs, so it seems this is the place to increase the cost to avoid the vectorization, or? Could you provide a test case and explain why this approach is not working?

Just because some other place is using this heuristic doesn't mean that it is necessarily optimal - it could very well be the other way around so that it really shouldn't be limited in those places either...

In D82227#2408262, @jonpa wrote:

In D82227#2395271, @vdmitrie wrote:

In D82227#2389212, @jonpa wrote:

I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...

See https://bugs.llvm.org/show_bug.cgi?id=48155

We also observed regression on x86 for imagick after this patch. I'm not sure whether we observed the same case but it was definitely related to this patch.
Here is what happened:
SLP vectorizer started from 5 PHIs of i64 type. And it turned out unfortunate to have only the last four of them being perfectly vectorizable with VF=4.
Max register size blindly cut that list taking the first four of them that were unprofitable to vectorize and thus left the good one even outside of possible try out window. SLP finally end up vectorizing just two of them with VF=2.
Before the patch the vectorizer after failed attempt to vectorize first four scalars took the next four (last in the list) and succeeded.
Assuming max vector size is power of two it probably makes sense to cut list at 2*MaxRegSize-1 rather than at MaxRegSize.

I think I am seeing something very similar on SystemZ: The important BB has 5 PHIs of type double, where the first one has different operand opcodes (constant) than the others. Before this patch, that group of 5 was passed to tryToVectorizeList() which first tried 0-3 which failed due to the different operands of PHI#0, but then succeeded with 1-4. This patch changed that so that instead groups of just 2 are passed to tryVectorizeList(), where 0-1 fail, and then it seems some VF=2 vectorization takes place instead.

By using -slp-max-reg-size=320 I get exactly 5 elements and the old behaviour for this example is restored, with the result of <4 x double> instructions. If I instead changed the loop in vectorizeChainsInBlock() to retry on the next instruction in the group ('IncIt = SameTypeIt' => 'IncIt++'), vectorization is now tried with groups [1,2] and [3,4], which I thought might be as good as the [1,2,3,4] group which will be split later anyway. This however did not work: only [3,4] was actually vectorized because [1,2] was considered too expensive. I tried adjusting with a negative value for-slp-threshold, but even though that enabled more vectorization, the performance was not improved at all.

The performance improvement is only present with VF=4, which also adds 3 shufflevectors so it seems that SLP can shuffle a few vectors and produce better code than at VF=2.

The problem with using costs returned from TTI is exactly this: it was ignored here and vectorization of PHI was trying to grab as much as it could.

It seems that might be simply missing then? tryToVectorizeList() calls getTreeCost() which queries TTI for these costs, so it seems this is the place to increase the cost to avoid the vectorization, or? Could you provide a test case and explain why this approach is not working?

It did not work for the reduced test I have submitted with the change for a simple reason: it was not checked.

Just because some other place is using this heuristic doesn't mean that it is necessarily optimal - it could very well be the other way around so that it really shouldn't be limited in those places either...

I can understand the argument that controlling it with register size might not be a best approach. In this case we can just expose another target callback, specifically for the vectorization purposes.

It did not work for the reduced test I have submitted with the change for a simple reason: it was not checked.

Yeah - but if you have it and post it here we can work together on it... Maybe an .ll/.bc file with a runline which gives in the output what you need to avoid. Maybe even an llc runline on that to show the spilling...

I can understand the argument that controlling it with register size might not be a best approach. In this case we can just expose another target callback, specifically for the vectorization purposes.

Would it make sense to have SLP first try the full group, and then as long as it's not profitable reiterate with half of the previous group size? In other words first try 32 in your case, and then start over with a max of 16, then 8, all the way down to 2 unless TTI costs returned a profitable total cost? That is one idea.. An alternative might be to have SLP look at the tree it wants to convert and do a register pressure estimate and add that to the total cost... This is assuming that greater VFs are beneficial, which I at least think they are at the moment...

In D82227#2408693, @jonpa wrote:

It did not work for the reduced test I have submitted with the change for a simple reason: it was not checked.

Yeah - but if you have it and post it here we can work together on it... Maybe an .ll/.bc file with a runline which gives in the output what you need to avoid. Maybe even an llc runline on that to show the spilling...

It has been really long time ago. I was trying to find the original bc from the failing app, but cannot anymore :(

I can understand the argument that controlling it with register size might not be a best approach. In this case we can just expose another target callback, specifically for the vectorization purposes.

Would it make sense to have SLP first try the full group, and then as long as it's not profitable reiterate with half of the previous group size? In other words first try 32 in your case, and then start over with a max of 16, then 8, all the way down to 2 unless TTI costs returned a profitable total cost? That is one idea.. An alternative might be to have SLP look at the tree it wants to convert and do a register pressure estimate and add that to the total cost... This is assuming that greater VFs are beneficial, which I at least think they are at the moment...

That is if you believe that a wider vectorization is a bonus. It might be on some targets, it is definitely not for AMDGPU. We have very wide register tuples, but really only 2 element vector ALU instructions (and 4 element vector loads and stores). Nonetheless since we have these wide registers RA would use them thus increasing register pressure. The generated code will use subregs of these wide tuples. In fact just by returning twice wider result for the vector register size I see increase in the number of consumed registers which directly lowers the performance.

Then for a target which does not have such registers there is no such option and vectors will be split at lowering. And that's the real difference here.

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

It seems like we would want to retry vectorization from the next instruction in the group if the current group failed. It also seems like we want to try greater VFs first. This seems to be what tryToVectorizeList() is doing when it gets all of the group as input (like before this patch). This patch breaks both of these points: it retrys by skipping the *whole* failed group so for instance with a group of 5 PHIs where the first one is different, it will not restart on PHI#1, but on PHI#2, which is bad. It also forces a small VF which seems to miss vectorization opportunities.

Instead of just updating this patch with a new hook would it perhaps be even better to put that hook inside tryToVectorizeList()? I see MinVF and MaxVF being set there and maybe that's a better place to cap VF? That way you would get the restart on the next instruction... Otherwise I agree it would be a good start to just use a new hook in the same place to get the old behavior back on SystemZ and other targets.

In D82227#2409429, @jonpa wrote:

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

It seems like we would want to retry vectorization from the next instruction in the group if the current group failed. It also seems like we want to try greater VFs first. This seems to be what tryToVectorizeList() is doing when it gets all of the group as input (like before this patch). This patch breaks both of these points: it retrys by skipping the *whole* failed group so for instance with a group of 5 PHIs where the first one is different, it will not restart on PHI#1, but on PHI#2, which is bad. It also forces a small VF which seems to miss vectorization opportunities.

Instead of just updating this patch with a new hook would it perhaps be even better to put that hook inside tryToVectorizeList()? I see MinVF and MaxVF being set there and maybe that's a better place to cap VF? That way you would get the restart on the next instruction... Otherwise I agree it would be a good start to just use a new hook in the same place to get the old behavior back on SystemZ and other targets.

That makes sense to me. I.e. try to restrart vectorization in tryToVectorizeList() but still make sure we do not produce vectors wider than requested by TTI.

In D82227#2412089, @rampitec wrote:

In D82227#2409429, @jonpa wrote:

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

It seems like we would want to retry vectorization from the next instruction in the group if the current group failed. It also seems like we want to try greater VFs first. This seems to be what tryToVectorizeList() is doing when it gets all of the group as input (like before this patch). This patch breaks both of these points: it retrys by skipping the *whole* failed group so for instance with a group of 5 PHIs where the first one is different, it will not restart on PHI#1, but on PHI#2, which is bad. It also forces a small VF which seems to miss vectorization opportunities.

Instead of just updating this patch with a new hook would it perhaps be even better to put that hook inside tryToVectorizeList()? I see MinVF and MaxVF being set there and maybe that's a better place to cap VF? That way you would get the restart on the next instruction... Otherwise I agree it would be a good start to just use a new hook in the same place to get the old behavior back on SystemZ and other targets.

That makes sense to me. I.e. try to restrart vectorization in tryToVectorizeList() but still make sure we do not produce vectors wider than requested by TTI.

Would you like to give that a try? I think this patch could basically be reverted, and then some new hook would be needed to control MinVF/MaxVF in tryToVectorizeList(). Personally, I think it could be nice with something like TTI->getMaximumVF() or even getMaximumSLPVF()... Probably also good if with a test case for your target if possible that is supposed to not produce any spilling...

In D82227#2413359, @jonpa wrote:

In D82227#2412089, @rampitec wrote:

In D82227#2409429, @jonpa wrote:

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.

It seems like we would want to retry vectorization from the next instruction in the group if the current group failed. It also seems like we want to try greater VFs first. This seems to be what tryToVectorizeList() is doing when it gets all of the group as input (like before this patch). This patch breaks both of these points: it retrys by skipping the *whole* failed group so for instance with a group of 5 PHIs where the first one is different, it will not restart on PHI#1, but on PHI#2, which is bad. It also forces a small VF which seems to miss vectorization opportunities.

Instead of just updating this patch with a new hook would it perhaps be even better to put that hook inside tryToVectorizeList()? I see MinVF and MaxVF being set there and maybe that's a better place to cap VF? That way you would get the restart on the next instruction... Otherwise I agree it would be a good start to just use a new hook in the same place to get the old behavior back on SystemZ and other targets.

That makes sense to me. I.e. try to restrart vectorization in tryToVectorizeList() but still make sure we do not produce vectors wider than requested by TTI.

Would you like to give that a try? I think this patch could basically be reverted, and then some new hook would be needed to control MinVF/MaxVF in tryToVectorizeList(). Personally, I think it could be nice with something like TTI->getMaximumVF() or even getMaximumSLPVF()... Probably also good if with a test case for your target if possible that is supposed to not produce any spilling...

WRT the testcase, it is really the testcase in the patch with a single modification right before ret void:

store float %phi31, float* undef
ret void

That makes the value not dead and then if I feed opt's output into llc I either get spilling with wide vectors (first case) or not (second case):

opt -slp-vectorizer -slp-max-reg-size=1024 -S < slp-max-phi-size.ll | ~/work/llvm/rel/bin/llc -march=amdgcn -mcpu=gfx900
opt -slp-vectorizer -slp-max-reg-size=256 -S < slp-max-phi-size.ll | ~/work/llvm/rel/bin/llc -march=amdgcn -mcpu=gfx900

In D82227#2414446, @rampitec wrote:
WRT the testcase, it is really the testcase in the patch with a single modification right before ret void:
store float %phi31, float* undef
ret void
That makes the value not dead and then if I feed opt's output into llc I either get spilling with wide vectors (first case) or not (second case):
opt -slp-vectorizer -slp-max-reg-size=1024 -S < slp-max-phi-size.ll | ~/work/llvm/rel/bin/llc -march=amdgcn -mcpu=gfx900
opt -slp-vectorizer -slp-max-reg-size=256 -S < slp-max-phi-size.ll | ~/work/llvm/rel/bin/llc -march=amdgcn -mcpu=gfx900

You can check it here: D92047
Not sure how appropriate is it to commit a target test using llc into the Transforms directory.

In D82227#2413359, @jonpa wrote:

Would you like to give that a try? I think this patch could basically be reverted, and then some new hook would be needed to control MinVF/MaxVF in tryToVectorizeList(). Personally, I think it could be nice with something like TTI->getMaximumVF() or even getMaximumSLPVF()... Probably also good if with a test case for your target if possible that is supposed to not produce any spilling...

I am experimenting with it now. I have reverted the patch and instead added this into tryToVectorizeList():

   unsigned MaxVF = std::max<unsigned>(PowerOf2Floor(VL.size()), MinVF);
+  MaxVF = std::min(R.getMaxVecRegSize() / Sz, MaxVF);

I.e. clamp the list to the MaxVecRegSize. It should still allow reordering.

I'd say that solves the initial problem and probably a better solution overall. For instance I see less vectorization with AMDGPU where it should have not been vectorized in the first place.
However, the impact of that change will be less overall vectorization across all targets since tryToVectorizeList() is used not just for PHI.
That can be fixed by that new target callback if default value is MAX_UINT or just some large number. This default is somewhat subpar as I would expect a better default to be MaxVecRegSize, but I guess it can be decided per target.

In D82227#2413359, @jonpa wrote:

Would you like to give that a try? I think this patch could basically be reverted, and then some new hook would be needed to control MinVF/MaxVF in tryToVectorizeList(). Personally, I think it could be nice with something like TTI->getMaximumVF() or even getMaximumSLPVF()... Probably also good if with a test case for your target if possible that is supposed to not produce any spilling...

Please check D92059, I believe that is what we have discussed here.

rampitec mentioned this in rG87d7757bbe14: [SLP] Control maximum vectorization factor from TTI.Dec 14 2020, 8:50 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

13 lines

test/

Transforms/

SLPVectorizer/

X86/

remark_unsupported.ll

2 lines

slp-max-phi-size.ll

582 lines

Diff 276444

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,355 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::vectorizeSimpleInstructions(
Instructions.clear();		Instructions.clear();
return OpsChanged;		return OpsChanged;
}		}

bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {		bool SLPVectorizerPass::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {
bool Changed = false;		bool Changed = false;
SmallVector<Value *, 4> Incoming;		SmallVector<Value *, 4> Incoming;
SmallPtrSet<Value *, 16> VisitedInstrs;		SmallPtrSet<Value *, 16> VisitedInstrs;
		unsigned MaxVecRegSize = R.getMaxVecRegSize();

bool HaveVectorizedPhiNodes = true;		bool HaveVectorizedPhiNodes = true;
while (HaveVectorizedPhiNodes) {		while (HaveVectorizedPhiNodes) {
HaveVectorizedPhiNodes = false;		HaveVectorizedPhiNodes = false;

// Collect the incoming values from the PHIs.		// Collect the incoming values from the PHIs.
Incoming.clear();		Incoming.clear();
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
Show All 10 Lines	while (HaveVectorizedPhiNodes) {

// Try to vectorize elements base on their type.		// Try to vectorize elements base on their type.
for (SmallVector<Value *, 4>::iterator IncIt = Incoming.begin(),		for (SmallVector<Value *, 4>::iterator IncIt = Incoming.begin(),
E = Incoming.end();		E = Incoming.end();
IncIt != E;) {		IncIt != E;) {

// Look for the next elements with the same type.		// Look for the next elements with the same type.
SmallVector<Value *, 4>::iterator SameTypeIt = IncIt;		SmallVector<Value *, 4>::iterator SameTypeIt = IncIt;
		Type EltTy = (IncIt)->getType();
		unsigned EltSize = EltTy->isSized() ? DL->getTypeSizeInBits(EltTy)
		: MaxVecRegSize;
		unsigned MaxNumElts = MaxVecRegSize / EltSize;
		if (MaxNumElts < 2) {
		++IncIt;
		continue;
		}

while (SameTypeIt != E &&		while (SameTypeIt != E &&
(SameTypeIt)->getType() == (IncIt)->getType()) {		(*SameTypeIt)->getType() == EltTy &&
		(SameTypeIt - IncIt) < MaxNumElts) {
VisitedInstrs.insert(*SameTypeIt);		VisitedInstrs.insert(*SameTypeIt);
++SameTypeIt;		++SameTypeIt;
}		}

// Try to vectorize them.		// Try to vectorize them.
unsigned NumElts = (SameTypeIt - IncIt);		unsigned NumElts = (SameTypeIt - IncIt);
LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize starting at PHIs ("		LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize starting at PHIs ("
<< NumElts << ")\n");		<< NumElts << ")\n");
▲ Show 20 Lines • Show All 218 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/remark_unsupported.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -mtriple=x86_64-pc-linux-gnu -mcpu=generic -slp-vectorizer -pass-remarks-output=%t < %s \| FileCheck %s			; RUN: opt -S -mtriple=x86_64-pc-linux-gnu -mcpu=generic -slp-vectorizer --slp-max-reg-size=256 -pass-remarks-output=%t < %s \| FileCheck %s
	; RUN: FileCheck --input-file=%t --check-prefix=YAML %s			; RUN: FileCheck --input-file=%t --check-prefix=YAML %s

	; This type is not supported by SLP			; This type is not supported by SLP
	define void @test(x86_fp80* %i1, x86_fp80* %i2, x86_fp80* %o) {			define void @test(x86_fp80* %i1, x86_fp80* %i2, x86_fp80* %o) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[I1_0:%.]] = load x86_fp80, x86_fp80 [[I1:%.*]], align 16			; CHECK-NEXT: [[I1_0:%.]] = load x86_fp80, x86_fp80 [[I1:%.*]], align 16
	; CHECK-NEXT: [[I1_GEP1:%.]] = getelementptr x86_fp80, x86_fp80 [[I1]], i64 1			; CHECK-NEXT: [[I1_GEP1:%.]] = getelementptr x86_fp80, x86_fp80 [[I1]], i64 1
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -slp-max-reg-size=32 < %s \| FileCheck -check-prefix=MAX32 %s			; RUN: opt -slp-vectorizer -S -slp-max-reg-size=32 < %s \| FileCheck -check-prefix=MAX32 %s
	; RUN: opt -slp-vectorizer -S -slp-max-reg-size=256 < %s \| FileCheck -check-prefix=MAX256 %s			; RUN: opt -slp-vectorizer -S -slp-max-reg-size=256 < %s \| FileCheck -check-prefix=MAX256 %s
	; RUN: opt -slp-vectorizer -S -slp-max-reg-size=1024 < %s \| FileCheck -check-prefix=MAX1024 %s			; RUN: opt -slp-vectorizer -S -slp-max-reg-size=1024 < %s \| FileCheck -check-prefix=MAX1024 %s

	define void @phi_float32(half %hval, float %fval) {			define void @phi_float32(half %hval, float %fval) {
	; MAX32-LABEL: @phi_float32(			; MAX32-LABEL: @phi_float32(
	; MAX32-NEXT: bb:			; MAX32-NEXT: bb:
	; MAX32-NEXT: br label [[BB1:%.*]]			; MAX32-NEXT: br label [[BB1:%.*]]
	; MAX32: bb1:			; MAX32: bb1:
	; MAX32-NEXT: [[TMP0:%.]] = insertelement <4 x half> undef, half [[HVAL:%.]], i32 0			; MAX32-NEXT: [[I:%.]] = fpext half [[HVAL:%.]] to float
	; MAX32-NEXT: [[TMP1:%.*]] = insertelement <4 x half> [[TMP0]], half [[HVAL]], i32 1			; MAX32-NEXT: [[I1:%.]] = fmul float [[I]], [[FVAL:%.]]
	; MAX32-NEXT: [[TMP2:%.*]] = insertelement <4 x half> [[TMP1]], half [[HVAL]], i32 2			; MAX32-NEXT: [[I2:%.*]] = fadd float 0.000000e+00, [[I1]]
	; MAX32-NEXT: [[TMP3:%.*]] = insertelement <4 x half> [[TMP2]], half [[HVAL]], i32 3			; MAX32-NEXT: [[I3:%.*]] = fpext half [[HVAL]] to float
	; MAX32-NEXT: [[TMP4:%.*]] = fpext <4 x half> [[TMP3]] to <4 x float>			; MAX32-NEXT: [[I4:%.*]] = fmul float [[I3]], [[FVAL]]
	; MAX32-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x float> [[TMP4]], <4 x float> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0>			; MAX32-NEXT: [[I5:%.*]] = fadd float 0.000000e+00, [[I4]]
	; MAX32-NEXT: [[TMP5:%.]] = insertelement <32 x float> undef, float [[FVAL:%.]], i32 0			; MAX32-NEXT: [[I6:%.*]] = fpext half [[HVAL]] to float
	; MAX32-NEXT: [[TMP6:%.*]] = insertelement <32 x float> [[TMP5]], float [[FVAL]], i32 1			; MAX32-NEXT: [[I7:%.*]] = fmul float [[I6]], [[FVAL]]
	; MAX32-NEXT: [[TMP7:%.*]] = insertelement <32 x float> [[TMP6]], float [[FVAL]], i32 2			; MAX32-NEXT: [[I8:%.*]] = fadd float 0.000000e+00, [[I7]]
	; MAX32-NEXT: [[TMP8:%.*]] = insertelement <32 x float> [[TMP7]], float [[FVAL]], i32 3			; MAX32-NEXT: [[I9:%.*]] = fpext half [[HVAL]] to float
	; MAX32-NEXT: [[TMP9:%.*]] = insertelement <32 x float> [[TMP8]], float [[FVAL]], i32 4			; MAX32-NEXT: [[I10:%.*]] = fmul float [[I9]], [[FVAL]]
	; MAX32-NEXT: [[TMP10:%.*]] = insertelement <32 x float> [[TMP9]], float [[FVAL]], i32 5			; MAX32-NEXT: [[I11:%.*]] = fadd float 0.000000e+00, [[I10]]
	; MAX32-NEXT: [[TMP11:%.*]] = insertelement <32 x float> [[TMP10]], float [[FVAL]], i32 6			; MAX32-NEXT: [[I12:%.*]] = fmul float [[I]], [[FVAL]]
	; MAX32-NEXT: [[TMP12:%.*]] = insertelement <32 x float> [[TMP11]], float [[FVAL]], i32 7			; MAX32-NEXT: [[I13:%.*]] = fadd float 0.000000e+00, [[I12]]
	; MAX32-NEXT: [[TMP13:%.*]] = insertelement <32 x float> [[TMP12]], float [[FVAL]], i32 8			; MAX32-NEXT: [[I14:%.*]] = fmul float [[I3]], [[FVAL]]
	; MAX32-NEXT: [[TMP14:%.*]] = insertelement <32 x float> [[TMP13]], float [[FVAL]], i32 9			; MAX32-NEXT: [[I15:%.*]] = fadd float 0.000000e+00, [[I14]]
	; MAX32-NEXT: [[TMP15:%.*]] = insertelement <32 x float> [[TMP14]], float [[FVAL]], i32 10			; MAX32-NEXT: [[I16:%.*]] = fmul float [[I6]], [[FVAL]]
	; MAX32-NEXT: [[TMP16:%.*]] = insertelement <32 x float> [[TMP15]], float [[FVAL]], i32 11			; MAX32-NEXT: [[I17:%.*]] = fadd float 0.000000e+00, [[I16]]
	; MAX32-NEXT: [[TMP17:%.*]] = insertelement <32 x float> [[TMP16]], float [[FVAL]], i32 12			; MAX32-NEXT: [[I18:%.*]] = fmul float [[I9]], [[FVAL]]
	; MAX32-NEXT: [[TMP18:%.*]] = insertelement <32 x float> [[TMP17]], float [[FVAL]], i32 13			; MAX32-NEXT: [[I19:%.*]] = fadd float 0.000000e+00, [[I18]]
	; MAX32-NEXT: [[TMP19:%.*]] = insertelement <32 x float> [[TMP18]], float [[FVAL]], i32 14			; MAX32-NEXT: [[I20:%.*]] = fmul float [[I]], [[FVAL]]
	; MAX32-NEXT: [[TMP20:%.*]] = insertelement <32 x float> [[TMP19]], float [[FVAL]], i32 15			; MAX32-NEXT: [[I21:%.*]] = fadd float 0.000000e+00, [[I20]]
	; MAX32-NEXT: [[TMP21:%.*]] = insertelement <32 x float> [[TMP20]], float [[FVAL]], i32 16			; MAX32-NEXT: [[I22:%.*]] = fmul float [[I3]], [[FVAL]]
	; MAX32-NEXT: [[TMP22:%.*]] = insertelement <32 x float> [[TMP21]], float [[FVAL]], i32 17			; MAX32-NEXT: [[I23:%.*]] = fadd float 0.000000e+00, [[I22]]
	; MAX32-NEXT: [[TMP23:%.*]] = insertelement <32 x float> [[TMP22]], float [[FVAL]], i32 18			; MAX32-NEXT: [[I24:%.*]] = fmul float [[I6]], [[FVAL]]
	; MAX32-NEXT: [[TMP24:%.*]] = insertelement <32 x float> [[TMP23]], float [[FVAL]], i32 19			; MAX32-NEXT: [[I25:%.*]] = fadd float 0.000000e+00, [[I24]]
	; MAX32-NEXT: [[TMP25:%.*]] = insertelement <32 x float> [[TMP24]], float [[FVAL]], i32 20			; MAX32-NEXT: [[I26:%.*]] = fmul float [[I9]], [[FVAL]]
	; MAX32-NEXT: [[TMP26:%.*]] = insertelement <32 x float> [[TMP25]], float [[FVAL]], i32 21			; MAX32-NEXT: [[I27:%.*]] = fadd float 0.000000e+00, [[I26]]
	; MAX32-NEXT: [[TMP27:%.*]] = insertelement <32 x float> [[TMP26]], float [[FVAL]], i32 22			; MAX32-NEXT: [[I28:%.*]] = fmul float [[I]], [[FVAL]]
	; MAX32-NEXT: [[TMP28:%.*]] = insertelement <32 x float> [[TMP27]], float [[FVAL]], i32 23			; MAX32-NEXT: [[I29:%.*]] = fadd float 0.000000e+00, [[I28]]
	; MAX32-NEXT: [[TMP29:%.*]] = insertelement <32 x float> [[TMP28]], float [[FVAL]], i32 24			; MAX32-NEXT: [[I30:%.*]] = fmul float [[I3]], [[FVAL]]
	; MAX32-NEXT: [[TMP30:%.*]] = insertelement <32 x float> [[TMP29]], float [[FVAL]], i32 25			; MAX32-NEXT: [[I31:%.*]] = fadd float 0.000000e+00, [[I30]]
	; MAX32-NEXT: [[TMP31:%.*]] = insertelement <32 x float> [[TMP30]], float [[FVAL]], i32 26			; MAX32-NEXT: [[I32:%.*]] = fmul float [[I6]], [[FVAL]]
	; MAX32-NEXT: [[TMP32:%.*]] = insertelement <32 x float> [[TMP31]], float [[FVAL]], i32 27			; MAX32-NEXT: [[I33:%.*]] = fadd float 0.000000e+00, [[I32]]
	; MAX32-NEXT: [[TMP33:%.*]] = insertelement <32 x float> [[TMP32]], float [[FVAL]], i32 28			; MAX32-NEXT: [[I34:%.*]] = fmul float [[I9]], [[FVAL]]
	; MAX32-NEXT: [[TMP34:%.*]] = insertelement <32 x float> [[TMP33]], float [[FVAL]], i32 29			; MAX32-NEXT: [[I35:%.*]] = fadd float 0.000000e+00, [[I34]]
	; MAX32-NEXT: [[TMP35:%.*]] = insertelement <32 x float> [[TMP34]], float [[FVAL]], i32 30			; MAX32-NEXT: [[I36:%.*]] = fmul float [[I]], [[FVAL]]
	; MAX32-NEXT: [[TMP36:%.*]] = insertelement <32 x float> [[TMP35]], float [[FVAL]], i32 31			; MAX32-NEXT: [[I37:%.*]] = fadd float 0.000000e+00, [[I36]]
	; MAX32-NEXT: [[TMP37:%.*]] = fmul <32 x float> [[SHUFFLE]], [[TMP36]]			; MAX32-NEXT: [[I38:%.*]] = fmul float [[I3]], [[FVAL]]
	; MAX32-NEXT: [[TMP38:%.*]] = fadd <32 x float> zeroinitializer, [[TMP37]]			; MAX32-NEXT: [[I39:%.*]] = fadd float 0.000000e+00, [[I38]]
	; MAX32-NEXT: [[TMP39:%.*]] = extractelement <32 x float> [[TMP38]], i32 0			; MAX32-NEXT: [[I40:%.*]] = fmul float [[I6]], [[FVAL]]
	; MAX32-NEXT: [[TMP40:%.*]] = insertelement <32 x float> undef, float [[TMP39]], i32 0			; MAX32-NEXT: [[I41:%.*]] = fadd float 0.000000e+00, [[I40]]
	; MAX32-NEXT: [[TMP41:%.*]] = extractelement <32 x float> [[TMP38]], i32 1			; MAX32-NEXT: [[I42:%.*]] = fmul float [[I9]], [[FVAL]]
	; MAX32-NEXT: [[TMP42:%.*]] = insertelement <32 x float> [[TMP40]], float [[TMP41]], i32 1			; MAX32-NEXT: [[I43:%.*]] = fadd float 0.000000e+00, [[I42]]
	; MAX32-NEXT: [[TMP43:%.*]] = insertelement <32 x float> [[TMP42]], float [[FVAL]], i32 2			; MAX32-NEXT: [[I44:%.*]] = fmul float [[I]], [[FVAL]]
	; MAX32-NEXT: [[TMP44:%.*]] = insertelement <32 x float> [[TMP43]], float [[FVAL]], i32 3			; MAX32-NEXT: [[I45:%.*]] = fadd float 0.000000e+00, [[I44]]
	; MAX32-NEXT: [[TMP45:%.*]] = extractelement <32 x float> [[TMP38]], i32 4			; MAX32-NEXT: [[I46:%.*]] = fmul float [[I3]], [[FVAL]]
	; MAX32-NEXT: [[TMP46:%.*]] = insertelement <32 x float> [[TMP44]], float [[TMP45]], i32 4			; MAX32-NEXT: [[I47:%.*]] = fadd float 0.000000e+00, [[I46]]
	; MAX32-NEXT: [[TMP47:%.*]] = extractelement <32 x float> [[TMP38]], i32 5			; MAX32-NEXT: [[I48:%.*]] = fmul float [[I6]], [[FVAL]]
	; MAX32-NEXT: [[TMP48:%.*]] = insertelement <32 x float> [[TMP46]], float [[TMP47]], i32 5			; MAX32-NEXT: [[I49:%.*]] = fadd float 0.000000e+00, [[I48]]
	; MAX32-NEXT: [[TMP49:%.*]] = insertelement <32 x float> [[TMP48]], float [[FVAL]], i32 6			; MAX32-NEXT: [[I50:%.*]] = fmul float [[I9]], [[FVAL]]
	; MAX32-NEXT: [[TMP50:%.*]] = insertelement <32 x float> [[TMP49]], float [[FVAL]], i32 7			; MAX32-NEXT: [[I51:%.*]] = fadd float 0.000000e+00, [[I50]]
	; MAX32-NEXT: [[TMP51:%.*]] = insertelement <32 x float> [[TMP50]], float [[FVAL]], i32 8			; MAX32-NEXT: [[I52:%.*]] = fmul float [[I]], [[FVAL]]
	; MAX32-NEXT: [[TMP52:%.*]] = insertelement <32 x float> [[TMP51]], float [[FVAL]], i32 9			; MAX32-NEXT: [[I53:%.*]] = fadd float 0.000000e+00, [[I52]]
	; MAX32-NEXT: [[TMP53:%.*]] = extractelement <32 x float> [[TMP38]], i32 10			; MAX32-NEXT: [[I54:%.*]] = fmul float [[I3]], [[FVAL]]
	; MAX32-NEXT: [[TMP54:%.*]] = insertelement <32 x float> [[TMP52]], float [[TMP53]], i32 10			; MAX32-NEXT: [[I55:%.*]] = fadd float 0.000000e+00, [[I54]]
	; MAX32-NEXT: [[TMP55:%.*]] = extractelement <32 x float> [[TMP38]], i32 11			; MAX32-NEXT: [[I56:%.*]] = fmul float [[I6]], [[FVAL]]
	; MAX32-NEXT: [[TMP56:%.*]] = insertelement <32 x float> [[TMP54]], float [[TMP55]], i32 11			; MAX32-NEXT: [[I57:%.*]] = fadd float 0.000000e+00, [[I56]]
	; MAX32-NEXT: [[TMP57:%.*]] = insertelement <32 x float> [[TMP56]], float [[FVAL]], i32 12			; MAX32-NEXT: [[I58:%.*]] = fmul float [[I9]], [[FVAL]]
	; MAX32-NEXT: [[TMP58:%.*]] = insertelement <32 x float> [[TMP57]], float [[FVAL]], i32 13			; MAX32-NEXT: [[I59:%.*]] = fadd float 0.000000e+00, [[I58]]
	; MAX32-NEXT: [[TMP59:%.*]] = extractelement <32 x float> [[TMP38]], i32 14			; MAX32-NEXT: [[I60:%.*]] = fmul float [[I]], [[FVAL]]
	; MAX32-NEXT: [[TMP60:%.*]] = insertelement <32 x float> [[TMP58]], float [[TMP59]], i32 14			; MAX32-NEXT: [[I61:%.*]] = fadd float 0.000000e+00, [[I60]]
	; MAX32-NEXT: [[TMP61:%.*]] = extractelement <32 x float> [[TMP38]], i32 15			; MAX32-NEXT: [[I62:%.*]] = fmul float [[I3]], [[FVAL]]
	; MAX32-NEXT: [[TMP62:%.*]] = insertelement <32 x float> [[TMP60]], float [[TMP61]], i32 15			; MAX32-NEXT: [[I63:%.*]] = fadd float 0.000000e+00, [[I62]]
	; MAX32-NEXT: [[TMP63:%.*]] = insertelement <32 x float> [[TMP62]], float [[FVAL]], i32 16			; MAX32-NEXT: [[I64:%.*]] = fmul float [[I6]], [[FVAL]]
	; MAX32-NEXT: [[TMP64:%.*]] = insertelement <32 x float> [[TMP63]], float [[FVAL]], i32 17			; MAX32-NEXT: [[I65:%.*]] = fadd float 0.000000e+00, [[I64]]
	; MAX32-NEXT: [[TMP65:%.*]] = extractelement <32 x float> [[TMP38]], i32 18			; MAX32-NEXT: [[I66:%.*]] = fmul float [[I9]], [[FVAL]]
	; MAX32-NEXT: [[TMP66:%.*]] = insertelement <32 x float> [[TMP64]], float [[TMP65]], i32 18			; MAX32-NEXT: [[I67:%.*]] = fadd float 0.000000e+00, [[I66]]
	; MAX32-NEXT: [[TMP67:%.*]] = extractelement <32 x float> [[TMP38]], i32 19
	; MAX32-NEXT: [[TMP68:%.*]] = insertelement <32 x float> [[TMP66]], float [[TMP67]], i32 19
	; MAX32-NEXT: [[TMP69:%.*]] = insertelement <32 x float> [[TMP68]], float [[FVAL]], i32 20
	; MAX32-NEXT: [[TMP70:%.*]] = insertelement <32 x float> [[TMP69]], float [[FVAL]], i32 21
	; MAX32-NEXT: [[TMP71:%.*]] = extractelement <32 x float> [[TMP38]], i32 22
	; MAX32-NEXT: [[TMP72:%.*]] = insertelement <32 x float> [[TMP70]], float [[TMP71]], i32 22
	; MAX32-NEXT: [[TMP73:%.*]] = extractelement <32 x float> [[TMP38]], i32 23
	; MAX32-NEXT: [[TMP74:%.*]] = insertelement <32 x float> [[TMP72]], float [[TMP73]], i32 23
	; MAX32-NEXT: [[TMP75:%.*]] = insertelement <32 x float> [[TMP74]], float [[FVAL]], i32 24
	; MAX32-NEXT: [[TMP76:%.*]] = insertelement <32 x float> [[TMP75]], float [[FVAL]], i32 25
	; MAX32-NEXT: [[TMP77:%.*]] = extractelement <32 x float> [[TMP38]], i32 26
	; MAX32-NEXT: [[TMP78:%.*]] = insertelement <32 x float> [[TMP76]], float [[TMP77]], i32 26
	; MAX32-NEXT: [[TMP79:%.*]] = extractelement <32 x float> [[TMP38]], i32 27
	; MAX32-NEXT: [[TMP80:%.*]] = insertelement <32 x float> [[TMP78]], float [[TMP79]], i32 27
	; MAX32-NEXT: [[TMP81:%.*]] = insertelement <32 x float> [[TMP80]], float [[FVAL]], i32 28
	; MAX32-NEXT: [[TMP82:%.*]] = insertelement <32 x float> [[TMP81]], float [[FVAL]], i32 29
	; MAX32-NEXT: [[TMP83:%.*]] = extractelement <32 x float> [[TMP38]], i32 30
	; MAX32-NEXT: [[TMP84:%.*]] = insertelement <32 x float> [[TMP82]], float [[TMP83]], i32 30
	; MAX32-NEXT: [[TMP85:%.*]] = extractelement <32 x float> [[TMP38]], i32 31
	; MAX32-NEXT: [[TMP86:%.*]] = insertelement <32 x float> [[TMP84]], float [[TMP85]], i32 31
	; MAX32-NEXT: switch i32 undef, label [[BB5:%.*]] [			; MAX32-NEXT: switch i32 undef, label [[BB5:%.*]] [
	; MAX32-NEXT: i32 0, label [[BB2:%.*]]			; MAX32-NEXT: i32 0, label [[BB2:%.*]]
	; MAX32-NEXT: i32 1, label [[BB3:%.*]]			; MAX32-NEXT: i32 1, label [[BB3:%.*]]
	; MAX32-NEXT: i32 2, label [[BB4:%.*]]			; MAX32-NEXT: i32 2, label [[BB4:%.*]]
	; MAX32-NEXT: ]			; MAX32-NEXT: ]
	; MAX32: bb3:			; MAX32: bb3:
	; MAX32-NEXT: br label [[BB2]]			; MAX32-NEXT: br label [[BB2]]
	; MAX32: bb4:			; MAX32: bb4:
	; MAX32-NEXT: [[TMP87:%.*]] = insertelement <32 x float> [[TMP40]], float [[FVAL]], i32 1
	; MAX32-NEXT: [[TMP88:%.*]] = insertelement <32 x float> [[TMP87]], float [[FVAL]], i32 2
	; MAX32-NEXT: [[TMP89:%.*]] = extractelement <32 x float> [[TMP38]], i32 3
	; MAX32-NEXT: [[TMP90:%.*]] = insertelement <32 x float> [[TMP88]], float [[TMP89]], i32 3
	; MAX32-NEXT: [[TMP91:%.*]] = insertelement <32 x float> [[TMP90]], float [[TMP45]], i32 4
	; MAX32-NEXT: [[TMP92:%.*]] = insertelement <32 x float> [[TMP91]], float [[FVAL]], i32 5
	; MAX32-NEXT: [[TMP93:%.*]] = insertelement <32 x float> [[TMP92]], float [[FVAL]], i32 6
	; MAX32-NEXT: [[TMP94:%.*]] = extractelement <32 x float> [[TMP38]], i32 7
	; MAX32-NEXT: [[TMP95:%.*]] = insertelement <32 x float> [[TMP93]], float [[TMP94]], i32 7
	; MAX32-NEXT: [[TMP96:%.*]] = extractelement <32 x float> [[TMP38]], i32 8
	; MAX32-NEXT: [[TMP97:%.*]] = insertelement <32 x float> [[TMP95]], float [[TMP96]], i32 8
	; MAX32-NEXT: [[TMP98:%.*]] = insertelement <32 x float> [[TMP97]], float [[FVAL]], i32 9
	; MAX32-NEXT: [[TMP99:%.*]] = insertelement <32 x float> [[TMP98]], float [[FVAL]], i32 10
	; MAX32-NEXT: [[TMP100:%.*]] = insertelement <32 x float> [[TMP99]], float [[TMP55]], i32 11
	; MAX32-NEXT: [[TMP101:%.*]] = extractelement <32 x float> [[TMP38]], i32 12
	; MAX32-NEXT: [[TMP102:%.*]] = insertelement <32 x float> [[TMP100]], float [[TMP101]], i32 12
	; MAX32-NEXT: [[TMP103:%.*]] = insertelement <32 x float> [[TMP102]], float [[FVAL]], i32 13
	; MAX32-NEXT: [[TMP104:%.*]] = insertelement <32 x float> [[TMP103]], float [[FVAL]], i32 14
	; MAX32-NEXT: [[TMP105:%.*]] = insertelement <32 x float> [[TMP104]], float [[TMP61]], i32 15
	; MAX32-NEXT: [[TMP106:%.*]] = extractelement <32 x float> [[TMP38]], i32 16
	; MAX32-NEXT: [[TMP107:%.*]] = insertelement <32 x float> [[TMP105]], float [[TMP106]], i32 16
	; MAX32-NEXT: [[TMP108:%.*]] = insertelement <32 x float> [[TMP107]], float [[FVAL]], i32 17
	; MAX32-NEXT: [[TMP109:%.*]] = insertelement <32 x float> [[TMP108]], float [[FVAL]], i32 18
	; MAX32-NEXT: [[TMP110:%.*]] = insertelement <32 x float> [[TMP109]], float [[TMP67]], i32 19
	; MAX32-NEXT: [[TMP111:%.*]] = extractelement <32 x float> [[TMP38]], i32 20
	; MAX32-NEXT: [[TMP112:%.*]] = insertelement <32 x float> [[TMP110]], float [[TMP111]], i32 20
	; MAX32-NEXT: [[TMP113:%.*]] = insertelement <32 x float> [[TMP112]], float [[FVAL]], i32 21
	; MAX32-NEXT: [[TMP114:%.*]] = insertelement <32 x float> [[TMP113]], float [[FVAL]], i32 22
	; MAX32-NEXT: [[TMP115:%.*]] = insertelement <32 x float> [[TMP114]], float [[TMP73]], i32 23
	; MAX32-NEXT: [[TMP116:%.*]] = extractelement <32 x float> [[TMP38]], i32 24
	; MAX32-NEXT: [[TMP117:%.*]] = insertelement <32 x float> [[TMP115]], float [[TMP116]], i32 24
	; MAX32-NEXT: [[TMP118:%.*]] = insertelement <32 x float> [[TMP117]], float [[FVAL]], i32 25
	; MAX32-NEXT: [[TMP119:%.*]] = insertelement <32 x float> [[TMP118]], float [[FVAL]], i32 26
	; MAX32-NEXT: [[TMP120:%.*]] = insertelement <32 x float> [[TMP119]], float [[TMP79]], i32 27
	; MAX32-NEXT: [[TMP121:%.*]] = extractelement <32 x float> [[TMP38]], i32 28
	; MAX32-NEXT: [[TMP122:%.*]] = insertelement <32 x float> [[TMP120]], float [[TMP121]], i32 28
	; MAX32-NEXT: [[TMP123:%.*]] = insertelement <32 x float> [[TMP122]], float [[FVAL]], i32 29
	; MAX32-NEXT: [[TMP124:%.*]] = insertelement <32 x float> [[TMP123]], float [[FVAL]], i32 30
	; MAX32-NEXT: [[TMP125:%.*]] = insertelement <32 x float> [[TMP124]], float [[TMP85]], i32 31
	; MAX32-NEXT: br label [[BB2]]			; MAX32-NEXT: br label [[BB2]]
	; MAX32: bb5:			; MAX32: bb5:
	; MAX32-NEXT: [[TMP126:%.*]] = insertelement <32 x float> [[TMP5]], float [[TMP41]], i32 1
	; MAX32-NEXT: [[TMP127:%.*]] = insertelement <32 x float> [[TMP126]], float [[FVAL]], i32 2
	; MAX32-NEXT: [[TMP128:%.*]] = extractelement <32 x float> [[TMP38]], i32 3
	; MAX32-NEXT: [[TMP129:%.*]] = insertelement <32 x float> [[TMP127]], float [[TMP128]], i32 3
	; MAX32-NEXT: [[TMP130:%.*]] = insertelement <32 x float> [[TMP129]], float [[FVAL]], i32 4
	; MAX32-NEXT: [[TMP131:%.*]] = insertelement <32 x float> [[TMP130]], float [[TMP47]], i32 5
	; MAX32-NEXT: [[TMP132:%.*]] = insertelement <32 x float> [[TMP131]], float [[FVAL]], i32 6
	; MAX32-NEXT: [[TMP133:%.*]] = extractelement <32 x float> [[TMP38]], i32 7
	; MAX32-NEXT: [[TMP134:%.*]] = insertelement <32 x float> [[TMP132]], float [[TMP133]], i32 7
	; MAX32-NEXT: [[TMP135:%.*]] = extractelement <32 x float> [[TMP38]], i32 8
	; MAX32-NEXT: [[TMP136:%.*]] = insertelement <32 x float> [[TMP134]], float [[TMP135]], i32 8
	; MAX32-NEXT: [[TMP137:%.*]] = insertelement <32 x float> [[TMP136]], float [[FVAL]], i32 9
	; MAX32-NEXT: [[TMP138:%.*]] = insertelement <32 x float> [[TMP137]], float [[TMP53]], i32 10
	; MAX32-NEXT: [[TMP139:%.*]] = insertelement <32 x float> [[TMP138]], float [[FVAL]], i32 11
	; MAX32-NEXT: [[TMP140:%.*]] = extractelement <32 x float> [[TMP38]], i32 12
	; MAX32-NEXT: [[TMP141:%.*]] = insertelement <32 x float> [[TMP139]], float [[TMP140]], i32 12
	; MAX32-NEXT: [[TMP142:%.*]] = insertelement <32 x float> [[TMP141]], float [[FVAL]], i32 13
	; MAX32-NEXT: [[TMP143:%.*]] = insertelement <32 x float> [[TMP142]], float [[TMP59]], i32 14
	; MAX32-NEXT: [[TMP144:%.*]] = insertelement <32 x float> [[TMP143]], float [[FVAL]], i32 15
	; MAX32-NEXT: [[TMP145:%.*]] = extractelement <32 x float> [[TMP38]], i32 16
	; MAX32-NEXT: [[TMP146:%.*]] = insertelement <32 x float> [[TMP144]], float [[TMP145]], i32 16
	; MAX32-NEXT: [[TMP147:%.*]] = insertelement <32 x float> [[TMP146]], float [[FVAL]], i32 17
	; MAX32-NEXT: [[TMP148:%.*]] = insertelement <32 x float> [[TMP147]], float [[TMP65]], i32 18
	; MAX32-NEXT: [[TMP149:%.*]] = insertelement <32 x float> [[TMP148]], float [[FVAL]], i32 19
	; MAX32-NEXT: [[TMP150:%.*]] = extractelement <32 x float> [[TMP38]], i32 20
	; MAX32-NEXT: [[TMP151:%.*]] = insertelement <32 x float> [[TMP149]], float [[TMP150]], i32 20
	; MAX32-NEXT: [[TMP152:%.*]] = insertelement <32 x float> [[TMP151]], float [[FVAL]], i32 21
	; MAX32-NEXT: [[TMP153:%.*]] = insertelement <32 x float> [[TMP152]], float [[TMP71]], i32 22
	; MAX32-NEXT: [[TMP154:%.*]] = insertelement <32 x float> [[TMP153]], float [[FVAL]], i32 23
	; MAX32-NEXT: [[TMP155:%.*]] = extractelement <32 x float> [[TMP38]], i32 24
	; MAX32-NEXT: [[TMP156:%.*]] = insertelement <32 x float> [[TMP154]], float [[TMP155]], i32 24
	; MAX32-NEXT: [[TMP157:%.*]] = insertelement <32 x float> [[TMP156]], float [[FVAL]], i32 25
	; MAX32-NEXT: [[TMP158:%.*]] = insertelement <32 x float> [[TMP157]], float [[TMP77]], i32 26
	; MAX32-NEXT: [[TMP159:%.*]] = insertelement <32 x float> [[TMP158]], float [[FVAL]], i32 27
	; MAX32-NEXT: [[TMP160:%.*]] = extractelement <32 x float> [[TMP38]], i32 28
	; MAX32-NEXT: [[TMP161:%.*]] = insertelement <32 x float> [[TMP159]], float [[TMP160]], i32 28
	; MAX32-NEXT: [[TMP162:%.*]] = insertelement <32 x float> [[TMP161]], float [[FVAL]], i32 29
	; MAX32-NEXT: [[TMP163:%.*]] = insertelement <32 x float> [[TMP162]], float [[TMP83]], i32 30
	; MAX32-NEXT: [[TMP164:%.*]] = insertelement <32 x float> [[TMP163]], float [[FVAL]], i32 31
	; MAX32-NEXT: br label [[BB2]]			; MAX32-NEXT: br label [[BB2]]
	; MAX32: bb2:			; MAX32: bb2:
	; MAX32-NEXT: [[TMP165:%.*]] = phi <32 x float> [ [[TMP38]], [[BB3]] ], [ [[TMP125]], [[BB4]] ], [ [[TMP164]], [[BB5]] ], [ [[TMP86]], [[BB1]] ]			; MAX32-NEXT: [[PHI1:%.*]] = phi float [ [[I19]], [[BB3]] ], [ [[I19]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[I19]], [[BB1]] ]
				; MAX32-NEXT: [[PHI2:%.*]] = phi float [ [[I17]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[I17]], [[BB5]] ], [ [[I17]], [[BB1]] ]
				; MAX32-NEXT: [[PHI3:%.*]] = phi float [ [[I15]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI4:%.*]] = phi float [ [[I13]], [[BB3]] ], [ [[I13]], [[BB4]] ], [ [[I13]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI5:%.*]] = phi float [ [[I11]], [[BB3]] ], [ [[I11]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[I11]], [[BB1]] ]
				; MAX32-NEXT: [[PHI6:%.*]] = phi float [ [[I8]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[I8]], [[BB5]] ], [ [[I8]], [[BB1]] ]
				; MAX32-NEXT: [[PHI7:%.*]] = phi float [ [[I5]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI8:%.*]] = phi float [ [[I2]], [[BB3]] ], [ [[I2]], [[BB4]] ], [ [[I2]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI9:%.*]] = phi float [ [[I21]], [[BB3]] ], [ [[I21]], [[BB4]] ], [ [[I21]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI10:%.*]] = phi float [ [[I23]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI11:%.*]] = phi float [ [[I25]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[I25]], [[BB5]] ], [ [[I25]], [[BB1]] ]
				; MAX32-NEXT: [[PHI12:%.*]] = phi float [ [[I27]], [[BB3]] ], [ [[I27]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[I27]], [[BB1]] ]
				; MAX32-NEXT: [[PHI13:%.*]] = phi float [ [[I29]], [[BB3]] ], [ [[I29]], [[BB4]] ], [ [[I29]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI14:%.*]] = phi float [ [[I31]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI15:%.*]] = phi float [ [[I33]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[I33]], [[BB5]] ], [ [[I33]], [[BB1]] ]
				; MAX32-NEXT: [[PHI16:%.*]] = phi float [ [[I35]], [[BB3]] ], [ [[I35]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[I35]], [[BB1]] ]
				; MAX32-NEXT: [[PHI17:%.*]] = phi float [ [[I37]], [[BB3]] ], [ [[I37]], [[BB4]] ], [ [[I37]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI18:%.*]] = phi float [ [[I39]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI19:%.*]] = phi float [ [[I41]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[I41]], [[BB5]] ], [ [[I41]], [[BB1]] ]
				; MAX32-NEXT: [[PHI20:%.*]] = phi float [ [[I43]], [[BB3]] ], [ [[I43]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[I43]], [[BB1]] ]
				; MAX32-NEXT: [[PHI21:%.*]] = phi float [ [[I45]], [[BB3]] ], [ [[I45]], [[BB4]] ], [ [[I45]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI22:%.*]] = phi float [ [[I47]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI23:%.*]] = phi float [ [[I49]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[I49]], [[BB5]] ], [ [[I49]], [[BB1]] ]
				; MAX32-NEXT: [[PHI24:%.*]] = phi float [ [[I51]], [[BB3]] ], [ [[I51]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[I51]], [[BB1]] ]
				; MAX32-NEXT: [[PHI25:%.*]] = phi float [ [[I53]], [[BB3]] ], [ [[I53]], [[BB4]] ], [ [[I53]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI26:%.*]] = phi float [ [[I55]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI27:%.*]] = phi float [ [[I57]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[I57]], [[BB5]] ], [ [[I57]], [[BB1]] ]
				; MAX32-NEXT: [[PHI28:%.*]] = phi float [ [[I59]], [[BB3]] ], [ [[I59]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[I59]], [[BB1]] ]
				; MAX32-NEXT: [[PHI29:%.*]] = phi float [ [[I61]], [[BB3]] ], [ [[I61]], [[BB4]] ], [ [[I61]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI30:%.*]] = phi float [ [[I63]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[FVAL]], [[BB1]] ]
				; MAX32-NEXT: [[PHI31:%.*]] = phi float [ [[I65]], [[BB3]] ], [ [[FVAL]], [[BB4]] ], [ [[I65]], [[BB5]] ], [ [[I65]], [[BB1]] ]
				; MAX32-NEXT: [[PHI32:%.*]] = phi float [ [[I67]], [[BB3]] ], [ [[I67]], [[BB4]] ], [ [[FVAL]], [[BB5]] ], [ [[I67]], [[BB1]] ]
	; MAX32-NEXT: ret void			; MAX32-NEXT: ret void
	;			;
	; MAX256-LABEL: @phi_float32(			; MAX256-LABEL: @phi_float32(
	; MAX256-NEXT: bb:			; MAX256-NEXT: bb:
	; MAX256-NEXT: br label [[BB1:%.*]]			; MAX256-NEXT: br label [[BB1:%.*]]
	; MAX256: bb1:			; MAX256: bb1:
	; MAX256-NEXT: [[TMP0:%.]] = insertelement <4 x half> undef, half [[HVAL:%.]], i32 0			; MAX256-NEXT: [[TMP0:%.]] = insertelement <4 x half> undef, half [[HVAL:%.]], i32 0
	; MAX256-NEXT: [[TMP1:%.*]] = insertelement <4 x half> [[TMP0]], half [[HVAL]], i32 1			; MAX256-NEXT: [[TMP1:%.*]] = insertelement <4 x half> [[TMP0]], half [[HVAL]], i32 1
	; MAX256-NEXT: [[TMP2:%.*]] = insertelement <4 x half> [[TMP1]], half [[HVAL]], i32 2			; MAX256-NEXT: [[TMP2:%.*]] = insertelement <4 x half> [[TMP1]], half [[HVAL]], i32 2
	; MAX256-NEXT: [[TMP3:%.*]] = insertelement <4 x half> [[TMP2]], half [[HVAL]], i32 3			; MAX256-NEXT: [[TMP3:%.*]] = insertelement <4 x half> [[TMP2]], half [[HVAL]], i32 3
	; MAX256-NEXT: [[TMP4:%.*]] = fpext <4 x half> [[TMP3]] to <4 x float>			; MAX256-NEXT: [[TMP4:%.*]] = fpext <4 x half> [[TMP3]] to <4 x float>
	; MAX256-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x float> [[TMP4]], <4 x float> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0, i32 3, i32 2, i32 1, i32 0>			; MAX256-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x float> [[TMP4]], <4 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
	; MAX256-NEXT: [[TMP5:%.]] = insertelement <32 x float> undef, float [[FVAL:%.]], i32 0			; MAX256-NEXT: [[TMP5:%.]] = insertelement <8 x float> undef, float [[FVAL:%.]], i32 0
	; MAX256-NEXT: [[TMP6:%.*]] = insertelement <32 x float> [[TMP5]], float [[FVAL]], i32 1			; MAX256-NEXT: [[TMP6:%.*]] = insertelement <8 x float> [[TMP5]], float [[FVAL]], i32 1
	; MAX256-NEXT: [[TMP7:%.*]] = insertelement <32 x float> [[TMP6]], float [[FVAL]], i32 2			; MAX256-NEXT: [[TMP7:%.*]] = insertelement <8 x float> [[TMP6]], float [[FVAL]], i32 2
	; MAX256-NEXT: [[TMP8:%.*]] = insertelement <32 x float> [[TMP7]], float [[FVAL]], i32 3			; MAX256-NEXT: [[TMP8:%.*]] = insertelement <8 x float> [[TMP7]], float [[FVAL]], i32 3
	; MAX256-NEXT: [[TMP9:%.*]] = insertelement <32 x float> [[TMP8]], float [[FVAL]], i32 4			; MAX256-NEXT: [[TMP9:%.*]] = insertelement <8 x float> [[TMP8]], float [[FVAL]], i32 4
	; MAX256-NEXT: [[TMP10:%.*]] = insertelement <32 x float> [[TMP9]], float [[FVAL]], i32 5			; MAX256-NEXT: [[TMP10:%.*]] = insertelement <8 x float> [[TMP9]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP11:%.*]] = insertelement <32 x float> [[TMP10]], float [[FVAL]], i32 6			; MAX256-NEXT: [[TMP11:%.*]] = insertelement <8 x float> [[TMP10]], float [[FVAL]], i32 6
	; MAX256-NEXT: [[TMP12:%.*]] = insertelement <32 x float> [[TMP11]], float [[FVAL]], i32 7			; MAX256-NEXT: [[TMP12:%.*]] = insertelement <8 x float> [[TMP11]], float [[FVAL]], i32 7
	; MAX256-NEXT: [[TMP13:%.*]] = insertelement <32 x float> [[TMP12]], float [[FVAL]], i32 8			; MAX256-NEXT: [[TMP13:%.*]] = fmul <8 x float> [[SHUFFLE]], [[TMP12]]
	; MAX256-NEXT: [[TMP14:%.*]] = insertelement <32 x float> [[TMP13]], float [[FVAL]], i32 9			; MAX256-NEXT: [[TMP14:%.*]] = fadd <8 x float> zeroinitializer, [[TMP13]]
	; MAX256-NEXT: [[TMP15:%.*]] = insertelement <32 x float> [[TMP14]], float [[FVAL]], i32 10			; MAX256-NEXT: [[TMP15:%.*]] = extractelement <8 x float> [[SHUFFLE]], i32 3
	; MAX256-NEXT: [[TMP16:%.*]] = insertelement <32 x float> [[TMP15]], float [[FVAL]], i32 11			; MAX256-NEXT: [[TMP16:%.*]] = extractelement <8 x float> [[SHUFFLE]], i32 2
	; MAX256-NEXT: [[TMP17:%.*]] = insertelement <32 x float> [[TMP16]], float [[FVAL]], i32 12			; MAX256-NEXT: [[TMP17:%.*]] = extractelement <8 x float> [[SHUFFLE]], i32 1
	; MAX256-NEXT: [[TMP18:%.*]] = insertelement <32 x float> [[TMP17]], float [[FVAL]], i32 13			; MAX256-NEXT: [[TMP18:%.*]] = extractelement <8 x float> [[SHUFFLE]], i32 0
	; MAX256-NEXT: [[TMP19:%.*]] = insertelement <32 x float> [[TMP18]], float [[FVAL]], i32 14			; MAX256-NEXT: [[TMP19:%.*]] = insertelement <8 x float> undef, float [[TMP15]], i32 0
	; MAX256-NEXT: [[TMP20:%.*]] = insertelement <32 x float> [[TMP19]], float [[FVAL]], i32 15			; MAX256-NEXT: [[TMP20:%.*]] = insertelement <8 x float> [[TMP19]], float [[TMP16]], i32 1
	; MAX256-NEXT: [[TMP21:%.*]] = insertelement <32 x float> [[TMP20]], float [[FVAL]], i32 16			; MAX256-NEXT: [[TMP21:%.*]] = insertelement <8 x float> [[TMP20]], float [[TMP17]], i32 2
	; MAX256-NEXT: [[TMP22:%.*]] = insertelement <32 x float> [[TMP21]], float [[FVAL]], i32 17			; MAX256-NEXT: [[TMP22:%.*]] = insertelement <8 x float> [[TMP21]], float [[TMP18]], i32 3
	; MAX256-NEXT: [[TMP23:%.*]] = insertelement <32 x float> [[TMP22]], float [[FVAL]], i32 18			; MAX256-NEXT: [[TMP23:%.*]] = insertelement <8 x float> [[TMP22]], float [[TMP15]], i32 4
	; MAX256-NEXT: [[TMP24:%.*]] = insertelement <32 x float> [[TMP23]], float [[FVAL]], i32 19			; MAX256-NEXT: [[TMP24:%.*]] = insertelement <8 x float> [[TMP23]], float [[TMP16]], i32 5
	; MAX256-NEXT: [[TMP25:%.*]] = insertelement <32 x float> [[TMP24]], float [[FVAL]], i32 20			; MAX256-NEXT: [[TMP25:%.*]] = insertelement <8 x float> [[TMP24]], float [[TMP17]], i32 6
	; MAX256-NEXT: [[TMP26:%.*]] = insertelement <32 x float> [[TMP25]], float [[FVAL]], i32 21			; MAX256-NEXT: [[TMP26:%.*]] = insertelement <8 x float> [[TMP25]], float [[TMP18]], i32 7
	; MAX256-NEXT: [[TMP27:%.*]] = insertelement <32 x float> [[TMP26]], float [[FVAL]], i32 22			; MAX256-NEXT: [[TMP27:%.*]] = fmul <8 x float> [[TMP26]], [[TMP12]]
	; MAX256-NEXT: [[TMP28:%.*]] = insertelement <32 x float> [[TMP27]], float [[FVAL]], i32 23			; MAX256-NEXT: [[TMP28:%.*]] = fadd <8 x float> zeroinitializer, [[TMP27]]
	; MAX256-NEXT: [[TMP29:%.*]] = insertelement <32 x float> [[TMP28]], float [[FVAL]], i32 24			; MAX256-NEXT: [[TMP29:%.*]] = fmul <8 x float> [[TMP26]], [[TMP12]]
	; MAX256-NEXT: [[TMP30:%.*]] = insertelement <32 x float> [[TMP29]], float [[FVAL]], i32 25			; MAX256-NEXT: [[TMP30:%.*]] = fadd <8 x float> zeroinitializer, [[TMP29]]
	; MAX256-NEXT: [[TMP31:%.*]] = insertelement <32 x float> [[TMP30]], float [[FVAL]], i32 26			; MAX256-NEXT: [[TMP31:%.*]] = fmul <8 x float> [[TMP26]], [[TMP12]]
	; MAX256-NEXT: [[TMP32:%.*]] = insertelement <32 x float> [[TMP31]], float [[FVAL]], i32 27			; MAX256-NEXT: [[TMP32:%.*]] = fadd <8 x float> zeroinitializer, [[TMP31]]
	; MAX256-NEXT: [[TMP33:%.*]] = insertelement <32 x float> [[TMP32]], float [[FVAL]], i32 28			; MAX256-NEXT: [[TMP33:%.*]] = extractelement <8 x float> [[TMP14]], i32 0
				ABataevUnsubmitted Not Done Reply Inline Actions It really makes the vectorization worse, in general. Most of these inserts/extracts will be transformed into the simple shuffles by the instcombiner. And if there is really a problem with target-specific limitations, it is better to adapt the cost model rather than introduce changes that may affect all targets. Maybe need to fix `TTI::getTypeLegalizationCost`? ABataev: It really makes the vectorization worse, in general. Most of these inserts/extracts will be…
				rampitecAuthorUnsubmitted Done Reply Inline Actions We certainly working on improving our cost model, although it will not help here. I have experimented and we really need to lie a lot to avoid vectorization for a case like this. Note though that 128 bit case was suppressed by the cost model of a generic processor which believes it is not profitable, so I have updated test to 256 bit. The main issue is a PHI of a wide vector type, we do not need anything else to run into the problem, and BoUpSLP::getEntryCost() does not even check a cost of a PHI: switch (ShuffleOrOp) { case Instruction::PHI: return 0; That said, I also do not believe it can be correctly solved by a cost model. This is not a cost problem, this is a question of ability to correctly generate code. Cost model covers instruction size, throughput, and latency, but it does not cover register pressure. If we believe there are targets out there which may benefit from an unlimitly wide vectorization I can expose yet another TTI callback. We have TTI::getRegisterBitWidth() and option -slp-max-reg-size, I can add TTI::getRealRegisterBitWidth() and -slp-real-max-reg-size. Alternatively targets believing in an unconditionally good wide vectors may return 1024 from getRegisterBitWidth(), right? rampitec: We certainly working on improving our cost model, although it will not help here. I have…
				ABataevUnsubmitted Not Done Reply Inline Actions Yes, I thought about it, that may be a good alternative solution. Maybe, just return `0` and in this case do not check for size at all? ABataev: Yes, I thought about it, that may be a good alternative solution. Maybe, just return `0` and in…
				rampitecAuthorUnsubmitted Done Reply Inline Actions The easiest way is to return UINT_MAX, right? I think 0 logically means "we do not have it at all, no vectorization please". rampitec: The easiest way is to return UINT_MAX, right? I think 0 logically means "we do not have it at…
				ABataevUnsubmitted Not Done Reply Inline Actions I'm fine with it. ABataev: I'm fine with it.
				rampitecAuthorUnsubmitted Done Reply Inline Actions I believe that's how it works now, the same effect as -slp-max-reg-size=-1 rampitec: I believe that's how it works now, the same effect as -slp-max-reg-size=-1
	; MAX256-NEXT: [[TMP34:%.*]] = insertelement <32 x float> [[TMP33]], float [[FVAL]], i32 29			; MAX256-NEXT: [[TMP34:%.*]] = insertelement <8 x float> undef, float [[TMP33]], i32 0
	; MAX256-NEXT: [[TMP35:%.*]] = insertelement <32 x float> [[TMP34]], float [[FVAL]], i32 30			; MAX256-NEXT: [[TMP35:%.*]] = extractelement <8 x float> [[TMP14]], i32 1
	; MAX256-NEXT: [[TMP36:%.*]] = insertelement <32 x float> [[TMP35]], float [[FVAL]], i32 31			; MAX256-NEXT: [[TMP36:%.*]] = insertelement <8 x float> [[TMP34]], float [[TMP35]], i32 1
	; MAX256-NEXT: [[TMP37:%.*]] = fmul <32 x float> [[SHUFFLE]], [[TMP36]]			; MAX256-NEXT: [[TMP37:%.*]] = insertelement <8 x float> [[TMP36]], float [[FVAL]], i32 2
	; MAX256-NEXT: [[TMP38:%.*]] = fadd <32 x float> zeroinitializer, [[TMP37]]			; MAX256-NEXT: [[TMP38:%.*]] = insertelement <8 x float> [[TMP37]], float [[FVAL]], i32 3
	; MAX256-NEXT: [[TMP39:%.*]] = extractelement <32 x float> [[TMP38]], i32 0			; MAX256-NEXT: [[TMP39:%.*]] = extractelement <8 x float> [[TMP14]], i32 4
	; MAX256-NEXT: [[TMP40:%.*]] = insertelement <32 x float> undef, float [[TMP39]], i32 0			; MAX256-NEXT: [[TMP40:%.*]] = insertelement <8 x float> [[TMP38]], float [[TMP39]], i32 4
	; MAX256-NEXT: [[TMP41:%.*]] = extractelement <32 x float> [[TMP38]], i32 1			; MAX256-NEXT: [[TMP41:%.*]] = extractelement <8 x float> [[TMP14]], i32 5
	; MAX256-NEXT: [[TMP42:%.*]] = insertelement <32 x float> [[TMP40]], float [[TMP41]], i32 1			; MAX256-NEXT: [[TMP42:%.*]] = insertelement <8 x float> [[TMP40]], float [[TMP41]], i32 5
	; MAX256-NEXT: [[TMP43:%.*]] = insertelement <32 x float> [[TMP42]], float [[FVAL]], i32 2			; MAX256-NEXT: [[TMP43:%.*]] = insertelement <8 x float> [[TMP42]], float [[FVAL]], i32 6
	; MAX256-NEXT: [[TMP44:%.*]] = insertelement <32 x float> [[TMP43]], float [[FVAL]], i32 3			; MAX256-NEXT: [[TMP44:%.*]] = insertelement <8 x float> [[TMP43]], float [[FVAL]], i32 7
	; MAX256-NEXT: [[TMP45:%.*]] = extractelement <32 x float> [[TMP38]], i32 4			; MAX256-NEXT: [[TMP45:%.*]] = extractelement <8 x float> [[TMP28]], i32 2
	; MAX256-NEXT: [[TMP46:%.*]] = insertelement <32 x float> [[TMP44]], float [[TMP45]], i32 4			; MAX256-NEXT: [[TMP46:%.*]] = insertelement <8 x float> [[TMP6]], float [[TMP45]], i32 2
	; MAX256-NEXT: [[TMP47:%.*]] = extractelement <32 x float> [[TMP38]], i32 5			; MAX256-NEXT: [[TMP47:%.*]] = extractelement <8 x float> [[TMP28]], i32 3
	; MAX256-NEXT: [[TMP48:%.*]] = insertelement <32 x float> [[TMP46]], float [[TMP47]], i32 5			; MAX256-NEXT: [[TMP48:%.*]] = insertelement <8 x float> [[TMP46]], float [[TMP47]], i32 3
	; MAX256-NEXT: [[TMP49:%.*]] = insertelement <32 x float> [[TMP48]], float [[FVAL]], i32 6			; MAX256-NEXT: [[TMP49:%.*]] = insertelement <8 x float> [[TMP48]], float [[FVAL]], i32 4
	; MAX256-NEXT: [[TMP50:%.*]] = insertelement <32 x float> [[TMP49]], float [[FVAL]], i32 7			; MAX256-NEXT: [[TMP50:%.*]] = insertelement <8 x float> [[TMP49]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP51:%.*]] = insertelement <32 x float> [[TMP50]], float [[FVAL]], i32 8			; MAX256-NEXT: [[TMP51:%.*]] = extractelement <8 x float> [[TMP28]], i32 6
	; MAX256-NEXT: [[TMP52:%.*]] = insertelement <32 x float> [[TMP51]], float [[FVAL]], i32 9			; MAX256-NEXT: [[TMP52:%.*]] = insertelement <8 x float> [[TMP50]], float [[TMP51]], i32 6
	; MAX256-NEXT: [[TMP53:%.*]] = extractelement <32 x float> [[TMP38]], i32 10			; MAX256-NEXT: [[TMP53:%.*]] = extractelement <8 x float> [[TMP28]], i32 7
	; MAX256-NEXT: [[TMP54:%.*]] = insertelement <32 x float> [[TMP52]], float [[TMP53]], i32 10			; MAX256-NEXT: [[TMP54:%.*]] = insertelement <8 x float> [[TMP52]], float [[TMP53]], i32 7
	; MAX256-NEXT: [[TMP55:%.*]] = extractelement <32 x float> [[TMP38]], i32 11			; MAX256-NEXT: [[TMP55:%.*]] = extractelement <8 x float> [[TMP30]], i32 2
	; MAX256-NEXT: [[TMP56:%.*]] = insertelement <32 x float> [[TMP54]], float [[TMP55]], i32 11			; MAX256-NEXT: [[TMP56:%.*]] = insertelement <8 x float> [[TMP6]], float [[TMP55]], i32 2
	; MAX256-NEXT: [[TMP57:%.*]] = insertelement <32 x float> [[TMP56]], float [[FVAL]], i32 12			; MAX256-NEXT: [[TMP57:%.*]] = extractelement <8 x float> [[TMP30]], i32 3
	; MAX256-NEXT: [[TMP58:%.*]] = insertelement <32 x float> [[TMP57]], float [[FVAL]], i32 13			; MAX256-NEXT: [[TMP58:%.*]] = insertelement <8 x float> [[TMP56]], float [[TMP57]], i32 3
	; MAX256-NEXT: [[TMP59:%.*]] = extractelement <32 x float> [[TMP38]], i32 14			; MAX256-NEXT: [[TMP59:%.*]] = insertelement <8 x float> [[TMP58]], float [[FVAL]], i32 4
	; MAX256-NEXT: [[TMP60:%.*]] = insertelement <32 x float> [[TMP58]], float [[TMP59]], i32 14			; MAX256-NEXT: [[TMP60:%.*]] = insertelement <8 x float> [[TMP59]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP61:%.*]] = extractelement <32 x float> [[TMP38]], i32 15			; MAX256-NEXT: [[TMP61:%.*]] = extractelement <8 x float> [[TMP30]], i32 6
	; MAX256-NEXT: [[TMP62:%.*]] = insertelement <32 x float> [[TMP60]], float [[TMP61]], i32 15			; MAX256-NEXT: [[TMP62:%.*]] = insertelement <8 x float> [[TMP60]], float [[TMP61]], i32 6
	; MAX256-NEXT: [[TMP63:%.*]] = insertelement <32 x float> [[TMP62]], float [[FVAL]], i32 16			; MAX256-NEXT: [[TMP63:%.*]] = extractelement <8 x float> [[TMP30]], i32 7
	; MAX256-NEXT: [[TMP64:%.*]] = insertelement <32 x float> [[TMP63]], float [[FVAL]], i32 17			; MAX256-NEXT: [[TMP64:%.*]] = insertelement <8 x float> [[TMP62]], float [[TMP63]], i32 7
	; MAX256-NEXT: [[TMP65:%.*]] = extractelement <32 x float> [[TMP38]], i32 18			; MAX256-NEXT: [[TMP65:%.*]] = extractelement <8 x float> [[TMP32]], i32 2
	; MAX256-NEXT: [[TMP66:%.*]] = insertelement <32 x float> [[TMP64]], float [[TMP65]], i32 18			; MAX256-NEXT: [[TMP66:%.*]] = insertelement <8 x float> [[TMP6]], float [[TMP65]], i32 2
	; MAX256-NEXT: [[TMP67:%.*]] = extractelement <32 x float> [[TMP38]], i32 19			; MAX256-NEXT: [[TMP67:%.*]] = extractelement <8 x float> [[TMP32]], i32 3
	; MAX256-NEXT: [[TMP68:%.*]] = insertelement <32 x float> [[TMP66]], float [[TMP67]], i32 19			; MAX256-NEXT: [[TMP68:%.*]] = insertelement <8 x float> [[TMP66]], float [[TMP67]], i32 3
	; MAX256-NEXT: [[TMP69:%.*]] = insertelement <32 x float> [[TMP68]], float [[FVAL]], i32 20			; MAX256-NEXT: [[TMP69:%.*]] = insertelement <8 x float> [[TMP68]], float [[FVAL]], i32 4
	; MAX256-NEXT: [[TMP70:%.*]] = insertelement <32 x float> [[TMP69]], float [[FVAL]], i32 21			; MAX256-NEXT: [[TMP70:%.*]] = insertelement <8 x float> [[TMP69]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP71:%.*]] = extractelement <32 x float> [[TMP38]], i32 22			; MAX256-NEXT: [[TMP71:%.*]] = extractelement <8 x float> [[TMP32]], i32 6
	; MAX256-NEXT: [[TMP72:%.*]] = insertelement <32 x float> [[TMP70]], float [[TMP71]], i32 22			; MAX256-NEXT: [[TMP72:%.*]] = insertelement <8 x float> [[TMP70]], float [[TMP71]], i32 6
	; MAX256-NEXT: [[TMP73:%.*]] = extractelement <32 x float> [[TMP38]], i32 23			; MAX256-NEXT: [[TMP73:%.*]] = extractelement <8 x float> [[TMP32]], i32 7
	; MAX256-NEXT: [[TMP74:%.*]] = insertelement <32 x float> [[TMP72]], float [[TMP73]], i32 23			; MAX256-NEXT: [[TMP74:%.*]] = insertelement <8 x float> [[TMP72]], float [[TMP73]], i32 7
	; MAX256-NEXT: [[TMP75:%.*]] = insertelement <32 x float> [[TMP74]], float [[FVAL]], i32 24
	; MAX256-NEXT: [[TMP76:%.*]] = insertelement <32 x float> [[TMP75]], float [[FVAL]], i32 25
	; MAX256-NEXT: [[TMP77:%.*]] = extractelement <32 x float> [[TMP38]], i32 26
	; MAX256-NEXT: [[TMP78:%.*]] = insertelement <32 x float> [[TMP76]], float [[TMP77]], i32 26
	; MAX256-NEXT: [[TMP79:%.*]] = extractelement <32 x float> [[TMP38]], i32 27
	; MAX256-NEXT: [[TMP80:%.*]] = insertelement <32 x float> [[TMP78]], float [[TMP79]], i32 27
	; MAX256-NEXT: [[TMP81:%.*]] = insertelement <32 x float> [[TMP80]], float [[FVAL]], i32 28
	; MAX256-NEXT: [[TMP82:%.*]] = insertelement <32 x float> [[TMP81]], float [[FVAL]], i32 29
	; MAX256-NEXT: [[TMP83:%.*]] = extractelement <32 x float> [[TMP38]], i32 30
	; MAX256-NEXT: [[TMP84:%.*]] = insertelement <32 x float> [[TMP82]], float [[TMP83]], i32 30
	; MAX256-NEXT: [[TMP85:%.*]] = extractelement <32 x float> [[TMP38]], i32 31
	; MAX256-NEXT: [[TMP86:%.*]] = insertelement <32 x float> [[TMP84]], float [[TMP85]], i32 31
	; MAX256-NEXT: switch i32 undef, label [[BB5:%.*]] [			; MAX256-NEXT: switch i32 undef, label [[BB5:%.*]] [
	; MAX256-NEXT: i32 0, label [[BB2:%.*]]			; MAX256-NEXT: i32 0, label [[BB2:%.*]]
	; MAX256-NEXT: i32 1, label [[BB3:%.*]]			; MAX256-NEXT: i32 1, label [[BB3:%.*]]
	; MAX256-NEXT: i32 2, label [[BB4:%.*]]			; MAX256-NEXT: i32 2, label [[BB4:%.*]]
	; MAX256-NEXT: ]			; MAX256-NEXT: ]
	; MAX256: bb3:			; MAX256: bb3:
	; MAX256-NEXT: br label [[BB2]]			; MAX256-NEXT: br label [[BB2]]
	; MAX256: bb4:			; MAX256: bb4:
	; MAX256-NEXT: [[TMP87:%.*]] = insertelement <32 x float> [[TMP40]], float [[FVAL]], i32 1			; MAX256-NEXT: [[TMP75:%.*]] = insertelement <8 x float> [[TMP34]], float [[FVAL]], i32 1
	; MAX256-NEXT: [[TMP88:%.*]] = insertelement <32 x float> [[TMP87]], float [[FVAL]], i32 2			; MAX256-NEXT: [[TMP76:%.*]] = insertelement <8 x float> [[TMP75]], float [[FVAL]], i32 2
	; MAX256-NEXT: [[TMP89:%.*]] = extractelement <32 x float> [[TMP38]], i32 3			; MAX256-NEXT: [[TMP77:%.*]] = extractelement <8 x float> [[TMP14]], i32 3
	; MAX256-NEXT: [[TMP90:%.*]] = insertelement <32 x float> [[TMP88]], float [[TMP89]], i32 3			; MAX256-NEXT: [[TMP78:%.*]] = insertelement <8 x float> [[TMP76]], float [[TMP77]], i32 3
	; MAX256-NEXT: [[TMP91:%.*]] = insertelement <32 x float> [[TMP90]], float [[TMP45]], i32 4			; MAX256-NEXT: [[TMP79:%.*]] = insertelement <8 x float> [[TMP78]], float [[TMP39]], i32 4
	; MAX256-NEXT: [[TMP92:%.*]] = insertelement <32 x float> [[TMP91]], float [[FVAL]], i32 5			; MAX256-NEXT: [[TMP80:%.*]] = insertelement <8 x float> [[TMP79]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP93:%.*]] = insertelement <32 x float> [[TMP92]], float [[FVAL]], i32 6			; MAX256-NEXT: [[TMP81:%.*]] = insertelement <8 x float> [[TMP80]], float [[FVAL]], i32 6
	; MAX256-NEXT: [[TMP94:%.*]] = extractelement <32 x float> [[TMP38]], i32 7			; MAX256-NEXT: [[TMP82:%.*]] = extractelement <8 x float> [[TMP14]], i32 7
	; MAX256-NEXT: [[TMP95:%.*]] = insertelement <32 x float> [[TMP93]], float [[TMP94]], i32 7			; MAX256-NEXT: [[TMP83:%.*]] = insertelement <8 x float> [[TMP81]], float [[TMP82]], i32 7
	; MAX256-NEXT: [[TMP96:%.*]] = extractelement <32 x float> [[TMP38]], i32 8			; MAX256-NEXT: [[TMP84:%.*]] = extractelement <8 x float> [[TMP28]], i32 0
	; MAX256-NEXT: [[TMP97:%.*]] = insertelement <32 x float> [[TMP95]], float [[TMP96]], i32 8			; MAX256-NEXT: [[TMP85:%.*]] = insertelement <8 x float> undef, float [[TMP84]], i32 0
	; MAX256-NEXT: [[TMP98:%.*]] = insertelement <32 x float> [[TMP97]], float [[FVAL]], i32 9			; MAX256-NEXT: [[TMP86:%.*]] = insertelement <8 x float> [[TMP85]], float [[FVAL]], i32 1
	; MAX256-NEXT: [[TMP99:%.*]] = insertelement <32 x float> [[TMP98]], float [[FVAL]], i32 10			; MAX256-NEXT: [[TMP87:%.*]] = insertelement <8 x float> [[TMP86]], float [[FVAL]], i32 2
	; MAX256-NEXT: [[TMP100:%.*]] = insertelement <32 x float> [[TMP99]], float [[TMP55]], i32 11			; MAX256-NEXT: [[TMP88:%.*]] = insertelement <8 x float> [[TMP87]], float [[TMP47]], i32 3
	; MAX256-NEXT: [[TMP101:%.*]] = extractelement <32 x float> [[TMP38]], i32 12			; MAX256-NEXT: [[TMP89:%.*]] = extractelement <8 x float> [[TMP28]], i32 4
	; MAX256-NEXT: [[TMP102:%.*]] = insertelement <32 x float> [[TMP100]], float [[TMP101]], i32 12			; MAX256-NEXT: [[TMP90:%.*]] = insertelement <8 x float> [[TMP88]], float [[TMP89]], i32 4
	; MAX256-NEXT: [[TMP103:%.*]] = insertelement <32 x float> [[TMP102]], float [[FVAL]], i32 13			; MAX256-NEXT: [[TMP91:%.*]] = insertelement <8 x float> [[TMP90]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP104:%.*]] = insertelement <32 x float> [[TMP103]], float [[FVAL]], i32 14			; MAX256-NEXT: [[TMP92:%.*]] = insertelement <8 x float> [[TMP91]], float [[FVAL]], i32 6
	; MAX256-NEXT: [[TMP105:%.*]] = insertelement <32 x float> [[TMP104]], float [[TMP61]], i32 15			; MAX256-NEXT: [[TMP93:%.*]] = insertelement <8 x float> [[TMP92]], float [[TMP53]], i32 7
	; MAX256-NEXT: [[TMP106:%.*]] = extractelement <32 x float> [[TMP38]], i32 16			; MAX256-NEXT: [[TMP94:%.*]] = extractelement <8 x float> [[TMP30]], i32 0
	; MAX256-NEXT: [[TMP107:%.*]] = insertelement <32 x float> [[TMP105]], float [[TMP106]], i32 16			; MAX256-NEXT: [[TMP95:%.*]] = insertelement <8 x float> undef, float [[TMP94]], i32 0
	; MAX256-NEXT: [[TMP108:%.*]] = insertelement <32 x float> [[TMP107]], float [[FVAL]], i32 17			; MAX256-NEXT: [[TMP96:%.*]] = insertelement <8 x float> [[TMP95]], float [[FVAL]], i32 1
	; MAX256-NEXT: [[TMP109:%.*]] = insertelement <32 x float> [[TMP108]], float [[FVAL]], i32 18			; MAX256-NEXT: [[TMP97:%.*]] = insertelement <8 x float> [[TMP96]], float [[FVAL]], i32 2
	; MAX256-NEXT: [[TMP110:%.*]] = insertelement <32 x float> [[TMP109]], float [[TMP67]], i32 19			; MAX256-NEXT: [[TMP98:%.*]] = insertelement <8 x float> [[TMP97]], float [[TMP57]], i32 3
	; MAX256-NEXT: [[TMP111:%.*]] = extractelement <32 x float> [[TMP38]], i32 20			; MAX256-NEXT: [[TMP99:%.*]] = extractelement <8 x float> [[TMP30]], i32 4
	; MAX256-NEXT: [[TMP112:%.*]] = insertelement <32 x float> [[TMP110]], float [[TMP111]], i32 20			; MAX256-NEXT: [[TMP100:%.*]] = insertelement <8 x float> [[TMP98]], float [[TMP99]], i32 4
	; MAX256-NEXT: [[TMP113:%.*]] = insertelement <32 x float> [[TMP112]], float [[FVAL]], i32 21			; MAX256-NEXT: [[TMP101:%.*]] = insertelement <8 x float> [[TMP100]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP114:%.*]] = insertelement <32 x float> [[TMP113]], float [[FVAL]], i32 22			; MAX256-NEXT: [[TMP102:%.*]] = insertelement <8 x float> [[TMP101]], float [[FVAL]], i32 6
	; MAX256-NEXT: [[TMP115:%.*]] = insertelement <32 x float> [[TMP114]], float [[TMP73]], i32 23			; MAX256-NEXT: [[TMP103:%.*]] = insertelement <8 x float> [[TMP102]], float [[TMP63]], i32 7
	; MAX256-NEXT: [[TMP116:%.*]] = extractelement <32 x float> [[TMP38]], i32 24			; MAX256-NEXT: [[TMP104:%.*]] = extractelement <8 x float> [[TMP32]], i32 0
	; MAX256-NEXT: [[TMP117:%.*]] = insertelement <32 x float> [[TMP115]], float [[TMP116]], i32 24			; MAX256-NEXT: [[TMP105:%.*]] = insertelement <8 x float> undef, float [[TMP104]], i32 0
	; MAX256-NEXT: [[TMP118:%.*]] = insertelement <32 x float> [[TMP117]], float [[FVAL]], i32 25			; MAX256-NEXT: [[TMP106:%.*]] = insertelement <8 x float> [[TMP105]], float [[FVAL]], i32 1
	; MAX256-NEXT: [[TMP119:%.*]] = insertelement <32 x float> [[TMP118]], float [[FVAL]], i32 26			; MAX256-NEXT: [[TMP107:%.*]] = insertelement <8 x float> [[TMP106]], float [[FVAL]], i32 2
	; MAX256-NEXT: [[TMP120:%.*]] = insertelement <32 x float> [[TMP119]], float [[TMP79]], i32 27			; MAX256-NEXT: [[TMP108:%.*]] = insertelement <8 x float> [[TMP107]], float [[TMP67]], i32 3
	; MAX256-NEXT: [[TMP121:%.*]] = extractelement <32 x float> [[TMP38]], i32 28			; MAX256-NEXT: [[TMP109:%.*]] = extractelement <8 x float> [[TMP32]], i32 4
	; MAX256-NEXT: [[TMP122:%.*]] = insertelement <32 x float> [[TMP120]], float [[TMP121]], i32 28			; MAX256-NEXT: [[TMP110:%.*]] = insertelement <8 x float> [[TMP108]], float [[TMP109]], i32 4
	; MAX256-NEXT: [[TMP123:%.*]] = insertelement <32 x float> [[TMP122]], float [[FVAL]], i32 29			; MAX256-NEXT: [[TMP111:%.*]] = insertelement <8 x float> [[TMP110]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP124:%.*]] = insertelement <32 x float> [[TMP123]], float [[FVAL]], i32 30			; MAX256-NEXT: [[TMP112:%.*]] = insertelement <8 x float> [[TMP111]], float [[FVAL]], i32 6
	; MAX256-NEXT: [[TMP125:%.*]] = insertelement <32 x float> [[TMP124]], float [[TMP85]], i32 31			; MAX256-NEXT: [[TMP113:%.*]] = insertelement <8 x float> [[TMP112]], float [[TMP73]], i32 7
	; MAX256-NEXT: br label [[BB2]]			; MAX256-NEXT: br label [[BB2]]
	; MAX256: bb5:			; MAX256: bb5:
	; MAX256-NEXT: [[TMP126:%.*]] = insertelement <32 x float> [[TMP5]], float [[TMP41]], i32 1			; MAX256-NEXT: [[TMP114:%.*]] = insertelement <8 x float> [[TMP5]], float [[TMP35]], i32 1
	; MAX256-NEXT: [[TMP127:%.*]] = insertelement <32 x float> [[TMP126]], float [[FVAL]], i32 2			; MAX256-NEXT: [[TMP115:%.*]] = insertelement <8 x float> [[TMP114]], float [[FVAL]], i32 2
	; MAX256-NEXT: [[TMP128:%.*]] = extractelement <32 x float> [[TMP38]], i32 3			; MAX256-NEXT: [[TMP116:%.*]] = extractelement <8 x float> [[TMP14]], i32 3
	; MAX256-NEXT: [[TMP129:%.*]] = insertelement <32 x float> [[TMP127]], float [[TMP128]], i32 3			; MAX256-NEXT: [[TMP117:%.*]] = insertelement <8 x float> [[TMP115]], float [[TMP116]], i32 3
	; MAX256-NEXT: [[TMP130:%.*]] = insertelement <32 x float> [[TMP129]], float [[FVAL]], i32 4			; MAX256-NEXT: [[TMP118:%.*]] = insertelement <8 x float> [[TMP117]], float [[FVAL]], i32 4
	; MAX256-NEXT: [[TMP131:%.*]] = insertelement <32 x float> [[TMP130]], float [[TMP47]], i32 5			; MAX256-NEXT: [[TMP119:%.*]] = insertelement <8 x float> [[TMP118]], float [[TMP41]], i32 5
	; MAX256-NEXT: [[TMP132:%.*]] = insertelement <32 x float> [[TMP131]], float [[FVAL]], i32 6			; MAX256-NEXT: [[TMP120:%.*]] = insertelement <8 x float> [[TMP119]], float [[FVAL]], i32 6
	; MAX256-NEXT: [[TMP133:%.*]] = extractelement <32 x float> [[TMP38]], i32 7			; MAX256-NEXT: [[TMP121:%.*]] = extractelement <8 x float> [[TMP14]], i32 7
	; MAX256-NEXT: [[TMP134:%.*]] = insertelement <32 x float> [[TMP132]], float [[TMP133]], i32 7			; MAX256-NEXT: [[TMP122:%.*]] = insertelement <8 x float> [[TMP120]], float [[TMP121]], i32 7
	; MAX256-NEXT: [[TMP135:%.*]] = extractelement <32 x float> [[TMP38]], i32 8			; MAX256-NEXT: [[TMP123:%.*]] = extractelement <8 x float> [[TMP28]], i32 0
	; MAX256-NEXT: [[TMP136:%.*]] = insertelement <32 x float> [[TMP134]], float [[TMP135]], i32 8			; MAX256-NEXT: [[TMP124:%.*]] = insertelement <8 x float> undef, float [[TMP123]], i32 0
	; MAX256-NEXT: [[TMP137:%.*]] = insertelement <32 x float> [[TMP136]], float [[FVAL]], i32 9			; MAX256-NEXT: [[TMP125:%.*]] = insertelement <8 x float> [[TMP124]], float [[FVAL]], i32 1
	; MAX256-NEXT: [[TMP138:%.*]] = insertelement <32 x float> [[TMP137]], float [[TMP53]], i32 10			; MAX256-NEXT: [[TMP126:%.*]] = insertelement <8 x float> [[TMP125]], float [[TMP45]], i32 2
	; MAX256-NEXT: [[TMP139:%.*]] = insertelement <32 x float> [[TMP138]], float [[FVAL]], i32 11			; MAX256-NEXT: [[TMP127:%.*]] = insertelement <8 x float> [[TMP126]], float [[FVAL]], i32 3
	; MAX256-NEXT: [[TMP140:%.*]] = extractelement <32 x float> [[TMP38]], i32 12			; MAX256-NEXT: [[TMP128:%.*]] = extractelement <8 x float> [[TMP28]], i32 4
	; MAX256-NEXT: [[TMP141:%.*]] = insertelement <32 x float> [[TMP139]], float [[TMP140]], i32 12			; MAX256-NEXT: [[TMP129:%.*]] = insertelement <8 x float> [[TMP127]], float [[TMP128]], i32 4
	; MAX256-NEXT: [[TMP142:%.*]] = insertelement <32 x float> [[TMP141]], float [[FVAL]], i32 13			; MAX256-NEXT: [[TMP130:%.*]] = insertelement <8 x float> [[TMP129]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP143:%.*]] = insertelement <32 x float> [[TMP142]], float [[TMP59]], i32 14			; MAX256-NEXT: [[TMP131:%.*]] = insertelement <8 x float> [[TMP130]], float [[TMP51]], i32 6
	; MAX256-NEXT: [[TMP144:%.*]] = insertelement <32 x float> [[TMP143]], float [[FVAL]], i32 15			; MAX256-NEXT: [[TMP132:%.*]] = insertelement <8 x float> [[TMP131]], float [[FVAL]], i32 7
	; MAX256-NEXT: [[TMP145:%.*]] = extractelement <32 x float> [[TMP38]], i32 16			; MAX256-NEXT: [[TMP133:%.*]] = extractelement <8 x float> [[TMP30]], i32 0
	; MAX256-NEXT: [[TMP146:%.*]] = insertelement <32 x float> [[TMP144]], float [[TMP145]], i32 16			; MAX256-NEXT: [[TMP134:%.*]] = insertelement <8 x float> undef, float [[TMP133]], i32 0
	; MAX256-NEXT: [[TMP147:%.*]] = insertelement <32 x float> [[TMP146]], float [[FVAL]], i32 17			; MAX256-NEXT: [[TMP135:%.*]] = insertelement <8 x float> [[TMP134]], float [[FVAL]], i32 1
	; MAX256-NEXT: [[TMP148:%.*]] = insertelement <32 x float> [[TMP147]], float [[TMP65]], i32 18			; MAX256-NEXT: [[TMP136:%.*]] = insertelement <8 x float> [[TMP135]], float [[TMP55]], i32 2
	; MAX256-NEXT: [[TMP149:%.*]] = insertelement <32 x float> [[TMP148]], float [[FVAL]], i32 19			; MAX256-NEXT: [[TMP137:%.*]] = insertelement <8 x float> [[TMP136]], float [[FVAL]], i32 3
	; MAX256-NEXT: [[TMP150:%.*]] = extractelement <32 x float> [[TMP38]], i32 20			; MAX256-NEXT: [[TMP138:%.*]] = extractelement <8 x float> [[TMP30]], i32 4
	; MAX256-NEXT: [[TMP151:%.*]] = insertelement <32 x float> [[TMP149]], float [[TMP150]], i32 20			; MAX256-NEXT: [[TMP139:%.*]] = insertelement <8 x float> [[TMP137]], float [[TMP138]], i32 4
	; MAX256-NEXT: [[TMP152:%.*]] = insertelement <32 x float> [[TMP151]], float [[FVAL]], i32 21			; MAX256-NEXT: [[TMP140:%.*]] = insertelement <8 x float> [[TMP139]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP153:%.*]] = insertelement <32 x float> [[TMP152]], float [[TMP71]], i32 22			; MAX256-NEXT: [[TMP141:%.*]] = insertelement <8 x float> [[TMP140]], float [[TMP61]], i32 6
	; MAX256-NEXT: [[TMP154:%.*]] = insertelement <32 x float> [[TMP153]], float [[FVAL]], i32 23			; MAX256-NEXT: [[TMP142:%.*]] = insertelement <8 x float> [[TMP141]], float [[FVAL]], i32 7
	; MAX256-NEXT: [[TMP155:%.*]] = extractelement <32 x float> [[TMP38]], i32 24			; MAX256-NEXT: [[TMP143:%.*]] = extractelement <8 x float> [[TMP32]], i32 0
	; MAX256-NEXT: [[TMP156:%.*]] = insertelement <32 x float> [[TMP154]], float [[TMP155]], i32 24			; MAX256-NEXT: [[TMP144:%.*]] = insertelement <8 x float> undef, float [[TMP143]], i32 0
	; MAX256-NEXT: [[TMP157:%.*]] = insertelement <32 x float> [[TMP156]], float [[FVAL]], i32 25			; MAX256-NEXT: [[TMP145:%.*]] = insertelement <8 x float> [[TMP144]], float [[FVAL]], i32 1
	; MAX256-NEXT: [[TMP158:%.*]] = insertelement <32 x float> [[TMP157]], float [[TMP77]], i32 26			; MAX256-NEXT: [[TMP146:%.*]] = insertelement <8 x float> [[TMP145]], float [[TMP65]], i32 2
	; MAX256-NEXT: [[TMP159:%.*]] = insertelement <32 x float> [[TMP158]], float [[FVAL]], i32 27			; MAX256-NEXT: [[TMP147:%.*]] = insertelement <8 x float> [[TMP146]], float [[FVAL]], i32 3
	; MAX256-NEXT: [[TMP160:%.*]] = extractelement <32 x float> [[TMP38]], i32 28			; MAX256-NEXT: [[TMP148:%.*]] = extractelement <8 x float> [[TMP32]], i32 4
	; MAX256-NEXT: [[TMP161:%.*]] = insertelement <32 x float> [[TMP159]], float [[TMP160]], i32 28			; MAX256-NEXT: [[TMP149:%.*]] = insertelement <8 x float> [[TMP147]], float [[TMP148]], i32 4
	; MAX256-NEXT: [[TMP162:%.*]] = insertelement <32 x float> [[TMP161]], float [[FVAL]], i32 29			; MAX256-NEXT: [[TMP150:%.*]] = insertelement <8 x float> [[TMP149]], float [[FVAL]], i32 5
	; MAX256-NEXT: [[TMP163:%.*]] = insertelement <32 x float> [[TMP162]], float [[TMP83]], i32 30			; MAX256-NEXT: [[TMP151:%.*]] = insertelement <8 x float> [[TMP150]], float [[TMP71]], i32 6
	; MAX256-NEXT: [[TMP164:%.*]] = insertelement <32 x float> [[TMP163]], float [[FVAL]], i32 31			; MAX256-NEXT: [[TMP152:%.*]] = insertelement <8 x float> [[TMP151]], float [[FVAL]], i32 7
	; MAX256-NEXT: br label [[BB2]]			; MAX256-NEXT: br label [[BB2]]
	; MAX256: bb2:			; MAX256: bb2:
	; MAX256-NEXT: [[TMP165:%.*]] = phi <32 x float> [ [[TMP38]], [[BB3]] ], [ [[TMP125]], [[BB4]] ], [ [[TMP164]], [[BB5]] ], [ [[TMP86]], [[BB1]] ]			; MAX256-NEXT: [[TMP153:%.*]] = phi <8 x float> [ [[TMP14]], [[BB3]] ], [ [[TMP83]], [[BB4]] ], [ [[TMP122]], [[BB5]] ], [ [[TMP44]], [[BB1]] ]
				; MAX256-NEXT: [[TMP154:%.*]] = phi <8 x float> [ [[TMP28]], [[BB3]] ], [ [[TMP93]], [[BB4]] ], [ [[TMP132]], [[BB5]] ], [ [[TMP54]], [[BB1]] ]
				; MAX256-NEXT: [[TMP155:%.*]] = phi <8 x float> [ [[TMP30]], [[BB3]] ], [ [[TMP103]], [[BB4]] ], [ [[TMP142]], [[BB5]] ], [ [[TMP64]], [[BB1]] ]
				; MAX256-NEXT: [[TMP156:%.*]] = phi <8 x float> [ [[TMP32]], [[BB3]] ], [ [[TMP113]], [[BB4]] ], [ [[TMP152]], [[BB5]] ], [ [[TMP74]], [[BB1]] ]
	; MAX256-NEXT: ret void			; MAX256-NEXT: ret void
	;			;
	; MAX1024-LABEL: @phi_float32(			; MAX1024-LABEL: @phi_float32(
	; MAX1024-NEXT: bb:			; MAX1024-NEXT: bb:
	; MAX1024-NEXT: br label [[BB1:%.*]]			; MAX1024-NEXT: br label [[BB1:%.*]]
	; MAX1024: bb1:			; MAX1024: bb1:
	; MAX1024-NEXT: [[TMP0:%.]] = insertelement <4 x half> undef, half [[HVAL:%.]], i32 0			; MAX1024-NEXT: [[TMP0:%.]] = insertelement <4 x half> undef, half [[HVAL:%.]], i32 0
	; MAX1024-NEXT: [[TMP1:%.*]] = insertelement <4 x half> [[TMP0]], half [[HVAL]], i32 1			; MAX1024-NEXT: [[TMP1:%.*]] = insertelement <4 x half> [[TMP0]], half [[HVAL]], i32 1
	▲ Show 20 Lines • Show All 301 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

SLP: honor requested max vector size merging PHIsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 276444

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/remark_unsupported.ll

llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll

SLP: honor requested max vector size merging PHIs
ClosedPublic