[SLPVectorizer] Limit the number of block chain instructions to max register size
AbandonedPublic

Authored by mkazantsev on Apr 26 2017, 4:50 AM.

Details

Summary

When aggregating instructions of the same type in the vectorizeChainsInBlock method, SLPVectorizer
does not care about the maximum vector register size. As result, it may generate a vector instruction
which uses vectors that are longer than allowed.

This patch limits the number of aggregated instructions so that their overall size does not exceed
the max allowed vector size.

Diff Detail

mkazantsev created this revision.Apr 26 2017, 4:50 AM
anna added a subscriber: mssimpso.Apr 26 2017, 6:56 AM

Hi Max,
I think we are allowed to have vector operations (other than vector stores) on vector sizes greater than getMaxVecRegSize in the IR. In the SLP vectorizer, we look for the max size when storing. Both GEP vectorization and phi-vectorization goes by the number of scalar elements chosen for vectorization, rather than the R.getMaxVecRegSize (for example, GEP chooses chunks of 16). Also, looking at all the llvm tests, there are many cases where geps, shuffles, inserts and extracts operate on <16 x i64> vector types = 1024 bits. The maxVecRegSize among all targets is 512 bits (zmm registers in AVX512). Running this through codegen, I can see we use the correct vector register size allowed at the target, even though the IR has <16 x i64>. Given this, I don't think this is a bug.

@mssimpso Is this correct? Also, I would like to know the design philosophy behind having the large vector sizes, and allowing LLC to find the correct vector register size.

In D32533#738049, @anna wrote:

@mssimpso Is this correct? Also, I would like to know the design philosophy behind having the large vector sizes, and allowing LLC to find the correct vector register size.

As far as I know, this is correct. The vectorizers are generally allowed (and do in some cases) create vectors that are wider than the physical vector register size of the target. The backend is supposed to know how to split them up into legal sizes. The min/max vector register size options in SLP, can use some work, though (see D31965 and the TODO on line 325). I think they are primarily used to narrow the search space for compile-time savings. But I think we currently only do this for store-rooted trees (vectorizeStores).

anna added a comment.Apr 26 2017, 9:15 AM

okay, that explains why we do limit to the target vector register *only* for the store chain in SLP (compile time benefit). However, if we look at the Loop Vectorizer, we consider the maximum vector register size when generating the code in the IR. This also gives a more accurate cost model for LV.

Not considering the physical vector register size is limiting the SLP cost model right? For example, in the target, we would have 4 shuffles instead of a single shuffle.

In D32533#738239, @anna wrote:

Not considering the physical vector register size is limiting the SLP cost model right? For example, in the target, we would have 4 shuffles instead of a single shuffle.

The physical vector register size is used when computing costs. In the loop vectorizer, we set a MaxVF based on the target register size, then compute the cost for all VFs up to this size and select the VF that is most profitable. But for SLP (for store-rooted trees), we have MinVecRegSize and MaxVecRegSize (where MaxVecRegSize is defined by the target). We try VFs based on these sizes from Max to Min and vectorize with the first one that is profitable (we don't try all of them).

But again, it looks to me like these limits are only imposed on the store-rooted trees. For vectorizeChainsInBlock and vectorizeGEPIndices, it looks like we basically let the vector be as wide as it can be, compute the cost of that, and if profitable, vectorize. The costs are still based on the target register size, though.

anna added a comment.Apr 26 2017, 10:32 AM

Thanks Matt. Looking at the SLP vectorizer code, I agree with what you've said.

Just to summarize, SLP vectorizer computes with the max possible vector length in the IR for GEPs and PHIs. However, the physical vector register size is used in the cost calculation, and vectorization is done only if its profitable. I've verified that vectorizing stores, geps and phis use this cost threshold to decide if vectorization is profitable. The downside to how the cost model is used in SLP vectorizer is that we may miss out on vectorizing geps and phis because we chose the widest possible vector and if it's unprofitable (because target's physical register size is small), we just wont vectorize.

I think the right fix from a profitability standpoint (i.e. vectorizing more geps and phis) is to do the same thing we do for store-rooted trees: try from max VF = target's physical vector reg width upto min VF, and stop when we find something profitable. The issue might be a higher compile time. From correctness standpoint, SLP code is doing the right thing.

In D32533#738239, @anna wrote:

okay, that explains why we do limit to the target vector register *only* for the store chain in SLP (compile time benefit). However, if we look at the Loop Vectorizer, we consider the maximum vector register size when generating the code in the IR. This also gives a more accurate cost model for LV.

Not considering the physical vector register size is limiting the SLP cost model right? For example, in the target, we would have 4 shuffles instead of a single shuffle.

SLP and LV, unfortunately, have different approaches here.

SLP, except for the store chain case, ignores register sizes. The assumption is that (a) the legalizer will do a good job, and (b) the cost model accurately reflects legalization costs. LV is more conservative, and will not create vectors wider than the register size (it has a flag that enables it to do so, but it's off by default). The direction we want to move in is of *not* limiting the vector size in either SLP or LV, that is, the opposite of what this patch does. This is important for vectorizing code that mixes types of different sizes. That hasn't happened yet, for a couple of reasons. One is that the assumptions SLP makes about the cost model and legalization don't necessarily hold. :-) The other is that doing it correctly also requires modeling register pressure. We already do it in LV for the interleaving (unrolling) factor, to an extent, but it needs to get integrated with the vectorization factor heuristic as well.

So, overall, I'd say the right solution here is not to stop SLP from creating wide vectors, but to fix the backend/cost model issues.

mkazantsev abandoned this revision.Apr 26 2017, 9:20 PM

Thank you guys for clarification, I abandon this revision since the existing behavior is correct.