Make a separate routine for GEPs cost calculation and make
the approach uniform across load, store and GEP tree nodes.
Additional issue fixed is GEP cost savings were applied twice
for ScatterVectorize nodes (aka gather load) making them look
unrealistically profitable for vectorization.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/Transforms/SLPVectorizer/X86/geps-non-pow-2.ll | ||
---|---|---|
2 | Havig to add a slp-threshold on an existing defaut test makes me a little nervous tbh |
llvm/test/Transforms/SLPVectorizer/X86/geps-non-pow-2.ll | ||
---|---|---|
2 | I believe that the purpose of this test is to show how vectorization goes without/with non-pow2 feature support (even though the feature is not ready yet). So we need to keep it vectorized. This is why I changed threshold instead of re-generating checks. We already have some tests that explicitly set the threshold for similar reason as reduced tests are not always run as desired with default threshold. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6911–6916 | If pointer has multiple uses, it still will be vectorized + added the cost of the external use. I think currently, we still may add the cost of the external use for such geps. Shall we drop Ptr->hasOneUse() for some nodes, like scattervectorize, but not for vector loads/stores? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6911–6916 | We shall treat GEP is not a regular instruction. For regular instruction we can perform vector operation and than extract a lane to get a value. We cannot do the same for a GEP. When scalar GEPs has just one use it means that their user will be removed as a result of vectorization. But this does not happen for GEPs as there is no vector version of GEP exists. Instead we just cast base pointer to required vector type. If some non-base pointer GEPs have more that one use that means they may still have uses after the tree was vectorized (i.e. GEP will not be removed). That is my understanding about what logic was put behind that hasOneUse check. I agree that it can be not quite satisfactory and I left the comment about "all uses inside vectorizable tree". In this regard test case geps-non-pow-2.ll probably represents an exception from the above as GEPs are arguments of PHIs (and we do not really know what is relationships between the GEPs) and an external use will produce an extract rather than leaves the original GEP instruction. Note that in this case all nodes in the tree are either "vectorize" or splats (See attached pdf). I.e. we probably may drop hasOneUse when an in-tree user of GEPs is PHI but I believe it would be incorrect to do that if GEP user is a load/store node (regardless of vectorize/scattervectorize kind). |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6911–6916 | I'm not saying about particular test, just a common question. %gep1 = getelementptr %gep2 = getelementptr %a = load %gep1 %b = load %gep2 %c = load %gep1 If 2 first loads gets vectorized, the third load will get extractelement from vector getelementptr: %gep1 = getelement <x> %vec_a = load < 2 x> %gep1 %gep = extractelement %gep1, 0 %c load %gep try the next code: define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) { %load.1 = load i32, ptr %in, align 4 %gep.1 = getelementptr inbounds i32, ptr %in, i64 3 %load.2 = load i32, ptr %gep.1, align 4 %gep.2 = getelementptr inbounds i32, ptr %in, i64 6 %load.3 = load i32, ptr %gep.2, align 4 %gep.3 = getelementptr inbounds i32, ptr %in, i64 9 %load.4 = load i32, ptr %gep.3, align 4 %load.5 = load i32, ptr %inn, align 4 %gep.4 = getelementptr inbounds i32, ptr %inn, i64 1 %load.6 = load i32, ptr %gep.4, align 4 %gep.5 = getelementptr inbounds i32, ptr %inn, i64 2 %load.7 = load i32, ptr %gep.5, align 4 %gep.6 = getelementptr inbounds i32, ptr %inn, i64 3 %load.8 = load i32, ptr %gep.6, align 4 %mul.1 = mul i32 %load.1, %load.5 %mul.2 = mul i32 %load.2, %load.6 %mul.3 = mul i32 %load.3, %load.7 %mul.4 = mul i32 %load.4, %load.8 %gep.8 = getelementptr inbounds i32, ptr %out, i64 1 %gep.9 = getelementptr inbounds i32, ptr %out, i64 2 %gep.10 = getelementptr inbounds i32, ptr %out, i64 3 store i32 %mul.1, ptr %gep.9, align 4 store i32 %mul.2, ptr %out, align 4 store i32 %mul.3, ptr %gep.10, align 4 store i32 %mul.4, ptr %gep.8, align 4 %ll = load i32, ptr %gep.1, align 4 ret i32 undef } opt -S -mtriple=x86_64-unknown -mattr=+avx512vl -passes=slp-vectorizer ./repro.ll define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) #0 { %1 = insertelement <4 x ptr> poison, ptr %in, i32 0 %2 = shufflevector <4 x ptr> %1, <4 x ptr> poison, <4 x i32> zeroinitializer %3 = getelementptr i32, <4 x ptr> %2, <4 x i64> <i64 3, i64 9, i64 0, i64 6> %4 = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %3, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison) %5 = load <4 x i32>, ptr %inn, align 4 %6 = shufflevector <4 x i32> %5, <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2> %7 = mul <4 x i32> %4, %6 store <4 x i32> %7, ptr %out, align 4 %8 = extractelement <4 x ptr> %3, i32 0 %ll = load i32, ptr %8, align 4 ret i32 undef } For vector loads/stores we maybe do not emit extractelement but we can account its cost. Need to exclude this extra cost for the geps with multiple uses |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6911–6916 | Ahh, I see. It was my misunderstanding. Thanks for clarification. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6919–6925 | Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not it better just to rebuild the scalar GEP? | |
6924–6925 | Shall we drop this check? We still vectorize GEPs with multiple uses and then emit extractelement for them. The cost of the extractelement is calculated separately. So, when we calculate the cost for GEPs with multiple uses, we exclude them from saving cost and then we add an extra cost for extractelement. If we're still going to emit extractelement, need to remove this check (the original пуз will be vectorized and removed and then the extractelement is generated). If it is better to keep scalar copy, need to remove the cost of the extractelement calculation and keep this check. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6919–6925 |
IMO, this is not a bad idea. We don't really need to rematerialize a scalar GEP. Instead we can leave the original scalar one and use it instead of generating an extract. This will also break undesired dependency on vector GEP. So generally looks like room for improvement. | |
6924–6925 | For plain vector loads and stores we do not vectorize GEPs and hence do not emit extract element instructions. Instead as scalar loads are removed and GEPs for which these loads (or stores) were single users are also removed. All the rest GEPs stay in the code. When we build vec tree we do not dive into loads or stores pointer arguments, these loads/or store nodes are terminal nodes. This is why I added check for stores/loads which will only return true for vector loads or stores. %arrayidxA0 = getelementptr inbounds double, ptr %src, i64 0 %A0 = load double, ptr %arrayidxA0, align 1 %arrayidxA1 = getelementptr inbounds double, ptr %src, i64 1 %A1 = load double, ptr %arrayidxA1, align 1 %arrayidx0 = getelementptr inbounds double, ptr %dst, i64 0 store double %A0, ptr %arrayidx0, align 16 %arrayidx1 = getelementptr inbounds double, ptr %dst, i64 1 store double %A1, ptr %arrayidx1, align 16 ret ptr %arrayidxA1 } We generate: %arrayidxA0 = getelementptr inbounds double, ptr %src, i64 0 %arrayidxA1 = getelementptr inbounds double, ptr %src, i64 1 %arrayidx0 = getelementptr inbounds double, ptr %dst, i64 0 %0 = load <2 x double>, ptr %arrayidxA0, align 1 store <2 x double> %0, ptr %arrayidx0, align 16 ret ptr %arrayidxA1 We do not do the same for gather loads (aka ScatterVectorize) as we indeed vectorize GEPs and then do extracts for external uses. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6924–6925 | Yes, for vector loads/store it is so. But what about masked gather? We avoid the cost compensation here and then add the extractelement cost. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6924–6925 | For gather loads this routine is not called when we process tree node with set of loads. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6924–6925 | I mean not gather but masked gather, ScatterVectorize. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6924–6925 |
Yes, I understand you. By "gather loads" I meant exactly masked gather load (aka ScatterVectorize). So what I said still holds.
This can be considered for a follow up changes. To be more accurate we do not count base address now (see the first check point in the loop) assuming that cost of vector GEP will be the same. So we don't add its cost for scalar calculation and hence do not subtract it then (it's kind a shortcut). For plain vector loads and stores that is correct estimation (where for regular pointers we only generate a pointer cast which I believe is free). For GEP nodes this estimation may not be quite correct. |
LG
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6924–6925 |
|
If pointer has multiple uses, it still will be vectorized + added the cost of the external use. I think currently, we still may add the cost of the external use for such geps. Shall we drop Ptr->hasOneUse() for some nodes, like scattervectorize, but not for vector loads/stores?