This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Unify GEP cost modeling for load, store and GEP nodes.
ClosedPublic

Authored by vdmitrie on Dec 30 2022, 2:47 PM.

Details

Summary

Make a separate routine for GEPs cost calculation and make
the approach uniform across load, store and GEP tree nodes.
Additional issue fixed is GEP cost savings were applied twice
for ScatterVectorize nodes (aka gather load) making them look
unrealistically profitable for vectorization.

Diff Detail

Event Timeline

vdmitrie created this revision.Dec 30 2022, 2:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2022, 2:47 PM
vdmitrie requested review of this revision.Dec 30 2022, 2:47 PM
RKSimon added inline comments.Jan 1 2023, 4:10 AM
llvm/test/Transforms/SLPVectorizer/X86/geps-non-pow-2.ll
2

Havig to add a slp-threshold on an existing defaut test makes me a little nervous tbh

vdmitrie added inline comments.Jan 3 2023, 9:14 AM
llvm/test/Transforms/SLPVectorizer/X86/geps-non-pow-2.ll
2

I believe that the purpose of this test is to show how vectorization goes without/with non-pow2 feature support (even though the feature is not ready yet). So we need to keep it vectorized. This is why I changed threshold instead of re-generating checks. We already have some tests that explicitly set the threshold for similar reason as reduced tests are not always run as desired with default threshold.
In this particular case the behavior has changed because the unified GEP cost routine adds an ADD cost for only those GEPs that used once while previously it did that unconditionally.

ABataev added inline comments.Jan 3 2023, 9:22 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6911–6916

If pointer has multiple uses, it still will be vectorized + added the cost of the external use. I think currently, we still may add the cost of the external use for such geps. Shall we drop Ptr->hasOneUse() for some nodes, like scattervectorize, but not for vector loads/stores?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6911–6916

We shall treat GEP is not a regular instruction. For regular instruction we can perform vector operation and than extract a lane to get a value. We cannot do the same for a GEP. When scalar GEPs has just one use it means that their user will be removed as a result of vectorization. But this does not happen for GEPs as there is no vector version of GEP exists. Instead we just cast base pointer to required vector type. If some non-base pointer GEPs have more that one use that means they may still have uses after the tree was vectorized (i.e. GEP will not be removed). That is my understanding about what logic was put behind that hasOneUse check. I agree that it can be not quite satisfactory and I left the comment about "all uses inside vectorizable tree". In this regard test case geps-non-pow-2.ll probably represents an exception from the above as GEPs are arguments of PHIs (and we do not really know what is relationships between the GEPs) and an external use will produce an extract rather than leaves the original GEP instruction. Note that in this case all nodes in the tree are either "vectorize" or splats (See attached pdf). I.e. we probably may drop hasOneUse when an in-tree user of GEPs is PHI but I believe it would be incorrect to do that if GEP user is a load/store node (regardless of vectorize/scattervectorize kind).

ABataev added inline comments.Jan 3 2023, 11:19 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6911–6916

I'm not saying about particular test, just a common question.
Say, we have something like this:

%gep1 = getelementptr
%gep2 = getelementptr
%a = load %gep1
%b = load %gep2
%c = load %gep1

If 2 first loads gets vectorized, the third load will get extractelement from vector getelementptr:

%gep1 = getelement <x>
%vec_a = load < 2 x> %gep1
%gep = extractelement %gep1, 0
%c load %gep

try the next code:

define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) {
  %load.1 = load i32, ptr %in, align 4
  %gep.1 = getelementptr inbounds i32, ptr %in, i64 3
  %load.2 = load i32, ptr %gep.1, align 4
  %gep.2 = getelementptr inbounds i32, ptr %in, i64 6
  %load.3 = load i32, ptr %gep.2, align 4
  %gep.3 = getelementptr inbounds i32, ptr %in, i64 9
  %load.4 = load i32, ptr %gep.3, align 4
  %load.5 = load i32, ptr %inn, align 4
  %gep.4 = getelementptr inbounds i32, ptr %inn, i64 1
  %load.6 = load i32, ptr %gep.4, align 4
  %gep.5 = getelementptr inbounds i32, ptr %inn, i64 2
  %load.7 = load i32, ptr %gep.5, align 4
  %gep.6 = getelementptr inbounds i32, ptr %inn, i64 3
  %load.8 = load i32, ptr %gep.6, align 4
  %mul.1 = mul i32 %load.1, %load.5
  %mul.2 = mul i32 %load.2, %load.6
  %mul.3 = mul i32 %load.3, %load.7
  %mul.4 = mul i32 %load.4, %load.8
  %gep.8 = getelementptr inbounds i32, ptr %out, i64 1
  %gep.9 = getelementptr inbounds i32, ptr %out, i64 2
  %gep.10 = getelementptr inbounds i32, ptr %out, i64 3
  store i32 %mul.1, ptr %gep.9, align 4
  store i32 %mul.2, ptr %out, align 4
  store i32 %mul.3, ptr %gep.10, align 4
  store i32 %mul.4, ptr %gep.8, align 4

  %ll = load i32, ptr %gep.1, align 4

  ret i32 undef
}

opt -S -mtriple=x86_64-unknown -mattr=+avx512vl -passes=slp-vectorizer ./repro.ll

define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) #0 {
  %1 = insertelement <4 x ptr> poison, ptr %in, i32 0
  %2 = shufflevector <4 x ptr> %1, <4 x ptr> poison, <4 x i32> zeroinitializer
  %3 = getelementptr i32, <4 x ptr> %2, <4 x i64> <i64 3, i64 9, i64 0, i64 6>
  %4 = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %3, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison)
  %5 = load <4 x i32>, ptr %inn, align 4
  %6 = shufflevector <4 x i32> %5, <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
  %7 = mul <4 x i32> %4, %6
  store <4 x i32> %7, ptr %out, align 4
  %8 = extractelement <4 x ptr> %3, i32 0
  %ll = load i32, ptr %8, align 4
  ret i32 undef
}

For vector loads/stores we maybe do not emit extractelement but we can account its cost. Need to exclude this extra cost for the geps with multiple uses

vdmitrie added inline comments.Jan 3 2023, 11:57 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6911–6916

Ahh, I see. It was my misunderstanding. Thanks for clarification.
Basically we only need to account hasOneUse() for only regular loads and stores and not for the rest because these are terminators and don't really generate a vector GEP. I'll make the necessary changes and update the patch.

vdmitrie updated this revision to Diff 486091.Jan 3 2023, 2:34 PM

Address comment + rebase

ABataev added inline comments.Jan 4 2023, 7:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6919–6925

Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not it better just to rebuild the scalar GEP?

6924–6925

Shall we drop this check? We still vectorize GEPs with multiple uses and then emit extractelement for them. The cost of the extractelement is calculated separately. So, when we calculate the cost for GEPs with multiple uses, we exclude them from saving cost and then we add an extra cost for extractelement. If we're still going to emit extractelement, need to remove this check (the original пуз will be vectorized and removed and then the extractelement is generated). If it is better to keep scalar copy, need to remove the cost of the extractelement calculation and keep this check.

vdmitrie added inline comments.Jan 4 2023, 10:11 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6919–6925

Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not it better just to rebuild the scalar GEP?

IMO, this is not a bad idea. We don't really need to rematerialize a scalar GEP. Instead we can leave the original scalar one and use it instead of generating an extract. This will also break undesired dependency on vector GEP. So generally looks like room for improvement.

6924–6925

For plain vector loads and stores we do not vectorize GEPs and hence do not emit extract element instructions. Instead as scalar loads are removed and GEPs for which these loads (or stores) were single users are also removed. All the rest GEPs stay in the code. When we build vec tree we do not dive into loads or stores pointer arguments, these loads/or store nodes are terminal nodes. This is why I added check for stores/loads which will only return true for vector loads or stores.
define ptr @foo(ptr nocapture readonly %src, ptr nocapture %dst) local_unnamed_addr {
entry:

%arrayidxA0 = getelementptr inbounds double, ptr %src, i64 0
%A0 = load double, ptr %arrayidxA0, align 1
%arrayidxA1 = getelementptr inbounds double, ptr %src, i64 1
%A1 = load double, ptr %arrayidxA1, align 1

%arrayidx0 = getelementptr inbounds double, ptr %dst, i64 0
store double %A0, ptr %arrayidx0, align 16
%arrayidx1 = getelementptr inbounds double, ptr %dst, i64 1
store double %A1, ptr %arrayidx1, align 16

ret ptr %arrayidxA1

}

We generate:

%arrayidxA0 = getelementptr inbounds double, ptr %src, i64 0
%arrayidxA1 = getelementptr inbounds double, ptr %src, i64 1
%arrayidx0 = getelementptr inbounds double, ptr %dst, i64 0
%0 = load <2 x double>, ptr %arrayidxA0, align 1
store <2 x double> %0, ptr %arrayidx0, align 16
ret ptr %arrayidxA1

We do not do the same for gather loads (aka ScatterVectorize) as we indeed vectorize GEPs and then do extracts for external uses.

ABataev added inline comments.Jan 4 2023, 10:15 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925

Yes, for vector loads/store it is so. But what about masked gather? We avoid the cost compensation here and then add the extractelement cost.

vdmitrie added inline comments.Jan 4 2023, 10:29 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925

For gather loads this routine is not called when we process tree node with set of loads.
It is called when tree node with GEPs is processed. For GEPs the VL0 will not be a load.
It was a bug in cost modeling that we evaluated GEPs twice for gather loads which this patch particularly fixes.

ABataev added inline comments.Jan 4 2023, 10:35 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925

I mean not gather but masked gather, ScatterVectorize.
Another one thing. Shall we add a cost of vector GEP? Currently we just subtract the cost of scalar GEPs.

vdmitrie added inline comments.Jan 4 2023, 10:50 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925

I mean not gather but masked gather, ScatterVectorize.

Yes, I understand you. By "gather loads" I meant exactly masked gather load (aka ScatterVectorize). So what I said still holds.

Another one thing. Shall we add a cost of vector GEP? Currently we just subtract the cost of scalar GEPs.

This can be considered for a follow up changes. To be more accurate we do not count base address now (see the first check point in the loop) assuming that cost of vector GEP will be the same. So we don't add its cost for scalar calculation and hence do not subtract it then (it's kind a shortcut). For plain vector loads and stores that is correct estimation (where for regular pointers we only generate a pointer cast which I believe is free). For GEP nodes this estimation may not be quite correct.

ABataev accepted this revision.Jan 4 2023, 11:02 AM

LG

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925
  1. Yes, I overlooked it.
  2. Ok.
This revision is now accepted and ready to land.Jan 4 2023, 11:02 AM