This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Unify GEP cost modeling for load, store and GEP nodes.
ClosedPublic

Authored by vdmitrie on Dec 30 2022, 2:47 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon

Commits

rG6d677c0b3d91: [SLP] Unify GEP cost modeling for load, store and GEP nodes.

Summary

Make a separate routine for GEPs cost calculation and make
the approach uniform across load, store and GEP tree nodes.
Additional issue fixed is GEP cost savings were applied twice
for ScatterVectorize nodes (aka gather load) making them look
unrealistically profitable for vectorization.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

vdmitrie created this revision.Dec 30 2022, 2:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2022, 2:47 PM

Herald added subscribers: vporpo, hiraditya. · View Herald Transcript

vdmitrie requested review of this revision.Dec 30 2022, 2:47 PM

Herald added subscribers: llvm-commits, • pcwang-thead. · View Herald TranscriptDec 30 2022, 2:47 PM

Harbormaster completed remote builds in B205240: Diff 485727.Dec 30 2022, 3:36 PM

RKSimon added inline comments.Jan 1 2023, 4:10 AM

llvm/test/Transforms/SLPVectorizer/X86/geps-non-pow-2.ll
2	Havig to add a slp-threshold on an existing defaut test makes me a little nervous tbh

vdmitrie added inline comments.Jan 3 2023, 9:14 AM

llvm/test/Transforms/SLPVectorizer/X86/geps-non-pow-2.ll
2	I believe that the purpose of this test is to show how vectorization goes without/with non-pow2 feature support (even though the feature is not ready yet). So we need to keep it vectorized. This is why I changed threshold instead of re-generating checks. We already have some tests that explicitly set the threshold for similar reason as reduced tests are not always run as desired with default threshold. In this particular case the behavior has changed because the unified GEP cost routine adds an ADD cost for only those GEPs that used once while previously it did that unconditionally.

ABataev added inline comments.Jan 3 2023, 9:22 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6911–6916	If pointer has multiple uses, it still will be vectorized + added the cost of the external use. I think currently, we still may add the cost of the external use for such geps. Shall we drop Ptr->hasOneUse() for some nodes, like scattervectorize, but not for vector loads/stores?

02.pdf13 KBDownload

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6911–6916	We shall treat GEP is not a regular instruction. For regular instruction we can perform vector operation and than extract a lane to get a value. We cannot do the same for a GEP. When scalar GEPs has just one use it means that their user will be removed as a result of vectorization. But this does not happen for GEPs as there is no vector version of GEP exists. Instead we just cast base pointer to required vector type. If some non-base pointer GEPs have more that one use that means they may still have uses after the tree was vectorized (i.e. GEP will not be removed). That is my understanding about what logic was put behind that hasOneUse check. I agree that it can be not quite satisfactory and I left the comment about "all uses inside vectorizable tree". In this regard test case geps-non-pow-2.ll probably represents an exception from the above as GEPs are arguments of PHIs (and we do not really know what is relationships between the GEPs) and an external use will produce an extract rather than leaves the original GEP instruction. Note that in this case all nodes in the tree are either "vectorize" or splats (See attached pdf). I.e. we probably may drop hasOneUse when an in-tree user of GEPs is PHI but I believe it would be incorrect to do that if GEP user is a load/store node (regardless of vectorize/scattervectorize kind).

ABataev added inline comments.Jan 3 2023, 11:19 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

6911–6916

I'm not saying about particular test, just a common question.
Say, we have something like this:

%gep1 = getelementptr
%gep2 = getelementptr
%a = load %gep1
%b = load %gep2
%c = load %gep1

If 2 first loads gets vectorized, the third load will get extractelement from vector getelementptr:

%gep1 = getelement <x>
%vec_a = load < 2 x> %gep1
%gep = extractelement %gep1, 0
%c load %gep

try the next code:

define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) {
  %load.1 = load i32, ptr %in, align 4
  %gep.1 = getelementptr inbounds i32, ptr %in, i64 3
  %load.2 = load i32, ptr %gep.1, align 4
  %gep.2 = getelementptr inbounds i32, ptr %in, i64 6
  %load.3 = load i32, ptr %gep.2, align 4
  %gep.3 = getelementptr inbounds i32, ptr %in, i64 9
  %load.4 = load i32, ptr %gep.3, align 4
  %load.5 = load i32, ptr %inn, align 4
  %gep.4 = getelementptr inbounds i32, ptr %inn, i64 1
  %load.6 = load i32, ptr %gep.4, align 4
  %gep.5 = getelementptr inbounds i32, ptr %inn, i64 2
  %load.7 = load i32, ptr %gep.5, align 4
  %gep.6 = getelementptr inbounds i32, ptr %inn, i64 3
  %load.8 = load i32, ptr %gep.6, align 4
  %mul.1 = mul i32 %load.1, %load.5
  %mul.2 = mul i32 %load.2, %load.6
  %mul.3 = mul i32 %load.3, %load.7
  %mul.4 = mul i32 %load.4, %load.8
  %gep.8 = getelementptr inbounds i32, ptr %out, i64 1
  %gep.9 = getelementptr inbounds i32, ptr %out, i64 2
  %gep.10 = getelementptr inbounds i32, ptr %out, i64 3
  store i32 %mul.1, ptr %gep.9, align 4
  store i32 %mul.2, ptr %out, align 4
  store i32 %mul.3, ptr %gep.10, align 4
  store i32 %mul.4, ptr %gep.8, align 4

  %ll = load i32, ptr %gep.1, align 4

  ret i32 undef
}

opt -S -mtriple=x86_64-unknown -mattr=+avx512vl -passes=slp-vectorizer ./repro.ll

define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) #0 {
  %1 = insertelement <4 x ptr> poison, ptr %in, i32 0
  %2 = shufflevector <4 x ptr> %1, <4 x ptr> poison, <4 x i32> zeroinitializer
  %3 = getelementptr i32, <4 x ptr> %2, <4 x i64> <i64 3, i64 9, i64 0, i64 6>
  %4 = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %3, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison)
  %5 = load <4 x i32>, ptr %inn, align 4
  %6 = shufflevector <4 x i32> %5, <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
  %7 = mul <4 x i32> %4, %6
  store <4 x i32> %7, ptr %out, align 4
  %8 = extractelement <4 x ptr> %3, i32 0
  %ll = load i32, ptr %8, align 4
  ret i32 undef
}

For vector loads/stores we maybe do not emit extractelement but we can account its cost. Need to exclude this extra cost for the geps with multiple uses

vdmitrie added inline comments.Jan 3 2023, 11:57 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6911–6916	Ahh, I see. It was my misunderstanding. Thanks for clarification. Basically we only need to account hasOneUse() for only regular loads and stores and not for the rest because these are terminators and don't really generate a vector GEP. I'll make the necessary changes and update the patch.

Address comment + rebase

Harbormaster completed remote builds in B205539: Diff 486091.Jan 3 2023, 4:42 PM

ABataev added inline comments.Jan 4 2023, 7:23 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6919–6925	Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not it better just to rebuild the scalar GEP?
6924–6925	Shall we drop this check? We still vectorize GEPs with multiple uses and then emit extractelement for them. The cost of the extractelement is calculated separately. So, when we calculate the cost for GEPs with multiple uses, we exclude them from saving cost and then we add an extra cost for extractelement. If we're still going to emit extractelement, need to remove this check (the original пуз will be vectorized and removed and then the extractelement is generated). If it is better to keep scalar copy, need to remove the cost of the extractelement calculation and keep this check.

vdmitrie added inline comments.Jan 4 2023, 10:11 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6919–6925	Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not it better just to rebuild the scalar GEP? IMO, this is not a bad idea. We don't really need to rematerialize a scalar GEP. Instead we can leave the original scalar one and use it instead of generating an extract. This will also break undesired dependency on vector GEP. So generally looks like room for improvement.
6924–6925	For plain vector loads and stores we do not vectorize GEPs and hence do not emit extract element instructions. Instead as scalar loads are removed and GEPs for which these loads (or stores) were single users are also removed. All the rest GEPs stay in the code. When we build vec tree we do not dive into loads or stores pointer arguments, these loads/or store nodes are terminal nodes. This is why I added check for stores/loads which will only return true for vector loads or stores. define ptr @foo(ptr nocapture readonly %src, ptr nocapture %dst) local_unnamed_addr { entry: %arrayidxA0 = getelementptr inbounds double, ptr %src, i64 0 %A0 = load double, ptr %arrayidxA0, align 1 %arrayidxA1 = getelementptr inbounds double, ptr %src, i64 1 %A1 = load double, ptr %arrayidxA1, align 1 %arrayidx0 = getelementptr inbounds double, ptr %dst, i64 0 store double %A0, ptr %arrayidx0, align 16 %arrayidx1 = getelementptr inbounds double, ptr %dst, i64 1 store double %A1, ptr %arrayidx1, align 16 ret ptr %arrayidxA1 } We generate: %arrayidxA0 = getelementptr inbounds double, ptr %src, i64 0 %arrayidxA1 = getelementptr inbounds double, ptr %src, i64 1 %arrayidx0 = getelementptr inbounds double, ptr %dst, i64 0 %0 = load <2 x double>, ptr %arrayidxA0, align 1 store <2 x double> %0, ptr %arrayidx0, align 16 ret ptr %arrayidxA1 We do not do the same for gather loads (aka ScatterVectorize) as we indeed vectorize GEPs and then do extracts for external uses.

ABataev added inline comments.Jan 4 2023, 10:15 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925	Yes, for vector loads/store it is so. But what about masked gather? We avoid the cost compensation here and then add the extractelement cost.

vdmitrie added inline comments.Jan 4 2023, 10:29 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925	For gather loads this routine is not called when we process tree node with set of loads. It is called when tree node with GEPs is processed. For GEPs the VL0 will not be a load. It was a bug in cost modeling that we evaluated GEPs twice for gather loads which this patch particularly fixes.

ABataev added inline comments.Jan 4 2023, 10:35 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925	I mean not gather but masked gather, ScatterVectorize. Another one thing. Shall we add a cost of vector GEP? Currently we just subtract the cost of scalar GEPs.

vdmitrie added inline comments.Jan 4 2023, 10:50 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925	I mean not gather but masked gather, ScatterVectorize. Yes, I understand you. By "gather loads" I meant exactly masked gather load (aka ScatterVectorize). So what I said still holds. Another one thing. Shall we add a cost of vector GEP? Currently we just subtract the cost of scalar GEPs. This can be considered for a follow up changes. To be more accurate we do not count base address now (see the first check point in the loop) assuming that cost of vector GEP will be the same. So we don't add its cost for scalar calculation and hence do not subtract it then (it's kind a shortcut). For plain vector loads and stores that is correct estimation (where for regular pointers we only generate a pointer cast which I believe is free). For GEP nodes this estimation may not be quite correct.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6924–6925	Yes, I overlooked it. Ok.

This revision is now accepted and ready to land.Jan 4 2023, 11:02 AM

Closed by commit rG6d677c0b3d91: [SLP] Unify GEP cost modeling for load, store and GEP nodes. (authored by vdmitrie). · Explain WhyJan 5 2023, 10:11 AM

This revision was automatically updated to reflect the committed changes.

vdmitrie added a commit: rG6d677c0b3d91: [SLP] Unify GEP cost modeling for load, store and GEP nodes..

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

111 lines

test/

Transforms/

SLPVectorizer/

X86/

geps-non-pow-2.ll

2 lines

remark_gather-load-redux-cost.ll

2 lines

scatter-vectorize-reused-pointer.ll

2 lines

Diff 486617

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,893 Lines • ▼ Show 20 Lines	auto GetCostDiff =
LLVM_DEBUG(		LLVM_DEBUG(
dumpTreeCosts(E, CommonCost, VecCost - CommonCost, ScalarCost));		dumpTreeCosts(E, CommonCost, VecCost - CommonCost, ScalarCost));
// Disable warnings for `this` and `E` are unused. Required for		// Disable warnings for `this` and `E` are unused. Required for
// `dumpTreeCosts`.		// `dumpTreeCosts`.
(void)this;		(void)this;
(void)E;		(void)E;
return VecCost - ScalarCost;		return VecCost - ScalarCost;
};		};
		// Calculate cost difference from vectorizing set of GEPs.
		// Negative value means vectorizing is profitable.
		auto GetGEPCostDiff = [=](ArrayRef<Value > Ptrs, Value BasePtr) {
		InstructionCost CostSavings = 0;
		for (Value *V : Ptrs) {
		if (V == BasePtr)
		continue;
		auto *Ptr = dyn_cast<GetElementPtrInst>(V);
		// GEPs may contain just addresses without instructions, considered free.
		// GEPs with all constant indices also considered to have zero cost.
		if (!Ptr \|\| Ptr->hasAllConstantIndices())
		continue;

		// Here we differentiate two cases: when GEPs represent a regular
		// vectorization tree node (and hence vectorized) and when the set is
		ABataevUnsubmitted Not Done Reply Inline Actions If pointer has multiple uses, it still will be vectorized + added the cost of the external use. I think currently, we still may add the cost of the external use for such geps. Shall we drop Ptr->hasOneUse() for some nodes, like scattervectorize, but not for vector loads/stores? ABataev: If pointer has multiple uses, it still will be vectorized + added the cost of the external use.
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions We shall treat GEP is not a regular instruction. For regular instruction we can perform vector operation and than extract a lane to get a value. We cannot do the same for a GEP. When scalar GEPs has just one use it means that their user will be removed as a result of vectorization. But this does not happen for GEPs as there is no vector version of GEP exists. Instead we just cast base pointer to required vector type. If some non-base pointer GEPs have more that one use that means they may still have uses after the tree was vectorized (i.e. GEP will not be removed). That is my understanding about what logic was put behind that hasOneUse check. I agree that it can be not quite satisfactory and I left the comment about "all uses inside vectorizable tree". In this regard test case geps-non-pow-2.ll probably represents an exception from the above as GEPs are arguments of PHIs (and we do not really know what is relationships between the GEPs) and an external use will produce an extract rather than leaves the original GEP instruction. Note that in this case all nodes in the tree are either "vectorize" or splats (See attached pdf). I.e. we probably may drop hasOneUse when an in-tree user of GEPs is PHI but I believe it would be incorrect to do that if GEP user is a load/store node (regardless of vectorize/scattervectorize kind). vdmitrie: We shall treat GEP is not a regular instruction. For regular instruction we can perform vector…
		ABataevUnsubmitted Not Done Reply Inline Actions I'm not saying about particular test, just a common question. Say, we have something like this: %gep1 = getelementptr %gep2 = getelementptr %a = load %gep1 %b = load %gep2 %c = load %gep1 If 2 first loads gets vectorized, the third load will get extractelement from vector getelementptr: %gep1 = getelement <x> %vec_a = load < 2 x> %gep1 %gep = extractelement %gep1, 0 %c load %gep try the next code: define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) { %load.1 = load i32, ptr %in, align 4 %gep.1 = getelementptr inbounds i32, ptr %in, i64 3 %load.2 = load i32, ptr %gep.1, align 4 %gep.2 = getelementptr inbounds i32, ptr %in, i64 6 %load.3 = load i32, ptr %gep.2, align 4 %gep.3 = getelementptr inbounds i32, ptr %in, i64 9 %load.4 = load i32, ptr %gep.3, align 4 %load.5 = load i32, ptr %inn, align 4 %gep.4 = getelementptr inbounds i32, ptr %inn, i64 1 %load.6 = load i32, ptr %gep.4, align 4 %gep.5 = getelementptr inbounds i32, ptr %inn, i64 2 %load.7 = load i32, ptr %gep.5, align 4 %gep.6 = getelementptr inbounds i32, ptr %inn, i64 3 %load.8 = load i32, ptr %gep.6, align 4 %mul.1 = mul i32 %load.1, %load.5 %mul.2 = mul i32 %load.2, %load.6 %mul.3 = mul i32 %load.3, %load.7 %mul.4 = mul i32 %load.4, %load.8 %gep.8 = getelementptr inbounds i32, ptr %out, i64 1 %gep.9 = getelementptr inbounds i32, ptr %out, i64 2 %gep.10 = getelementptr inbounds i32, ptr %out, i64 3 store i32 %mul.1, ptr %gep.9, align 4 store i32 %mul.2, ptr %out, align 4 store i32 %mul.3, ptr %gep.10, align 4 store i32 %mul.4, ptr %gep.8, align 4 %ll = load i32, ptr %gep.1, align 4 ret i32 undef } opt -S -mtriple=x86_64-unknown -mattr=+avx512vl -passes=slp-vectorizer ./repro.ll define i32 @jumbled-load(ptr noalias nocapture %in, ptr noalias nocapture %inn, ptr noalias nocapture %out) #0 { %1 = insertelement <4 x ptr> poison, ptr %in, i32 0 %2 = shufflevector <4 x ptr> %1, <4 x ptr> poison, <4 x i32> zeroinitializer %3 = getelementptr i32, <4 x ptr> %2, <4 x i64> <i64 3, i64 9, i64 0, i64 6> %4 = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %3, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison) %5 = load <4 x i32>, ptr %inn, align 4 %6 = shufflevector <4 x i32> %5, <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2> %7 = mul <4 x i32> %4, %6 store <4 x i32> %7, ptr %out, align 4 %8 = extractelement <4 x ptr> %3, i32 0 %ll = load i32, ptr %8, align 4 ret i32 undef } For vector loads/stores we maybe do not emit extractelement but we can account its cost. Need to exclude this extra cost for the geps with multiple uses ABataev: I'm not saying about particular test, just a common question. Say, we have something like this…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions Ahh, I see. It was my misunderstanding. Thanks for clarification. Basically we only need to account hasOneUse() for only regular loads and stores and not for the rest because these are terminators and don't really generate a vector GEP. I'll make the necessary changes and update the patch. vdmitrie: Ahh, I see. It was my misunderstanding. Thanks for clarification. Basically we only need to…
		// arguments of a set of loads or stores being vectorized. In the former
		// case all the scalar GEPs will be removed as a result of vectorization.
		// For any external uses of some lanes extract element instructions will
		// be generated (which cost is estimated separately). For the latter case
		// since the set of GEPs itself is not vectorized those used more than
		// once will remain staying in vectorized code as well. So we should not
		// count them as savings.
		if (!Ptr->hasOneUse() && isa<LoadInst, StoreInst>(VL0))
		continue;
		ABataevUnsubmitted Not Done Reply Inline Actions Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not it better just to rebuild the scalar GEP? ABataev: Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not it…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not it better just to rebuild the scalar GEP? IMO, this is not a bad idea. We don't really need to rematerialize a scalar GEP. Instead we can leave the original scalar one and use it instead of generating an extract. This will also break undesired dependency on vector GEP. So generally looks like room for improvement. vdmitrie: > Probably not related to this patch. Do we need to emit extractelement for GEP at all? Is not…
		ABataevUnsubmitted Not Done Reply Inline Actions Shall we drop this check? We still vectorize GEPs with multiple uses and then emit extractelement for them. The cost of the extractelement is calculated separately. So, when we calculate the cost for GEPs with multiple uses, we exclude them from saving cost and then we add an extra cost for extractelement. If we're still going to emit extractelement, need to remove this check (the original пуз will be vectorized and removed and then the extractelement is generated). If it is better to keep scalar copy, need to remove the cost of the extractelement calculation and keep this check. ABataev: Shall we drop this check? We still vectorize GEPs with multiple uses and then emit…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions For plain vector loads and stores we do not vectorize GEPs and hence do not emit extract element instructions. Instead as scalar loads are removed and GEPs for which these loads (or stores) were single users are also removed. All the rest GEPs stay in the code. When we build vec tree we do not dive into loads or stores pointer arguments, these loads/or store nodes are terminal nodes. This is why I added check for stores/loads which will only return true for vector loads or stores. define ptr @foo(ptr nocapture readonly %src, ptr nocapture %dst) local_unnamed_addr { entry: %arrayidxA0 = getelementptr inbounds double, ptr %src, i64 0 %A0 = load double, ptr %arrayidxA0, align 1 %arrayidxA1 = getelementptr inbounds double, ptr %src, i64 1 %A1 = load double, ptr %arrayidxA1, align 1 %arrayidx0 = getelementptr inbounds double, ptr %dst, i64 0 store double %A0, ptr %arrayidx0, align 16 %arrayidx1 = getelementptr inbounds double, ptr %dst, i64 1 store double %A1, ptr %arrayidx1, align 16 ret ptr %arrayidxA1 } We generate: %arrayidxA0 = getelementptr inbounds double, ptr %src, i64 0 %arrayidxA1 = getelementptr inbounds double, ptr %src, i64 1 %arrayidx0 = getelementptr inbounds double, ptr %dst, i64 0 %0 = load <2 x double>, ptr %arrayidxA0, align 1 store <2 x double> %0, ptr %arrayidx0, align 16 ret ptr %arrayidxA1 We do not do the same for gather loads (aka ScatterVectorize) as we indeed vectorize GEPs and then do extracts for external uses. vdmitrie: For plain vector loads and stores we do not vectorize GEPs and hence do not emit extract…
		ABataevUnsubmitted Not Done Reply Inline Actions Yes, for vector loads/store it is so. But what about masked gather? We avoid the cost compensation here and then add the extractelement cost. ABataev: Yes, for vector loads/store it is so. But what about masked gather? We avoid the cost…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions For gather loads this routine is not called when we process tree node with set of loads. It is called when tree node with GEPs is processed. For GEPs the VL0 will not be a load. It was a bug in cost modeling that we evaluated GEPs twice for gather loads which this patch particularly fixes. vdmitrie: For gather loads this routine is not called when we process tree node with set of loads. It is…
		ABataevUnsubmitted Not Done Reply Inline Actions I mean not gather but masked gather, ScatterVectorize. Another one thing. Shall we add a cost of vector GEP? Currently we just subtract the cost of scalar GEPs. ABataev: I mean not gather but masked gather, ScatterVectorize. Another one thing. Shall we add a cost…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions I mean not gather but masked gather, ScatterVectorize. Yes, I understand you. By "gather loads" I meant exactly masked gather load (aka ScatterVectorize). So what I said still holds. Another one thing. Shall we add a cost of vector GEP? Currently we just subtract the cost of scalar GEPs. This can be considered for a follow up changes. To be more accurate we do not count base address now (see the first check point in the loop) assuming that cost of vector GEP will be the same. So we don't add its cost for scalar calculation and hence do not subtract it then (it's kind a shortcut). For plain vector loads and stores that is correct estimation (where for regular pointers we only generate a pointer cast which I believe is free). For GEP nodes this estimation may not be quite correct. vdmitrie: > I mean not gather but masked gather, ScatterVectorize. Yes, I understand you. By "gather…
		ABataevUnsubmitted Not Done Reply Inline Actions Yes, I overlooked it. Ok. ABataev: 1. Yes, I overlooked it. 2. Ok.

		// TODO: it is target dependent, so need to implement and then use a TTI
		// interface.
		CostSavings += TTI->getArithmeticInstrCost(Instruction::Add,
		Ptr->getType(), CostKind);
		}
		LLVM_DEBUG(dbgs() << "SLP: Calculated GEPs cost savings or Tree:\n";
		E->dump());
		LLVM_DEBUG(dbgs() << "SLP: GEP cost saving = " << CostSavings << "\n");
		return InstructionCost() - CostSavings;
		};

switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
// Count reused scalars.		// Count reused scalars.
InstructionCost ScalarCost = 0;		InstructionCost ScalarCost = 0;
SmallPtrSet<const TreeEntry *, 4> CountedOps;		SmallPtrSet<const TreeEntry *, 4> CountedOps;
for (Value *V : VL) {		for (Value *V : VL) {
auto *PHI = dyn_cast<PHINode>(V);		auto *PHI = dyn_cast<PHINode>(V);
if (!PHI)		if (!PHI)
▲ Show 20 Lines • Show All 258 Lines • ▼ Show 20 Lines	InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,
case Instruction::URem:		case Instruction::URem:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor:		case Instruction::Xor: {
case Instruction::GetElementPtr: {
unsigned Opcode = ShuffleOrOp == Instruction::GetElementPtr
? static_cast<unsigned>(Instruction::Add)
: ShuffleOrOp;
auto GetScalarCost = [=](unsigned Idx) {		auto GetScalarCost = [=](unsigned Idx) {
auto *VI = dyn_cast<Instruction>(VL[Idx]);		auto *VI = cast<Instruction>(VL[Idx]);
// GEPs may contain just addresses without instructions, consider
// their cost 0.
if (!VI)
return InstructionCost();
unsigned OpIdx = isa<UnaryOperator>(VI) ? 0 : 1;		unsigned OpIdx = isa<UnaryOperator>(VI) ? 0 : 1;
TTI::OperandValueInfo Op1Info = TTI::getOperandInfo(VI->getOperand(0));		TTI::OperandValueInfo Op1Info = TTI::getOperandInfo(VI->getOperand(0));
TTI::OperandValueInfo Op2Info =		TTI::OperandValueInfo Op2Info =
TTI::getOperandInfo(VI->getOperand(OpIdx));		TTI::getOperandInfo(VI->getOperand(OpIdx));
SmallVector<const Value *> Operands(VI->operand_values());		SmallVector<const Value *> Operands(VI->operand_values());
return TTI->getArithmeticInstrCost(Opcode, ScalarTy, CostKind, Op1Info,		return TTI->getArithmeticInstrCost(ShuffleOrOp, ScalarTy, CostKind,
Op2Info, Operands, VI);		Op1Info, Op2Info, Operands, VI);
};		};
auto GetVectorCost = [=](InstructionCost CommonCost) {		auto GetVectorCost = [=](InstructionCost CommonCost) {
unsigned OpIdx = isa<UnaryOperator>(VL0) ? 0 : 1;		unsigned OpIdx = isa<UnaryOperator>(VL0) ? 0 : 1;
TTI::OperandValueInfo Op1Info = getOperandInfo(VL, 0);		TTI::OperandValueInfo Op1Info = getOperandInfo(VL, 0);
TTI::OperandValueInfo Op2Info = getOperandInfo(VL, OpIdx);		TTI::OperandValueInfo Op2Info = getOperandInfo(VL, OpIdx);
return TTI->getArithmeticInstrCost(Opcode, VecTy, CostKind, Op1Info,		return TTI->getArithmeticInstrCost(ShuffleOrOp, VecTy, CostKind, Op1Info,
Op2Info) +		Op2Info) +
CommonCost;		CommonCost;
};		};
return GetCostDiff(GetScalarCost, GetVectorCost);		return GetCostDiff(GetScalarCost, GetVectorCost);
}		}
		case Instruction::GetElementPtr: {
		return CommonCost + GetGEPCostDiff(VL, VL0);
		}
case Instruction::Load: {		case Instruction::Load: {
auto GetScalarCost = [=](unsigned Idx) {		auto GetScalarCost = [=](unsigned Idx) {
auto *VI = cast<LoadInst>(VL[Idx]);		auto *VI = cast<LoadInst>(VL[Idx]);
InstructionCost GEPCost = 0;		return TTI->getMemoryOpCost(Instruction::Load, ScalarTy, VI->getAlign(),
if (VI != VL0) {
auto *Ptr = dyn_cast<GetElementPtrInst>(VI->getPointerOperand());
if (Ptr && Ptr->hasOneUse() && !Ptr->hasAllConstantIndices())
GEPCost = TTI->getArithmeticInstrCost(Instruction::Add,
Ptr->getType(), CostKind);
}
return GEPCost +
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, VI->getAlign(),
VI->getPointerAddressSpace(), CostKind,		VI->getPointerAddressSpace(), CostKind,
TTI::OperandValueInfo(), VI);		TTI::OperandValueInfo(), VI);
};		};
auto GetVectorCost = [=](InstructionCost CommonCost) {
auto *LI0 = cast<LoadInst>(VL0);		auto *LI0 = cast<LoadInst>(VL0);
		auto GetVectorCost = [=](InstructionCost CommonCost) {
InstructionCost VecLdCost;		InstructionCost VecLdCost;
if (E->State == TreeEntry::Vectorize) {		if (E->State == TreeEntry::Vectorize) {
VecLdCost = TTI->getMemoryOpCost(		VecLdCost = TTI->getMemoryOpCost(
Instruction::Load, VecTy, LI0->getAlign(),		Instruction::Load, VecTy, LI0->getAlign(),
LI0->getPointerAddressSpace(), CostKind, TTI::OperandValueInfo());		LI0->getPointerAddressSpace(), CostKind, TTI::OperandValueInfo());
} else {		} else {
assert(E->State == TreeEntry::ScatterVectorize && "Unknown EntryState");		assert(E->State == TreeEntry::ScatterVectorize && "Unknown EntryState");
Align CommonAlignment = LI0->getAlign();		Align CommonAlignment = LI0->getAlign();
for (Value *V : VL)		for (Value *V : VL)
CommonAlignment =		CommonAlignment =
std::min(CommonAlignment, cast<LoadInst>(V)->getAlign());		std::min(CommonAlignment, cast<LoadInst>(V)->getAlign());
VecLdCost = TTI->getGatherScatterOpCost(		VecLdCost = TTI->getGatherScatterOpCost(
Instruction::Load, VecTy, LI0->getPointerOperand(),		Instruction::Load, VecTy, LI0->getPointerOperand(),
/VariableMask=/false, CommonAlignment, CostKind);		/VariableMask=/false, CommonAlignment, CostKind);
}		}
return VecLdCost + CommonCost;		return VecLdCost + CommonCost;
};		};
return GetCostDiff(GetScalarCost, GetVectorCost);
		InstructionCost Cost = GetCostDiff(GetScalarCost, GetVectorCost);
		// If this node generates masked gather load then it is not a terminal node.
		// Hence address operand cost is estimated separately.
		if (E->State == TreeEntry::ScatterVectorize)
		return Cost;

		// Estimate cost of GEPs since this tree node is a terminator.
		SmallVector<Value *> PointerOps(VL.size());
		for (auto [I, V] : enumerate(VL))
		PointerOps[I] = cast<LoadInst>(V)->getPointerOperand();
		return Cost + GetGEPCostDiff(PointerOps, LI0->getPointerOperand());
}		}
case Instruction::Store: {		case Instruction::Store: {
bool IsReorder = !E->ReorderIndices.empty();		bool IsReorder = !E->ReorderIndices.empty();
auto *SI = cast<StoreInst>(IsReorder ? VL[E->ReorderIndices.front()] : VL0);
auto GetScalarCost = [=](unsigned Idx) {		auto GetScalarCost = [=](unsigned Idx) {
auto *VI = cast<StoreInst>(VL[Idx]);		auto *VI = cast<StoreInst>(VL[Idx]);
InstructionCost GEPCost = 0;
if (VI != SI) {
auto *Ptr = dyn_cast<GetElementPtrInst>(VI->getPointerOperand());
if (Ptr && Ptr->hasOneUse() && !Ptr->hasAllConstantIndices())
GEPCost = TTI->getArithmeticInstrCost(Instruction::Add,
Ptr->getType(), CostKind);
}
TTI::OperandValueInfo OpInfo = getOperandInfo(VI, 0);		TTI::OperandValueInfo OpInfo = getOperandInfo(VI, 0);
return GEPCost + TTI->getMemoryOpCost(		return TTI->getMemoryOpCost(Instruction::Store, ScalarTy, VI->getAlign(),
Instruction::Store, ScalarTy, VI->getAlign(),		VI->getPointerAddressSpace(), CostKind,
VI->getPointerAddressSpace(), CostKind, OpInfo, VI);		OpInfo, VI);
};		};
		auto *BaseSI =
		cast<StoreInst>(IsReorder ? VL[E->ReorderIndices.front()] : VL0);
auto GetVectorCost = [=](InstructionCost CommonCost) {		auto GetVectorCost = [=](InstructionCost CommonCost) {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
TTI::OperandValueInfo OpInfo = getOperandInfo(VL, 0);		TTI::OperandValueInfo OpInfo = getOperandInfo(VL, 0);
return TTI->getMemoryOpCost(Instruction::Store, VecTy, SI->getAlign(),		return TTI->getMemoryOpCost(Instruction::Store, VecTy, BaseSI->getAlign(),
SI->getPointerAddressSpace(), CostKind,		BaseSI->getPointerAddressSpace(), CostKind,
OpInfo) +		OpInfo) +
CommonCost;		CommonCost;
};		};
return GetCostDiff(GetScalarCost, GetVectorCost);		SmallVector<Value *> PointerOps(VL.size());
		for (auto [I, V] : enumerate(VL)) {
		unsigned Idx = IsReorder ? E->ReorderIndices[I] : I;
		PointerOps[Idx] = cast<StoreInst>(V)->getPointerOperand();
		}

		return GetCostDiff(GetScalarCost, GetVectorCost) +
		GetGEPCostDiff(PointerOps, BaseSI->getPointerOperand());
}		}
case Instruction::Call: {		case Instruction::Call: {
auto GetScalarCost = [=](unsigned Idx) {		auto GetScalarCost = [=](unsigned Idx) {
auto *CI = cast<CallInst>(VL[Idx]);		auto *CI = cast<CallInst>(VL[Idx]);
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
if (ID != Intrinsic::not_intrinsic) {		if (ID != Intrinsic::not_intrinsic) {
IntrinsicCostAttributes CostAttrs(ID, *CI, 1);		IntrinsicCostAttributes CostAttrs(ID, *CI, 1);
return TTI->getIntrinsicInstrCost(CostAttrs, CostKind);		return TTI->getIntrinsicInstrCost(CostAttrs, CostKind);
▲ Show 20 Lines • Show All 6,411 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/geps-non-pow-2.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -passes=slp-vectorizer -S -o - -mtriple=x86_64-unknown-linux -mcpu=haswell < %s \| FileCheck %s			; RUN: opt -passes=slp-vectorizer -S -o - -mtriple=x86_64-unknown-linux -mcpu=haswell -slp-threshold=-3 < %s \| FileCheck %s
				RKSimonUnsubmitted Not Done Reply Inline Actions Havig to add a slp-threshold on an existing defaut test makes me a little nervous tbh RKSimon: Havig to add a slp-threshold on an existing defaut test makes me a little nervous tbh
				vdmitrieAuthorUnsubmitted Done Reply Inline Actions I believe that the purpose of this test is to show how vectorization goes without/with non-pow2 feature support (even though the feature is not ready yet). So we need to keep it vectorized. This is why I changed threshold instead of re-generating checks. We already have some tests that explicitly set the threshold for similar reason as reduced tests are not always run as desired with default threshold. In this particular case the behavior has changed because the unified GEP cost routine adds an ADD cost for only those GEPs that used once while previously it did that unconditionally. vdmitrie: I believe that the purpose of this test is to show how vectorization goes without/with non-pow2…
	@e = dso_local local_unnamed_addr global i32 0, align 4			@e = dso_local local_unnamed_addr global i32 0, align 4
	@f = dso_local local_unnamed_addr global i32 0, align 4			@f = dso_local local_unnamed_addr global i32 0, align 4

	; Function Attrs: nofree norecurse nounwind uwtable			; Function Attrs: nofree norecurse nounwind uwtable
	define dso_local i32 @g() local_unnamed_addr {			define dso_local i32 @g() local_unnamed_addr {
	; CHECK-LABEL: @g(			; CHECK-LABEL: @g(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr @e, align 4			; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr @e, align 4
	▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/remark_gather-load-redux-cost.ll

	Show All 16 Lines
	; CHECK-NEXT: ret i32 [[TMP8]]			; CHECK-NEXT: ret i32 [[TMP8]]
	;			;
	; YAML: --- !Passed			; YAML: --- !Passed
	; YAML-NEXT: Pass: slp-vectorizer			; YAML-NEXT: Pass: slp-vectorizer
	; YAML-NEXT: Name: VectorizedHorizontalReduction			; YAML-NEXT: Name: VectorizedHorizontalReduction
	; YAML-NEXT: Function: test			; YAML-NEXT: Function: test
	; YAML-NEXT: Args:			; YAML-NEXT: Args:
	; YAML-NEXT: - String: 'Vectorized horizontal reduction with cost '			; YAML-NEXT: - String: 'Vectorized horizontal reduction with cost '
	; YAML-NEXT: - Cost: '-17'			; YAML-NEXT: - Cost: '-3'
	; YAML-NEXT: - String: ' and with tree size '			; YAML-NEXT: - String: ' and with tree size '
	; YAML-NEXT: - TreeSize: '7'			; YAML-NEXT: - TreeSize: '7'
	entry:			entry:
	%off0.1 = getelementptr inbounds i32, ptr %addr, i32 1			%off0.1 = getelementptr inbounds i32, ptr %addr, i32 1
	%idx0 = load i32, ptr %off0.1, align 8			%idx0 = load i32, ptr %off0.1, align 8
	%gep0 = getelementptr inbounds i32, ptr %p, i32 %idx0			%gep0 = getelementptr inbounds i32, ptr %p, i32 %idx0
	%ld0 = load i32, ptr %gep0, align 4			%ld0 = load i32, ptr %gep0, align 4

	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -passes=slp-vectorizer < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-12 \| FileCheck %s			; RUN: opt -S -passes=slp-vectorizer < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-13 \| FileCheck %s

	define void @test(i1 %c, ptr %arg) {			define void @test(i1 %c, ptr %arg) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: br i1 [[C:%.]], label [[IF:%.]], label [[ELSE:%.*]]			; CHECK-NEXT: br i1 [[C:%.]], label [[IF:%.]], label [[ELSE:%.*]]
	; CHECK: if:			; CHECK: if:
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x ptr> poison, ptr [[ARG:%.]], i32 0			; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x ptr> poison, ptr [[ARG:%.]], i32 0
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x ptr> [[TMP1]], <4 x ptr> poison, <4 x i32> zeroinitializer			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x ptr> [[TMP1]], <4 x ptr> poison, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP2:%.*]] = getelementptr i8, <4 x ptr> [[SHUFFLE]], <4 x i64> <i64 32, i64 24, i64 8, i64 0>			; CHECK-NEXT: [[TMP2:%.*]] = getelementptr i8, <4 x ptr> [[SHUFFLE]], <4 x i64> <i64 32, i64 24, i64 8, i64 0>
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Unify GEP cost modeling for load, store and GEP nodes.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 486617

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/geps-non-pow-2.ll

llvm/test/Transforms/SLPVectorizer/X86/remark_gather-load-redux-cost.ll

llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll

[SLP] Unify GEP cost modeling for load, store and GEP nodes.
ClosedPublic