This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve costs of vectorized loads/stores by analyzing GEPs.
ClosedPublic

Authored by ABataev on Oct 5 2022, 9:29 AM.

Download Raw Diff

Details

Reviewers

RKSimon
vdmitrie

Commits

rGc787986cddce: [SLP]Improve costs of vectorized loads/stores by analyzing GEPs.

Summary

When generating masked gathers nodes, SLP vectorizer accounts the cost
of the GEPs for loads as part of the scalar-vector transformation cost
estimation. But it does not do it for vectorized loads/stores, while it
may completely remove some of the GEPs completely. Because of this in
some cases masked gather operation can be much more profitable rather
than regular vectorization (masked-gather cost + vector GEP - scalar
loads + GEPs comparing to vectorized loads - scalar loads).
Added the analysis of the removed scalarGEPs for vectorized load/store nodes for better cost estimation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Oct 5 2022, 9:29 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 5 2022, 9:29 AM

Herald added subscribers: vporpo, hiraditya. · View Herald Transcript

ABataev requested review of this revision.Oct 5 2022, 9:29 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 5 2022, 9:29 AM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

Harbormaster completed remote builds in B190516: Diff 465438.Oct 5 2022, 11:13 AM

RKSimon added inline comments.Oct 12 2022, 4:32 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6654	Ptr->hasAllConstantIndices() ?

RKSimon added inline comments.Oct 12 2022, 4:35 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6646	Better to use OperandValueInfo() instead of the initialization?

Address comments

RKSimon added inline comments.Oct 12 2022, 6:58 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6691	hasAllConstantIndices
11545	hasAllConstantIndices

Harbormaster completed remote builds in B191721: Diff 467122.Oct 12 2022, 7:39 AM

Address comments

Harbormaster completed remote builds in B191738: Diff 467147.Oct 12 2022, 8:40 AM

LGTM - in the medium term I think we should be trying to move more of this into getGEPCost - but that callback needs improvement first as its barely used.

This revision is now accepted and ready to land.Oct 13 2022, 1:29 AM

Closed by commit rGc787986cddce: [SLP]Improve costs of vectorized loads/stores by analyzing GEPs. (authored by ABataev). · Explain WhyOct 13 2022, 7:24 AM

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGc787986cddce: [SLP]Improve costs of vectorized loads/stores by analyzing GEPs..

vdmitrie added inline comments.Nov 17 2022, 6:31 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	We see quite a significant performance regression related to this patch. It does not look the right adjustment. For x86 specifically these GEPS cost nothing as they end up merely as different displacement values in memory operands. So the bias towards vectorization isn't justified for plain loads and stores. It can be seen even for test case test/Transforms/SLPVectorizer/X86/remark_not_all_parts.ll Vectorization makes code less profitable here. It already existed before the patch but this patch but cost modeling although tipped over to vectorization it was close enough to say "not profitable". But now we have even more bias. Vecorized code: Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 subq $136, %rsp 1 0 0.17 xorl %ecx, %ecx 1 0 0.17 xorl %eax, %eax 1 5 0.50 * movq (%rdi,%rcx), %xmm0 1 5 0.50 * movq 16(%rdi,%rcx), %xmm1 1 1 0.33 paddd %xmm0, %xmm1 1 2 1.00 movd %xmm1, %edx 1 1 0.25 addl %eax, %edx 2 1 1.00 * movq %xmm1, -128(%rsp,%rcx) 1 1 1.00 pshufd $85, %xmm1, %xmm0 1 2 1.00 movd %xmm0, %eax 1 1 0.25 addl %edx, %eax 1 1 0.25 addq $32, %rcx 1 1 0.25 cmpq $256, %rcx 1 1 0.50 jne .LBB0_1 1 1 0.25 addq $136, %rsp 3 7 1.00 U retq Original: [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 subq $136, %rsp 1 0 0.17 xorl %eax, %eax 1 1 0.25 movq $-256, %rcx 1 5 0.50 * movl 272(%rdi,%rcx), %edx 2 6 0.50 * addl 256(%rdi,%rcx), %edx 1 1 1.00 * movl %edx, 128(%rsp,%rcx) 1 5 0.50 * movl 276(%rdi,%rcx), %esi 2 6 0.50 * addl 260(%rdi,%rcx), %esi 1 1 1.00 * movl %esi, 132(%rsp,%rcx) 1 1 0.25 addl %esi, %eax 1 1 0.25 addl %edx, %eax 1 1 0.25 addq $32, %rcx 1 1 0.50 jne .LBB0_1 1 1 0.25 addq $136, %rsp 3 7 1.00 U retq

ABataev added inline comments.Nov 18 2022, 3:25 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	I can say, it was expected. That's why there was a discussion about using getGEPCost instead of this. This changes just syncs cost estimation for masked gathers and vector loads. As you noted, we already had the issue with the geps costs. We need to fix this. It would help if Intel will try to implement their part of getGEPCost and we can start using it here for better cost estimation.

vdmitrie added inline comments.Nov 18 2022, 9:39 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	You got me wrong. I did not say we already had issues with geps. What I did I did say is: we already had issues with vectorizing sequences when we should not. Most issues with these wrongful vectorizations come from shuffles and permutations generated. And the example, which is the test case in this patch (remark_not_all_parts.ll) merely does show that vectorized code is worse that the original. And I don't think that you anticipated 60% performance regression from this patch. But that is what we have now. I suggest you to revert this change for these reasons: huge regression it introduced. It makes bad things even worse. Cost of inserts and permutations on integer vectors seems underestimated. That is where most regressions come from. But adding unjustified bias to the cost towards vectorization makes the problem even worse. the CM heuristics added does not reflect real thigs - it goes into displacement part of a memory operand which costs nothing. test cases which changed within this patch do not show where the patch would help. Moreover they create impression that nothing has changed. But that isn't the case. Can you show any real test case which shows how this patch improves vectorization? Aligning gather loads does not justify enough IMO. May be gathers have this same issue? Can you point at a test case for gather loads?

ABataev added inline comments.Nov 18 2022, 9:47 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	It is not to improve the vectorization but to fix the cost difference between vector loads and masked gather. If we're going to revert it, we need to remove the geps cost estimation for masked gathers. Otherwise there are cases, where consecutive loads are less profitable than the masked gather, and it leads to the perf regressions.

vdmitrie added inline comments.Nov 18 2022, 10:04 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	Yes, we better to focus on these cases and find out why gather loads look more profitable (instead of making unit stride load look less profitable). It can be because of the same issue with geps or it can be that gather load cost itself is too optimistic. For gather loads basically indices are populated at run time - so this per-element ADD cost should go into gather load rather than scalar loads. Although it most likely will be a single load + ADD of displacements stored in constants pool.

ABataev added inline comments.Nov 18 2022, 10:11 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	The reason is know - need to fix the cost estimation for GEPs. And we need to fix getGEPCost function. Without this we are overoptimistic about GEPs. And we need to fix the cost in getGEPCost function and use it in the cost estimation. This will fix the regression introduced in this patch and fix general problem. Reverting the patch won't fix the issue, it will just hide it.

vdmitrie added inline comments.Nov 18 2022, 10:22 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	Could you point out the test case (where GEPS are incorrectly estimated) please? So far general problem that I see is that scalar load cost too overestimated and that adds too much favor towards vectorization.

ABataev added inline comments.Nov 18 2022, 10:37 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	I don't remember exactly currently, found out when worked on strided loads vectorization. We ignored the cost of the GEPs for vector loads and because of that vector loads became less profitable than the masked gather with vectorized GEPs (the strided load cost is higher than the vector load, but vectorized GEPs currently are more profitable than the scalar ones, since the cost of each GEP is currently calculated as the cost of ADD). So far general problem that I see is that scalar load cost too overestimated and that adds too much favor towards vectorization. Yes, right. And we need to fix couple things about it - the cost for GEPs (which are free for many cases on X86) and the cost of scalar/vector loads (which also are free in many cases for X86). But I just don't have enough time to do it. I would appreciate it if you could try implement the GEP cost for X86 and we could switch to getGEPCost instead of using the cost of simple ADD for GEPs.

vdmitrie added inline comments.Nov 18 2022, 2:32 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	To be honest, I still do not understand what is the problem with GEPs. GEPs only have cost when stride is unknown. But if we end up with "vectorize" state node here we are already ensured that stride is known and and it is unit stride(i.e. we load or store adjacent elements in memory). That equally applies to loads and stores but the thing is here in SLP we don't yet issue scatter stores yet (if I've not missed something). So what problem did you suppose to solve when applied this GEP adjustment to stores is still unclear. It's unfortunate that you lost the test case that reasoned you for this patch. Any chance to recover it somehow?

ABataev added inline comments.Nov 18 2022, 2:37 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	To be honest, I still do not understand what is the problem with GEPs. Different cost model for GEPS for masked loads and vector loads. GEPs only have cost when stride is unknown. For X86, but there other archs. But if we end up with "vectorize" state node here we are already ensured that stride is known and and it is unit stride(i.e. we load or store adjacent elements in memory). That equally applies to loads and stores but the thing is here in SLP we don't yet issue scatter stores yet (if I've not missed something). So what problem did you suppose to solve when applied this GEP adjustment to stores is still unclear. It's unfortunate that you lost the test case that reasoned you for this patch. Any chance to recover it somehow? I did not lost it, just need some time to find it.

vdmitrie added inline comments.Nov 18 2022, 5:25 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	Different cost model for GEPS for masked loads and vector loads. Different does not mean incorrect. For X86, but there other archs. Thanks for admitting this was inappropriate place to fix problem. It should be a part of target dependent TTI implementation for the target that you meant to fix. Now it equally applies to all targets.

ABataev added inline comments.Nov 18 2022, 5:42 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	Different cost model for GEPS for masked loads and vector loads. Different does not mean incorrect. In this case it is incorrect. For X86, but there other archs. Thanks for admitting this was inappropriate place to fix problem. It should be a part of target dependent TTI implementation for the target that you meant to fix. Now it equally applies to all targets. That's why I asked you to help with the implementation of getGEPCost in TTI for X86 so we could use it here instead of add cost.

vdmitrie added inline comments.Nov 21 2022, 11:28 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	I'm willing to work on it. But I cannot begin earlier than after couple of weeks (before Dec 6th to be precise). And we need to meet somewhere to discuss, share ideas and sync on this issue (phab is not the right place for that). It would be nice if you could find test cases that can be reduced (if not already) to show case the issue to address.

ABataev added inline comments.Nov 22 2022, 6:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656	Sounds good. We can discuss it via e-mail or set up a meeting.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

53 lines

test/

Transforms/

SLPVectorizer/

X86/

remark_horcost.ll

2 lines

remark_not_all_parts.ll

2 lines

Diff 467464

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 928 Lines • ▼ Show 20 Lines	public:
/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
const SmallDenseSet<Value *> &UserIgnoreLst);		const SmallDenseSet<Value *> &UserIgnoreLst);

/// Construct a vectorizable tree that starts at \p Roots.		/// Construct a vectorizable tree that starts at \p Roots.
void buildTree(ArrayRef<Value *> Roots);		void buildTree(ArrayRef<Value *> Roots);

		/// Checks if the very first tree node is going to be vectorized.
		bool isVectorizedFirstNode() const {
		return !VectorizableTree.empty() &&
		VectorizableTree.front()->State == TreeEntry::Vectorize;
		}

		/// Returns the main instruction for the very first node.
		Instruction *getFirstNodeMainOp() const {
		assert(!VectorizableTree.empty() && "No tree to get the first node from");
		return VectorizableTree.front()->getMainOp();
		}

/// Builds external uses of the vectorized scalars, i.e. the list of		/// Builds external uses of the vectorized scalars, i.e. the list of
/// vectorized scalars to be extracted, their lanes and their scalar users. \p		/// vectorized scalars to be extracted, their lanes and their scalar users. \p
/// ExternallyUsedValues contains additional list of external uses to handle		/// ExternallyUsedValues contains additional list of external uses to handle
/// vectorization of reductions.		/// vectorization of reductions.
void		void
buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});		buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
▲ Show 20 Lines • Show All 5,680 Lines • ▼ Show 20 Lines	case Instruction::Load: {
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, Alignment, 0,		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, Alignment, 0,
CostKind, {TTI::OK_AnyValue, TTI::OP_None}, VL0);		CostKind, {TTI::OK_AnyValue, TTI::OP_None}, VL0);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
CommonCost -= (EntryVF - VL.size()) * ScalarEltCost;		CommonCost -= (EntryVF - VL.size()) * ScalarEltCost;
}		}
InstructionCost ScalarLdCost = VecTy->getNumElements() * ScalarEltCost;		InstructionCost ScalarLdCost = VecTy->getNumElements() * ScalarEltCost;
InstructionCost VecLdCost;		InstructionCost VecLdCost;
if (E->State == TreeEntry::Vectorize) {		if (E->State == TreeEntry::Vectorize) {
VecLdCost = TTI->getMemoryOpCost(Instruction::Load, VecTy, Alignment, 0,		VecLdCost =
CostKind, {TTI::OK_AnyValue, TTI::OP_None}, VL0);		TTI->getMemoryOpCost(Instruction::Load, VecTy, Alignment, 0,
		RKSimonUnsubmitted Not Done Reply Inline Actions Better to use OperandValueInfo() instead of the initialization? RKSimon: Better to use OperandValueInfo() instead of the initialization?
		CostKind, TTI::OperandValueInfo(), VL0);
		for (Value *V : VL) {
		auto *VI = cast<LoadInst>(V);
		// Add the costs of scalar GEP pointers, to be removed from the code.
		if (VI == VL0)
		continue;
		auto *Ptr = dyn_cast<GetElementPtrInst>(VI->getPointerOperand());
		if (!Ptr \|\| !Ptr->hasOneUse() \|\| Ptr->hasAllConstantIndices())
		RKSimonUnsubmitted Not Done Reply Inline Actions Ptr->hasAllConstantIndices() ? RKSimon: Ptr->hasAllConstantIndices() ?
		continue;
		ScalarLdCost += TTI->getArithmeticInstrCost(Instruction::Add,
		vdmitrieUnsubmitted Not Done Reply Inline Actions We see quite a significant performance regression related to this patch. It does not look the right adjustment. For x86 specifically these GEPS cost nothing as they end up merely as different displacement values in memory operands. So the bias towards vectorization isn't justified for plain loads and stores. It can be seen even for test case test/Transforms/SLPVectorizer/X86/remark_not_all_parts.ll Vectorization makes code less profitable here. It already existed before the patch but this patch but cost modeling although tipped over to vectorization it was close enough to say "not profitable". But now we have even more bias. Vecorized code: Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 subq $136, %rsp 1 0 0.17 xorl %ecx, %ecx 1 0 0.17 xorl %eax, %eax 1 5 0.50 * movq (%rdi,%rcx), %xmm0 1 5 0.50 * movq 16(%rdi,%rcx), %xmm1 1 1 0.33 paddd %xmm0, %xmm1 1 2 1.00 movd %xmm1, %edx 1 1 0.25 addl %eax, %edx 2 1 1.00 * movq %xmm1, -128(%rsp,%rcx) 1 1 1.00 pshufd $85, %xmm1, %xmm0 1 2 1.00 movd %xmm0, %eax 1 1 0.25 addl %edx, %eax 1 1 0.25 addq $32, %rcx 1 1 0.25 cmpq $256, %rcx 1 1 0.50 jne .LBB0_1 1 1 0.25 addq $136, %rsp 3 7 1.00 U retq Original: [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 subq $136, %rsp 1 0 0.17 xorl %eax, %eax 1 1 0.25 movq $-256, %rcx 1 5 0.50 * movl 272(%rdi,%rcx), %edx 2 6 0.50 * addl 256(%rdi,%rcx), %edx 1 1 1.00 * movl %edx, 128(%rsp,%rcx) 1 5 0.50 * movl 276(%rdi,%rcx), %esi 2 6 0.50 * addl 260(%rdi,%rcx), %esi 1 1 1.00 * movl %esi, 132(%rsp,%rcx) 1 1 0.25 addl %esi, %eax 1 1 0.25 addl %edx, %eax 1 1 0.25 addq $32, %rcx 1 1 0.50 jne .LBB0_1 1 1 0.25 addq $136, %rsp 3 7 1.00 U retq vdmitrie: We see quite a significant performance regression related to this patch. It does not look the…
		ABataevAuthorUnsubmitted Done Reply Inline Actions I can say, it was expected. That's why there was a discussion about using getGEPCost instead of this. This changes just syncs cost estimation for masked gathers and vector loads. As you noted, we already had the issue with the geps costs. We need to fix this. It would help if Intel will try to implement their part of getGEPCost and we can start using it here for better cost estimation. ABataev: I can say, it was expected. That's why there was a discussion about using getGEPCost instead of…
		vdmitrieUnsubmitted Not Done Reply Inline Actions You got me wrong. I did not say we already had issues with geps. What I did I did say is: we already had issues with vectorizing sequences when we should not. Most issues with these wrongful vectorizations come from shuffles and permutations generated. And the example, which is the test case in this patch (remark_not_all_parts.ll) merely does show that vectorized code is worse that the original. And I don't think that you anticipated 60% performance regression from this patch. But that is what we have now. I suggest you to revert this change for these reasons: huge regression it introduced. It makes bad things even worse. Cost of inserts and permutations on integer vectors seems underestimated. That is where most regressions come from. But adding unjustified bias to the cost towards vectorization makes the problem even worse. the CM heuristics added does not reflect real thigs - it goes into displacement part of a memory operand which costs nothing. test cases which changed within this patch do not show where the patch would help. Moreover they create impression that nothing has changed. But that isn't the case. Can you show any real test case which shows how this patch improves vectorization? Aligning gather loads does not justify enough IMO. May be gathers have this same issue? Can you point at a test case for gather loads? vdmitrie: You got me wrong. I did not say we already had issues with geps. What I did I did say is: we…
		ABataevAuthorUnsubmitted Done Reply Inline Actions It is not to improve the vectorization but to fix the cost difference between vector loads and masked gather. If we're going to revert it, we need to remove the geps cost estimation for masked gathers. Otherwise there are cases, where consecutive loads are less profitable than the masked gather, and it leads to the perf regressions. ABataev: It is not to improve the vectorization but to fix the cost difference between vector loads and…
		vdmitrieUnsubmitted Not Done Reply Inline Actions Yes, we better to focus on these cases and find out why gather loads look more profitable (instead of making unit stride load look less profitable). It can be because of the same issue with geps or it can be that gather load cost itself is too optimistic. For gather loads basically indices are populated at run time - so this per-element ADD cost should go into gather load rather than scalar loads. Although it most likely will be a single load + ADD of displacements stored in constants pool. vdmitrie: Yes, we better to focus on these cases and find out why gather loads look more profitable…
		ABataevAuthorUnsubmitted Done Reply Inline Actions The reason is know - need to fix the cost estimation for GEPs. And we need to fix getGEPCost function. Without this we are overoptimistic about GEPs. And we need to fix the cost in getGEPCost function and use it in the cost estimation. This will fix the regression introduced in this patch and fix general problem. Reverting the patch won't fix the issue, it will just hide it. ABataev: The reason is know - need to fix the cost estimation for GEPs. And we need to fix getGEPCost…
		vdmitrieUnsubmitted Not Done Reply Inline Actions Could you point out the test case (where GEPS are incorrectly estimated) please? So far general problem that I see is that scalar load cost too overestimated and that adds too much favor towards vectorization. vdmitrie: Could you point out the test case (where GEPS are incorrectly estimated) please? So far general…
		ABataevAuthorUnsubmitted Done Reply Inline Actions I don't remember exactly currently, found out when worked on strided loads vectorization. We ignored the cost of the GEPs for vector loads and because of that vector loads became less profitable than the masked gather with vectorized GEPs (the strided load cost is higher than the vector load, but vectorized GEPs currently are more profitable than the scalar ones, since the cost of each GEP is currently calculated as the cost of ADD). So far general problem that I see is that scalar load cost too overestimated and that adds too much favor towards vectorization. Yes, right. And we need to fix couple things about it - the cost for GEPs (which are free for many cases on X86) and the cost of scalar/vector loads (which also are free in many cases for X86). But I just don't have enough time to do it. I would appreciate it if you could try implement the GEP cost for X86 and we could switch to getGEPCost instead of using the cost of simple ADD for GEPs. ABataev: I don't remember exactly currently, found out when worked on strided loads vectorization. We…
		vdmitrieUnsubmitted Not Done Reply Inline Actions To be honest, I still do not understand what is the problem with GEPs. GEPs only have cost when stride is unknown. But if we end up with "vectorize" state node here we are already ensured that stride is known and and it is unit stride(i.e. we load or store adjacent elements in memory). That equally applies to loads and stores but the thing is here in SLP we don't yet issue scatter stores yet (if I've not missed something). So what problem did you suppose to solve when applied this GEP adjustment to stores is still unclear. It's unfortunate that you lost the test case that reasoned you for this patch. Any chance to recover it somehow? vdmitrie: To be honest, I still do not understand what is the problem with GEPs. GEPs only have cost when…
		ABataevAuthorUnsubmitted Done Reply Inline Actions To be honest, I still do not understand what is the problem with GEPs. Different cost model for GEPS for masked loads and vector loads. GEPs only have cost when stride is unknown. For X86, but there other archs. But if we end up with "vectorize" state node here we are already ensured that stride is known and and it is unit stride(i.e. we load or store adjacent elements in memory). That equally applies to loads and stores but the thing is here in SLP we don't yet issue scatter stores yet (if I've not missed something). So what problem did you suppose to solve when applied this GEP adjustment to stores is still unclear. It's unfortunate that you lost the test case that reasoned you for this patch. Any chance to recover it somehow? I did not lost it, just need some time to find it. ABataev: > To be honest, I still do not understand what is the problem with GEPs. Different cost model…
		vdmitrieUnsubmitted Not Done Reply Inline Actions Different cost model for GEPS for masked loads and vector loads. Different does not mean incorrect. For X86, but there other archs. Thanks for admitting this was inappropriate place to fix problem. It should be a part of target dependent TTI implementation for the target that you meant to fix. Now it equally applies to all targets. vdmitrie: > Different cost model for GEPS for masked loads and vector loads. Different does not mean…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Different cost model for GEPS for masked loads and vector loads. Different does not mean incorrect. In this case it is incorrect. For X86, but there other archs. Thanks for admitting this was inappropriate place to fix problem. It should be a part of target dependent TTI implementation for the target that you meant to fix. Now it equally applies to all targets. That's why I asked you to help with the implementation of getGEPCost in TTI for X86 so we could use it here instead of add cost. ABataev: > > > Different cost model for GEPS for masked loads and vector loads. > > Different does not…
		vdmitrieUnsubmitted Not Done Reply Inline Actions I'm willing to work on it. But I cannot begin earlier than after couple of weeks (before Dec 6th to be precise). And we need to meet somewhere to discuss, share ideas and sync on this issue (phab is not the right place for that). It would be nice if you could find test cases that can be reduced (if not already) to show case the issue to address. vdmitrie: I'm willing to work on it. But I cannot begin earlier than after couple of weeks (before Dec…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Sounds good. We can discuss it via e-mail or set up a meeting. ABataev: Sounds good. We can discuss it via e-mail or set up a meeting.
		Ptr->getType(), CostKind);
		}
} else {		} else {
assert(E->State == TreeEntry::ScatterVectorize && "Unknown EntryState");		assert(E->State == TreeEntry::ScatterVectorize && "Unknown EntryState");
Align CommonAlignment = Alignment;		Align CommonAlignment = Alignment;
for (Value *V : VL)		for (Value *V : VL)
CommonAlignment =		CommonAlignment =
std::min(CommonAlignment, cast<LoadInst>(V)->getAlign());		std::min(CommonAlignment, cast<LoadInst>(V)->getAlign());
VecLdCost = TTI->getGatherScatterOpCost(		VecLdCost = TTI->getGatherScatterOpCost(
Instruction::Load, VecTy, cast<LoadInst>(VL0)->getPointerOperand(),		Instruction::Load, VecTy, cast<LoadInst>(VL0)->getPointerOperand(),
/VariableMask=/false, CommonAlignment, CostKind, VL0);		/VariableMask=/false, CommonAlignment, CostKind, VL0);
}		}
LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecLdCost, ScalarLdCost));		LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecLdCost, ScalarLdCost));
return CommonCost + VecLdCost - ScalarLdCost;		return CommonCost + VecLdCost - ScalarLdCost;
}		}
case Instruction::Store: {		case Instruction::Store: {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
bool IsReorder = !E->ReorderIndices.empty();		bool IsReorder = !E->ReorderIndices.empty();
auto *SI =		auto *SI =
cast<StoreInst>(IsReorder ? VL[E->ReorderIndices.front()] : VL0);		cast<StoreInst>(IsReorder ? VL[E->ReorderIndices.front()] : VL0);
Align Alignment = SI->getAlign();		Align Alignment = SI->getAlign();
InstructionCost ScalarStCost = 0;		InstructionCost ScalarStCost = 0;
for (auto *V : VL) {		for (auto *V : VL) {
auto *VI = cast<Instruction>(V);		auto *VI = cast<StoreInst>(V);
TTI::OperandValueInfo OpInfo = TTI::getOperandInfo(VI->getOperand(0));		TTI::OperandValueInfo OpInfo = TTI::getOperandInfo(VI->getOperand(0));
ScalarStCost +=		ScalarStCost +=
TTI->getMemoryOpCost(Instruction::Store, ScalarTy, Alignment, 0,		TTI->getMemoryOpCost(Instruction::Store, ScalarTy, Alignment, 0,
CostKind, OpInfo, VI);		CostKind, OpInfo, VI);
		// Add the costs of scalar GEP pointers, to be removed from the code.
		if (VI == SI)
		continue;
		auto *Ptr = dyn_cast<GetElementPtrInst>(VI->getPointerOperand());
		if (!Ptr \|\| !Ptr->hasOneUse() \|\| Ptr->hasAllConstantIndices())
		continue;
		ScalarStCost += TTI->getArithmeticInstrCost(Instruction::Add,
		RKSimonUnsubmitted Not Done Reply Inline Actions hasAllConstantIndices RKSimon: hasAllConstantIndices
		Ptr->getType(), CostKind);
}		}
TTI::OperandValueInfo OpInfo = getOperandInfo(VL, 0);		TTI::OperandValueInfo OpInfo = getOperandInfo(VL, 0);
InstructionCost VecStCost =		InstructionCost VecStCost =
TTI->getMemoryOpCost(Instruction::Store, VecTy, Alignment, 0, CostKind,		TTI->getMemoryOpCost(Instruction::Store, VecTy, Alignment, 0, CostKind,
OpInfo);		OpInfo);
LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecStCost, ScalarStCost));		LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecStCost, ScalarStCost));
return CommonCost + VecStCost - ScalarStCost;		return CommonCost + VecStCost - ScalarStCost;
}		}
▲ Show 20 Lines • Show All 4,825 Lines • ▼ Show 20 Lines	for (unsigned I = 0, E = ReducedVals.size(); I < E; ++I) {
RdxFMF.set();		RdxFMF.set();
for (Value *U : IgnoreList)		for (Value *U : IgnoreList)
if (auto *FPMO = dyn_cast<FPMathOperator>(U))		if (auto *FPMO = dyn_cast<FPMathOperator>(U))
RdxFMF &= FPMO->getFastMathFlags();		RdxFMF &= FPMO->getFastMathFlags();
// Estimate cost.		// Estimate cost.
InstructionCost TreeCost = V.getTreeCost(VL);		InstructionCost TreeCost = V.getTreeCost(VL);
InstructionCost ReductionCost =		InstructionCost ReductionCost =
getReductionCost(TTI, VL, ReduxWidth, RdxFMF);		getReductionCost(TTI, VL, ReduxWidth, RdxFMF);
		if (V.isVectorizedFirstNode() && isa<LoadInst>(VL.front())) {
		Instruction *MainOp = V.getFirstNodeMainOp();
		for (Value *V : VL) {
		auto *VI = dyn_cast<LoadInst>(V);
		// Add the costs of scalar GEP pointers, to be removed from the
		// code.
		if (!VI \|\| VI == MainOp)
		continue;
		auto *Ptr = dyn_cast<GetElementPtrInst>(VI->getPointerOperand());
		if (!Ptr \|\| !Ptr->hasOneUse() \|\| Ptr->hasAllConstantIndices())
		continue;
		TreeCost -= TTI->getArithmeticInstrCost(
		RKSimonUnsubmitted Not Done Reply Inline Actions hasAllConstantIndices RKSimon: hasAllConstantIndices
		Instruction::Add, Ptr->getType(), TTI::TCK_RecipThroughput);
		}
		}
InstructionCost Cost = TreeCost + ReductionCost;		InstructionCost Cost = TreeCost + ReductionCost;
LLVM_DEBUG(dbgs() << "SLP: Found cost = " << Cost << " for reduction\n");		LLVM_DEBUG(dbgs() << "SLP: Found cost = " << Cost << " for reduction\n");
if (!Cost.isValid()) {		if (!Cost.isValid()) {
return nullptr;		return nullptr;
}		}
if (Cost >= -SLPCostThreshold) {		if (Cost >= -SLPCostThreshold) {
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
return OptimizationRemarkMissed(		return OptimizationRemarkMissed(
▲ Show 20 Lines • Show All 1,281 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/remark_horcost.ll

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
%add52 = add nsw i32 %add38, %add45		%add52 = add nsw i32 %add38, %add45

; YAML: --- !Passed		; YAML: --- !Passed
; YAML-NEXT: Pass: slp-vectorizer		; YAML-NEXT: Pass: slp-vectorizer
; YAML-NEXT: Name: StoresVectorized		; YAML-NEXT: Name: StoresVectorized
; YAML-NEXT: Function: foo		; YAML-NEXT: Function: foo
; YAML-NEXT: Args:		; YAML-NEXT: Args:
; YAML-NEXT: - String: 'Stores SLP vectorized with cost '		; YAML-NEXT: - String: 'Stores SLP vectorized with cost '
; YAML-NEXT: - Cost: '-5'		; YAML-NEXT: - Cost: '-14'
; YAML-NEXT: - String: ' and with tree size '		; YAML-NEXT: - String: ' and with tree size '
; YAML-NEXT: - TreeSize: '4'		; YAML-NEXT: - TreeSize: '4'

; YAML: --- !Passed		; YAML: --- !Passed
; YAML-NEXT: Pass: slp-vectorizer		; YAML-NEXT: Pass: slp-vectorizer
; YAML-NEXT: Name: VectorizedHorizontalReduction		; YAML-NEXT: Name: VectorizedHorizontalReduction
; YAML-NEXT: Function: foo		; YAML-NEXT: Function: foo
; YAML-NEXT: Args:		; YAML-NEXT: Args:
Show All 12 Lines

llvm/test/Transforms/SLPVectorizer/X86/remark_not_all_parts.ll

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
store i32 %add17, i32* %arrayidx20, align 4		store i32 %add17, i32* %arrayidx20, align 4
%add24 = add nsw i32 %add10, %add17		%add24 = add nsw i32 %add10, %add17

; YAML: Pass: slp-vectorizer		; YAML: Pass: slp-vectorizer
; YAML-NEXT: Name: StoresVectorized		; YAML-NEXT: Name: StoresVectorized
; YAML-NEXT: Function: foo		; YAML-NEXT: Function: foo
; YAML-NEXT: Args:		; YAML-NEXT: Args:
; YAML-NEXT: - String: 'Stores SLP vectorized with cost '		; YAML-NEXT: - String: 'Stores SLP vectorized with cost '
; YAML-NEXT: - Cost: '-1'		; YAML-NEXT: - Cost: '-4'
; YAML-NEXT: - String: ' and with tree size '		; YAML-NEXT: - String: ' and with tree size '
; YAML-NEXT: - TreeSize: '4'		; YAML-NEXT: - TreeSize: '4'

%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 8		%exitcond = icmp eq i64 %indvars.iv.next, 8
br i1 %exitcond, label %for.end, label %for.body		br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body		for.end: ; preds = %for.body
%arraydecay = getelementptr inbounds [8 x [8 x i32]], [8 x [8 x i32]]* %m2, i64 0, i64 0		%arraydecay = getelementptr inbounds [8 x [8 x i32]], [8 x [8 x i32]]* %m2, i64 0, i64 0
ret i32 %add24		ret i32 %add24
}		}