This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
BasicTTIImpl.h
-
test/
-
Analysis/CostModel/AArch64/
-
CostModel/
-
AArch64/
-
mem-op-cost-model.ll
-
Transforms/SLPVectorizer/AArch64/
-
SLPVectorizer/
-
AArch64/
-
gather-cost.ll

Differential D91984

[CostModel] Add basic implementation of getGatherScatterOpCost.
ClosedPublic

Authored by fhahn on Nov 23 2020, 11:01 AM.

Download Raw Diff

Details

Reviewers

dmgreen
RKSimon
spatel
anton-afanasyev

Commits

rG926681b6be70: [CostModel] Add basic implementation of getGatherScatterOpCost.

Summary

Add a basic implementation of getGatherScatterOpCost to BasicTTIImpl.

The implementation estimates the cost of scalarizing the loads/stores,
the cost of packing/extracting the individual lanes and the cost of
only selecting enabled lanes.

This more accurately reflects the current cost on targets like AArch64.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Nov 23 2020, 11:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 23 2020, 11:01 AM

Herald added a subscriber: kristof.beyls. · View Herald Transcript

fhahn requested review of this revision.Nov 23 2020, 11:01 AM

fhahn mentioned this in D90445: [SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic.Nov 23 2020, 11:02 AM

Harbormaster completed remote builds in B79824: Diff 307125.Nov 23 2020, 11:33 AM

Hmm. I still don't think it's a great idea to try to SLP vectorize using gather operations if the target does not have any gathers! Having a better base cost does sound good though.

I think with a non-constant mask, the cost should be higher? I don't think you can just do it with a select, it will include i1 extracts and branches over whether the lanes are on or off.

Updated cost computation for variable masks, also include the cost of extracting the addresses from the address vector.

In D91984#2413019, @dmgreen wrote:

Hmm. I still don't think it's a great idea to try to SLP vectorize using gather operations if the target does not have any gathers! Having a better base cost does sound good though.

I think how we treat it in SLP exactly is a separate issue. I think assigning a more accurate cost will go a long way towards avoiding generating gathers in practice, unless they are beneficial even with the bad lowering.

I think with a non-constant mask, the cost should be higher? I don't think you can just do it with a select, it will include i1 extracts and branches over whether the lanes are on or off.

Yes it probably should better account for the fact that the mem ops need to be executed conditionally. Not really sure how to best cost that. I updated the cost to include extracting the individual conditions, doing a branch and a PHI to combine the result. With that, the cost is not really that much higher. Perhaps something else is missing, or it might be better to just return a sufficiently high number in that case?

Harbormaster completed remote builds in B79918: Diff 307294.Nov 24 2020, 3:38 AM

I think assigning a more accurate cost will go a long way towards avoiding generating gathers in practice, unless they are beneficial even with the bad lowering.

In the past, I believe the only thing to create gather's would be the loop vectorizer, and it would only do so if isLegalMaskedGather returned true, which guards us against a lot of bad costs. More accurate cost modelling now that it is being used more generally is certainly a good thing.

Yes it probably should better account for the fact that the mem ops need to be executed conditionally. Not really sure how to best cost that. I updated the cost to include extracting the individual conditions, doing a branch and a PHI to combine the result. With that, the cost is not really that much higher. Perhaps something else is missing, or it might be better to just return a sufficiently high number in that case?

The costs of branches and phi's is often 0 unfortunately (due to branch predictors and phis being folded away and whatnot) - they make a good default for the normal type of branch/phi in a loop or if. With so many branches together here that might not be true and the real cost perhaps should be higher. Hmm. Not sure what do suggest. Do we think these scores are high enough? (at least for the moment, we can always adjust them later). As far as I understand the SLP vectorizer will use constant masks which makes it the more immediate concern.

In D91984#2413579, @dmgreen wrote:

I think assigning a more accurate cost will go a long way towards avoiding generating gathers in practice, unless they are beneficial even with the bad lowering.

In the past, I believe the only thing to create gather's would be the loop vectorizer, and it would only do so if isLegalMaskedGather returned true, which guards us against a lot of bad costs. More accurate cost modelling now that it is being used more generally is certainly a good thing.

Once the cost-model is more reasonable, we might want to also just let loop vectorize rely on the cost. Using the costs more widely should flush out any potential issues much quicker.

Yes it probably should better account for the fact that the mem ops need to be executed conditionally. Not really sure how to best cost that. I updated the cost to include extracting the individual conditions, doing a branch and a PHI to combine the result. With that, the cost is not really that much higher. Perhaps something else is missing, or it might be better to just return a sufficiently high number in that case?

The costs of branches and phi's is often 0 unfortunately (due to branch predictors and phis being folded away and whatnot) - they make a good default for the normal type of branch/phi in a loop or if. With so many branches together here that might not be true and the real cost perhaps should be higher. Hmm. Not sure what do suggest. Do we think these scores are high enough? (at least for the moment, we can always adjust them later). As far as I understand the SLP vectorizer will use constant masks which makes it the more immediate concern.

Yeah, the estimate is very rough. I think they costs should be sufficiently high so that it won't get picked in practice, but if there's a better way to model the fact that we need to introduce multiple blocks & branches, I'm happy to adjust the code. Alternatively we could also use a very high number for the VariableMask case. In any case, this should be a substantial improvement to returning 1 as cost, as we do now.

Yep. We can improve the costs in the future as we need, and this seems like an excellent start to me.

LGTM.

This revision is now accepted and ready to land.Nov 25 2020, 6:32 AM

Thanks! I'm planning on landing this in tomorrow unless there are any other comments. I'll also add a comment that the estimate for the conditional execution is a bit rough.

Closed by commit rG926681b6be70: [CostModel] Add basic implementation of getGatherScatterOpCost. (authored by fhahn). · Explain WhyNov 26 2020, 4:03 AM

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rG926681b6be70: [CostModel] Add basic implementation of getGatherScatterOpCost..

fhahn mentioned this in D92701: [SLPVectorize] Call isLegalMaskedGather before creating a gather TreeEntry.Dec 5 2020, 2:42 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

BasicTTIImpl.h

45 lines

test/

Analysis/

CostModel/

AArch64/

mem-op-cost-model.ll

64 lines

Transforms/

SLPVectorizer/

AArch64/

gather-cost.ll

29 lines

Diff 307821

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 985 Lines • ▼ Show 20 Lines	if (Src->isVectorTy() &&
Opcode != Instruction::Store,		Opcode != Instruction::Store,
Opcode == Instruction::Store);		Opcode == Instruction::Store);
}		}
}		}

return Cost;		return Cost;
}		}

		unsigned getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
		const Value *Ptr, bool VariableMask,
		Align Alignment, TTI::TargetCostKind CostKind,
		const Instruction *I = nullptr) {
		auto *VT = cast<FixedVectorType>(DataTy);
		// Assume the target does not have support for gather/scatter operations
		// and provide a rough estimate.
		//
		// First, compute the cost of extracting the individual addresses and the
		// individual memory operations.
		int LoadCost =
		VT->getNumElements() *
		(getVectorInstrCost(
		Instruction::ExtractElement,
		FixedVectorType::get(PointerType::get(VT->getElementType(), 0),
		VT->getNumElements()),
		-1) +
		getMemoryOpCost(Opcode, VT->getElementType(), Alignment, 0, CostKind));

		// Next, compute the cost of packing the result in a vector.
		int PackingCost = getScalarizationOverhead(VT, Opcode != Instruction::Store,
		Opcode == Instruction::Store);

		int ConditionalCost = 0;
		if (VariableMask) {
		// Compute the cost of conditionally executing the memory operations with
		// variable masks. This includes extracting the individual conditions, a
		// branches and PHIs to combine the results.
		// NOTE: Estimating the cost of conditionally executing the memory
		// operations accurately is quite difficult and the current solution
		// provides a very rough estimate only.
		ConditionalCost =
		VT->getNumElements() *
		(getVectorInstrCost(
		Instruction::ExtractElement,
		FixedVectorType::get(Type::getInt1Ty(DataTy->getContext()),
		VT->getNumElements()),
		-1) +
		getCFInstrCost(Instruction::Br, CostKind) +
		getCFInstrCost(Instruction::PHI, CostKind));
		}

		return LoadCost + PackingCost + ConditionalCost;
		}

unsigned getInterleavedMemoryOpCost(		unsigned getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond = false, bool UseMaskForGaps = false) {		bool UseMaskForCond = false, bool UseMaskForGaps = false) {
auto *VT = cast<FixedVectorType>(VecTy);		auto *VT = cast<FixedVectorType>(VecTy);

unsigned NumElts = VT->getNumElements();		unsigned NumElts = VT->getNumElements();
assert(Factor > 1 && NumElts % Factor == 0 && "Invalid interleave factor");		assert(Factor > 1 && NumElts % Factor == 0 && "Invalid interleave factor");
▲ Show 20 Lines • Show All 961 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/mem-op-cost-model.ll

	Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction:			; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction:
	%out = load <8 x i64>, <8 x i64>* %ptr			%out = load <8 x i64>, <8 x i64>* %ptr
	ret <8 x i64> %out			ret <8 x i64> %out
	}			}

	declare <4 x i8> @llvm.masked.gather.v4i8.v4p0i8(<4 x i8*>, i32 immarg, <4 x i1>, <4 x i8>)			declare <4 x i8> @llvm.masked.gather.v4i8.v4p0i8(<4 x i8*>, i32 immarg, <4 x i1>, <4 x i8>)
	define <4 x i8> @gather_load_4xi8_constant_mask(<4 x i8*> %ptrs) {			define <4 x i8> @gather_load_4xi8_constant_mask(<4 x i8*> %ptrs) {
	; CHECK: gather_load_4xi8_constant_mask			; CHECK: gather_load_4xi8_constant_mask
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8			; CHECK-NEON: Cost Model: Found an estimated cost of 17 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8			; CHECK-SVE-128: Cost Model: Found an estimated cost of 17 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8			; CHECK-SVE-256: Cost Model: Found an estimated cost of 17 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8			; CHECK-SVE-512: Cost Model: Found an estimated cost of 17 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8
	;			;
	%lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8(<4 x i8*> %ptrs, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i8> undef)			%lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8(<4 x i8*> %ptrs, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i8> undef)
	ret <4 x i8> %lv			ret <4 x i8> %lv
	}			}

	define <4 x i8> @gather_load_4xi8_variable_mask(<4 x i8*> %ptrs, <4 x i1> %cond) {			define <4 x i8> @gather_load_4xi8_variable_mask(<4 x i8*> %ptrs, <4 x i1> %cond) {
	; CHECK: gather_load_4xi8_variable_mask			; CHECK: gather_load_4xi8_variable_mask
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8			; CHECK-NEON: Cost Model: Found an estimated cost of 29 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8			; CHECK-SVE-128: Cost Model: Found an estimated cost of 29 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8			; CHECK-SVE-256: Cost Model: Found an estimated cost of 29 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8			; CHECK-SVE-512: Cost Model: Found an estimated cost of 29 for instruction: %lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8
	;			;
	%lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8(<4 x i8*> %ptrs, i32 1, <4 x i1> %cond, <4 x i8> undef)			%lv = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8(<4 x i8*> %ptrs, i32 1, <4 x i1> %cond, <4 x i8> undef)
	ret <4 x i8> %lv			ret <4 x i8> %lv
	}			}

	declare void @llvm.masked.scatter.v4i8.v4p0i8(<4 x i8>, <4 x i8*>, i32 immarg, <4 x i1>)			declare void @llvm.masked.scatter.v4i8.v4p0i8(<4 x i8>, <4 x i8*>, i32 immarg, <4 x i1>)
	define void @scatter_store_4xi8_constant_mask(<4 x i8> %val, <4 x i8*> %ptrs) {			define void @scatter_store_4xi8_constant_mask(<4 x i8> %val, <4 x i8*> %ptrs) {
	; CHECK: scatter_store_4xi8_constant_mask			; CHECK: scatter_store_4xi8_constant_mask
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(			; CHECK-NEON: Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(			; CHECK-SVE-128: Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(			; CHECK-SVE-256: Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(			; CHECK-SVE-512: Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(
	;			;
	call void @llvm.masked.scatter.v4i8.v4p0i8(<4 x i8> %val, <4 x i8*> %ptrs, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)			call void @llvm.masked.scatter.v4i8.v4p0i8(<4 x i8> %val, <4 x i8*> %ptrs, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)
	ret void			ret void
	}			}

	define void @scatter_store_4xi8_variable_mask(<4 x i8> %val, <4 x i8*> %ptrs, <4 x i1> %cond) {			define void @scatter_store_4xi8_variable_mask(<4 x i8> %val, <4 x i8*> %ptrs, <4 x i1> %cond) {
	; CHECK: scatter_store_4xi8_variable_mask			; CHECK: scatter_store_4xi8_variable_mask
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(			; CHECK-NEON: Cost Model: Found an estimated cost of 29 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(			; CHECK-SVE-128: Cost Model: Found an estimated cost of 29 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(			; CHECK-SVE-256: Cost Model: Found an estimated cost of 29 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(			; CHECK-SVE-512: Cost Model: Found an estimated cost of 29 for instruction: call void @llvm.masked.scatter.v4i8.v4p0i8(
	;			;
	call void @llvm.masked.scatter.v4i8.v4p0i8(<4 x i8> %val, <4 x i8*> %ptrs, i32 1, <4 x i1> %cond)			call void @llvm.masked.scatter.v4i8.v4p0i8(<4 x i8> %val, <4 x i8*> %ptrs, i32 1, <4 x i1> %cond)
	ret void			ret void
	}			}

	declare <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*>, i32 immarg, <4 x i1>, <4 x i32>)			declare <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*>, i32 immarg, <4 x i1>, <4 x i32>)
	define <4 x i32> @gather_load_4xi32_constant_mask(<4 x i32*> %ptrs) {			define <4 x i32> @gather_load_4xi32_constant_mask(<4 x i32*> %ptrs) {
	; CHECK: gather_load_4xi32_constant_mask			; CHECK: gather_load_4xi32_constant_mask
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32			; CHECK-NEON: Cost Model: Found an estimated cost of 17 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32			; CHECK-SVE-128: Cost Model: Found an estimated cost of 17 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32			; CHECK-SVE-256: Cost Model: Found an estimated cost of 17 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32			; CHECK-SVE-512: Cost Model: Found an estimated cost of 17 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32
	;			;
	%lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> %ptrs, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef)			%lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> %ptrs, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef)
	ret <4 x i32> %lv			ret <4 x i32> %lv
	}			}

	define <4 x i32> @gather_load_4xi32_variable_mask(<4 x i32*> %ptrs, <4 x i1> %cond) {			define <4 x i32> @gather_load_4xi32_variable_mask(<4 x i32*> %ptrs, <4 x i1> %cond) {
	; CHECK: gather_load_4xi32_variable_mask			; CHECK: gather_load_4xi32_variable_mask
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32			; CHECK-NEON: Cost Model: Found an estimated cost of 29 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32			; CHECK-SVE-128: Cost Model: Found an estimated cost of 29 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32			; CHECK-SVE-256: Cost Model: Found an estimated cost of 29 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32			; CHECK-SVE-512: Cost Model: Found an estimated cost of 29 for instruction: %lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32
	;			;
	%lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> %ptrs, i32 1, <4 x i1> %cond, <4 x i32> undef)			%lv = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> %ptrs, i32 1, <4 x i1> %cond, <4 x i32> undef)
	ret <4 x i32> %lv			ret <4 x i32> %lv
	}			}

	declare void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32>, <4 x i32*>, i32 immarg, <4 x i1>)			declare void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32>, <4 x i32*>, i32 immarg, <4 x i1>)
	define void @scatter_store_4xi32_constant_mask(<4 x i32> %val, <4 x i32*> %ptrs) {			define void @scatter_store_4xi32_constant_mask(<4 x i32> %val, <4 x i32*> %ptrs) {
	; CHECK: scatter_store_4xi32_constant_mask			; CHECK: scatter_store_4xi32_constant_mask
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(			; CHECK-NEON: Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(			; CHECK-SVE-128: Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(			; CHECK-SVE-256: Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(			; CHECK-SVE-512: Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(
	;			;
	call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %val, <4 x i32*> %ptrs, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)			call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %val, <4 x i32*> %ptrs, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)
	ret void			ret void
	}			}

	define void @scatter_store_4xi32_variable_mask(<4 x i32> %val, <4 x i32*> %ptrs, <4 x i1> %cond) {			define void @scatter_store_4xi32_variable_mask(<4 x i32> %val, <4 x i32*> %ptrs, <4 x i1> %cond) {
	; CHECK: scatter_store_4xi32_variable_mask			; CHECK: scatter_store_4xi32_variable_mask
	; CHECK-NEON: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(			; CHECK-NEON: Cost Model: Found an estimated cost of 29 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(
	; CHECK-SVE-128: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(			; CHECK-SVE-128: Cost Model: Found an estimated cost of 29 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(
	; CHECK-SVE-256: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(			; CHECK-SVE-256: Cost Model: Found an estimated cost of 29 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(
	; CHECK-SVE-512: Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(			; CHECK-SVE-512: Cost Model: Found an estimated cost of 29 for instruction: call void @llvm.masked.scatter.v4i32.v4p0i32(
	;			;
	call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %val, <4 x i32*> %ptrs, i32 1, <4 x i1> %cond)			call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %val, <4 x i32*> %ptrs, i32 1, <4 x i1> %cond)
	ret void			ret void
	}			}

llvm/test/Transforms/SLPVectorizer/AArch64/gather-cost.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -S -slp-vectorizer -instcombine -pass-remarks-output=%t \| FileCheck %s		; RUN: opt < %s -S -slp-vectorizer -instcombine -pass-remarks-output=%t \| FileCheck %s
; RUN: cat %t \| FileCheck -check-prefix=REMARK %s		; RUN: cat %t \| FileCheck -check-prefix=REMARK %s
; RUN: opt < %s -S -aa-pipeline=basic-aa -passes='slp-vectorizer,instcombine' -pass-remarks-output=%t \| FileCheck %s		; RUN: opt < %s -S -aa-pipeline=basic-aa -passes='slp-vectorizer,instcombine' -pass-remarks-output=%t \| FileCheck %s
; RUN: cat %t \| FileCheck -check-prefix=REMARK %s		; RUN: cat %t \| FileCheck -check-prefix=REMARK %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

; REMARK-LABEL: Function: gather_multiple_use		; REMARK-LABEL: Function: gather_multiple_use
; REMARK: Args:		; REMARK: Args:
; REMARK-NEXT: - String: 'Vectorized horizontal reduction with cost '		; REMARK-NEXT: - String: 'Vectorized horizontal reduction with cost '
; REMARK-NEXT: - Cost: '-7'		; REMARK-NEXT: - Cost: '-7'
;		;
; REMARK-LABEL: Function: gather_load		; REMARK-NOT: Function: gather_load
; REMARK: Args:
; REMARK-NEXT: - String: 'Stores SLP vectorized with cost
; REMARK-NEXT: - Cost: '-2'

define internal i32 @gather_multiple_use(i32 %a, i32 %b, i32 %c, i32 %d) {		define internal i32 @gather_multiple_use(i32 %a, i32 %b, i32 %c, i32 %d) {
; CHECK-LABEL: @gather_multiple_use(		; CHECK-LABEL: @gather_multiple_use(
; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x i32> undef, i32 [[C:%.]], i32 0		; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x i32> undef, i32 [[C:%.]], i32 0
; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i32> [[TMP1]], i32 [[A:%.]], i32 1		; CHECK-NEXT: [[TMP2:%.]] = insertelement <4 x i32> [[TMP1]], i32 [[A:%.]], i32 1
; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i32> [[TMP2]], i32 [[B:%.]], i32 2		; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i32> [[TMP2]], i32 [[B:%.]], i32 2
; CHECK-NEXT: [[TMP4:%.]] = insertelement <4 x i32> [[TMP3]], i32 [[D:%.]], i32 3		; CHECK-NEXT: [[TMP4:%.]] = insertelement <4 x i32> [[TMP3]], i32 [[D:%.]], i32 3
; CHECK-NEXT: [[TMP5:%.*]] = lshr <4 x i32> [[TMP4]], <i32 15, i32 15, i32 15, i32 15>		; CHECK-NEXT: [[TMP5:%.*]] = lshr <4 x i32> [[TMP4]], <i32 15, i32 15, i32 15, i32 15>
Show All 29 Lines	;
%tmp22 = add i32 %tmp21, %tmp19		%tmp22 = add i32 %tmp21, %tmp19
ret i32 %tmp22		ret i32 %tmp22
}		}

@data = global [6 x [258 x i8]] zeroinitializer, align 1		@data = global [6 x [258 x i8]] zeroinitializer, align 1
define void @gather_load(i16* noalias %ptr) {		define void @gather_load(i16* noalias %ptr) {
; CHECK-LABEL: @gather_load(		; CHECK-LABEL: @gather_load(
; CHECK-NEXT: [[ARRAYIDX182:%.]] = getelementptr inbounds i16, i16 [[PTR:%.*]], i64 1		; CHECK-NEXT: [[ARRAYIDX182:%.]] = getelementptr inbounds i16, i16 [[PTR:%.*]], i64 1
; CHECK-NEXT: [[TMP1:%.]] = call <4 x i8> @llvm.masked.gather.v4i8.v4p0i8(<4 x i8> <i8* getelementptr inbounds ([6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 1, i64 0), i8* getelementptr inbounds ([6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 2, i64 1), i8* getelementptr inbounds ([6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 3, i64 2), i8* getelementptr inbounds ([6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 4, i64 3)>, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i8> undef)		; CHECK-NEXT: [[ARRAYIDX183:%.]] = getelementptr inbounds i16, i16 [[PTR]], i64 2
; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[TMP1]] to <4 x i16>		; CHECK-NEXT: [[ARRAYIDX184:%.]] = getelementptr inbounds i16, i16 [[PTR]], i64 3
; CHECK-NEXT: [[TMP3:%.*]] = add nuw nsw <4 x i16> [[TMP2]], <i16 10, i16 20, i16 30, i16 40>		; CHECK-NEXT: [[ARRAYIDX185:%.]] = getelementptr inbounds i16, i16 [[PTR]], i64 4
; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[ARRAYIDX182]] to <4 x i16>*		; CHECK-NEXT: [[L0:%.]] = load i8, i8 getelementptr inbounds ([6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 1, i64 0), align 1
; CHECK-NEXT: store <4 x i16> [[TMP3]], <4 x i16>* [[TMP4]], align 2		; CHECK-NEXT: [[CONV150:%.*]] = zext i8 [[L0]] to i16
		; CHECK-NEXT: [[ADD152:%.*]] = add nuw nsw i16 [[CONV150]], 10
		; CHECK-NEXT: [[L1:%.]] = load i8, i8 getelementptr inbounds ([6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 2, i64 1), align 1
		; CHECK-NEXT: [[CONV156:%.*]] = zext i8 [[L1]] to i16
		; CHECK-NEXT: [[ADD158:%.*]] = add nuw nsw i16 [[CONV156]], 20
		; CHECK-NEXT: [[L2:%.]] = load i8, i8 getelementptr inbounds ([6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 3, i64 2), align 1
		; CHECK-NEXT: [[CONV162:%.*]] = zext i8 [[L2]] to i16
		; CHECK-NEXT: [[ADD164:%.*]] = add nuw nsw i16 [[CONV162]], 30
		; CHECK-NEXT: [[L3:%.]] = load i8, i8 getelementptr inbounds ([6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 4, i64 3), align 1
		; CHECK-NEXT: [[CONV168:%.*]] = zext i8 [[L3]] to i16
		; CHECK-NEXT: [[ADD170:%.*]] = add nuw nsw i16 [[CONV168]], 40
		; CHECK-NEXT: store i16 [[ADD152]], i16* [[ARRAYIDX182]], align 2
		; CHECK-NEXT: store i16 [[ADD158]], i16* [[ARRAYIDX183]], align 2
		; CHECK-NEXT: store i16 [[ADD164]], i16* [[ARRAYIDX184]], align 2
		; CHECK-NEXT: store i16 [[ADD170]], i16* [[ARRAYIDX185]], align 2
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%arrayidx182 = getelementptr inbounds i16, i16* %ptr, i64 1		%arrayidx182 = getelementptr inbounds i16, i16* %ptr, i64 1
%arrayidx183 = getelementptr inbounds i16, i16* %ptr, i64 2		%arrayidx183 = getelementptr inbounds i16, i16* %ptr, i64 2
%arrayidx184 = getelementptr inbounds i16, i16* %ptr, i64 3		%arrayidx184 = getelementptr inbounds i16, i16* %ptr, i64 3
%arrayidx185 = getelementptr inbounds i16, i16* %ptr, i64 4		%arrayidx185 = getelementptr inbounds i16, i16* %ptr, i64 4
%arrayidx149 = getelementptr inbounds [6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 1, i64 0		%arrayidx149 = getelementptr inbounds [6 x [258 x i8]], [6 x [258 x i8]]* @data, i64 0, i64 1, i64 0
%l0 = load i8, i8* %arrayidx149, align 1		%l0 = load i8, i8* %arrayidx149, align 1
Show All 20 Lines