This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
1/9
X86TargetTransformInfo.cpp
-
test/Analysis/CostModel/X86/
-
Analysis/
-
CostModel/
-
X86/
-
shuffle-load.ll

Differential D124616

[TTI][X86] Fix splat-load cost when load+broadcast cannot be combined.
Needs ReviewPublic

Authored by vporpo on Apr 28 2022, 7:31 AM.

Download Raw Diff

Details

Reviewers

vdmitrie
dmgreen
ABataev

Summary

This patch fixes the issue with the cost of a combined load+broadcast
when these cannot be combined into a single instruction.

For example, there may be other instructions in between:

%ld = ...
<other instructions>
%vec = broadcast %ld ...

Or they may be in different bbs:
bb1:

%ld = ...

bb2:

%vec = broadcast %ld ...

This should fix cost modeling for SLP of such cases.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,050 ms	x64 debian > libFuzzer.libFuzzer::large.test
	60,050 ms	x64 debian > libFuzzer.libFuzzer::value-profile-load.test

Event Timeline

vporpo created this revision.Apr 28 2022, 7:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 7:31 AM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

vporpo requested review of this revision.Apr 28 2022, 7:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 7:31 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

vporpo mentioned this in rG5efa78985bf5: [SLP] Fix lookahead operand reordering for splat loads..Apr 28 2022, 7:34 AM

Harbormaster completed remote builds in B161791: Diff 425775.Apr 28 2022, 8:11 AM

We now only check if the load and its uses are in the same BB.

Fixed condition.

Harbormaster completed remote builds in B161844: Diff 425848.Apr 28 2022, 1:34 PM

I feel we would usually use a OneUse check to make sure that the load and the shuffle will be combined, not blocked by other uses of the load. Do you think that should be included too? Checking the load and the use are in the same block I've not seen before, but would make sense for how ISel works. You would hope that if the load did have one use (the shuffle) then the shuffle could be hoisted up to the load so that they do end up being combined.

Yes, I think it makes sense to add the hasOneUse check.
I am not really sure about the BBs though, it was reported as a potential issue in SLP, but it turned out that it was a different issue after all. So I guess I will drop this check and keep.

Added hasOneUse check and removed the BB check.

Harbormaster completed remote builds in B162255: Diff 426419.May 2 2022, 9:13 AM

dmgreen added inline comments.May 2 2022, 1:50 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1658	The user of an instruction should always be an instruction. Just to make sure - it isn't a problem for the SLP vectorizer is it? That the load might have multiple uses, which we are turning into a splat?

vporpo added inline comments.May 2 2022, 2:29 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1658	Oops, I will fix it. Good point. I did not see any test failures, but I am pretty sure we won't get the right shuffle cost because we are querying the cost model before generating the vector code (which btw is a good reason for testing the actual cost values with something like https://reviews.llvm.org/D124802). Hmm not sure what the best approach would be. Perhaps we could go through the users and count that we have at most one vector instruction. In this way SLP will still work as the users will be scalar and TTI still works with regular vector instructions. Any thoughts?

Updated condition for LoadCanBeCombined.

Harbormaster completed remote builds in B162338: Diff 426531.May 2 2022, 4:13 PM

vdmitrie added inline comments.May 2 2022, 4:29 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1658	This new part of TTI interface actually lacks description. For broadcast kind it sounds like we better to pass just one load instruction which we want to broadcast rather than the whole set that form the splat (otherwise we indeed applying assumption wrt intended use of the interface). Since we are broadcasting a scalar load its users are either scalar instructions or an insertelement. What did you mean by vector instruction user? During SLP cost modeling we have: %i1 = load double, double* %gep, align 8 %i1 = load double, double* %gep, align 8 fmul double %i1, .. fmul double %i1, .. ... . fadd double %i1, ... which we may transform into %i1 = load double, double* %gep, align 8 %0 = insertelement <2 x double> poison, double %i1, i32 0 %1 = insertelement <2 x double> %0, double %i1, i32 1 fmul <2 x double> %1 ... fadd double %i1, ... i.e. we have two uses to form the splat and one extra user. You probably intended to check one user per lane (i.e. each use will form a lane in a vector). But that is not that straightforward when we for example trying to vectorize with 2 or more users of the splat. If we end up vectorizing the tree CG can generate either code like this: vmovsd xmm1, mem64 ; load (throughput 0.5) vmovddup xmm2, xmm1 ; register form (throughput 1) vmulpd xmm2.. vaddsd xmm1 or like this vmovddup xmm1, mem64 ; load form (throughput 0.5) vmulpd xmm1.. vaddsd xmm1 ; scalar user Based on costs I tend to think CG should always prefer the second option (i.e. memory form) CG may consider other factors than cost when life range between load and its scalar users crosses block boundaries but since extract from vector register on x86 is essentially free it is again makes memory form of movddup look more profitable. We need somebody familiar with codegen to comment here.

vdmitrie added inline comments.May 2 2022, 4:55 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1663	I'm not following. The minimal vector is 2x. The minimal broadcast is 2x. As far as I understand if we are broadcasting <2 x double > {a,b} into <4 x double> we supposed to get {a,b,a,b}. Right? we don't have entry for v4f64 and vmovddup function is {x0,x1,x2,x3} => {x0,x0,x2,x2}

vdmitrie added inline comments.May 2 2022, 5:03 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1663	As far as I understand if we are broadcasting <2 x double > {a,b} into <4 x double> we supposed to get {a,b,a,b}. Right? Correcting myself: <2 x double > {a,b} into <2 x <2 x double>> => {{a,b},{a,b}}

vporpo added inline comments.May 2 2022, 5:43 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1658	I agree, the description is not very clear. My understanding is that we need `getShuffleCost()` to work on both scalar code or vector code: Case 1: If we query TTI from the SLP pass, the code is still scalar as we have not generated any vector instructions yet, so there are no `insertelement` instructions, the code would look like this: %ld = load double,... %mul1 = fmul double %ld, ... %mul2 = fmul double %ld, ... ... %add1 = fadd double %ld, ... %add2 = fadd double %ld, ... So the operand of the broadcast is `%ld` but the load has multiple users. In this case I am not sure how we would guarantee that `%ld` won't have some other user that is not getting vectorized. Perhaps just check that `num_users % VL == 0` ? Case 2: If we have vector code, it could look like this: %ld = load double, ... %vec = insertelement undef, %ld, 0 %bcast1 = shufflevector %vec, ... In this case the operand of the TTI getShuffleCost function should still be the scalar load. In this case the load should have a single user and that user should only have a single user too. One way to check which case we are in is to check the type of the users. If the users are scalar, then we are in Case 1. If the users are vector then we are in Case 2.

vdmitrie added inline comments.May 2 2022, 7:18 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1658	So the operand of the broadcast is `%ld` but the load has multiple users. In this case I am not sure how we would guarantee that `%ld` won't have some other user that is not getting vectorized. Perhaps just check that `num_users % VL == 0` ? The main issue here is that even if condition `num_users % VL == 0` is true there is no guarantee that all these uses will be vector eventually. Only caller supposed to know that. Case 2: If we have vector code, it could look like this: %ld = load double, ... %vec = insertelement undef, %ld, 0 %bcast1 = shufflevector %vec, ... In this case the operand of the TTI getShuffleCost function should still be the scalar load. Ah, that's an important note. Thanks. One way to check which case we are in is to check the type of the users. If the users are scalar, then we are in Case 1. If the users are vector then we are in Case 2. For me it is still not obvious why having extra (scalar) uses in case 2 would prevent combining load with broadcast shuffle. In my experiments I have not found any difference in generated code between %0 = insertelement <2 x double> poison, double %ld, i32 0 %1 = shufflevector <2 x double> %0, <2 x double> poison, <2 x i32> zeroinitializer versus %0 = insertelement <2 x double> poison, double %ld, i32 0 %1 = insertelement <2 x double> %0, double %ld, i32 1 Even having extra scalar use of the load did not prevent combining broadcast with load.

vporpo added inline comments.May 2 2022, 8:01 PM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1658	So the operand of the broadcast is %ld but the load has multiple users. In this case I am not sure how we would >>guarantee that %ld won't have some other user that is not getting vectorized. Perhaps just check that num_users % VL == 0 ? The main issue here is that even if condition num_users % VL == 0 is true there is no guarantee that all these uses >will be vector eventually. Only caller supposed to know that. Yes, there is no guarantee, not sure how we could improve this though. In case of SLP I think we are good because it will only call `getShuffleCost()` if there are no external uses, but in the general case it won't work. For me it is still not obvious why having extra (scalar) uses in case 2 would prevent combining load with broadcast shuffle. I think the concern is something like: %ld = double ... %ins1 = insertelement <2 x double> poison, double %ld, i32 0 %shuf1 = shufflevector <2 x double> %ins1, <2 x double> poison, <2 x i32> zeroinitializer ... %some_instr = ... %ld ... In this case code gen won't be able to combine the broadcast into a single instruction because it needs `%ld` for `%some_instr`. In my experiments I have not found any difference in generated code between %0 = insertelement <2 x double> poison, double %ld, i32 0 %1 = shufflevector <2 x double> %0, <2 x double> poison, <2 x i32> zeroinitializer versus %0 = insertelement <2 x double> poison, double %ld, i32 0 %1 = insertelement <2 x double> %0, double %ld, i32 1 Yes, there is no difference. We are just matching the canonicalized form of the broadcast which is the the form with the shuffle.
1663	Yeah this is broken. This is supposed to check for Case 1 or Case 2 and either check for `L->hasOneUse()` for Case 2, or just return true for Case 1. It should check the type of `L`'s uses, not `L` itself.

Added more comments and added checks for the count and type of the load's users.

Harbormaster completed remote builds in B162488: Diff 426744.May 3 2022, 11:17 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86TargetTransformInfo.cpp

52 lines

test/

Analysis/

CostModel/

X86/

shuffle-load.ll

46 lines

Diff 426744

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,602 Lines • ▼ Show 20 Lines	static const CostTblEntry SSE2ShuffleTbl[] = {
{ TTI::SK_PermuteTwoSrc, MVT::v16i8, 13 }, // blend+permute		{ TTI::SK_PermuteTwoSrc, MVT::v16i8, 13 }, // blend+permute
};		};

static const CostTblEntry SSE3BroadcastLoadTbl[] = {		static const CostTblEntry SSE3BroadcastLoadTbl[] = {
{TTI::SK_Broadcast, MVT::v2f64, 0}, // broadcast handled by movddup		{TTI::SK_Broadcast, MVT::v2f64, 0}, // broadcast handled by movddup
};		};

if (ST->hasSSE2()) {		if (ST->hasSSE2()) {
bool IsLoad =		if (bool IsLoad = !Args.empty() && isa<LoadInst>(Args[0])) {
llvm::any_of(Args, [](const auto &V) { return isa<LoadInst>(V); });		const LoadInst *L = cast<LoadInst>(Args[0]);
if (ST->hasSSE3() && IsLoad)		// There are two use cases:
		// Case 1 (scalar):
		// %ld = double ...
		// %add1 = fadd double %ld
		// %add2 = fadd double %ld
		// This is typically used by SLP.
		//
		// Case 2 (vector):
		// %ld = double ...
		// %ins = insertelement <2 x double> undef, double %ld, i32 0
		// %shf = shufflevector <2 x double> %ins, ...
		// This is the canonicalized form of the broadcast, so we are only
		// matching this one and not multiple insertelements.
		//
		// In order to tell which case we are in we are checking the users of the
		// load. If the users are all scalar, then we are in Case 1. Else if there
		// is at least one vector user we are in Case 2.
		//
		// In Case 1 we cannot be sure whether all the users will be converted to
		// vectors, and that there won't be any other exteranl user to the load.
		// This is up to the caller. So we eagerly consider that codegen will be
		// able to combine load + broadcast.
		//
		// In Case 2 we need to make sure that the load has a single user.
		// Otherwise codegen won't be able to combine load + broadcast.

		// Limit the users we visit to save compilation time.
		if (!L->hasNUsesOrMore(16)) {
		bool IsVectorCase = llvm::any_of(L->users(), [](const User *U) {
		return cast<Instruction>(U)->getType()->isVectorTy();
		});
		bool LoadCanBeCombined = IsVectorCase ? L->hasOneUse() : true;
		if (ST->hasSSE3() && LoadCanBeCombined)
if (const auto *Entry =		if (const auto *Entry =
CostTableLookup(SSE3BroadcastLoadTbl, Kind, LT.second)) {		CostTableLookup(SSE3BroadcastLoadTbl, Kind, LT.second)) {
assert(isLegalBroadcastLoad(BaseTp->getElementType(),		assert(isLegalBroadcastLoad(BaseTp->getElementType(),
LT.second.getVectorElementCount()) &&		LT.second.getVectorElementCount()) &&
"Table entry missing from isLegalBroadcastLoad()");		"Table entry missing from isLegalBroadcastLoad()");
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;
}		}
		}
		}

if (const auto *Entry = CostTableLookup(SSE2ShuffleTbl, Kind, LT.second))		if (const auto *Entry = CostTableLookup(SSE2ShuffleTbl, Kind, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;
}		}
		dmgreenUnsubmitted Not Done Reply Inline Actions The user of an instruction should always be an instruction. Just to make sure - it isn't a problem for the SLP vectorizer is it? That the load might have multiple uses, which we are turning into a splat? dmgreen: The user of an instruction should always be an instruction. Just to make sure - it isn't a…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions Oops, I will fix it. Good point. I did not see any test failures, but I am pretty sure we won't get the right shuffle cost because we are querying the cost model before generating the vector code (which btw is a good reason for testing the actual cost values with something like https://reviews.llvm.org/D124802). Hmm not sure what the best approach would be. Perhaps we could go through the users and count that we have at most one vector instruction. In this way SLP will still work as the users will be scalar and TTI still works with regular vector instructions. Any thoughts? vporpo: Oops, I will fix it. Good point. I did not see any test failures, but I am pretty sure we…
		vdmitrieUnsubmitted Not Done Reply Inline Actions This new part of TTI interface actually lacks description. For broadcast kind it sounds like we better to pass just one load instruction which we want to broadcast rather than the whole set that form the splat (otherwise we indeed applying assumption wrt intended use of the interface). Since we are broadcasting a scalar load its users are either scalar instructions or an insertelement. What did you mean by vector instruction user? During SLP cost modeling we have: %i1 = load double, double* %gep, align 8 %i1 = load double, double* %gep, align 8 fmul double %i1, .. fmul double %i1, .. ... . fadd double %i1, ... which we may transform into %i1 = load double, double* %gep, align 8 %0 = insertelement <2 x double> poison, double %i1, i32 0 %1 = insertelement <2 x double> %0, double %i1, i32 1 fmul <2 x double> %1 ... fadd double %i1, ... i.e. we have two uses to form the splat and one extra user. You probably intended to check one user per lane (i.e. each use will form a lane in a vector). But that is not that straightforward when we for example trying to vectorize with 2 or more users of the splat. If we end up vectorizing the tree CG can generate either code like this: vmovsd xmm1, mem64 ; load (throughput 0.5) vmovddup xmm2, xmm1 ; register form (throughput 1) vmulpd xmm2.. vaddsd xmm1 or like this vmovddup xmm1, mem64 ; load form (throughput 0.5) vmulpd xmm1.. vaddsd xmm1 ; scalar user Based on costs I tend to think CG should always prefer the second option (i.e. memory form) CG may consider other factors than cost when life range between load and its scalar users crosses block boundaries but since extract from vector register on x86 is essentially free it is again makes memory form of movddup look more profitable. We need somebody familiar with codegen to comment here. vdmitrie: This new part of TTI interface actually lacks description. For broadcast kind it sounds like we…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions I agree, the description is not very clear. My understanding is that we need `getShuffleCost()` to work on both scalar code or vector code: Case 1: If we query TTI from the SLP pass, the code is still scalar as we have not generated any vector instructions yet, so there are no `insertelement` instructions, the code would look like this: %ld = load double,... %mul1 = fmul double %ld, ... %mul2 = fmul double %ld, ... ... %add1 = fadd double %ld, ... %add2 = fadd double %ld, ... So the operand of the broadcast is `%ld` but the load has multiple users. In this case I am not sure how we would guarantee that `%ld` won't have some other user that is not getting vectorized. Perhaps just check that `num_users % VL == 0` ? Case 2: If we have vector code, it could look like this: %ld = load double, ... %vec = insertelement undef, %ld, 0 %bcast1 = shufflevector %vec, ... In this case the operand of the TTI getShuffleCost function should still be the scalar load. In this case the load should have a single user and that user should only have a single user too. One way to check which case we are in is to check the type of the users. If the users are scalar, then we are in Case 1. If the users are vector then we are in Case 2. vporpo: I agree, the description is not very clear. My understanding is that we need `getShuffleCost()`…
		vdmitrieUnsubmitted Not Done Reply Inline Actions So the operand of the broadcast is `%ld` but the load has multiple users. In this case I am not sure how we would guarantee that `%ld` won't have some other user that is not getting vectorized. Perhaps just check that `num_users % VL == 0` ? The main issue here is that even if condition `num_users % VL == 0` is true there is no guarantee that all these uses will be vector eventually. Only caller supposed to know that. Case 2: If we have vector code, it could look like this: %ld = load double, ... %vec = insertelement undef, %ld, 0 %bcast1 = shufflevector %vec, ... In this case the operand of the TTI getShuffleCost function should still be the scalar load. Ah, that's an important note. Thanks. One way to check which case we are in is to check the type of the users. If the users are scalar, then we are in Case 1. If the users are vector then we are in Case 2. For me it is still not obvious why having extra (scalar) uses in case 2 would prevent combining load with broadcast shuffle. In my experiments I have not found any difference in generated code between %0 = insertelement <2 x double> poison, double %ld, i32 0 %1 = shufflevector <2 x double> %0, <2 x double> poison, <2 x i32> zeroinitializer versus %0 = insertelement <2 x double> poison, double %ld, i32 0 %1 = insertelement <2 x double> %0, double %ld, i32 1 Even having extra scalar use of the load did not prevent combining broadcast with load. vdmitrie: > So the operand of the broadcast is `%ld` but the load has multiple users. In this case I am…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions So the operand of the broadcast is %ld but the load has multiple users. In this case I am not sure how we would >>guarantee that %ld won't have some other user that is not getting vectorized. Perhaps just check that num_users % VL == 0 ? The main issue here is that even if condition num_users % VL == 0 is true there is no guarantee that all these uses >will be vector eventually. Only caller supposed to know that. Yes, there is no guarantee, not sure how we could improve this though. In case of SLP I think we are good because it will only call `getShuffleCost()` if there are no external uses, but in the general case it won't work. For me it is still not obvious why having extra (scalar) uses in case 2 would prevent combining load with broadcast shuffle. I think the concern is something like: %ld = double ... %ins1 = insertelement <2 x double> poison, double %ld, i32 0 %shuf1 = shufflevector <2 x double> %ins1, <2 x double> poison, <2 x i32> zeroinitializer ... %some_instr = ... %ld ... In this case code gen won't be able to combine the broadcast into a single instruction because it needs `%ld` for `%some_instr`. In my experiments I have not found any difference in generated code between %0 = insertelement <2 x double> poison, double %ld, i32 0 %1 = shufflevector <2 x double> %0, <2 x double> poison, <2 x i32> zeroinitializer versus %0 = insertelement <2 x double> poison, double %ld, i32 0 %1 = insertelement <2 x double> %0, double %ld, i32 1 Yes, there is no difference. We are just matching the canonicalized form of the broadcast which is the the form with the shuffle. vporpo: >>So the operand of the broadcast is %ld but the load has multiple users. In this case I am not…

static const CostTblEntry SSE1ShuffleTbl[] = {		static const CostTblEntry SSE1ShuffleTbl[] = {
{ TTI::SK_Broadcast, MVT::v4f32, 1 }, // shufps		{ TTI::SK_Broadcast, MVT::v4f32, 1 }, // shufps
{ TTI::SK_Reverse, MVT::v4f32, 1 }, // shufps		{ TTI::SK_Reverse, MVT::v4f32, 1 }, // shufps
{ TTI::SK_Select, MVT::v4f32, 2 }, // 2*shufps		{ TTI::SK_Select, MVT::v4f32, 2 }, // 2*shufps
		vdmitrieUnsubmitted Not Done Reply Inline Actions I'm not following. The minimal vector is 2x. The minimal broadcast is 2x. As far as I understand if we are broadcasting <2 x double > {a,b} into <4 x double> we supposed to get {a,b,a,b}. Right? we don't have entry for v4f64 and vmovddup function is {x0,x1,x2,x3} => {x0,x0,x2,x2} vdmitrie: I'm not following. The minimal vector is 2x. The minimal broadcast is 2x. As far as I…
		vdmitrieUnsubmitted Not Done Reply Inline Actions As far as I understand if we are broadcasting <2 x double > {a,b} into <4 x double> we supposed to get {a,b,a,b}. Right? Correcting myself: <2 x double > {a,b} into <2 x <2 x double>> => {{a,b},{a,b}} vdmitrie: > As far as I understand if we are broadcasting <2 x double > {a,b} into <4 x double> > we…
		vporpoAuthorUnsubmitted Done Reply Inline Actions Yeah this is broken. This is supposed to check for Case 1 or Case 2 and either check for `L->hasOneUse()` for Case 2, or just return true for Case 1. It should check the type of `L`'s uses, not `L` itself. vporpo: Yeah this is broken. This is supposed to check for Case 1 or Case 2 and either check for `L…
{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 1 }, // shufps		{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 1 }, // shufps
{ TTI::SK_PermuteTwoSrc, MVT::v4f32, 2 }, // 2*shufps		{ TTI::SK_PermuteTwoSrc, MVT::v4f32, 2 }, // 2*shufps
};		};

if (ST->hasSSE1())		if (ST->hasSSE1())
if (const auto *Entry = CostTableLookup(SSE1ShuffleTbl, Kind, LT.second))		if (const auto *Entry = CostTableLookup(SSE1ShuffleTbl, Kind, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

▲ Show 20 Lines • Show All 4,246 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/shuffle-load.ll

Show First 20 Lines • Show All 465 Lines • ▼ Show 20 Lines	;
%sf_2xf64 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer		%sf_2xf64 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
%ld_4xf64 = load <4 x double>, ptr undef		%ld_4xf64 = load <4 x double>, ptr undef
%sf_4xf64 = shufflevector <4 x double> %ld_4xf64, <4 x double> undef, <4 x i32> zeroinitializer		%sf_4xf64 = shufflevector <4 x double> %ld_4xf64, <4 x double> undef, <4 x i32> zeroinitializer
%ld_8xf64 = load <8 x double>, ptr undef		%ld_8xf64 = load <8 x double>, ptr undef
%sf_8xf64 = shufflevector <8 x double> %ld_8xf64, <8 x double> undef, <8 x i32> zeroinitializer		%sf_8xf64 = shufflevector <8 x double> %ld_8xf64, <8 x double> undef, <8 x i32> zeroinitializer

ret void		ret void
}		}

		; Checks the cost of a load+broadcast that cannot be combined.
		define void @multiple_uses() {
		; SSE-LABEL: 'multiple_uses'
		; SSE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %ld_2xf64 = load <2 x double>, ptr undef, align 16
		; SSE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_1 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; SSE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_2 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; SSE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
		;
		; SSE2-LABEL: 'multiple_uses'
		; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %ld_2xf64 = load <2 x double>, ptr undef, align 16
		; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_1 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; SSE2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_2 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; SSE2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
		;
		; SSE3-LABEL: 'multiple_uses'
		; SSE3-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %ld_2xf64 = load <2 x double>, ptr undef, align 16
		; SSE3-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_1 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; SSE3-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_2 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; SSE3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
		;
		; AVX-LABEL: 'multiple_uses'
		; AVX-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %ld_2xf64 = load <2 x double>, ptr undef, align 16
		; AVX-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_1 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; AVX-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_2 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; AVX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
		;
		; AVX2-LABEL: 'multiple_uses'
		; AVX2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %ld_2xf64 = load <2 x double>, ptr undef, align 16
		; AVX2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_1 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; AVX2-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_2 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; AVX2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
		;
		; AVX512-LABEL: 'multiple_uses'
		; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %ld_2xf64 = load <2 x double>, ptr undef, align 16
		; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_1 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; AVX512-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sf_2xf64_2 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		; AVX512-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
		;
		bb1:
		%ld_2xf64 = load <2 x double>, ptr undef
		; Load has multiple uses
		%sf_2xf64_1 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		%sf_2xf64_2 = shufflevector <2 x double> %ld_2xf64, <2 x double> undef, <2 x i32> zeroinitializer
		ret void
		}