This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
18/31
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
vectorize-free-extracts-inserts.ll
-
X86/
-
alternate-fp-inseltpoison.ll
-
alternate-fp.ll

Differential D99719

[SLP] Better estimate cost of no-op extracts on target vectors.
ClosedPublic

Authored by fhahn on Apr 1 2021, 5:17 AM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
spatel
vdmitrie
dtemirbulatov
anton-afanasyev

Commits

rG0f3230390b8b: [SLP] Better estimate cost of no-op extracts on target vectors.

Summary

The motivation for this patch is to better estimate the cost of
extracelement instructions in cases were they are going to be free,
because the source vector can be used directly.

A simple example is

%v1.lane.0 = extractelement <2 x double> %v.1, i32 0
%v1.lane.1 = extractelement <2 x double> %v.1, i32 1

%a.lane.0 = fmul double %v1.lane.0, %x
%a.lane.1 = fmul double %v1.lane.1, %y

Currently we only consider the extracts free, if there are no other
users.

In this particular case, on AArch64 which can fit <2 x double> in a
vector register, the extracts should be free, independently of other
users, because the source vector of the extracts will be in a vector
register directly, so it should be free to use the vector directly.

The SLP vectorized version of noop_extracts_9_lanes is 30%-50% faster on
certain AArch64 CPUs.

It looks like this does not impact any code in
SPEC2000/SPEC2006/MultiSource both on X86 and AArch64 with -O3 -flto.

This originally regressed after D80773, so if there's a better
alternative to explore, I'd be more than happy to do that.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Apr 1 2021, 5:17 AM

Herald added subscribers: pengfei, hiraditya, kristof.beyls. · View Herald TranscriptApr 1 2021, 5:17 AM

fhahn requested review of this revision.Apr 1 2021, 5:17 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2021, 5:17 AM

Harbormaster completed remote builds in B96686: Diff 334653.Apr 1 2021, 6:01 AM

RKSimon added a reviewer: anton-afanasyev.Apr 1 2021, 6:14 AM

ABataev added inline comments.Apr 1 2021, 6:48 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551	I think it is better to use `TLI->getTypeLegalizationCost(DL, cast<ExtractElementInst>(V)->getVectorOperandType());` to get the real machine vector type and the number of splits.
3562	`->getVectorOperand()` instead of `getOperand(0)`
3566	I think you need to use the real extract indices here to be more correct, i.e. for (unsigned I = Idx - EltsPerVector; I <= Idx; ++I) Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement, cast<ExtractElementInst>(VL[I])->getVectorOperandType(), *getExtractIndex(cast<Instruction>(VL[I])));

Address comments, thanks!

fhahn marked an inline comment as done.Apr 1 2021, 7:16 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551	I think using `TargetLoweringInfo` would indeed be better, but unfortunately I don't think we can access it here, as it is defined in CodeGen? I tried to see if there are any other such uses in `llvm/lib/Transforms` but couldn't. Perhaps there's a way to use it I am missing?
3566	Thanks, I think that's much better, updated.

ABataev added inline comments.Apr 1 2021, 7:29 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551	You can use `TTI->getNumberOfParts()` to get the number of registers and then calculate EltsPerVector. Also, what if there are extracts from 2 different vectors with the different numbers of elements?

Harbormaster completed remote builds in B96709: Diff 334678.Apr 1 2021, 7:54 AM

fhahn added inline comments.Apr 1 2021, 8:01 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551	That's convenient, thanks! I just gave it a try, but I stumbled over a problem. For example, on AArch64, `<2 x i32>` fits and can be used as the lower half of a vector register, so `EltsPerVector` would be 2 (and rightly so). But this has the unfortunate effect that in some cases we would vectorize some operations earlier with `<2 x i32>`, rather than vectorizing a larger expression with `<4 x i32>`. By using the larger vector register, we make sure to only do so to use the largest VF. Arguably using `getNumberOfParts` is the right thing to use here, but I really want to avoid introducing any regressions and I don't think there's a way at the moment to skip vectorizing eagerly if it would prevent optimizing with a wider VF later on. WDYT? Also, what if there are extracts from 2 different vectors with the different numbers of elements? At the moment all extracts in a block need to have the same vector register, so the types should also be the same. The `extracts_first_2_lanes_different_vectors` test should check for that case.

ABataev added inline comments.Apr 1 2021, 8:13 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551	Could you give an example, please? Then maybe guard these extra checks with something like: if (*ShuffleKind == TargetTransformInfo::SK_PermuteSingleSrc) { ... } ?

Subtract shufflecost for vector part, rather than multiple extractelement costs, to be symmetric to the cost computed earlier.

fhahn added inline comments.Apr 1 2021, 8:49 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551	Just had another look at the failure and it was caused by computing `Cost = TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask);` up-front, but then subtracting the cost of individual extracts. I think this may reduce the cost too aggressively. For example in the code below, the shuffle-cost on AArch64 is 1, but the cost to extract from lane `1` is 3. The new code should just cancel out the cost of the shuffle, but in this case it made the cost more profitable than it should be! I think it should be more correct to subtract the cost of a single shuffle for a vector with `EltsPerVector` elements. That should be more symmetric to the computed `Cost`. %v0.0 = extractelement <4 x i32> %v0, i32 0 %v0.1 = extractelement <4 x i32> %v0, i32 1 I think there's a similar problem in the `AllUsersVectorized && !ScalarToTreeEntry.count()` case, but it is not obvious to me how to improve that, given that we potentially need to subtract the cost for only a subset of extracts. Perhaps we should ensure `Cost` is at least 0, to avoid the problem?

ABataev added inline comments.Apr 1 2021, 9:06 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551	I think `AllUsersVectorized && !ScalarToTreeEntry.count()` case is correct since we removing the ExtractElement instruction from the code completely.
3566	Hmm, I think more correct would be to do something like this: Cost += TTI->getShuffleCost( ShuffleKind.getValue(), FixedVectorType::get(VecTy->getElementType(), EltsPerVector), Mask); in case if `AllConsecutive` is `false` and ignore the initial shuffle cost completely. Also, you need to calculate the correct `Mask` here or pass `llvm::None` So, I think you need to split it into separate parts. The first one for `*ShuffleKind == TargetTransformInfo::SK_PermuteSingleSrc` with improved shuffle cost for non-consecutive extracts and the generic one with the old functionality

Harbormaster completed remote builds in B96735: Diff 334712.Apr 1 2021, 9:37 AM

fhahn added inline comments.Apr 1 2021, 10:01 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3566	That sounds good! I put up D99745. Was that what you had in mind for preparation? (Still need to add some comments, but I wanted to make sure that's what you actually had in mind)

ABataev added inline comments.Apr 1 2021, 10:03 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3566	Sorry, I did not mean to split the patch into 2 parts :) I meant to split processing into 2 parts. Plus, the patch cannot be NFC since it changes the functionality.

Move new code to separate function.

fhahn added inline comments.Apr 1 2021, 12:41 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3566	Oh right! I think I see what you mean now. I move the new logic to a separate function and changed it to only add the shuffle costs for blocks that are not consecutive. I think the code should be much clearer now, thanks!

ABataev added inline comments.Apr 1 2021, 12:54 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3453–3455	Can you make it static local?
3458–3460	Is this possible? Or better to make it an assert?
3477–3481	Use `*getExtractIndex(cast<Instruction>(V));` to get the indeces
3504–3505	Optional<TargetTransformInfo::ShuffleKind> ShuffleKind = isShuffle(VL.slice(StartIdx, std::min(EltsPerVector, VL.size() - StartIdx)), Mask); Can we skip this call? Just `TargetTransformInfo::SK_PermuteSingleSrc` and `None` as a `Mask`?

address comments: removed mask computation, made static.

fhahn marked an inline comment as done.Apr 1 2021, 1:14 PM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3453–3455	Yes, originally I was still using areAllUsesVectorized, but I removed that now, as it does not really fit logically any more. This impacted 2 X86 tests, but I think it is the patch working as expected.
3458–3460	Unfortunately that can happen when compiling without a specific target and there's a test that tiggers an assert otherwise.
3504–3505	Done, much simple now!

ABataev added inline comments.Apr 1 2021, 1:17 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3458–3460	Maybe just exit in this case? And rely on the conservative cost model?
3487	No need for this check anymore

Address comments: Add early exit if getNumberOfParts returns 0, remove vector operand checks

fhahn marked an inline comment as done.Apr 1 2021, 1:24 PM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3458–3460	yes, I updated it to just return the shuffle cost for the whole `VecTy`.

ABataev added inline comments.Apr 1 2021, 1:28 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3457	Move this after `if`
3546–3549	Since you have a call for `TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask)` in `computeExtractCostForPermuteSingleSrc` maybe just check for shuffle kind in this function rathr than here? And go the conservatiave way if `*ShuffleKind != TargetTransformInfo::SK_PermuteSingleSrc`? In this case you would need to rename `computeExtractCostForPermuteSingleSrc` to something like `computeExtractCost`

Addressed comments: renamed to computeExtractCost and moved shuffle cost computation to function. Thanks again!

Harbormaster completed remote builds in B96789: Diff 334797.Apr 1 2021, 1:39 PM

This revision is now accepted and ready to land.Apr 1 2021, 1:39 PM

fhahn marked 2 inline comments as done.Apr 1 2021, 1:40 PM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3546–3549	Done, everything is now done in the function. Overall that should make the code simpler again, thanks.

Harbormaster completed remote builds in B96798: Diff 334810.Apr 1 2021, 2:15 PM

Harbormaster completed remote builds in B96796: Diff 334807.

Harbormaster completed remote builds in B96804: Diff 334818.Apr 1 2021, 2:45 PM

Closed by commit rG0f3230390b8b: [SLP] Better estimate cost of no-op extracts on target vectors. (authored by fhahn). · Explain WhyApr 2 2021, 2:52 AM

This revision was automatically updated to reflect the committed changes.

fhahn marked an inline comment as done.

fhahn added a commit: rG0f3230390b8b: [SLP] Better estimate cost of no-op extracts on target vectors..

MaskRay added a subscriber: MaskRay.Apr 2 2021, 10:44 AM

MaskRay added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3476	I have seen a case where VecTy->getNumElements() == 2 NumOfParts == 4 EltsPerVector == 0 Idx == 0 So divide-by-zero SIGFPE. Trying to reduce to a test case, but how should I paper over the problem quickly?

fhahn added inline comments.Apr 2 2021, 10:52 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3476	Oh right, that's interesting! Perhaps a vector with an element type that does not fit into a single register? For those cases, the logic below should not apply I think, so perhaps adding a build out on ExltsPerVector == 0 after computing it?

MaskRay added inline comments.Apr 2 2021, 10:53 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3476	`if (EltsPerVector == 0) return TTI.getShuffleCost(ShuffleKind, VecTy, Mask);` ?

fhahn added inline comments.Apr 2 2021, 10:55 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3476	Yes that was what I was thinking! It would be great to add a test case as well.

MaskRay mentioned this in rG8e5f3d04f269: [SLPVectorizer] Fix divide-by-zero after D99719.Apr 2 2021, 11:14 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

54 lines

test/

Transforms/

SLPVectorizer/

AArch64/

vectorize-free-extracts-inserts.ll

242 lines

X86/

alternate-fp-inseltpoison.ll

13 lines

alternate-fp.ll

13 lines

Diff 334928

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,444 Lines • ▼ Show 20 Lines	if (!CI->isNoBuiltin() && VecFunc) {
// Calculate the cost of the vector library call.		// Calculate the cost of the vector library call.
// If the corresponding vector call is cheaper, return its cost.		// If the corresponding vector call is cheaper, return its cost.
LibCost = TTI->getCallInstrCost(nullptr, VecTy, VecTys,		LibCost = TTI->getCallInstrCost(nullptr, VecTy, VecTys,
TTI::TCK_RecipThroughput);		TTI::TCK_RecipThroughput);
}		}
return {IntrinsicCost, LibCost};		return {IntrinsicCost, LibCost};
}		}

		/// Compute the cost of creating a vector of type \p VecTy containing the
		/// extracted values from \p VL.
		static InstructionCost
		ABataevUnsubmitted Not Done Reply Inline Actions Can you make it static local? ABataev: Can you make it static local?
		fhahnAuthorUnsubmitted Done Reply Inline Actions Yes, originally I was still using areAllUsesVectorized, but I removed that now, as it does not really fit logically any more. This impacted 2 X86 tests, but I think it is the patch working as expected. fhahn: Yes, originally I was still using areAllUsesVectorized, but I removed that now, as it does not…
		computeExtractCost(ArrayRef<Value > VL, FixedVectorType VecTy,
		TargetTransformInfo::ShuffleKind ShuffleKind,
		ABataevUnsubmitted Done Reply Inline Actions Move this after `if` ABataev: Move this after `if`
		ArrayRef<int> Mask, TargetTransformInfo &TTI) {
		unsigned NumOfParts = TTI.getNumberOfParts(VecTy);

		ABataevUnsubmitted Not Done Reply Inline Actions Is this possible? Or better to make it an assert? ABataev: Is this possible? Or better to make it an assert?
		fhahnAuthorUnsubmitted Done Reply Inline Actions Unfortunately that can happen when compiling without a specific target and there's a test that tiggers an assert otherwise. fhahn: Unfortunately that can happen when compiling without a specific target and there's a test that…
		ABataevUnsubmitted Not Done Reply Inline Actions Maybe just exit in this case? And rely on the conservative cost model? ABataev: Maybe just exit in this case? And rely on the conservative cost model?
		fhahnAuthorUnsubmitted Done Reply Inline Actions yes, I updated it to just return the shuffle cost for the whole `VecTy`. fhahn: yes, I updated it to just return the shuffle cost for the whole `VecTy`.
		if (ShuffleKind != TargetTransformInfo::SK_PermuteSingleSrc \|\| !NumOfParts)
		return TTI.getShuffleCost(ShuffleKind, VecTy, Mask);

		bool AllConsecutive = true;
		unsigned EltsPerVector = VecTy->getNumElements() / NumOfParts;
		unsigned Idx = -1;
		InstructionCost Cost = 0;

		// Process extracts in blocks of EltsPerVector to check if the source vector
		// operand can be re-used directly. If not, add the cost of creating a shuffle
		// to extract the values into a vector register.
		for (auto *V : VL) {
		++Idx;

		// Reached the start of a new vector registers.
		if (Idx % EltsPerVector == 0) {
		MaskRayUnsubmitted Not Done Reply Inline Actions I have seen a case where VecTy->getNumElements() == 2 NumOfParts == 4 EltsPerVector == 0 Idx == 0 So divide-by-zero SIGFPE. Trying to reduce to a test case, but how should I paper over the problem quickly? MaskRay: I have seen a case where ``` VecTy->getNumElements() == 2 NumOfParts == 4 EltsPerVector == 0…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Oh right, that's interesting! Perhaps a vector with an element type that does not fit into a single register? For those cases, the logic below should not apply I think, so perhaps adding a build out on ExltsPerVector == 0 after computing it? fhahn: Oh right, that's interesting! Perhaps a vector with an element type that does not fit into a…
		MaskRayUnsubmitted Not Done Reply Inline Actions `if (EltsPerVector == 0) return TTI.getShuffleCost(ShuffleKind, VecTy, Mask);` ? MaskRay: `if (EltsPerVector == 0) return TTI.getShuffleCost(ShuffleKind, VecTy, Mask);` ?
		fhahnAuthorUnsubmitted Done Reply Inline Actions Yes that was what I was thinking! It would be great to add a test case as well. fhahn: Yes that was what I was thinking! It would be great to add a test case as well.
		AllConsecutive = true;
		continue;
		}

		// Check all extracts for a vector register on the target directly
		ABataevUnsubmitted Done Reply Inline Actions Use `getExtractIndex(cast<Instruction>(V));` to get the indeces ABataev:* Use `*getExtractIndex(cast<Instruction>(V));` to get the indeces
		// extract values in order.
		unsigned CurrentIdx = *getExtractIndex(cast<Instruction>(V));
		unsigned PrevIdx = *getExtractIndex(cast<Instruction>(VL[Idx - 1]));
		AllConsecutive &= PrevIdx + 1 == CurrentIdx &&
		CurrentIdx % EltsPerVector == Idx % EltsPerVector;

		ABataevUnsubmitted Done Reply Inline Actions No need for this check anymore ABataev: No need for this check anymore
		if (AllConsecutive)
		continue;

		// Skip all indices, except for the last index per vector block.
		if ((Idx + 1) % EltsPerVector != 0 && Idx + 1 != VL.size())
		continue;

		// If we have a series of extracts which are not consecutive and hence
		// cannot re-use the source vector register directly, compute the shuffle
		// cost to extract the a vector with EltsPerVector elements.
		Cost += TTI.getShuffleCost(
		TargetTransformInfo::SK_PermuteSingleSrc,
		FixedVectorType::get(VecTy->getElementType(), EltsPerVector));
		}
		return Cost;
		}

InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {		InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {
		ABataevUnsubmitted Not Done Reply Inline Actions Optional<TargetTransformInfo::ShuffleKind> ShuffleKind = isShuffle(VL.slice(StartIdx, std::min(EltsPerVector, VL.size() - StartIdx)), Mask); Can we skip this call? Just `TargetTransformInfo::SK_PermuteSingleSrc` and `None` as a `Mask`? ABataev: 1. ``` Optional<TargetTransformInfo::ShuffleKind> ShuffleKind = isShuffle(VL.slice(StartIdx…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Done, much simple now! fhahn: Done, much simple now!
ArrayRef<Value*> VL = E->Scalars;		ArrayRef<Value*> VL = E->Scalars;

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
else if (CmpInst *CI = dyn_cast<CmpInst>(VL[0]))		else if (CmpInst *CI = dyn_cast<CmpInst>(VL[0]))
ScalarTy = CI->getOperand(0)->getType();		ScalarTy = CI->getOperand(0)->getType();
auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());		auto *VecTy = FixedVectorType::get(ScalarTy, VL.size());
Show All 23 Lines	if (E->State == TreeEntry::NeedToGather) {
}		}
if (E->getOpcode() == Instruction::ExtractElement &&		if (E->getOpcode() == Instruction::ExtractElement &&
allSameType(VL) && allSameBlock(VL)) {		allSameType(VL) && allSameBlock(VL)) {
SmallVector<int> Mask;		SmallVector<int> Mask;
Optional<TargetTransformInfo::ShuffleKind> ShuffleKind =		Optional<TargetTransformInfo::ShuffleKind> ShuffleKind =
isShuffle(VL, Mask);		isShuffle(VL, Mask);
if (ShuffleKind.hasValue()) {		if (ShuffleKind.hasValue()) {
InstructionCost Cost =		InstructionCost Cost =
TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask);		computeExtractCost(VL, VecTy, ShuffleKind, Mask, TTI);
for (auto *V : VL) {		for (auto *V : VL) {
// If all users of instruction are going to be vectorized and this		// If all users of instruction are going to be vectorized and this
// instruction itself is not going to be vectorized, consider this		// instruction itself is not going to be vectorized, consider this
// instruction as dead and remove its cost from the final cost of the		// instruction as dead and remove its cost from the final cost of the
		ABataevUnsubmitted Done Reply Inline Actions Since you have a call for `TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask)` in `computeExtractCostForPermuteSingleSrc` maybe just check for shuffle kind in this function rathr than here? And go the conservatiave way if `ShuffleKind != TargetTransformInfo::SK_PermuteSingleSrc`? In this case you would need to rename `computeExtractCostForPermuteSingleSrc` to something like `computeExtractCost` ABataev:* Since you have a call for `TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask)` in…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Done, everything is now done in the function. Overall that should make the code simpler again, thanks. fhahn: Done, everything is now done in the function. Overall that should make the code simpler again…
// vectorized tree.		// vectorized tree.
if (areAllUsersVectorized(cast<Instruction>(V)) &&		if (areAllUsersVectorized(cast<Instruction>(V)) &&
		ABataevUnsubmitted Not Done Reply Inline Actions I think it is better to use `TLI->getTypeLegalizationCost(DL, cast<ExtractElementInst>(V)->getVectorOperandType());` to get the real machine vector type and the number of splits. ABataev: I think it is better to use `TLI->getTypeLegalizationCost(DL, cast<ExtractElementInst>(V)…
		fhahnAuthorUnsubmitted Done Reply Inline Actions I think using `TargetLoweringInfo` would indeed be better, but unfortunately I don't think we can access it here, as it is defined in CodeGen? I tried to see if there are any other such uses in `llvm/lib/Transforms` but couldn't. Perhaps there's a way to use it I am missing? fhahn: I think using `TargetLoweringInfo` would indeed be better, but unfortunately I don't think we…
		ABataevUnsubmitted Not Done Reply Inline Actions You can use `TTI->getNumberOfParts()` to get the number of registers and then calculate EltsPerVector. Also, what if there are extracts from 2 different vectors with the different numbers of elements? ABataev: You can use `TTI->getNumberOfParts()` to get the number of registers and then calculate…
		fhahnAuthorUnsubmitted Done Reply Inline Actions That's convenient, thanks! I just gave it a try, but I stumbled over a problem. For example, on AArch64, `<2 x i32>` fits and can be used as the lower half of a vector register, so `EltsPerVector` would be 2 (and rightly so). But this has the unfortunate effect that in some cases we would vectorize some operations earlier with `<2 x i32>`, rather than vectorizing a larger expression with `<4 x i32>`. By using the larger vector register, we make sure to only do so to use the largest VF. Arguably using `getNumberOfParts` is the right thing to use here, but I really want to avoid introducing any regressions and I don't think there's a way at the moment to skip vectorizing eagerly if it would prevent optimizing with a wider VF later on. WDYT? Also, what if there are extracts from 2 different vectors with the different numbers of elements? At the moment all extracts in a block need to have the same vector register, so the types should also be the same. The `extracts_first_2_lanes_different_vectors` test should check for that case. fhahn: That's convenient, thanks! I just gave it a try, but I stumbled over a problem. For example, on…
		ABataevUnsubmitted Not Done Reply Inline Actions Could you give an example, please? Then maybe guard these extra checks with something like: if (ShuffleKind == TargetTransformInfo::SK_PermuteSingleSrc) { ... } ? ABataev:* 1. Could you give an example, please? 2. Then maybe guard these extra checks with something…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Just had another look at the failure and it was caused by computing `Cost = TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask);` up-front, but then subtracting the cost of individual extracts. I think this may reduce the cost too aggressively. For example in the code below, the shuffle-cost on AArch64 is 1, but the cost to extract from lane `1` is 3. The new code should just cancel out the cost of the shuffle, but in this case it made the cost more profitable than it should be! I think it should be more correct to subtract the cost of a single shuffle for a vector with `EltsPerVector` elements. That should be more symmetric to the computed `Cost`. %v0.0 = extractelement <4 x i32> %v0, i32 0 %v0.1 = extractelement <4 x i32> %v0, i32 1 I think there's a similar problem in the `AllUsersVectorized && !ScalarToTreeEntry.count()` case, but it is not obvious to me how to improve that, given that we potentially need to subtract the cost for only a subset of extracts. Perhaps we should ensure `Cost` is at least 0, to avoid the problem? fhahn: Just had another look at the failure and it was caused by computing `Cost = TTI->getShuffleCost…
		ABataevUnsubmitted Not Done Reply Inline Actions I think `AllUsersVectorized && !ScalarToTreeEntry.count()` case is correct since we removing the ExtractElement instruction from the code completely. ABataev: I think `AllUsersVectorized && !ScalarToTreeEntry.count()` case is correct since we removing…
!ScalarToTreeEntry.count(V)) {		!ScalarToTreeEntry.count(V)) {
auto *IO = cast<ConstantInt>(		auto *IO = cast<ConstantInt>(
cast<ExtractElementInst>(V)->getIndexOperand());		cast<ExtractElementInst>(V)->getIndexOperand());
Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,		Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
IO->getZExtValue());		IO->getZExtValue());
}		}
}		}
return ReuseShuffleCost + Cost;		return ReuseShuffleCost + Cost;
}		}
}		}
return ReuseShuffleCost + getGatherCost(VL);		return ReuseShuffleCost + getGatherCost(VL);
		ABataevUnsubmitted Done Reply Inline Actions `->getVectorOperand()` instead of `getOperand(0)` ABataev: `->getVectorOperand()` instead of `getOperand(0)`
}		}
assert((E->State == TreeEntry::Vectorize \|\|		assert((E->State == TreeEntry::Vectorize \|\|
E->State == TreeEntry::ScatterVectorize) &&		E->State == TreeEntry::ScatterVectorize) &&
"Unhandled state");		"Unhandled state");
		ABataevUnsubmitted Not Done Reply Inline Actions I think you need to use the real extract indices here to be more correct, i.e. for (unsigned I = Idx - EltsPerVector; I <= Idx; ++I) Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement, cast<ExtractElementInst>(VL[I])->getVectorOperandType(), getExtractIndex(cast<Instruction>(VL[I]))); ABataev:* I think you need to use the real extract indices here to be more correct, i.e. ``` for…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Thanks, I think that's much better, updated. fhahn: Thanks, I think that's much better, updated.
		ABataevUnsubmitted Not Done Reply Inline Actions Hmm, I think more correct would be to do something like this: Cost += TTI->getShuffleCost( ShuffleKind.getValue(), FixedVectorType::get(VecTy->getElementType(), EltsPerVector), Mask); in case if `AllConsecutive` is `false` and ignore the initial shuffle cost completely. Also, you need to calculate the correct `Mask` here or pass `llvm::None` So, I think you need to split it into separate parts. The first one for `ShuffleKind == TargetTransformInfo::SK_PermuteSingleSrc` with improved shuffle cost for non-consecutive extracts and the generic one with the old functionality ABataev:* Hmm, I think more correct would be to do something like this: ``` Cost += TTI->getShuffleCost…
		fhahnAuthorUnsubmitted Done Reply Inline Actions That sounds good! I put up D99745. Was that what you had in mind for preparation? (Still need to add some comments, but I wanted to make sure that's what you actually had in mind) fhahn: That sounds good! I put up D99745. Was that what you had in mind for preparation? (Still need…
		ABataevUnsubmitted Not Done Reply Inline Actions Sorry, I did not mean to split the patch into 2 parts :) I meant to split processing into 2 parts. Plus, the patch cannot be NFC since it changes the functionality. ABataev: Sorry, I did not mean to split the patch into 2 parts :) I meant to split processing into 2…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Oh right! I think I see what you mean now. I move the new logic to a separate function and changed it to only add the shuffle costs for blocks that are not consecutive. I think the code should be much clearer now, thanks! fhahn: Oh right! I think I see what you mean now. I move the new logic to a separate function and…
assert(E->getOpcode() && allSameType(VL) && allSameBlock(VL) && "Invalid VL");		assert(E->getOpcode() && allSameType(VL) && allSameBlock(VL) && "Invalid VL");
Instruction *VL0 = E->getMainOp();		Instruction *VL0 = E->getMainOp();
unsigned ShuffleOrOp =		unsigned ShuffleOrOp =
E->isAltShuffle() ? (unsigned)Instruction::ShuffleVector : E->getOpcode();		E->isAltShuffle() ? (unsigned)Instruction::ShuffleVector : E->getOpcode();
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI:		case Instruction::PHI:
return 0;		return 0;

▲ Show 20 Lines • Show All 4,338 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S %s \| FileCheck %s

	target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
	target triple = "arm64-apple-darwin"			target triple = "arm64-apple-darwin"

	declare void @use(double)			declare void @use(double)

	; The extracts %v1.lane.0 and %v1.lane.1 should be considered free during SLP,			; The extracts %v1.lane.0 and %v1.lane.1 should be considered free during SLP,
	; because they will be directly in a vector register on AArch64.			; because they will be directly in a vector register on AArch64.
	define void @noop_extracts_first_2_lanes(<2 x double>* %ptr.1, <4 x double>* %ptr.2) {			define void @noop_extracts_first_2_lanes(<2 x double>* %ptr.1, <4 x double>* %ptr.2) {
	; CHECK-LABEL: @noop_extracts_first_2_lanes(			; CHECK-LABEL: @noop_extracts_first_2_lanes(
	; CHECK-NEXT: bb:			; CHECK-NEXT: bb:
	; CHECK-NEXT: [[V_1:%.]] = load <2 x double>, <2 x double> [[PTR_1:%.*]], align 8			; CHECK-NEXT: [[V_1:%.]] = load <2 x double>, <2 x double> [[PTR_1:%.*]], align 8
				; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16
				; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2
				; CHECK-NEXT: [[V2_LANE_3:%.*]] = extractelement <4 x double> [[V_2]], i32 3
				; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V2_LANE_2]], i32 0
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[V2_LANE_3]], i32 1
				; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> [[V_1]], [[TMP1]]
				; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 0
				; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <2 x double> undef, double [[TMP3]], i32 0
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[TMP2]], i32 1
				; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <2 x double> [[A_INS_0]], double [[TMP4]], i32 1
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[V_1]], i32 0
				; CHECK-NEXT: call void @use(double [[TMP5]])
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[V_1]], i32 1
				; CHECK-NEXT: call void @use(double [[TMP6]])
				; CHECK-NEXT: store <2 x double> [[A_INS_1]], <2 x double>* [[PTR_1]], align 8
				; CHECK-NEXT: ret void
				;
				bb:
				%v.1 = load <2 x double>, <2 x double>* %ptr.1, align 8
				%v1.lane.0 = extractelement <2 x double> %v.1, i32 0
				%v1.lane.1 = extractelement <2 x double> %v.1, i32 1

				%v.2 = load <4 x double>, <4 x double>* %ptr.2, align 16
				%v2.lane.2 = extractelement <4 x double> %v.2, i32 2
				%v2.lane.3 = extractelement <4 x double> %v.2, i32 3

				%a.lane.0 = fmul double %v1.lane.0, %v2.lane.2
				%a.lane.1 = fmul double %v1.lane.1, %v2.lane.3

				%a.ins.0 = insertelement <2 x double> undef, double %a.lane.0, i32 0
				%a.ins.1 = insertelement <2 x double> %a.ins.0, double %a.lane.1, i32 1

				call void @use(double %v1.lane.0)
				call void @use(double %v1.lane.1)

				store <2 x double> %a.ins.1, <2 x double>* %ptr.1, align 8
				ret void
				}

				; Extracts of consecutive indices, but different vector operand.
				define void @extracts_first_2_lanes_different_vectors(<2 x double>* %ptr.1, <4 x double>* %ptr.2, <2 x double>* %ptr.3) {
				; CHECK-LABEL: @extracts_first_2_lanes_different_vectors(
				; CHECK-NEXT: bb:
				; CHECK-NEXT: [[V_1:%.]] = load <2 x double>, <2 x double> [[PTR_1:%.*]], align 8
	; CHECK-NEXT: [[V1_LANE_0:%.*]] = extractelement <2 x double> [[V_1]], i32 0			; CHECK-NEXT: [[V1_LANE_0:%.*]] = extractelement <2 x double> [[V_1]], i32 0
	; CHECK-NEXT: [[V1_LANE_1:%.*]] = extractelement <2 x double> [[V_1]], i32 1			; CHECK-NEXT: [[V_3:%.]] = load <2 x double>, <2 x double> [[PTR_3:%.*]], align 8
				; CHECK-NEXT: [[V3_LANE_1:%.*]] = extractelement <2 x double> [[V_3]], i32 1
	; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16			; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16
	; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2			; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2
	; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]			; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]
	; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_2]]			; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V3_LANE_1]], [[V2_LANE_2]]
	; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <2 x double> undef, double [[A_LANE_0]], i32 0			; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <2 x double> undef, double [[A_LANE_0]], i32 0
	; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <2 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1			; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <2 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1
	; CHECK-NEXT: call void @use(double [[V1_LANE_0]])			; CHECK-NEXT: call void @use(double [[V1_LANE_0]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_1]])			; CHECK-NEXT: call void @use(double [[V3_LANE_1]])
	; CHECK-NEXT: store <2 x double> [[A_INS_1]], <2 x double>* [[PTR_1]], align 8			; CHECK-NEXT: store <2 x double> [[A_INS_1]], <2 x double>* [[PTR_1]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	%v.1 = load <2 x double>, <2 x double>* %ptr.1, align 8			%v.1 = load <2 x double>, <2 x double>* %ptr.1, align 8
	%v1.lane.0 = extractelement <2 x double> %v.1, i32 0			%v1.lane.0 = extractelement <2 x double> %v.1, i32 0
	%v1.lane.1 = extractelement <2 x double> %v.1, i32 1			%v.3 = load <2 x double>, <2 x double>* %ptr.3, align 8
				%v3.lane.1 = extractelement <2 x double> %v.3, i32 1

	%v.2 = load <4 x double>, <4 x double>* %ptr.2, align 16			%v.2 = load <4 x double>, <4 x double>* %ptr.2, align 16
	%v2.lane.2 = extractelement <4 x double> %v.2, i32 2			%v2.lane.2 = extractelement <4 x double> %v.2, i32 2

	%a.lane.0 = fmul double %v1.lane.0, %v2.lane.2			%a.lane.0 = fmul double %v1.lane.0, %v2.lane.2
	%a.lane.1 = fmul double %v1.lane.1, %v2.lane.2			%a.lane.1 = fmul double %v3.lane.1, %v2.lane.2

	%a.ins.0 = insertelement <2 x double> undef, double %a.lane.0, i32 0			%a.ins.0 = insertelement <2 x double> undef, double %a.lane.0, i32 0
	%a.ins.1 = insertelement <2 x double> %a.ins.0, double %a.lane.1, i32 1			%a.ins.1 = insertelement <2 x double> %a.ins.0, double %a.lane.1, i32 1

	call void @use(double %v1.lane.0)			call void @use(double %v1.lane.0)
	call void @use(double %v1.lane.1)			call void @use(double %v3.lane.1)

	store <2 x double> %a.ins.1, <2 x double>* %ptr.1, align 8			store <2 x double> %a.ins.1, <2 x double>* %ptr.1, align 8
	ret void			ret void
	}			}

	; The extracts %v1.lane.2 and %v1.lane.3 should be considered free during SLP,			; The extracts %v1.lane.2 and %v1.lane.3 should be considered free during SLP,
	; because they will be directly in a vector register on AArch64.			; because they will be directly in a vector register on AArch64.
	define void @noop_extract_second_2_lanes(<4 x double>* %ptr.1, <4 x double>* %ptr.2) {			define void @noop_extract_second_2_lanes(<4 x double>* %ptr.1, <4 x double>* %ptr.2) {
	; CHECK-LABEL: @noop_extract_second_2_lanes(			; CHECK-LABEL: @noop_extract_second_2_lanes(
	; CHECK-NEXT: bb:			; CHECK-NEXT: bb:
	; CHECK-NEXT: [[V_1:%.]] = load <4 x double>, <4 x double> [[PTR_1:%.*]], align 8			; CHECK-NEXT: [[V_1:%.]] = load <4 x double>, <4 x double> [[PTR_1:%.*]], align 8
	; CHECK-NEXT: [[V1_LANE_2:%.*]] = extractelement <4 x double> [[V_1]], i32 2			; CHECK-NEXT: [[V1_LANE_2:%.*]] = extractelement <4 x double> [[V_1]], i32 2
	; CHECK-NEXT: [[V1_LANE_3:%.*]] = extractelement <4 x double> [[V_1]], i32 3			; CHECK-NEXT: [[V1_LANE_3:%.*]] = extractelement <4 x double> [[V_1]], i32 3
	; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16			; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16
	; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2			; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2
	; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V1_LANE_2]], i32 0
	; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[V1_LANE_3]], i32 1
	; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <4 x double> undef, double [[A_LANE_0]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V2_LANE_2]], i32 0
	; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <4 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[V2_LANE_2]], i32 1
				; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[TMP3]]
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP4]], i32 0
				; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <4 x double> undef, double [[TMP5]], i32 0
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP4]], i32 1
				; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <4 x double> [[A_INS_0]], double [[TMP6]], i32 1
	; CHECK-NEXT: call void @use(double [[V1_LANE_2]])			; CHECK-NEXT: call void @use(double [[V1_LANE_2]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_3]])			; CHECK-NEXT: call void @use(double [[V1_LANE_3]])
	; CHECK-NEXT: store <4 x double> [[A_INS_1]], <4 x double>* [[PTR_1]], align 8			; CHECK-NEXT: store <4 x double> [[A_INS_1]], <4 x double>* [[PTR_1]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	%v.1 = load <4 x double>, <4 x double>* %ptr.1, align 8			%v.1 = load <4 x double>, <4 x double>* %ptr.1, align 8
	%v1.lane.2 = extractelement <4 x double> %v.1, i32 2			%v1.lane.2 = extractelement <4 x double> %v.1, i32 2
	▲ Show 20 Lines • Show All 103 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[V1_LANE_0:%.*]] = extractelement <9 x double> [[V_1]], i32 0			; CHECK-NEXT: [[V1_LANE_0:%.*]] = extractelement <9 x double> [[V_1]], i32 0
	; CHECK-NEXT: [[V1_LANE_1:%.*]] = extractelement <9 x double> [[V_1]], i32 1			; CHECK-NEXT: [[V1_LANE_1:%.*]] = extractelement <9 x double> [[V_1]], i32 1
	; CHECK-NEXT: [[V1_LANE_2:%.*]] = extractelement <9 x double> [[V_1]], i32 2			; CHECK-NEXT: [[V1_LANE_2:%.*]] = extractelement <9 x double> [[V_1]], i32 2
	; CHECK-NEXT: [[V1_LANE_3:%.*]] = extractelement <9 x double> [[V_1]], i32 3			; CHECK-NEXT: [[V1_LANE_3:%.*]] = extractelement <9 x double> [[V_1]], i32 3
	; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16			; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16
	; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0			; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0
	; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1			; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1
	; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2			; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2
	; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <4 x double> poison, double [[V1_LANE_2]], i32 0
	; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> [[TMP0]], double [[V1_LANE_3]], i32 1
	; CHECK-NEXT: [[A_LANE_2:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x double> [[TMP1]], double [[V1_LANE_0]], i32 2
	; CHECK-NEXT: [[A_LANE_3:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP2]], double [[V1_LANE_1]], i32 3
	; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[A_LANE_0]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> poison, double [[V2_LANE_2]], i32 0
	; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[V2_LANE_0]], i32 1
	; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP5]], <2 x double> poison, <4 x i32> <i32 0, i32 0, i32 0, i32 1>
	; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3			; CHECK-NEXT: [[TMP6:%.*]] = fmul <4 x double> [[TMP3]], [[SHUFFLE]]
				; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x double> [[TMP6]], i32 0
				; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[TMP7]], i32 0
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x double> [[TMP6]], i32 1
				; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[TMP8]], i32 1
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x double> [[TMP6]], i32 2
				; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[TMP9]], i32 2
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x double> [[TMP6]], i32 3
				; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[TMP10]], i32 3
	; CHECK-NEXT: call void @use(double [[V1_LANE_0]])			; CHECK-NEXT: call void @use(double [[V1_LANE_0]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_1]])			; CHECK-NEXT: call void @use(double [[V1_LANE_1]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_2]])			; CHECK-NEXT: call void @use(double [[V1_LANE_2]])
	; CHECK-NEXT: call void @use(double [[V1_LANE_3]])			; CHECK-NEXT: call void @use(double [[V1_LANE_3]])
	; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8			; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[V1_LANE_5:%.*]] = extractelement <9 x double> [[V_1]], i32 5			; CHECK-NEXT: [[V1_LANE_5:%.*]] = extractelement <9 x double> [[V_1]], i32 5
	; CHECK-NEXT: [[V1_LANE_6:%.*]] = extractelement <9 x double> [[V_1]], i32 6			; CHECK-NEXT: [[V1_LANE_6:%.*]] = extractelement <9 x double> [[V_1]], i32 6
	; CHECK-NEXT: [[V1_LANE_7:%.*]] = extractelement <9 x double> [[V_1]], i32 7			; CHECK-NEXT: [[V1_LANE_7:%.*]] = extractelement <9 x double> [[V_1]], i32 7
	; CHECK-NEXT: [[V1_LANE_8:%.*]] = extractelement <9 x double> [[V_1]], i32 8			; CHECK-NEXT: [[V1_LANE_8:%.*]] = extractelement <9 x double> [[V_1]], i32 8
	; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16			; CHECK-NEXT: [[V_2:%.]] = load <4 x double>, <4 x double> [[PTR_2:%.*]], align 16
	; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0			; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0
	; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1			; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1
	; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2			; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2
	; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <8 x double> poison, double [[V1_LANE_3]], i32 0
	; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_4]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <8 x double> [[TMP0]], double [[V1_LANE_4]], i32 1
	; CHECK-NEXT: [[A_LANE_2:%.*]] = fmul double [[V1_LANE_5]], [[V2_LANE_1]]			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <8 x double> [[TMP1]], double [[V1_LANE_5]], i32 2
	; CHECK-NEXT: [[A_LANE_3:%.*]] = fmul double [[V1_LANE_6]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <8 x double> [[TMP2]], double [[V1_LANE_6]], i32 3
	; CHECK-NEXT: [[A_LANE_4:%.*]] = fmul double [[V1_LANE_7]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <8 x double> [[TMP3]], double [[V1_LANE_7]], i32 4
	; CHECK-NEXT: [[A_LANE_5:%.*]] = fmul double [[V1_LANE_8]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <8 x double> [[TMP4]], double [[V1_LANE_8]], i32 5
	; CHECK-NEXT: [[A_LANE_6:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <8 x double> [[TMP5]], double [[V1_LANE_0]], i32 6
	; CHECK-NEXT: [[A_LANE_7:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_1]]			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <8 x double> [[TMP6]], double [[V1_LANE_1]], i32 7
				; CHECK-NEXT: [[TMP8:%.*]] = insertelement <8 x double> poison, double [[V2_LANE_0]], i32 0
				; CHECK-NEXT: [[TMP9:%.*]] = insertelement <8 x double> [[TMP8]], double [[V2_LANE_2]], i32 1
				; CHECK-NEXT: [[TMP10:%.*]] = insertelement <8 x double> [[TMP9]], double [[V2_LANE_1]], i32 2
				; CHECK-NEXT: [[TMP11:%.*]] = insertelement <8 x double> [[TMP10]], double [[V2_LANE_0]], i32 3
				; CHECK-NEXT: [[TMP12:%.*]] = insertelement <8 x double> [[TMP11]], double [[V2_LANE_2]], i32 4
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <8 x double> [[TMP12]], double [[V2_LANE_0]], i32 5
				; CHECK-NEXT: [[TMP14:%.*]] = insertelement <8 x double> [[TMP13]], double [[V2_LANE_2]], i32 6
				; CHECK-NEXT: [[TMP15:%.*]] = insertelement <8 x double> [[TMP14]], double [[V2_LANE_1]], i32 7
				; CHECK-NEXT: [[TMP16:%.*]] = fmul <8 x double> [[TMP7]], [[TMP15]]
	; CHECK-NEXT: [[A_LANE_8:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_0]]			; CHECK-NEXT: [[A_LANE_8:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_0]]
	; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[A_LANE_0]], i32 0			; CHECK-NEXT: [[TMP17:%.*]] = extractelement <8 x double> [[TMP16]], i32 0
	; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1			; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[TMP17]], i32 0
	; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2			; CHECK-NEXT: [[TMP18:%.*]] = extractelement <8 x double> [[TMP16]], i32 1
	; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3			; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[TMP18]], i32 1
	; CHECK-NEXT: [[A_INS_4:%.*]] = insertelement <9 x double> [[A_INS_3]], double [[A_LANE_4]], i32 4			; CHECK-NEXT: [[TMP19:%.*]] = extractelement <8 x double> [[TMP16]], i32 2
	; CHECK-NEXT: [[A_INS_5:%.*]] = insertelement <9 x double> [[A_INS_4]], double [[A_LANE_5]], i32 5			; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[TMP19]], i32 2
	; CHECK-NEXT: [[A_INS_6:%.*]] = insertelement <9 x double> [[A_INS_5]], double [[A_LANE_6]], i32 6			; CHECK-NEXT: [[TMP20:%.*]] = extractelement <8 x double> [[TMP16]], i32 3
	; CHECK-NEXT: [[A_INS_7:%.*]] = insertelement <9 x double> [[A_INS_6]], double [[A_LANE_7]], i32 7			; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[TMP20]], i32 3
				; CHECK-NEXT: [[TMP21:%.*]] = extractelement <8 x double> [[TMP16]], i32 4
				; CHECK-NEXT: [[A_INS_4:%.*]] = insertelement <9 x double> [[A_INS_3]], double [[TMP21]], i32 4
				; CHECK-NEXT: [[TMP22:%.*]] = extractelement <8 x double> [[TMP16]], i32 5
				; CHECK-NEXT: [[A_INS_5:%.*]] = insertelement <9 x double> [[A_INS_4]], double [[TMP22]], i32 5
				; CHECK-NEXT: [[TMP23:%.*]] = extractelement <8 x double> [[TMP16]], i32 6
				; CHECK-NEXT: [[A_INS_6:%.*]] = insertelement <9 x double> [[A_INS_5]], double [[TMP23]], i32 6
				; CHECK-NEXT: [[TMP24:%.*]] = extractelement <8 x double> [[TMP16]], i32 7
				; CHECK-NEXT: [[A_INS_7:%.*]] = insertelement <9 x double> [[A_INS_6]], double [[TMP24]], i32 7
	; CHECK-NEXT: [[A_INS_8:%.*]] = insertelement <9 x double> [[A_INS_7]], double [[A_LANE_8]], i32 8			; CHECK-NEXT: [[A_INS_8:%.*]] = insertelement <9 x double> [[A_INS_7]], double [[A_LANE_8]], i32 8
	; CHECK-NEXT: [[B_LANE_0:%.*]] = fmul double [[V1_LANE_6]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP25:%.*]] = insertelement <8 x double> poison, double [[V1_LANE_6]], i32 0
	; CHECK-NEXT: [[B_LANE_1:%.*]] = fmul double [[V1_LANE_7]], [[V2_LANE_1]]			; CHECK-NEXT: [[TMP26:%.*]] = insertelement <8 x double> [[TMP25]], double [[V1_LANE_7]], i32 1
	; CHECK-NEXT: [[B_LANE_2:%.*]] = fmul double [[V1_LANE_8]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP27:%.*]] = insertelement <8 x double> [[TMP26]], double [[V1_LANE_8]], i32 2
	; CHECK-NEXT: [[B_LANE_3:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP28:%.*]] = insertelement <8 x double> [[TMP27]], double [[V1_LANE_0]], i32 3
	; CHECK-NEXT: [[B_LANE_4:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_1]]			; CHECK-NEXT: [[TMP29:%.*]] = insertelement <8 x double> [[TMP28]], double [[V1_LANE_1]], i32 4
	; CHECK-NEXT: [[B_LANE_5:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP30:%.*]] = insertelement <8 x double> [[TMP29]], double [[V1_LANE_2]], i32 5
	; CHECK-NEXT: [[B_LANE_6:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP31:%.*]] = insertelement <8 x double> [[TMP30]], double [[V1_LANE_3]], i32 6
	; CHECK-NEXT: [[B_LANE_7:%.*]] = fmul double [[V1_LANE_4]], [[V2_LANE_1]]			; CHECK-NEXT: [[TMP32:%.*]] = insertelement <8 x double> [[TMP31]], double [[V1_LANE_4]], i32 7
				; CHECK-NEXT: [[TMP33:%.*]] = insertelement <8 x double> poison, double [[V2_LANE_2]], i32 0
				; CHECK-NEXT: [[TMP34:%.*]] = insertelement <8 x double> [[TMP33]], double [[V2_LANE_1]], i32 1
				; CHECK-NEXT: [[TMP35:%.*]] = insertelement <8 x double> [[TMP34]], double [[V2_LANE_0]], i32 2
				; CHECK-NEXT: [[TMP36:%.*]] = insertelement <8 x double> [[TMP35]], double [[V2_LANE_2]], i32 3
				; CHECK-NEXT: [[TMP37:%.*]] = insertelement <8 x double> [[TMP36]], double [[V2_LANE_1]], i32 4
				; CHECK-NEXT: [[TMP38:%.*]] = insertelement <8 x double> [[TMP37]], double [[V2_LANE_0]], i32 5
				; CHECK-NEXT: [[TMP39:%.*]] = insertelement <8 x double> [[TMP38]], double [[V2_LANE_2]], i32 6
				; CHECK-NEXT: [[TMP40:%.*]] = insertelement <8 x double> [[TMP39]], double [[V2_LANE_1]], i32 7
				; CHECK-NEXT: [[TMP41:%.*]] = fmul <8 x double> [[TMP32]], [[TMP40]]
	; CHECK-NEXT: [[B_LANE_8:%.*]] = fmul double [[V1_LANE_5]], [[V2_LANE_0]]			; CHECK-NEXT: [[B_LANE_8:%.*]] = fmul double [[V1_LANE_5]], [[V2_LANE_0]]
	; CHECK-NEXT: [[B_INS_0:%.*]] = insertelement <9 x double> undef, double [[B_LANE_0]], i32 0			; CHECK-NEXT: [[TMP42:%.*]] = extractelement <8 x double> [[TMP41]], i32 0
	; CHECK-NEXT: [[B_INS_1:%.*]] = insertelement <9 x double> [[B_INS_0]], double [[B_LANE_1]], i32 1			; CHECK-NEXT: [[B_INS_0:%.*]] = insertelement <9 x double> undef, double [[TMP42]], i32 0
	; CHECK-NEXT: [[B_INS_2:%.*]] = insertelement <9 x double> [[B_INS_1]], double [[B_LANE_2]], i32 2			; CHECK-NEXT: [[TMP43:%.*]] = extractelement <8 x double> [[TMP41]], i32 1
	; CHECK-NEXT: [[B_INS_3:%.*]] = insertelement <9 x double> [[B_INS_2]], double [[B_LANE_3]], i32 3			; CHECK-NEXT: [[B_INS_1:%.*]] = insertelement <9 x double> [[B_INS_0]], double [[TMP43]], i32 1
	; CHECK-NEXT: [[B_INS_4:%.*]] = insertelement <9 x double> [[B_INS_3]], double [[B_LANE_4]], i32 4			; CHECK-NEXT: [[TMP44:%.*]] = extractelement <8 x double> [[TMP41]], i32 2
	; CHECK-NEXT: [[B_INS_5:%.*]] = insertelement <9 x double> [[B_INS_4]], double [[B_LANE_5]], i32 5			; CHECK-NEXT: [[B_INS_2:%.*]] = insertelement <9 x double> [[B_INS_1]], double [[TMP44]], i32 2
	; CHECK-NEXT: [[B_INS_6:%.*]] = insertelement <9 x double> [[B_INS_5]], double [[B_LANE_6]], i32 6			; CHECK-NEXT: [[TMP45:%.*]] = extractelement <8 x double> [[TMP41]], i32 3
	; CHECK-NEXT: [[B_INS_7:%.*]] = insertelement <9 x double> [[B_INS_6]], double [[B_LANE_7]], i32 7			; CHECK-NEXT: [[B_INS_3:%.*]] = insertelement <9 x double> [[B_INS_2]], double [[TMP45]], i32 3
				; CHECK-NEXT: [[TMP46:%.*]] = extractelement <8 x double> [[TMP41]], i32 4
				; CHECK-NEXT: [[B_INS_4:%.*]] = insertelement <9 x double> [[B_INS_3]], double [[TMP46]], i32 4
				; CHECK-NEXT: [[TMP47:%.*]] = extractelement <8 x double> [[TMP41]], i32 5
				; CHECK-NEXT: [[B_INS_5:%.*]] = insertelement <9 x double> [[B_INS_4]], double [[TMP47]], i32 5
				; CHECK-NEXT: [[TMP48:%.*]] = extractelement <8 x double> [[TMP41]], i32 6
				; CHECK-NEXT: [[B_INS_6:%.*]] = insertelement <9 x double> [[B_INS_5]], double [[TMP48]], i32 6
				; CHECK-NEXT: [[TMP49:%.*]] = extractelement <8 x double> [[TMP41]], i32 7
				; CHECK-NEXT: [[B_INS_7:%.*]] = insertelement <9 x double> [[B_INS_6]], double [[TMP49]], i32 7
	; CHECK-NEXT: [[B_INS_8:%.*]] = insertelement <9 x double> [[B_INS_7]], double [[B_LANE_8]], i32 8			; CHECK-NEXT: [[B_INS_8:%.*]] = insertelement <9 x double> [[B_INS_7]], double [[B_LANE_8]], i32 8
	; CHECK-NEXT: [[RES:%.*]] = fsub <9 x double> [[A_INS_8]], [[B_INS_8]]			; CHECK-NEXT: [[RES:%.*]] = fsub <9 x double> [[A_INS_8]], [[B_INS_8]]
	; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[PTR_1]], align 8			; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[PTR_1]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	%v.1 = load <9 x double>, <9 x double>* %ptr.1, align 8			%v.1 = load <9 x double>, <9 x double>* %ptr.1, align 8
	%v1.lane.0 = extractelement <9 x double> %v.1, i32 0			%v1.lane.0 = extractelement <9 x double> %v.1, i32 0
	▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1			; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1
	; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2			; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2
	; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3			; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3
	; CHECK-NEXT: [[A_INS_4:%.*]] = insertelement <9 x double> [[A_INS_3]], double [[A_LANE_4]], i32 4			; CHECK-NEXT: [[A_INS_4:%.*]] = insertelement <9 x double> [[A_INS_3]], double [[A_LANE_4]], i32 4
	; CHECK-NEXT: [[A_INS_5:%.*]] = insertelement <9 x double> [[A_INS_4]], double [[A_LANE_5]], i32 5			; CHECK-NEXT: [[A_INS_5:%.*]] = insertelement <9 x double> [[A_INS_4]], double [[A_LANE_5]], i32 5
	; CHECK-NEXT: [[A_INS_6:%.*]] = insertelement <9 x double> [[A_INS_5]], double [[A_LANE_6]], i32 6			; CHECK-NEXT: [[A_INS_6:%.*]] = insertelement <9 x double> [[A_INS_5]], double [[A_LANE_6]], i32 6
	; CHECK-NEXT: [[A_INS_7:%.*]] = insertelement <9 x double> [[A_INS_6]], double [[A_LANE_7]], i32 7			; CHECK-NEXT: [[A_INS_7:%.*]] = insertelement <9 x double> [[A_INS_6]], double [[A_LANE_7]], i32 7
	; CHECK-NEXT: [[A_INS_8:%.*]] = insertelement <9 x double> [[A_INS_7]], double [[A_LANE_8]], i32 8			; CHECK-NEXT: [[A_INS_8:%.*]] = insertelement <9 x double> [[A_INS_7]], double [[A_LANE_8]], i32 8
	; CHECK-NEXT: [[B_LANE_0:%.*]] = fmul double [[V1_LANE_6]], [[V2_LANE_1]]			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <8 x double> poison, double [[V1_LANE_6]], i32 0
	; CHECK-NEXT: [[B_LANE_1:%.*]] = fmul double [[V1_LANE_7]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <8 x double> [[TMP0]], double [[V1_LANE_7]], i32 1
	; CHECK-NEXT: [[B_LANE_2:%.*]] = fmul double [[V1_LANE_8]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <8 x double> [[TMP1]], double [[V1_LANE_8]], i32 2
	; CHECK-NEXT: [[B_LANE_3:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <8 x double> [[TMP2]], double [[V1_LANE_0]], i32 3
	; CHECK-NEXT: [[B_LANE_4:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <8 x double> [[TMP3]], double [[V1_LANE_1]], i32 4
	; CHECK-NEXT: [[B_LANE_5:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_1]]			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <8 x double> [[TMP4]], double [[V1_LANE_2]], i32 5
	; CHECK-NEXT: [[B_LANE_6:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_0]]			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <8 x double> [[TMP5]], double [[V1_LANE_3]], i32 6
	; CHECK-NEXT: [[B_LANE_7:%.*]] = fmul double [[V1_LANE_4]], [[V2_LANE_2]]			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <8 x double> [[TMP6]], double [[V1_LANE_4]], i32 7
				; CHECK-NEXT: [[TMP8:%.*]] = insertelement <8 x double> poison, double [[V2_LANE_1]], i32 0
				; CHECK-NEXT: [[TMP9:%.*]] = insertelement <8 x double> [[TMP8]], double [[V2_LANE_0]], i32 1
				; CHECK-NEXT: [[TMP10:%.*]] = insertelement <8 x double> [[TMP9]], double [[V2_LANE_2]], i32 2
				; CHECK-NEXT: [[TMP11:%.*]] = insertelement <8 x double> [[TMP10]], double [[V2_LANE_0]], i32 3
				; CHECK-NEXT: [[TMP12:%.*]] = insertelement <8 x double> [[TMP11]], double [[V2_LANE_2]], i32 4
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <8 x double> [[TMP12]], double [[V2_LANE_1]], i32 5
				; CHECK-NEXT: [[TMP14:%.*]] = insertelement <8 x double> [[TMP13]], double [[V2_LANE_0]], i32 6
				; CHECK-NEXT: [[TMP15:%.*]] = insertelement <8 x double> [[TMP14]], double [[V2_LANE_2]], i32 7
				; CHECK-NEXT: [[TMP16:%.*]] = fmul <8 x double> [[TMP7]], [[TMP15]]
	; CHECK-NEXT: [[B_LANE_8:%.*]] = fmul double [[V1_LANE_5]], [[V2_LANE_0]]			; CHECK-NEXT: [[B_LANE_8:%.*]] = fmul double [[V1_LANE_5]], [[V2_LANE_0]]
	; CHECK-NEXT: [[B_INS_0:%.*]] = insertelement <9 x double> undef, double [[B_LANE_0]], i32 0			; CHECK-NEXT: [[TMP17:%.*]] = extractelement <8 x double> [[TMP16]], i32 0
	; CHECK-NEXT: [[B_INS_1:%.*]] = insertelement <9 x double> [[B_INS_0]], double [[B_LANE_1]], i32 1			; CHECK-NEXT: [[B_INS_0:%.*]] = insertelement <9 x double> undef, double [[TMP17]], i32 0
	; CHECK-NEXT: [[B_INS_2:%.*]] = insertelement <9 x double> [[B_INS_1]], double [[B_LANE_2]], i32 2			; CHECK-NEXT: [[TMP18:%.*]] = extractelement <8 x double> [[TMP16]], i32 1
	; CHECK-NEXT: [[B_INS_3:%.*]] = insertelement <9 x double> [[B_INS_2]], double [[B_LANE_3]], i32 3			; CHECK-NEXT: [[B_INS_1:%.*]] = insertelement <9 x double> [[B_INS_0]], double [[TMP18]], i32 1
	; CHECK-NEXT: [[B_INS_4:%.*]] = insertelement <9 x double> [[B_INS_3]], double [[B_LANE_4]], i32 4			; CHECK-NEXT: [[TMP19:%.*]] = extractelement <8 x double> [[TMP16]], i32 2
	; CHECK-NEXT: [[B_INS_5:%.*]] = insertelement <9 x double> [[B_INS_4]], double [[B_LANE_5]], i32 5			; CHECK-NEXT: [[B_INS_2:%.*]] = insertelement <9 x double> [[B_INS_1]], double [[TMP19]], i32 2
	; CHECK-NEXT: [[B_INS_6:%.*]] = insertelement <9 x double> [[B_INS_5]], double [[B_LANE_6]], i32 6			; CHECK-NEXT: [[TMP20:%.*]] = extractelement <8 x double> [[TMP16]], i32 3
	; CHECK-NEXT: [[B_INS_7:%.*]] = insertelement <9 x double> [[B_INS_6]], double [[B_LANE_7]], i32 7			; CHECK-NEXT: [[B_INS_3:%.*]] = insertelement <9 x double> [[B_INS_2]], double [[TMP20]], i32 3
				; CHECK-NEXT: [[TMP21:%.*]] = extractelement <8 x double> [[TMP16]], i32 4
				; CHECK-NEXT: [[B_INS_4:%.*]] = insertelement <9 x double> [[B_INS_3]], double [[TMP21]], i32 4
				; CHECK-NEXT: [[TMP22:%.*]] = extractelement <8 x double> [[TMP16]], i32 5
				; CHECK-NEXT: [[B_INS_5:%.*]] = insertelement <9 x double> [[B_INS_4]], double [[TMP22]], i32 5
				; CHECK-NEXT: [[TMP23:%.*]] = extractelement <8 x double> [[TMP16]], i32 6
				; CHECK-NEXT: [[B_INS_6:%.*]] = insertelement <9 x double> [[B_INS_5]], double [[TMP23]], i32 6
				; CHECK-NEXT: [[TMP24:%.*]] = extractelement <8 x double> [[TMP16]], i32 7
				; CHECK-NEXT: [[B_INS_7:%.*]] = insertelement <9 x double> [[B_INS_6]], double [[TMP24]], i32 7
	; CHECK-NEXT: [[B_INS_8:%.*]] = insertelement <9 x double> [[B_INS_7]], double [[B_LANE_8]], i32 8			; CHECK-NEXT: [[B_INS_8:%.*]] = insertelement <9 x double> [[B_INS_7]], double [[B_LANE_8]], i32 8
	; CHECK-NEXT: [[RES:%.*]] = fsub <9 x double> [[A_INS_8]], [[B_INS_8]]			; CHECK-NEXT: [[RES:%.*]] = fsub <9 x double> [[A_INS_8]], [[B_INS_8]]
	; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[PTR_1]], align 8			; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[PTR_1]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb:			bb:
	%v.1 = load <9 x double>, <9 x double>* %ptr.1, align 8			%v.1 = load <9 x double>, <9 x double>* %ptr.1, align 8
	%v1.lane.0 = extractelement <9 x double> %v.1, i32 0			%v1.lane.0 = extractelement <9 x double> %v.1, i32 0
	▲ Show 20 Lines • Show All 179 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/alternate-fp-inseltpoison.ll

	Show First 20 Lines • Show All 137 Lines • ▼ Show 20 Lines
	}			}

	define <4 x float> @fmul_fdiv_v4f32_const(<4 x float> %a) {			define <4 x float> @fmul_fdiv_v4f32_const(<4 x float> %a) {
	; SSE-LABEL: @fmul_fdiv_v4f32_const(			; SSE-LABEL: @fmul_fdiv_v4f32_const(
	; SSE-NEXT: [[TMP1:%.]] = fmul <4 x float> [[A:%.]], <float 2.000000e+00, float 1.000000e+00, float 1.000000e+00, float 2.000000e+00>			; SSE-NEXT: [[TMP1:%.]] = fmul <4 x float> [[A:%.]], <float 2.000000e+00, float 1.000000e+00, float 1.000000e+00, float 2.000000e+00>
	; SSE-NEXT: ret <4 x float> [[TMP1]]			; SSE-NEXT: ret <4 x float> [[TMP1]]
	;			;
	; SLM-LABEL: @fmul_fdiv_v4f32_const(			; SLM-LABEL: @fmul_fdiv_v4f32_const(
	; SLM-NEXT: [[A0:%.]] = extractelement <4 x float> [[A:%.]], i32 0			; SLM-NEXT: [[A2:%.]] = extractelement <4 x float> [[A:%.]], i32 2
	; SLM-NEXT: [[A1:%.*]] = extractelement <4 x float> [[A]], i32 1
	; SLM-NEXT: [[A2:%.*]] = extractelement <4 x float> [[A]], i32 2
	; SLM-NEXT: [[A3:%.*]] = extractelement <4 x float> [[A]], i32 3			; SLM-NEXT: [[A3:%.*]] = extractelement <4 x float> [[A]], i32 3
	; SLM-NEXT: [[AB0:%.*]] = fmul float [[A0]], 2.000000e+00			; SLM-NEXT: [[TMP1:%.*]] = shufflevector <4 x float> [[A]], <4 x float> undef, <2 x i32> <i32 0, i32 1>
				; SLM-NEXT: [[TMP2:%.*]] = fmul <2 x float> [[TMP1]], <float 2.000000e+00, float 1.000000e+00>
	; SLM-NEXT: [[AB3:%.*]] = fmul float [[A3]], 2.000000e+00			; SLM-NEXT: [[AB3:%.*]] = fmul float [[A3]], 2.000000e+00
	; SLM-NEXT: [[R0:%.*]] = insertelement <4 x float> poison, float [[AB0]], i32 0			; SLM-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 0
	; SLM-NEXT: [[R1:%.*]] = insertelement <4 x float> [[R0]], float [[A1]], i32 1			; SLM-NEXT: [[R0:%.*]] = insertelement <4 x float> poison, float [[TMP3]], i32 0
				; SLM-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP2]], i32 1
				; SLM-NEXT: [[R1:%.*]] = insertelement <4 x float> [[R0]], float [[TMP4]], i32 1
	; SLM-NEXT: [[R2:%.*]] = insertelement <4 x float> [[R1]], float [[A2]], i32 2			; SLM-NEXT: [[R2:%.*]] = insertelement <4 x float> [[R1]], float [[A2]], i32 2
	; SLM-NEXT: [[R3:%.*]] = insertelement <4 x float> [[R2]], float [[AB3]], i32 3			; SLM-NEXT: [[R3:%.*]] = insertelement <4 x float> [[R2]], float [[AB3]], i32 3
	; SLM-NEXT: ret <4 x float> [[R3]]			; SLM-NEXT: ret <4 x float> [[R3]]
	;			;
	; AVX-LABEL: @fmul_fdiv_v4f32_const(			; AVX-LABEL: @fmul_fdiv_v4f32_const(
	; AVX-NEXT: [[TMP1:%.]] = fmul <4 x float> [[A:%.]], <float 2.000000e+00, float 1.000000e+00, float 1.000000e+00, float 2.000000e+00>			; AVX-NEXT: [[TMP1:%.]] = fmul <4 x float> [[A:%.]], <float 2.000000e+00, float 1.000000e+00, float 1.000000e+00, float 2.000000e+00>
	; AVX-NEXT: ret <4 x float> [[TMP1]]			; AVX-NEXT: ret <4 x float> [[TMP1]]
	;			;
	Show All 18 Lines

llvm/test/Transforms/SLPVectorizer/X86/alternate-fp.ll

	Show First 20 Lines • Show All 137 Lines • ▼ Show 20 Lines
	}			}

	define <4 x float> @fmul_fdiv_v4f32_const(<4 x float> %a) {			define <4 x float> @fmul_fdiv_v4f32_const(<4 x float> %a) {
	; SSE-LABEL: @fmul_fdiv_v4f32_const(			; SSE-LABEL: @fmul_fdiv_v4f32_const(
	; SSE-NEXT: [[TMP1:%.]] = fmul <4 x float> [[A:%.]], <float 2.000000e+00, float 1.000000e+00, float 1.000000e+00, float 2.000000e+00>			; SSE-NEXT: [[TMP1:%.]] = fmul <4 x float> [[A:%.]], <float 2.000000e+00, float 1.000000e+00, float 1.000000e+00, float 2.000000e+00>
	; SSE-NEXT: ret <4 x float> [[TMP1]]			; SSE-NEXT: ret <4 x float> [[TMP1]]
	;			;
	; SLM-LABEL: @fmul_fdiv_v4f32_const(			; SLM-LABEL: @fmul_fdiv_v4f32_const(
	; SLM-NEXT: [[A0:%.]] = extractelement <4 x float> [[A:%.]], i32 0			; SLM-NEXT: [[A2:%.]] = extractelement <4 x float> [[A:%.]], i32 2
	; SLM-NEXT: [[A1:%.*]] = extractelement <4 x float> [[A]], i32 1
	; SLM-NEXT: [[A2:%.*]] = extractelement <4 x float> [[A]], i32 2
	; SLM-NEXT: [[A3:%.*]] = extractelement <4 x float> [[A]], i32 3			; SLM-NEXT: [[A3:%.*]] = extractelement <4 x float> [[A]], i32 3
	; SLM-NEXT: [[AB0:%.*]] = fmul float [[A0]], 2.000000e+00			; SLM-NEXT: [[TMP1:%.*]] = shufflevector <4 x float> [[A]], <4 x float> undef, <2 x i32> <i32 0, i32 1>
				; SLM-NEXT: [[TMP2:%.*]] = fmul <2 x float> [[TMP1]], <float 2.000000e+00, float 1.000000e+00>
	; SLM-NEXT: [[AB3:%.*]] = fmul float [[A3]], 2.000000e+00			; SLM-NEXT: [[AB3:%.*]] = fmul float [[A3]], 2.000000e+00
	; SLM-NEXT: [[R0:%.*]] = insertelement <4 x float> undef, float [[AB0]], i32 0			; SLM-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 0
	; SLM-NEXT: [[R1:%.*]] = insertelement <4 x float> [[R0]], float [[A1]], i32 1			; SLM-NEXT: [[R0:%.*]] = insertelement <4 x float> undef, float [[TMP3]], i32 0
				; SLM-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP2]], i32 1
				; SLM-NEXT: [[R1:%.*]] = insertelement <4 x float> [[R0]], float [[TMP4]], i32 1
	; SLM-NEXT: [[R2:%.*]] = insertelement <4 x float> [[R1]], float [[A2]], i32 2			; SLM-NEXT: [[R2:%.*]] = insertelement <4 x float> [[R1]], float [[A2]], i32 2
	; SLM-NEXT: [[R3:%.*]] = insertelement <4 x float> [[R2]], float [[AB3]], i32 3			; SLM-NEXT: [[R3:%.*]] = insertelement <4 x float> [[R2]], float [[AB3]], i32 3
	; SLM-NEXT: ret <4 x float> [[R3]]			; SLM-NEXT: ret <4 x float> [[R3]]
	;			;
	; AVX-LABEL: @fmul_fdiv_v4f32_const(			; AVX-LABEL: @fmul_fdiv_v4f32_const(
	; AVX-NEXT: [[TMP1:%.]] = fmul <4 x float> [[A:%.]], <float 2.000000e+00, float 1.000000e+00, float 1.000000e+00, float 2.000000e+00>			; AVX-NEXT: [[TMP1:%.]] = fmul <4 x float> [[A:%.]], <float 2.000000e+00, float 1.000000e+00, float 1.000000e+00, float 2.000000e+00>
	; AVX-NEXT: ret <4 x float> [[TMP1]]			; AVX-NEXT: ret <4 x float> [[TMP1]]
	;			;
	Show All 18 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Better estimate cost of no-op extracts on target vectors.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 334928

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll

llvm/test/Transforms/SLPVectorizer/X86/alternate-fp-inseltpoison.ll

llvm/test/Transforms/SLPVectorizer/X86/alternate-fp.ll

[SLP] Better estimate cost of no-op extracts on target vectors.
ClosedPublic