This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Do not vectorize non-profitable alternate nodes.
ClosedPublic

Authored by ABataev on May 13 2022, 10:47 AM.

Download Raw Diff

Details

Reviewers

RKSimon
vdmitrie
vporpo

Commits

rG8b8281f35475: [SLP]Do not vectorize non-profitable alternate nodes.

Summary

If alternate node has only 2 instructions and the tree is already big
enough, better to skip the vectorization of such nodes, they are not
very profitable (the resulting code cotains 3 instructions instead of
original 2 scalars). SLP can try to vectorize the buildvector sequence
in the next attempt, if it is profitable.

Metric: SLP.NumVectorInstructions

Program SLP.NumVectorInstructions

                                                                          results                   results0 diff
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    72.00                     73.00   1.4%

test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 1186.00 1198.00 1.0%

test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   241.00                    242.00   0.4%
             test-suite :: MultiSource/Applications/JM/lencod/lencod.test  2131.00                   2139.00   0.4%

test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 6377.00 6384.00 0.1%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 6377.00 6384.00 0.1%

  test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 12650.00                  12658.00   0.1%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26169.00                  26147.00  -0.1%
    test-suite :: MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des.test    99.00                     86.00 -13.1%

Gains:
526.blender_r - more vectorized trees.
enc-3des - same.

Others:
510.parest_r - no changes.
miniFE - same
623.xalancbmk_s - some (non-profitable) parts of the trees are not

vectorized.

523.xalancbmk_r - same
lencod - same
timberwolfmc - same
miniAMR - same

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.May 13 2022, 10:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2022, 10:47 AM

Herald added subscribers: dmgreen, hiraditya. · View Herald Transcript

ABataev requested review of this revision.May 13 2022, 10:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2022, 10:47 AM

LGTM

This revision is now accepted and ready to land.May 13 2022, 11:12 AM

vdmitrie accepted this revision.May 13 2022, 11:42 AM

Harbormaster completed remote builds in B164354: Diff 429293.May 13 2022, 12:19 PM

This revision was landed with ongoing or failed builds.May 13 2022, 2:48 PM

Closed by commit rG8b8281f35475: [SLP]Do not vectorize non-profitable alternate nodes. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rG8b8281f35475: [SLP]Do not vectorize non-profitable alternate nodes..

Just a heads up, we are seeing ~6-7% regressions on some benchmarks with this change. I'll try to narrow down a reproducer.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

64 lines

test/

Transforms/

SLPVectorizer/

X86/

PR39774.ll

34 lines

slp-throttle.ll

18 lines

Diff 429361

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,010 Lines • ▼ Show 20 Lines	#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
LLVM_DUMP_METHOD void dump() const { print(dbgs()); }		LLVM_DUMP_METHOD void dump() const { print(dbgs()); }
#endif		#endif
};		};

/// Evaluate each pair in \p Candidates and return index into \p Candidates		/// Evaluate each pair in \p Candidates and return index into \p Candidates
/// for a pair which have highest score deemed to have best chance to form		/// for a pair which have highest score deemed to have best chance to form
/// root of profitable tree to vectorize. Return None if no candidate scored		/// root of profitable tree to vectorize. Return None if no candidate scored
/// above the LookAheadHeuristics::ScoreFail.		/// above the LookAheadHeuristics::ScoreFail.
		/// \param Limit Lower limit of the cost, considered to be good enough score.
Optional<int>		Optional<int>
findBestRootPair(ArrayRef<std::pair<Value , Value >> Candidates) {		findBestRootPair(ArrayRef<std::pair<Value , Value >> Candidates,
		int Limit = LookAheadHeuristics::ScoreFail) {
LookAheadHeuristics LookAhead(DL, SE, this, /NumLanes=*/2,		LookAheadHeuristics LookAhead(DL, SE, this, /NumLanes=*/2,
RootLookAheadMaxDepth);		RootLookAheadMaxDepth);
int BestScore = LookAheadHeuristics::ScoreFail;		int BestScore = Limit;
Optional<int> Index = None;		Optional<int> Index = None;
for (int I : seq<int>(0, Candidates.size())) {		for (int I : seq<int>(0, Candidates.size())) {
int Score = LookAhead.getScoreAtLevelRec(Candidates[I].first,		int Score = LookAhead.getScoreAtLevelRec(Candidates[I].first,
Candidates[I].second,		Candidates[I].second,
/U1=/nullptr, /U2=/nullptr,		/U1=/nullptr, /U2=/nullptr,
/Level=/1, None);		/Level=/1, None);
if (Score > BestScore) {		if (Score > BestScore) {
BestScore = Score;		BestScore = Score;
▲ Show 20 Lines • Show All 2,487 Lines • ▼ Show 20 Lines	if (SI->getValueOperand()->getType()->isVectorTy()) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
return;		return;
}		}

// If all of the operands are identical or constant we have a simple solution.		// If all of the operands are identical or constant we have a simple solution.
// If we deal with insert/extract instructions, they all must have constant		// If we deal with insert/extract instructions, they all must have constant
// indices, otherwise we should gather them, not try to vectorize.		// indices, otherwise we should gather them, not try to vectorize.
		// If alternate op node with 2 elements with gathered operands - do not
		// vectorize.
		auto &&NotProfitableForVectorization = [&S, this,
		Depth](ArrayRef<Value *> VL) {
		if (!S.getOpcode() \|\| !S.isAltShuffle() \|\| VL.size() > 2)
		return false;
		if (VectorizableTree.size() < MinTreeSize)
		return false;
		if (Depth >= RecursionMaxDepth - 1)
		return true;
		// Check if all operands are extracts, part of vector node or can build a
		// regular vectorize node.
		SmallVector<unsigned, 2> InstsCount(VL.size(), 0);
		for (Value *V : VL) {
		auto *I = cast<Instruction>(V);
		InstsCount.push_back(count_if(I->operand_values(), [](Value *Op) {
		return isa<Instruction>(Op) \|\| isVectorLikeInstWithConstOps(Op);
		}));
		}
		bool IsCommutative = isCommutative(S.MainOp) \|\| isCommutative(S.AltOp);
		if ((IsCommutative &&
		std::accumulate(InstsCount.begin(), InstsCount.end(), 0) < 2) \|\|
		(!IsCommutative &&
		all_of(InstsCount, [](unsigned ICnt) { return ICnt < 2; })))
		return true;
		assert(VL.size() == 2 && "Expected only 2 alternate op instructions.");
		SmallVector<SmallVector<std::pair<Value , Value >>> Candidates;
		auto *I1 = cast<Instruction>(VL.front());
		auto *I2 = cast<Instruction>(VL.back());
		for (int Op = 0, E = S.MainOp->getNumOperands(); Op < E; ++Op)
		Candidates.emplace_back().emplace_back(I1->getOperand(Op),
		I2->getOperand(Op));
		if (count_if(
		Candidates, [this](ArrayRef<std::pair<Value , Value >> Cand) {
		return findBestRootPair(Cand, LookAheadHeuristics::ScoreSplat);
		}) >= S.MainOp->getNumOperands() / 2)
		return false;
		if (S.MainOp->getNumOperands() > 2)
		return true;
		if (IsCommutative) {
		// Check permuted operands.
		Candidates.clear();
		for (int Op = 0, E = S.MainOp->getNumOperands(); Op < E; ++Op)
		Candidates.emplace_back().emplace_back(I1->getOperand(Op),
		I2->getOperand((Op + 1) % E));
		if (any_of(
		Candidates, [this](ArrayRef<std::pair<Value , Value >> Cand) {
		return findBestRootPair(Cand, LookAheadHeuristics::ScoreSplat);
		}))
		return false;
		}
		return true;
		};
if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !S.getOpcode() \|\|		if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !S.getOpcode() \|\|
(isa<InsertElementInst, ExtractValueInst, ExtractElementInst>(S.MainOp) &&		(isa<InsertElementInst, ExtractValueInst, ExtractElementInst>(S.MainOp) &&
!all_of(VL, isVectorLikeInstWithConstOps))) {		!all_of(VL, isVectorLikeInstWithConstOps)) \|\|
LLVM_DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");		NotProfitableForVectorization(VL)) {
		LLVM_DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O, small shuffle. \n");
if (TryToFindDuplicates(S))		if (TryToFindDuplicates(S))
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
return;		return;
}		}

// We now know that this is a vector of instructions of the same type from		// We now know that this is a vector of instructions of the same type from
// the same block.		// the same block.
▲ Show 20 Lines • Show All 7,198 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-4 \| FileCheck %s --check-prefix=CHECK			; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-4 \| FileCheck %s --check-prefix=CHECK
	; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-7 -slp-min-tree-size=6 \| FileCheck %s --check-prefix=FORCE_REDUCTION			; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-6 -slp-min-tree-size=5 \| FileCheck %s --check-prefix=FORCE_REDUCTION

	define void @Test(i32) {			define void @Test(i32) {
	; CHECK-LABEL: @Test(			; CHECK-LABEL: @Test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x i32> poison, i32 [[TMP0:%.]], i32 0			; CHECK-NEXT: [[TMP1:%.]] = insertelement <8 x i32> poison, i32 [[TMP0:%.]], i32 0
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <8 x i32> [[TMP1]], i32 [[TMP0]], i32 1			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <8 x i32> [[TMP1]], i32 [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <8 x i32> [[TMP2]], i32 [[TMP0]], i32 2			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <8 x i32> [[TMP2]], i32 [[TMP0]], i32 2
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <8 x i32> [[TMP3]], i32 [[TMP0]], i32 3			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <8 x i32> [[TMP3]], i32 [[TMP0]], i32 3
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	; FORCE_REDUCTION-NEXT: [[TMP19:%.*]] = insertelement <16 x i32> [[TMP18]], i32 [[TMP0]], i32 10			; FORCE_REDUCTION-NEXT: [[TMP19:%.*]] = insertelement <16 x i32> [[TMP18]], i32 [[TMP0]], i32 10
	; FORCE_REDUCTION-NEXT: [[TMP20:%.*]] = insertelement <16 x i32> [[TMP19]], i32 [[TMP0]], i32 11			; FORCE_REDUCTION-NEXT: [[TMP20:%.*]] = insertelement <16 x i32> [[TMP19]], i32 [[TMP0]], i32 11
	; FORCE_REDUCTION-NEXT: [[TMP21:%.*]] = insertelement <16 x i32> [[TMP20]], i32 [[TMP0]], i32 12			; FORCE_REDUCTION-NEXT: [[TMP21:%.*]] = insertelement <16 x i32> [[TMP20]], i32 [[TMP0]], i32 12
	; FORCE_REDUCTION-NEXT: [[TMP22:%.*]] = insertelement <16 x i32> [[TMP21]], i32 [[TMP0]], i32 13			; FORCE_REDUCTION-NEXT: [[TMP22:%.*]] = insertelement <16 x i32> [[TMP21]], i32 [[TMP0]], i32 13
	; FORCE_REDUCTION-NEXT: [[TMP23:%.*]] = insertelement <16 x i32> [[TMP22]], i32 [[TMP0]], i32 14			; FORCE_REDUCTION-NEXT: [[TMP23:%.*]] = insertelement <16 x i32> [[TMP22]], i32 [[TMP0]], i32 14
	; FORCE_REDUCTION-NEXT: [[TMP24:%.*]] = insertelement <16 x i32> [[TMP23]], i32 [[TMP0]], i32 15			; FORCE_REDUCTION-NEXT: [[TMP24:%.*]] = insertelement <16 x i32> [[TMP23]], i32 [[TMP0]], i32 15
	; FORCE_REDUCTION-NEXT: br label [[LOOP:%.*]]			; FORCE_REDUCTION-NEXT: br label [[LOOP:%.*]]
	; FORCE_REDUCTION: loop:			; FORCE_REDUCTION: loop:
	; FORCE_REDUCTION-NEXT: [[TMP25:%.]] = phi <2 x i32> [ [[TMP36:%.]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]			; FORCE_REDUCTION-NEXT: [[TMP25:%.]] = phi <2 x i32> [ [[TMP32:%.]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
	; FORCE_REDUCTION-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP25]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>			; FORCE_REDUCTION-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP25]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
	; FORCE_REDUCTION-NEXT: [[TMP26:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>			; FORCE_REDUCTION-NEXT: [[TMP26:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
	; FORCE_REDUCTION-NEXT: [[TMP27:%.*]] = call i32 @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP24]])			; FORCE_REDUCTION-NEXT: [[TMP27:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
	; FORCE_REDUCTION-NEXT: [[TMP28:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP8]])			; FORCE_REDUCTION-NEXT: [[TMP28:%.*]] = call i32 @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP24]])
	; FORCE_REDUCTION-NEXT: [[TMP29:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP26]])			; FORCE_REDUCTION-NEXT: [[TMP29:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP8]])
	; FORCE_REDUCTION-NEXT: [[OP_RDX13:%.*]] = and i32 [[TMP29]], [[TMP0]]			; FORCE_REDUCTION-NEXT: [[OP_RDX:%.*]] = and i32 [[TMP28]], [[TMP29]]
	; FORCE_REDUCTION-NEXT: [[OP_RDX14:%.*]] = and i32 [[OP_RDX13]], [[TMP0]]			; FORCE_REDUCTION-NEXT: [[TMP30:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP27]])
	; FORCE_REDUCTION-NEXT: [[OP_RDX15:%.*]] = and i32 [[OP_RDX14]], [[TMP0]]			; FORCE_REDUCTION-NEXT: [[OP_RDX1:%.*]] = and i32 [[OP_RDX]], [[TMP30]]
	; FORCE_REDUCTION-NEXT: [[OP_RDX16:%.*]] = and i32 [[OP_RDX15]], [[TMP27]]			; FORCE_REDUCTION-NEXT: [[OP_RDX2:%.*]] = and i32 [[OP_RDX1]], [[TMP0]]
	; FORCE_REDUCTION-NEXT: [[OP_RDX17:%.*]] = and i32 [[OP_RDX16]], [[TMP28]]			; FORCE_REDUCTION-NEXT: [[OP_RDX3:%.*]] = and i32 [[OP_RDX2]], [[TMP0]]
	; FORCE_REDUCTION-NEXT: [[TMP30:%.*]] = insertelement <2 x i32> <i32 poison, i32 14910>, i32 [[OP_RDX17]], i32 0			; FORCE_REDUCTION-NEXT: [[OP_RDX4:%.*]] = and i32 [[OP_RDX3]], [[TMP0]]
	; FORCE_REDUCTION-NEXT: [[TMP31:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1			; FORCE_REDUCTION-NEXT: [[OP_RDX5:%.*]] = and i32 [[OP_RDX4]], [[TMP26]]
	; FORCE_REDUCTION-NEXT: [[TMP32:%.*]] = insertelement <2 x i32> poison, i32 [[TMP31]], i32 0			; FORCE_REDUCTION-NEXT: [[VAL_43:%.*]] = add i32 [[TMP26]], 14910
	; FORCE_REDUCTION-NEXT: [[TMP33:%.*]] = insertelement <2 x i32> [[TMP32]], i32 [[TMP31]], i32 1			; FORCE_REDUCTION-NEXT: [[TMP31:%.*]] = insertelement <2 x i32> poison, i32 [[OP_RDX5]], i32 0
	; FORCE_REDUCTION-NEXT: [[TMP34:%.*]] = and <2 x i32> [[TMP30]], [[TMP33]]			; FORCE_REDUCTION-NEXT: [[TMP32]] = insertelement <2 x i32> [[TMP31]], i32 [[VAL_43]], i32 1
	; FORCE_REDUCTION-NEXT: [[TMP35:%.*]] = add <2 x i32> [[TMP30]], [[TMP33]]
	; FORCE_REDUCTION-NEXT: [[TMP36]] = shufflevector <2 x i32> [[TMP34]], <2 x i32> [[TMP35]], <2 x i32> <i32 0, i32 3>
	; FORCE_REDUCTION-NEXT: br label [[LOOP]]			; FORCE_REDUCTION-NEXT: br label [[LOOP]]
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%local_4_39.us = phi i32 [ %val_42, %loop ], [ 0, %entry ]			%local_4_39.us = phi i32 [ %val_42, %loop ], [ 0, %entry ]
	%local_8_43.us = phi i32 [ %val_43, %loop ], [ 0, %entry ]			%local_8_43.us = phi i32 [ %val_43, %loop ], [ 0, %entry ]
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s

	define dso_local void @rftbsub(double* %a) local_unnamed_addr #0 {			define dso_local void @rftbsub(double* %a) local_unnamed_addr #0 {
	; CHECK-LABEL: @rftbsub(			; CHECK-LABEL: @rftbsub(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 2			; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 2
	; CHECK-NEXT: [[TMP0:%.]] = load double, double [[ARRAYIDX6]], align 8			; CHECK-NEXT: [[SUB22:%.*]] = fsub double undef, undef
	; CHECK-NEXT: [[TMP1:%.*]] = or i64 2, 1			; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[ARRAYIDX6]] to <2 x double>*
	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
	; CHECK-NEXT: [[TMP2:%.]] = load double, double [[ARRAYIDX12]], align 8			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[TMP1]], i32 1
	; CHECK-NEXT: [[ADD16:%.*]] = fadd double [[TMP2]], undef			; CHECK-NEXT: [[ADD16:%.*]] = fadd double [[TMP2]], undef
	; CHECK-NEXT: [[MUL18:%.*]] = fmul double undef, [[ADD16]]			; CHECK-NEXT: [[MUL18:%.*]] = fmul double undef, [[ADD16]]
	; CHECK-NEXT: [[ADD19:%.*]] = fadd double undef, [[MUL18]]			; CHECK-NEXT: [[ADD19:%.*]] = fadd double undef, [[MUL18]]
	; CHECK-NEXT: [[SUB22:%.*]] = fsub double undef, undef			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> poison, double [[ADD19]], i32 0
	; CHECK-NEXT: [[SUB25:%.*]] = fsub double [[TMP0]], [[ADD19]]			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[SUB22]], i32 1
	; CHECK-NEXT: store double [[SUB25]], double* [[ARRAYIDX6]], align 8			; CHECK-NEXT: [[TMP5:%.*]] = fsub <2 x double> [[TMP1]], [[TMP4]]
	; CHECK-NEXT: [[SUB29:%.*]] = fsub double [[TMP2]], [[SUB22]]			; CHECK-NEXT: [[TMP6:%.]] = bitcast double [[ARRAYIDX6]] to <2 x double>*
	; CHECK-NEXT: store double [[SUB29]], double* [[ARRAYIDX12]], align 8			; CHECK-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[TMP6]], align 8
	; CHECK-NEXT: unreachable			; CHECK-NEXT: unreachable
	;			;
	entry:			entry:
	%arrayidx6 = getelementptr inbounds double, double* %a, i64 2			%arrayidx6 = getelementptr inbounds double, double* %a, i64 2
	%0 = load double, double* %arrayidx6, align 8			%0 = load double, double* %arrayidx6, align 8
	%1 = or i64 2, 1			%1 = or i64 2, 1
	%arrayidx12 = getelementptr inbounds double, double* %a, i64 %1			%arrayidx12 = getelementptr inbounds double, double* %a, i64 %1
	%2 = load double, double* %arrayidx12, align 8			%2 = load double, double* %arrayidx12, align 8
	Show All 10 Lines