This is an archive of the discontinued LLVM Phabricator instance.

[Reassociate] Add initial support for vector instructions.
ClosedPublic

Authored by rob.lougher on Feb 11 2015, 10:02 AM.

Download Raw Diff

Details

Reviewers

spatel
hfinkel
mcrosier

Commits

rG1bad505c3c19: [Reassociate] Add initial support for vector instructions.
rL232190: [Reassociate] Add initial support for vector instructions.

Summary

This patch adds initial support for vector instructions to the reassociation pass. It enables most parts of the pass to work with vectors but to keep the size of the patch small, optimization of Xor trees, canonicalization of negative constants and converting shifts to muls, etc., have been left out. This will be handled in later patches.

The patch is based on an initial patch by Chad Rosier (see http://reviews.llvm.org/D5222).

Robert Lougher
SN Systems - Sony Computer Entertainment Group

Diff Detail

Repository: rL LLVM

Event Timeline

rob.lougher updated this revision to Diff 19769.Feb 11 2015, 10:02 AM

rob.lougher retitled this revision from to [Reassocate] Add initial support for vector instructions..

rob.lougher updated this object.

rob.lougher edited the test plan for this revision. (Show Details)

rob.lougher added a reviewer: mcrosier.

rob.lougher added a subscriber: Unknown Object (MLST).

rob.lougher updated this object.Feb 11 2015, 10:06 AM

rob.lougher retitled this revision from [Reassocate] Add initial support for vector instructions. to [Reassociate] Add initial support for vector instructions..Feb 11 2015, 10:41 AM

rob.lougher added reviewers: hfinkel, spatel.Feb 12 2015, 4:06 AM

Rebased the patch and added tests for 'or' and 'and' (previously these just showed that the operands were commuted - the tests now check that the expressions are optimized).

This is the minimum changes necessary to add reassociation of vectors. The idea is to split the changes into smaller, reviewable chunks. The differences with Chad's original patch is:

Xor instructions are not optimized (the code in the original patch is incomplete). This will be added later.

Various constant handling is missing (e.g. removal of negative constant factors). Chad's original patch extended the checks to include constant splats. Again, these will addressed in later patches.

CanonicalizeNegConstExpr() was added after Chad's patch. Again, this will be handled in later patches.

Tests that rely on other passes (gvn, instcombine) have either been modified or removed.

FIXME tests have been added for some of the missing optimizations.

Chad's original review has been dormant since October. Note, I emailed Chad privately to make sure he was happy with someone else continuing the work.

ping

spatel added inline comments.Feb 23 2015, 9:07 AM

lib/Transforms/Scalar/Reassociate.cpp
2091 ↗	(On Diff #19894)	Can you add a bit to the comment to explain? Is this because we want to distinguish zero'ing idioms?
test/Transforms/Reassociate/fast-ReassociateVector.ll
4 ↗	(On Diff #19894)	I assume all of the floating point transforms apply identically to doubles as well as floats. Can you change some of these tests to use doubles so we have some coverage for those?

Thanks for the review Sanjay.

lib/Transforms/Scalar/Reassociate.cpp
2091 ↗	(On Diff #19894)	OK, I'll add a comment here. The only reason is that OptimizeXor (and helpers) hasn't been vectorized yet. This will be done in a later patch.
test/Transforms/Reassociate/fast-ReassociateVector.ll
4 ↗	(On Diff #19894)	OK. I'll update the patch.

New version of the patch that addresses Sanjay's comments. Tests 3, 7, 9 and 11 now use double rather than float.

Thanks for taking this on, Rob. I agree that this should be committed in incremental patches.

What kind of correctness/performance testing have you done?

IIRC, I didn't see a lot of change in SPEC2000/SPEC2006 with my original patch.

Hi Chad,

Sorry for the delay. I have ran SPEC CPU 2006 and open-source Bullet. Unfortunately, I found no performance difference at all. Sanjay also tried the n-body sim from https://github.com/tycho/nbody and didn't see any perf difference either. Apart from micro-benchmarks, the only consistent speed-up I see is in an internal benchmark of ~1%. I didn't see any performance regressions during testing.

LGTM. My past experience is that this really made no difference for my workloads (i.e., spec2000/spec2006/eembc). I just wanted to make sure we didn't have any odd regressions.

This revision is now accepted and ready to land.Mar 12 2015, 12:19 PM

Closed by commit rL232190: [Reassociate] Add initial support for vector instructions. (authored by rlougher). · Explain WhyMar 13 2015, 11:35 AM

This revision was automatically updated to reflect the committed changes.

I reverted the commit as it caused a couple of arm64 clang tests to fail. This was not expected to happen after a change to LLVM. Hopefully will be easy to fix and I'll resubmit on Monday.

I have reapplied the patch at revision 232209. Hopefully it will be OK this time!

spatel mentioned this in D5222: [Reassocate] Add support for vector instructions..Mar 29 2015, 3:28 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Scalar/

Reassociate.cpp

34 lines

test/

Transforms/

Reassociate/

fast-ReassociateVector.ll

242 lines

Diff 21944

llvm/trunk/lib/Transforms/Scalar/Reassociate.cpp

Show First 20 Lines • Show All 315 Lines • ▼ Show 20 Lines	unsigned Reassociate::getRank(Value *V) {
// cannot be loops in the value graph that do not go through PHI nodes.		// cannot be loops in the value graph that do not go through PHI nodes.
unsigned Rank = 0, MaxRank = RankMap[I->getParent()];		unsigned Rank = 0, MaxRank = RankMap[I->getParent()];
for (unsigned i = 0, e = I->getNumOperands();		for (unsigned i = 0, e = I->getNumOperands();
i != e && Rank != MaxRank; ++i)		i != e && Rank != MaxRank; ++i)
Rank = std::max(Rank, getRank(I->getOperand(i)));		Rank = std::max(Rank, getRank(I->getOperand(i)));

// If this is a not or neg instruction, do not count it for rank. This		// If this is a not or neg instruction, do not count it for rank. This
// assures us that X and ~X will have the same rank.		// assures us that X and ~X will have the same rank.
Type *Ty = V->getType();		if (!BinaryOperator::isNot(I) && !BinaryOperator::isNeg(I) &&
if ((!Ty->isIntegerTy() && !Ty->isFloatingPointTy()) \|\|		!BinaryOperator::isFNeg(I))
(!BinaryOperator::isNot(I) && !BinaryOperator::isNeg(I) &&
!BinaryOperator::isFNeg(I)))
++Rank;		++Rank;

DEBUG(dbgs() << "Calculated Rank[" << V->getName() << "] = " << Rank << "\n");		DEBUG(dbgs() << "Calculated Rank[" << V->getName() << "] = " << Rank << "\n");

return ValueRankMap[I] = Rank;		return ValueRankMap[I] = Rank;
}		}

// Canonicalize constants to RHS. Otherwise, sort the operands by rank.		// Canonicalize constants to RHS. Otherwise, sort the operands by rank.
Show All 10 Lines	if (isa<Constant>(RHS))
return;		return;

if (isa<Constant>(LHS) \|\| RHSRank < LHSRank)		if (isa<Constant>(LHS) \|\| RHSRank < LHSRank)
cast<BinaryOperator>(I)->swapOperands();		cast<BinaryOperator>(I)->swapOperands();
}		}

static BinaryOperator CreateAdd(Value S1, Value *S2, const Twine &Name,		static BinaryOperator CreateAdd(Value S1, Value *S2, const Twine &Name,
Instruction InsertBefore, Value FlagsOp) {		Instruction InsertBefore, Value FlagsOp) {
if (S1->getType()->isIntegerTy())		if (S1->getType()->isIntOrIntVectorTy())
return BinaryOperator::CreateAdd(S1, S2, Name, InsertBefore);		return BinaryOperator::CreateAdd(S1, S2, Name, InsertBefore);
else {		else {
BinaryOperator *Res =		BinaryOperator *Res =
BinaryOperator::CreateFAdd(S1, S2, Name, InsertBefore);		BinaryOperator::CreateFAdd(S1, S2, Name, InsertBefore);
Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());		Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());
return Res;		return Res;
}		}
}		}

static BinaryOperator CreateMul(Value S1, Value *S2, const Twine &Name,		static BinaryOperator CreateMul(Value S1, Value *S2, const Twine &Name,
Instruction InsertBefore, Value FlagsOp) {		Instruction InsertBefore, Value FlagsOp) {
if (S1->getType()->isIntegerTy())		if (S1->getType()->isIntOrIntVectorTy())
return BinaryOperator::CreateMul(S1, S2, Name, InsertBefore);		return BinaryOperator::CreateMul(S1, S2, Name, InsertBefore);
else {		else {
BinaryOperator *Res =		BinaryOperator *Res =
BinaryOperator::CreateFMul(S1, S2, Name, InsertBefore);		BinaryOperator::CreateFMul(S1, S2, Name, InsertBefore);
Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());		Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());
return Res;		return Res;
}		}
}		}

static BinaryOperator CreateNeg(Value S1, const Twine &Name,		static BinaryOperator CreateNeg(Value S1, const Twine &Name,
Instruction InsertBefore, Value FlagsOp) {		Instruction InsertBefore, Value FlagsOp) {
if (S1->getType()->isIntegerTy())		if (S1->getType()->isIntOrIntVectorTy())
return BinaryOperator::CreateNeg(S1, Name, InsertBefore);		return BinaryOperator::CreateNeg(S1, Name, InsertBefore);
else {		else {
BinaryOperator *Res = BinaryOperator::CreateFNeg(S1, Name, InsertBefore);		BinaryOperator *Res = BinaryOperator::CreateFNeg(S1, Name, InsertBefore);
Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());		Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());
return Res;		return Res;
}		}
}		}

/// LowerNegateToMultiply - Replace 0-X with X*-1.		/// LowerNegateToMultiply - Replace 0-X with X*-1.
///		///
static BinaryOperator LowerNegateToMultiply(Instruction Neg) {		static BinaryOperator LowerNegateToMultiply(Instruction Neg) {
Type *Ty = Neg->getType();		Type *Ty = Neg->getType();
Constant *NegOne = Ty->isIntegerTy() ? ConstantInt::getAllOnesValue(Ty)		Constant *NegOne = Ty->isIntOrIntVectorTy() ?
: ConstantFP::get(Ty, -1.0);		ConstantInt::getAllOnesValue(Ty) : ConstantFP::get(Ty, -1.0);

BinaryOperator *Res = CreateMul(Neg->getOperand(1), NegOne, "", Neg, Neg);		BinaryOperator *Res = CreateMul(Neg->getOperand(1), NegOne, "", Neg, Neg);
Neg->setOperand(1, Constant::getNullValue(Ty)); // Drop use of op.		Neg->setOperand(1, Constant::getNullValue(Ty)); // Drop use of op.
Res->takeName(Neg);		Res->takeName(Neg);
Neg->replaceAllUsesWith(Res);		Neg->replaceAllUsesWith(Res);
Res->setDebugLoc(Neg->getDebugLoc());		Res->setDebugLoc(Neg->getDebugLoc());
return Res;		return Res;
}		}
▲ Show 20 Lines • Show All 466 Lines • ▼ Show 20 Lines	for (unsigned i = 0; ; ++i) {
// hard (finding the mimimal number of multiplications needed to realize a		// hard (finding the mimimal number of multiplications needed to realize a
// multiplication expression is NP-complete). Whatever the reason, smart or		// multiplication expression is NP-complete). Whatever the reason, smart or
// stupid, create a new node if there are none left.		// stupid, create a new node if there are none left.
BinaryOperator *NewOp;		BinaryOperator *NewOp;
if (NodesToRewrite.empty()) {		if (NodesToRewrite.empty()) {
Constant *Undef = UndefValue::get(I->getType());		Constant *Undef = UndefValue::get(I->getType());
NewOp = BinaryOperator::Create(Instruction::BinaryOps(Opcode),		NewOp = BinaryOperator::Create(Instruction::BinaryOps(Opcode),
Undef, Undef, "", I);		Undef, Undef, "", I);
if (NewOp->getType()->isFloatingPointTy())		if (NewOp->getType()->isFPOrFPVectorTy())
NewOp->setFastMathFlags(I->getFastMathFlags());		NewOp->setFastMathFlags(I->getFastMathFlags());
} else {		} else {
NewOp = NodesToRewrite.pop_back_val();		NewOp = NodesToRewrite.pop_back_val();
}		}

DEBUG(dbgs() << "RA: " << *Op << '\n');		DEBUG(dbgs() << "RA: " << *Op << '\n');
Op->setOperand(0, NewOp);		Op->setOperand(0, NewOp);
DEBUG(dbgs() << "TO: " << *Op << '\n');		DEBUG(dbgs() << "TO: " << *Op << '\n');
▲ Show 20 Lines • Show All 631 Lines • ▼ Show 20 Lines	if (i+1 != Ops.size() && Ops[i+1].Op == TheOp) {
++NumFound;		++NumFound;
} while (i != Ops.size() && Ops[i].Op == TheOp);		} while (i != Ops.size() && Ops[i].Op == TheOp);

DEBUG(dbgs() << "\nFACTORING [" << NumFound << "]: " << *TheOp << '\n');		DEBUG(dbgs() << "\nFACTORING [" << NumFound << "]: " << *TheOp << '\n');
++NumFactor;		++NumFactor;

// Insert a new multiply.		// Insert a new multiply.
Type *Ty = TheOp->getType();		Type *Ty = TheOp->getType();
Constant *C = Ty->isIntegerTy() ? ConstantInt::get(Ty, NumFound)		Constant *C = Ty->isIntOrIntVectorTy() ?
: ConstantFP::get(Ty, NumFound);		ConstantInt::get(Ty, NumFound) : ConstantFP::get(Ty, NumFound);
Instruction *Mul = CreateMul(TheOp, C, "factor", I, I);		Instruction *Mul = CreateMul(TheOp, C, "factor", I, I);

// Now that we have inserted a multiply, optimize it. This allows us to		// Now that we have inserted a multiply, optimize it. This allows us to
// handle cases that require multiple factoring steps, such as this:		// handle cases that require multiple factoring steps, such as this:
// (X2) + (X2) + (X2) -> (X2)3 -> X6		// (X2) + (X2) + (X2) -> (X2)3 -> X6
RedoInsts.insert(Mul);		RedoInsts.insert(Mul);

// If every add operand was a duplicate, return the multiply.		// If every add operand was a duplicate, return the multiply.
▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	if (MaxOcc > 1) {
DEBUG(dbgs() << "\nFACTORING [" << MaxOcc << "]: " << *MaxOccVal << '\n');		DEBUG(dbgs() << "\nFACTORING [" << MaxOcc << "]: " << *MaxOccVal << '\n');
++NumFactor;		++NumFactor;

// Create a new instruction that uses the MaxOccVal twice. If we don't do		// Create a new instruction that uses the MaxOccVal twice. If we don't do
// this, we could otherwise run into situations where removing a factor		// this, we could otherwise run into situations where removing a factor
// from an expression will drop a use of maxocc, and this can cause		// from an expression will drop a use of maxocc, and this can cause
// RemoveFactorFromExpression on successive values to behave differently.		// RemoveFactorFromExpression on successive values to behave differently.
Instruction *DummyInst =		Instruction *DummyInst =
I->getType()->isIntegerTy()		I->getType()->isIntOrIntVectorTy()
? BinaryOperator::CreateAdd(MaxOccVal, MaxOccVal)		? BinaryOperator::CreateAdd(MaxOccVal, MaxOccVal)
: BinaryOperator::CreateFAdd(MaxOccVal, MaxOccVal);		: BinaryOperator::CreateFAdd(MaxOccVal, MaxOccVal);

SmallVector<WeakVH, 4> NewMulOps;		SmallVector<WeakVH, 4> NewMulOps;
for (unsigned i = 0; i != Ops.size(); ++i) {		for (unsigned i = 0; i != Ops.size(); ++i) {
// Only try to remove factors from expressions we're allowed to.		// Only try to remove factors from expressions we're allowed to.
BinaryOperator *BOp =		BinaryOperator *BOp =
isReassociableOp(Ops[i].Op, Instruction::Mul, Instruction::FMul);		isReassociableOp(Ops[i].Op, Instruction::Mul, Instruction::FMul);
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines
/// \brief Build a tree of multiplies, computing the product of Ops.		/// \brief Build a tree of multiplies, computing the product of Ops.
static Value *buildMultiplyTree(IRBuilder<> &Builder,		static Value *buildMultiplyTree(IRBuilder<> &Builder,
SmallVectorImpl<Value*> &Ops) {		SmallVectorImpl<Value*> &Ops) {
if (Ops.size() == 1)		if (Ops.size() == 1)
return Ops.back();		return Ops.back();

Value *LHS = Ops.pop_back_val();		Value *LHS = Ops.pop_back_val();
do {		do {
if (LHS->getType()->isIntegerTy())		if (LHS->getType()->isIntOrIntVectorTy())
LHS = Builder.CreateMul(LHS, Ops.pop_back_val());		LHS = Builder.CreateMul(LHS, Ops.pop_back_val());
else		else
LHS = Builder.CreateFMul(LHS, Ops.pop_back_val());		LHS = Builder.CreateFMul(LHS, Ops.pop_back_val());
} while (!Ops.empty());		} while (!Ops.empty());

return LHS;		return LHS;
}		}

▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines	if (Instruction *Res = canonicalizeNegConstExpr(I))
I = Res;		I = Res;

// Commute binary operators, to canonicalize the order of their operands.		// Commute binary operators, to canonicalize the order of their operands.
// This can potentially expose more CSE opportunities, and makes writing other		// This can potentially expose more CSE opportunities, and makes writing other
// transformations simpler.		// transformations simpler.
if (I->isCommutative())		if (I->isCommutative())
canonicalizeOperands(I);		canonicalizeOperands(I);

// Don't optimize vector instructions.		// TODO: We should optimize vector Xor instructions, but they are
if (I->getType()->isVectorTy())		// currently unsupported.
		if (I->getType()->isVectorTy() && I->getOpcode() == Instruction::Xor)
return;		return;

// Don't optimize floating point instructions that don't have unsafe algebra.		// Don't optimize floating point instructions that don't have unsafe algebra.
if (I->getType()->isFloatingPointTy() && !I->hasUnsafeAlgebra())		if (I->getType()->isFloatingPointTy() && !I->hasUnsafeAlgebra())
return;		return;

// Do not reassociate boolean (i1) expressions. We want to preserve the		// Do not reassociate boolean (i1) expressions. We want to preserve the
// original order of evaluation for short-circuited comparisons that		// original order of evaluation for short-circuited comparisons that
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	void Reassociate::OptimizeInst(Instruction *I) {
if (BO->hasOneUse() && BO->getOpcode() == Instruction::FAdd &&		if (BO->hasOneUse() && BO->getOpcode() == Instruction::FAdd &&
cast<Instruction>(BO->user_back())->getOpcode() == Instruction::FSub)		cast<Instruction>(BO->user_back())->getOpcode() == Instruction::FSub)
return;		return;

ReassociateExpression(BO);		ReassociateExpression(BO);
}		}

void Reassociate::ReassociateExpression(BinaryOperator *I) {		void Reassociate::ReassociateExpression(BinaryOperator *I) {
assert(!I->getType()->isVectorTy() &&
"Reassociation of vector instructions is not supported.");

// First, walk the expression tree, linearizing the tree, collecting the		// First, walk the expression tree, linearizing the tree, collecting the
// operand information.		// operand information.
SmallVector<RepeatedValue, 8> Tree;		SmallVector<RepeatedValue, 8> Tree;
MadeChange \|= LinearizeExprTree(I, Tree);		MadeChange \|= LinearizeExprTree(I, Tree);
SmallVector<ValueEntry, 8> Ops;		SmallVector<ValueEntry, 8> Ops;
Ops.reserve(Tree.size());		Ops.reserve(Tree.size());
for (unsigned i = 0, e = Tree.size(); i != e; ++i) {		for (unsigned i = 0, e = Tree.size(); i != e; ++i) {
RepeatedValue E = Tree[i];		RepeatedValue E = Tree[i];
▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/Reassociate/fast-ReassociateVector.ll

	; RUN: opt < %s -reassociate -S \| FileCheck %s			; RUN: opt < %s -reassociate -S \| FileCheck %s

	; Canonicalize operands, but don't optimize floating point vector operations.			; Check that ac+bc is turned into (a+b)*c
	define <4 x float> @test1() {			define <4 x float> @test1(<4 x float> %a, <4 x float> %b, <4 x float> %c) {
	; CHECK-LABEL: test1			; CHECK-LABEL: @test1
	; CHECK-NEXT: %tmp1 = fsub fast <4 x float> zeroinitializer, zeroinitializer			; CHECK-NEXT: %tmp = fadd fast <4 x float> %b, %a
	; CHECK-NEXT: %tmp2 = fmul fast <4 x float> %tmp1, zeroinitializer			; CHECK-NEXT: %tmp1 = fmul fast <4 x float> %tmp, %c
				; CHECK-NEXT: ret <4 x float> %tmp1
	%tmp1 = fsub fast <4 x float> zeroinitializer, zeroinitializer
	%tmp2 = fmul fast <4 x float> zeroinitializer, %tmp1			%mul = fmul fast <4 x float> %a, %c
	ret <4 x float> %tmp2			%mul1 = fmul fast <4 x float> %b, %c
	}			%add = fadd fast <4 x float> %mul, %mul1
				ret <4 x float> %add
	; Commute integer vector operations.			}
	define <2 x i32> @test2(<2 x i32> %x, <2 x i32> %y) {
	; CHECK-LABEL: test2			; Check that aab+aac is turned into a(a(b+c)).
	; CHECK-NEXT: %tmp1 = add <2 x i32> %x, %y			define <2 x float> @test2(<2 x float> %a, <2 x float> %b, <2 x float> %c) {
	; CHECK-NEXT: %tmp2 = add <2 x i32> %x, %y			; CHECK-LABEL: @test2
	; CHECK-NEXT: %tmp3 = add <2 x i32> %tmp1, %tmp2			; CHECK-NEXT: fadd fast <2 x float> %c, %b
				; CHECK-NEXT: fmul fast <2 x float> %a, %tmp2
	%tmp1 = add <2 x i32> %x, %y			; CHECK-NEXT: fmul fast <2 x float> %tmp3, %a
	%tmp2 = add <2 x i32> %y, %x			; CHECK-NEXT: ret <2 x float>
	%tmp3 = add <2 x i32> %tmp1, %tmp2
	ret <2 x i32> %tmp3			%t0 = fmul fast <2 x float> %a, %b
	}			%t1 = fmul fast <2 x float> %a, %t0
				%t2 = fmul fast <2 x float> %a, %c
	define <2 x i32> @test3(<2 x i32> %x, <2 x i32> %y) {			%t3 = fmul fast <2 x float> %a, %t2
	; CHECK-LABEL: test3			%t4 = fadd fast <2 x float> %t1, %t3
	; CHECK-NEXT: %tmp1 = mul <2 x i32> %x, %y			ret <2 x float> %t4
	; CHECK-NEXT: %tmp2 = mul <2 x i32> %x, %y			}
	; CHECK-NEXT: %tmp3 = mul <2 x i32> %tmp1, %tmp2
				; Check that ab+ac+d is turned into a*(b+c)+d.
	%tmp1 = mul <2 x i32> %x, %y			define <2 x double> @test3(<2 x double> %a, <2 x double> %b, <2 x double> %c, <2 x double> %d) {
	%tmp2 = mul <2 x i32> %y, %x			; CHECK-LABEL: @test3
	%tmp3 = mul <2 x i32> %tmp1, %tmp2			; CHECK-NEXT: fadd fast <2 x double> %c, %b
	ret <2 x i32> %tmp3			; CHECK-NEXT: fmul fast <2 x double> %tmp, %a
	}			; CHECK-NEXT: fadd fast <2 x double> %tmp1, %d
				; CHECK-NEXT: ret <2 x double>
	define <2 x i32> @test4(<2 x i32> %x, <2 x i32> %y) {
	; CHECK-LABEL: test4			%t0 = fmul fast <2 x double> %a, %b
	; CHECK-NEXT: %tmp1 = and <2 x i32> %x, %y			%t1 = fmul fast <2 x double> %a, %c
	; CHECK-NEXT: %tmp2 = and <2 x i32> %x, %y			%t2 = fadd fast <2 x double> %t1, %d
	; CHECK-NEXT: %tmp3 = and <2 x i32> %tmp1, %tmp2			%t3 = fadd fast <2 x double> %t0, %t2
				ret <2 x double> %t3
				}

				; No fast-math.
				define <2 x float> @test4(<2 x float> %A) {
				; CHECK-LABEL: @test4
				; CHECK-NEXT: %X = fadd <2 x float> %A, <float 1.000000e+00, float 1.000000e+00>
				; CHECK-NEXT: %Y = fadd <2 x float> %A, <float 1.000000e+00, float 1.000000e+00>
				; CHECK-NEXT: %R = fsub <2 x float> %X, %Y
				; CHECK-NEXT: ret <2 x float> %R

				%X = fadd <2 x float> %A, < float 1.000000e+00, float 1.000000e+00 >
				%Y = fadd <2 x float> %A, < float 1.000000e+00, float 1.000000e+00 >
				%R = fsub <2 x float> %X, %Y
				ret <2 x float> %R
				}

				; Check 47X + 47X -> 94*X.
				define <2 x float> @test5(<2 x float> %X) {
				; CHECK-LABEL: @test5
				; CHECK-NEXT: fmul fast <2 x float> %X, <float 9.400000e+01, float 9.400000e+01>
				; CHECK-NEXT: ret <2 x float>

				%Y = fmul fast <2 x float> %X, <float 4.700000e+01, float 4.700000e+01>
				%Z = fadd fast <2 x float> %Y, %Y
				ret <2 x float> %Z
				}

				; Check X+X+X -> 3*X.
				define <2 x float> @test6(<2 x float> %X) {
				; CHECK-LABEL: @test6
				; CHECK-NEXT: fmul fast <2 x float> %X, <float 3.000000e+00, float 3.000000e+00>
				; CHECK-NEXT: ret <2 x float>

				%Y = fadd fast <2 x float> %X ,%X
				%Z = fadd fast <2 x float> %Y, %X
				ret <2 x float> %Z
				}

				; Check 127W+50W -> 177*W.
				define <2 x double> @test7(<2 x double> %W) {
				; CHECK-LABEL: @test7
				; CHECK-NEXT: fmul fast <2 x double> %W, <double 1.770000e+02, double 1.770000e+02>
				; CHECK-NEXT: ret <2 x double>

				%X = fmul fast <2 x double> %W, <double 127.0, double 127.0>
				%Y = fmul fast <2 x double> %W, <double 50.0, double 50.0>
				%Z = fadd fast <2 x double> %Y, %X
				ret <2 x double> %Z
				}

				; Check X1212 -> X*144.
				define <2 x float> @test8(<2 x float> %arg) {
				; CHECK-LABEL: @test8
				; CHECK: fmul fast <2 x float> %arg, <float 1.440000e+02, float 1.440000e+02>
				; CHECK-NEXT: ret <2 x float> %tmp2

				%tmp1 = fmul fast <2 x float> <float 1.200000e+01, float 1.200000e+01>, %arg
				%tmp2 = fmul fast <2 x float> %tmp1, <float 1.200000e+01, float 1.200000e+01>
				ret <2 x float> %tmp2
				}

				; Check (b+(a+1234))+-a -> b+1234.
				define <2 x double> @test9(<2 x double> %b, <2 x double> %a) {
				; CHECK-LABEL: @test9
				; CHECK: fadd fast <2 x double> %b, <double 1.234000e+03, double 1.234000e+03>
				; CHECK-NEXT: ret <2 x double>

				%1 = fadd fast <2 x double> %a, <double 1.234000e+03, double 1.234000e+03>
				%2 = fadd fast <2 x double> %b, %1
				%3 = fsub fast <2 x double> <double 0.000000e+00, double 0.000000e+00>, %a
				%4 = fadd fast <2 x double> %2, %3
				ret <2 x double> %4
				}

				; Check -(-(z40)a) -> a40z.
				define <2 x float> @test10(<2 x float> %a, <2 x float> %b, <2 x float> %z) {
				; CHECK-LABEL: @test10
				; CHECK: fmul fast <2 x float> %a, <float 4.000000e+01, float 4.000000e+01>
				; CHECK-NEXT: fmul fast <2 x float> %e, %z
				; CHECK-NEXT: ret <2 x float>

				%d = fmul fast <2 x float> %z, <float 4.000000e+01, float 4.000000e+01>
				%c = fsub fast <2 x float> <float 0.000000e+00, float 0.000000e+00>, %d
				%e = fmul fast <2 x float> %a, %c
				%f = fsub fast <2 x float> <float 0.000000e+00, float 0.000000e+00>, %e
				ret <2 x float> %f
				}

				; Check xy+yx -> xy2.
				define <2 x double> @test11(<2 x double> %x, <2 x double> %y) {
				; CHECK-LABEL: @test11
				; CHECK-NEXT: %factor = fmul fast <2 x double> %y, <double 2.000000e+00, double 2.000000e+00>
				; CHECK-NEXT: %tmp1 = fmul fast <2 x double> %factor, %x
				; CHECK-NEXT: ret <2 x double> %tmp1

				%1 = fmul fast <2 x double> %x, %y
				%2 = fmul fast <2 x double> %y, %x
				%3 = fadd fast <2 x double> %1, %2
				ret <2 x double> %3
				}

				; FIXME: shifts should be converted to mul to assist further reassociation.
				define <2 x i64> @test12(<2 x i64> %b, <2 x i64> %c) {
				; CHECK-LABEL: @test12
				; CHECK-NEXT: %mul = mul <2 x i64> %c, %b
				; CHECK-NEXT: %shl = shl <2 x i64> %mul, <i64 5, i64 5>
				; CHECK-NEXT: ret <2 x i64> %shl

				%mul = mul <2 x i64> %c, %b
				%shl = shl <2 x i64> %mul, <i64 5, i64 5>
				ret <2 x i64> %shl
				}

				; FIXME: expressions with a negative const should be canonicalized to assist
				; further reassociation.
				; We would expect (-5b)+a -> a-(5b) but only the constant operand is commuted.
				define <4 x float> @test13(<4 x float> %a, <4 x float> %b) {
				; CHECK-LABEL: @test13
				; CHECK-NEXT: %mul = fmul fast <4 x float> %b, <float -5.000000e+00, float -5.000000e+00, float -5.000000e+00, float -5.000000e+00>
				; CHECK-NEXT: %add = fadd fast <4 x float> %mul, %a
				; CHECK-NEXT: ret <4 x float> %add

				%mul = fmul fast <4 x float> <float -5.000000e+00, float -5.000000e+00, float -5.000000e+00, float -5.000000e+00>, %b
				%add = fadd fast <4 x float> %mul, %a
				ret <4 x float> %add
				}

				; Break up subtract to assist further reassociation.
				; Check a+b-c -> a+b+-c.
				define <2 x i64> @test14(<2 x i64> %a, <2 x i64> %b, <2 x i64> %c) {
				; CHECK-LABEL: @test14
				; CHECK-NEXT: %add = add <2 x i64> %b, %a
				; CHECK-NEXT: %c.neg = sub <2 x i64> zeroinitializer, %c
				; CHECK-NEXT: %sub = add <2 x i64> %add, %c.neg
				; CHECK-NEXT: ret <2 x i64> %sub

				%add = add <2 x i64> %b, %a
				%sub = sub <2 x i64> %add, %c
				ret <2 x i64> %sub
				}

				define <2 x i32> @test15(<2 x i32> %x, <2 x i32> %y) {
				; CHECK-LABEL: test15
				; CHECK-NEXT: %tmp3 = and <2 x i32> %y, %x
				; CHECK-NEXT: ret <2 x i32> %tmp3

	%tmp1 = and <2 x i32> %x, %y			%tmp1 = and <2 x i32> %x, %y
	%tmp2 = and <2 x i32> %y, %x			%tmp2 = and <2 x i32> %y, %x
	%tmp3 = and <2 x i32> %tmp1, %tmp2			%tmp3 = and <2 x i32> %tmp1, %tmp2
	ret <2 x i32> %tmp3			ret <2 x i32> %tmp3
	}			}

	define <2 x i32> @test5(<2 x i32> %x, <2 x i32> %y) {			define <2 x i32> @test16(<2 x i32> %x, <2 x i32> %y) {
	; CHECK-LABEL: test5			; CHECK-LABEL: test16
	; CHECK-NEXT: %tmp1 = or <2 x i32> %x, %y			; CHECK-NEXT: %tmp3 = or <2 x i32> %y, %x
	; CHECK-NEXT: %tmp2 = or <2 x i32> %x, %y			; CHECK-NEXT: ret <2 x i32> %tmp3
	; CHECK-NEXT: %tmp3 = or <2 x i32> %tmp1, %tmp2

	%tmp1 = or <2 x i32> %x, %y			%tmp1 = or <2 x i32> %x, %y
	%tmp2 = or <2 x i32> %y, %x			%tmp2 = or <2 x i32> %y, %x
	%tmp3 = or <2 x i32> %tmp1, %tmp2			%tmp3 = or <2 x i32> %tmp1, %tmp2
	ret <2 x i32> %tmp3			ret <2 x i32> %tmp3
	}			}

	define <2 x i32> @test6(<2 x i32> %x, <2 x i32> %y) {			; FIXME: Optimize vector xor. Currently only commute operands.
	; CHECK-LABEL: test6			define <2 x i32> @test17(<2 x i32> %x, <2 x i32> %y) {
				; CHECK-LABEL: test17
	; CHECK-NEXT: %tmp1 = xor <2 x i32> %x, %y			; CHECK-NEXT: %tmp1 = xor <2 x i32> %x, %y
	; CHECK-NEXT: %tmp2 = xor <2 x i32> %x, %y			; CHECK-NEXT: %tmp2 = xor <2 x i32> %x, %y
	; CHECK-NEXT: %tmp3 = xor <2 x i32> %tmp1, %tmp2			; CHECK-NEXT: %tmp3 = xor <2 x i32> %tmp1, %tmp2

	%tmp1 = xor <2 x i32> %x, %y			%tmp1 = xor <2 x i32> %x, %y
	%tmp2 = xor <2 x i32> %y, %x			%tmp2 = xor <2 x i32> %y, %x
	%tmp3 = xor <2 x i32> %tmp1, %tmp2			%tmp3 = xor <2 x i32> %tmp1, %tmp2
	ret <2 x i32> %tmp3			ret <2 x i32> %tmp3
	}			}