This is an archive of the discontinued LLVM Phabricator instance.

[Matrix] Add forward shape propagation and first shape aware lowerings.
ClosedPublic

Authored by fhahn on Dec 2 2019, 5:37 AM.

Download Raw Diff

Details

Reviewers

anemet
Gerolf
hfinkel
andrew.w.kaylor
reames

Commits

rG109e4e3851e2: [Matrix] Add forward shape propagation and first shape aware lowerings.

Summary

This patch adds infrastructure for forward shape propagation to
LowerMatrixIntrinsics. It also updates the pass to make use of
the shape information to break up larger vector operations and to
eliminate unnecessary conversion operations between columnwise matrixes
and flattened vectors: if shape information is available for an
instruction, lower the operation to a set of instructions operating on
columns. For example, a store of a matrix is broken down into separate
stores for each column. For users that do not have shape
information (e.g. because they do not yet support shape information
aware lowering), we pack the result columns into a flat vector and
update those users.

It also adds shape aware lowering for the first non-intrinsic
instruction: vector stores.

Example:

For

%c  = call <4 x double> @llvm.matrix.transpose(<4 x double> %a, i32 2, i32 2)
store <4 x double> %c, <4 x double>* %Ptr

We generate the code below without shape propagation. Note %9 which
combines the columns of the transposed matrix into a flat vector.

%split = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 0, i32 1>
%split1 = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 2, i32 3>
%1 = extractelement <2 x double> %split, i64 0
%2 = insertelement <2 x double> undef, double %1, i64 0
%3 = extractelement <2 x double> %split1, i64 0
%4 = insertelement <2 x double> %2, double %3, i64 1
%5 = extractelement <2 x double> %split, i64 1
%6 = insertelement <2 x double> undef, double %5, i64 0
%7 = extractelement <2 x double> %split1, i64 1
%8 = insertelement <2 x double> %6, double %7, i64 1
%9 = shufflevector <2 x double> %4, <2 x double> %8, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
store <4 x double> %9, <4 x double>* %Ptr

With this patch, we propagate the 2x2 shape information from the
transpose to the store and we generate the code below. Note that we
store the columns directly and do not need an extra shuffle.

%9 = bitcast <4 x double>* %Ptr to double*
%10 = bitcast double* %9 to <2 x double>*
store <2 x double> %4, <2 x double>* %10, align 8
%11 = getelementptr double, double* %9, i32 2
%12 = bitcast double* %11 to <2 x double>*
store <2 x double> %8, <2 x double>* %12, align 8

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Dec 2 2019, 5:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 2 2019, 5:37 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

strip unnecessary test changes

Build result: FAILURE - Could not check out parent git hash "30959a9a1249e0d3b2f18c6622847da457308e49". It was not found in the repository. Did you configure the "Parent Revision" in Phabricator properly? Trying to apply the patch to the master branch instead...

ERROR: arc patch failed with error code 1. Check build log for details.
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B41705: Diff 231684!Dec 2 2019, 5:44 AM

Harbormaster failed remote builds in B41706: Diff 231685!

fhahn added a parent revision: D70456: [Matrix] Add first set of matrix intrinsics and initial lowering pass..Dec 2 2019, 5:44 AM

fhahn added a child revision: D70898: [Matrix] Propagate and use shape info for binary operators..

fhahn added a child revision: D70899: [Matrix] Implement back-propagation of shape information..Dec 2 2019, 5:49 AM

tschuett added a subscriber: tschuett.Dec 2 2019, 5:52 AM

reames resigned from this revision.Dec 2 2019, 4:49 PM

LuoYuanke added a subscriber: LuoYuanke.Dec 2 2019, 7:05 PM

Also the existing test diffs are hard to read, please explain what's going on there.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
109–130	Needs an update explaining the shape propagation and its use.
191	Is this supposed to indicate empty or undefined? I am not sure that the implicit conversion is self-explanatory here.
219–222	Update comment
230–241	The comment says you returning the ColumnMatrix here but you're not.
290	I may be missing something but do these need the lambda?
548	Please add a comment of what's happening here.

Address Adam's comments

In D70897#1768320, @anemet wrote:

Also the existing test diffs are hard to read, please explain what's going on there.

I've added comments to break up the check lines

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B41856: Diff 232136!Dec 4 2019, 8:17 AM

LuoYuanke added inline comments.Dec 8 2019, 2:35 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
362	It seems only store instruction is propagated with the shape information. Why? Take below pseudo code for example. Are v2 and v3 propagated with the shape information? v1 = matrix_columnwise_load(..., m, n) v2 = max(v1, 0) v3 = v1 / v2 store v3, ptr

fhahn marked an inline comment as done.Dec 8 2019, 3:24 AM

fhahn added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
362	This patch mostly adds the infrastructure for the propagation and only uses it for store instructions. So in the example, the shape is not propagated by this patch. Additional support for loads (D70900), binary operators (D70898) and back-propagation (D70899) are added in follow-up commits, to make reviewing them more manageable. The whole patch series is linked in Phabricator (see the 'stack' section). Please note that we could propagate shape information to more instructions, e.g. phis or selects. That can be added as follow-up as well, it is just a matter of priorities (we found loads/stores/binary operators to be by far the most common operations in matrix expressions). Any help with covering more cases would be very welcome :)

LuoYuanke added inline comments.Dec 8 2019, 4:33 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
362	Thank you for reply. Do you propagate the shape information recursively? If the matrix is stored to memory, does the propagation break? Can we get the shape information when reload the matrix from memory? v1 = matrix_multipy(..., m, n, k) store v1, ptr * v2 = load ptr* How do we pass the shape information across function and return shape from function? Do you plan to have matrix as first class type? v1 = matrix_multipy(..., m, n, k) v2 = call foo(v1)

fhahn marked an inline comment as done.Dec 8 2019, 7:00 AM

fhahn added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
362	Thank you for reply. Do you propagate the shape information recursively? It is propagated iteratively: once we propagated shape information to an instruction, we add its users to the worklist. A later patches add back propagation as well and D70901 implements iteration until no new shape information can be discovered. If the matrix is stored to memory, does the propagation break? Can we get the shape information when reload the matrix from memory? Currently yes, we do not propagate through memory instructions. For simple cases like the one above should not really show up, as such loads should be promoted to use the value directly. We could handle more involved cases by using MemorySSA/additional alias analysis. Currently that is not a high priority for us, but we would be happy to collaborate on that as well. How do we pass the shape information across function and return shape from function? Do you plan to have matrix as first class type? Currently we do not propagate the shape information across function boundaries and we do not plan on proposing a dedicated matrix type. The original proposal was focused around a dedicated type, but it was decided to go with a more lightweight solution and potential revisit the matrix type once there is a strong need. For propagating across function boundaries one way to go about would be to turn the lowering into a module pass.

Update after changing %stride.

clang-format: pass.

Build artifacts: console-log.txt, diff.json

Harbormaster failed remote builds in B42397: Diff 233628!Dec 12 2019, 8:51 AM

ping

In D70897#1769062, @fhahn wrote:

In D70897#1768320, @anemet wrote:

Also the existing test diffs are hard to read, please explain what's going on there.

I've added comments to break up the check lines

Thanks for that. Can you also explain the nature of the changes with one example. I am assuming we're removing embedVectors/extractVectors, i.e. bunch of shuffles, pointing to one example would be useful.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
116	produced
121	all instruction that we have
130	Nice write-up!
560	nit: ShapeMap.count?
llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll
38	FileCheck is never executed with the SHAPE prefix.

anemet requested changes to this revision.Dec 19 2019, 3:14 PM

This revision now requires changes to proceed.Dec 19 2019, 3:14 PM

Address comments, thanks!

fhahn edited the summary of this revision. (Show Details)Dec 19 2019, 4:43 PM

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

In D70897#1791861, @anemet wrote:

In D70897#1769062, @fhahn wrote:

In D70897#1768320, @anemet wrote:

Also the existing test diffs are hard to read, please explain what's going on there.

I've added comments to break up the check lines

Thanks for that. Can you also explain the nature of the changes with one example. I am assuming we're removing embedVectors/extractVectors, i.e. bunch of shuffles, pointing to one example would be useful.

I've updated the description of the patch to include an example.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
560	I usually prefer find(), but I guess for DenseMap it does not matter performance wise and I can change it before committing if you prefer.
llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll
38	I've dropped those, but added an explanation of what we are checking.

Harbormaster failed remote builds in B42802: Diff 234807!Dec 19 2019, 4:49 PM

LGTM

This revision is now accepted and ready to land.Dec 20 2019, 9:20 AM

Closed by commit rG109e4e3851e2: [Matrix] Add forward shape propagation and first shape aware lowerings. (authored by fhahn). · Explain WhyDec 23 2019, 4:58 AM

This revision was automatically updated to reflect the committed changes.

fhahn removed a child revision: D70899: [Matrix] Implement back-propagation of shape information..Jan 9 2020, 1:31 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LowerMatrixIntrinsics.cpp

330 lines

test/

Transforms/

LowerMatrixIntrinsics/

bigger-expressions-double.ll

559 lines

propagate-forward.ll

44 lines

propagate-mixed-users.ll

53 lines

Diff 235128

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

Show All 23 Lines
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
		#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"

using namespace llvm;		using namespace llvm;
		using namespace PatternMatch;

#define DEBUG_TYPE "lower-matrix-intrinsics"		#define DEBUG_TYPE "lower-matrix-intrinsics"

		static cl::opt<bool> EnableShapePropagation("matrix-propagate-shape",
		cl::init(true));

namespace {		namespace {

// Given an element poitner \p BasePtr to the start of a (sub) matrix, compute		// Given an element poitner \p BasePtr to the start of a (sub) matrix, compute
// the start address of column \p Col with type (\p EltType x \p NumRows)		// the start address of column \p Col with type (\p EltType x \p NumRows)
// assuming \p Stride elements between start two consecutive columns.		// assuming \p Stride elements between start two consecutive columns.
// \p Stride must be >= \p NumRows.		// \p Stride must be >= \p NumRows.
//		//
// Consider a 4x4 matrix like below		// Consider a 4x4 matrix like below
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	Value computeColumnAddr(Value BasePtr, Value Col, Value Stride,

// Cast elementwise column start pointer to a pointer to a column		// Cast elementwise column start pointer to a pointer to a column
// (EltType x NumRows)*.		// (EltType x NumRows)*.
Type *ColumnType = VectorType::get(EltType, NumRows);		Type *ColumnType = VectorType::get(EltType, NumRows);
Type *ColumnPtrType = PointerType::get(ColumnType, AS);		Type *ColumnPtrType = PointerType::get(ColumnType, AS);
return Builder.CreatePointerCast(ColumnStart, ColumnPtrType);		return Builder.CreatePointerCast(ColumnStart, ColumnPtrType);
}		}

/// LowerMatrixIntrinsics contains the methods used to lower matrix intrinsics.		/// LowerMatrixIntrinsics contains the methods used to lower matrix intrinsics.
///		///
/// Currently, the lowering for each matrix intrinsic is done as follows:		/// Currently, the lowering for each matrix intrinsic is done as follows:
/// 1. Split the operand vectors containing an embedded matrix into a set of		/// 1. Propagate the shape information from intrinsics to connected
/// column vectors, based on the shape information from the intrinsic.		/// instructions.
/// 2. Apply the transformation described by the intrinsic on the column		/// 2. Lower instructions with shape information.
/// vectors, which yields a set of column vectors containing result matrix.		/// 2.1. Get column vectors for each argument. If we already lowered the
/// 3. Embed the columns of the result matrix in a flat vector and replace all		/// definition of an argument, use the produced column vectors directly.
		anemetUnsubmitted Done Reply Inline Actions produced anemet: produced
/// uses of the intrinsic result with it.		/// If not, split the operand vector containing an embedded matrix into
		/// a set of column vectors,
		/// 2.2. Lower the instruction in terms of columnwise operations, which yields
		/// a set of column vectors containing result matrix. Note that we lower
		/// all instructions that have shape information. Besides the intrinsics,
		anemetUnsubmitted Done Reply Inline Actions all instruction that we have anemet: all instruction that we have
		/// this includes stores for example.
		/// 2.3. Update uses of the lowered instruction. If we have shape information
		/// for a user, there is nothing to do, as we will look up the result
		/// column matrix when lowering the user. For other uses, we embed the
		/// result matrix in a flat vector and update the use.
		/// 2.4. Cache the result column matrix for the instruction we lowered
		/// 3. After we lowered all instructions in a function, remove the now
		/// obsolete instructions.
		///
		anemetUnsubmitted Done Reply Inline Actions Needs an update explaining the shape propagation and its use. anemet: Needs an update explaining the shape propagation and its use.
		anemetUnsubmitted Not Done Reply Inline Actions Nice write-up! anemet: Nice write-up!
class LowerMatrixIntrinsics {		class LowerMatrixIntrinsics {
Function &Func;		Function &Func;
const DataLayout &DL;		const DataLayout &DL;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;

/// Wrapper class representing a matrix as a set of column vectors.		/// Wrapper class representing a matrix as a set of column vectors.
/// All column vectors must have the same vector type.		/// All column vectors must have the same vector type.
class ColumnMatrixTy {		class ColumnMatrixTy {
SmallVector<Value *, 16> Columns;		SmallVector<Value *, 16> Columns;

public:		public:
ColumnMatrixTy() : Columns() {}		ColumnMatrixTy() : Columns() {}
ColumnMatrixTy(ArrayRef<Value *> Cols)		ColumnMatrixTy(ArrayRef<Value *> Cols)
: Columns(Cols.begin(), Cols.end()) {}		: Columns(Cols.begin(), Cols.end()) {}

Value *getColumn(unsigned i) const { return Columns[i]; }		Value *getColumn(unsigned i) const { return Columns[i]; }

void setColumn(unsigned i, Value *V) { Columns[i] = V; }		void setColumn(unsigned i, Value *V) { Columns[i] = V; }

size_t getNumColumns() const { return Columns.size(); }		size_t getNumColumns() const { return Columns.size(); }
		size_t getNumRows() const {
		assert(Columns.size() > 0 && "Cannot call getNumRows without columns");
		return cast<VectorType>(Columns[0]->getType())->getNumElements();
		}

const SmallVectorImpl<Value *> &getColumnVectors() const { return Columns; }		const SmallVectorImpl<Value *> &getColumnVectors() const { return Columns; }

SmallVectorImpl<Value *> &getColumnVectors() { return Columns; }		SmallVectorImpl<Value *> &getColumnVectors() { return Columns; }

void addColumn(Value *V) { Columns.push_back(V); }		void addColumn(Value *V) { Columns.push_back(V); }

iterator_range<SmallVector<Value *, 8>::iterator> columns() {		iterator_range<SmallVector<Value *, 8>::iterator> columns() {
Show All 10 Lines	class LowerMatrixIntrinsics {

struct ShapeInfo {		struct ShapeInfo {
unsigned NumRows;		unsigned NumRows;
unsigned NumColumns;		unsigned NumColumns;

ShapeInfo(unsigned NumRows = 0, unsigned NumColumns = 0)		ShapeInfo(unsigned NumRows = 0, unsigned NumColumns = 0)
: NumRows(NumRows), NumColumns(NumColumns) {}		: NumRows(NumRows), NumColumns(NumColumns) {}

ShapeInfo(ConstantInt NumRows, ConstantInt NumColumns)		ShapeInfo(Value NumRows, Value NumColumns)
: NumRows(NumRows->getZExtValue()),		: NumRows(cast<ConstantInt>(NumRows)->getZExtValue()),
NumColumns(NumColumns->getZExtValue()) {}		NumColumns(cast<ConstantInt>(NumColumns)->getZExtValue()) {}

		bool operator==(const ShapeInfo &other) {
		return NumRows == other.NumRows && NumColumns == other.NumColumns;
		}
		bool operator!=(const ShapeInfo &other) { return !(*this == other); }

		/// Returns true if shape-information is defined, meaning both dimensions
		/// are != 0.
		anemetUnsubmitted Done Reply Inline Actions Is this supposed to indicate empty or undefined? I am not sure that the implicit conversion is self-explanatory here. anemet: Is this supposed to indicate empty or undefined? I am not sure that the implicit conversion is…
		operator bool() const {
		assert(NumRows == 0 \|\| NumColumns != 0);
		return NumRows != 0;
		}
};		};

		/// Maps instructions to their shape information. The shape information
		/// describes the shape to be used while lowering. This matches the shape of
		/// the result value of the instruction, with the only exceptions being store
		/// instructions and the matrix_columnwise_store intrinsics. For those, the
		/// shape information indicates that those instructions should be lowered
		/// using shape information as well.
		DenseMap<Value *, ShapeInfo> ShapeMap;

		/// List of instructions to remove. While lowering, we are not replacing all
		/// users of a lowered instruction, if shape information is available and
		/// those need to be removed after we finished lowering.
		SmallVector<Instruction *, 16> ToRemove;

		/// Map from instructions to their produced column matrix.
		DenseMap<Value *, ColumnMatrixTy> Inst2ColumnMatrix;

public:		public:
LowerMatrixIntrinsics(Function &F, TargetTransformInfo &TTI)		LowerMatrixIntrinsics(Function &F, TargetTransformInfo &TTI)
: Func(F), DL(F.getParent()->getDataLayout()), TTI(TTI) {}		: Func(F), DL(F.getParent()->getDataLayout()), TTI(TTI) {}

/// Return the set of column vectors that a matrix value is lowered to.		/// Return the set of column vectors that a matrix value is lowered to.
///		///
/// We split the flat vector \p MatrixVal containing a matrix with shape \p SI		/// If we lowered \p MatrixVal, just return the cache result column matrix.
/// into column vectors.		/// Otherwie split the flat vector \p MatrixVal containing a matrix with
		/// shape \p SI into column vectors.
		anemetUnsubmitted Done Reply Inline Actions Update comment anemet: Update comment
ColumnMatrixTy getMatrix(Value *MatrixVal, const ShapeInfo &SI,		ColumnMatrixTy getMatrix(Value *MatrixVal, const ShapeInfo &SI,
IRBuilder<> Builder) {		IRBuilder<> Builder) {
VectorType *VType = dyn_cast<VectorType>(MatrixVal->getType());		VectorType *VType = dyn_cast<VectorType>(MatrixVal->getType());
assert(VType && "MatrixVal must be a vector type");		assert(VType && "MatrixVal must be a vector type");
assert(VType->getNumElements() == SI.NumRows * SI.NumColumns &&		assert(VType->getNumElements() == SI.NumRows * SI.NumColumns &&
"The vector size must match the number of matrix elements");		"The vector size must match the number of matrix elements");

		// Check if we lowered MatrixVal using shape information. In that case,
		// return the existing column matrix, if it matches the requested shape
		// information. If there is a mis-match, embed the result in a flat
		// vector and split it later.
		auto Found = Inst2ColumnMatrix.find(MatrixVal);
		if (Found != Inst2ColumnMatrix.end()) {
		ColumnMatrixTy &M = Found->second;
		// Return the found matrix, if its shape matches the requested shape
		// information
		if (SI.NumRows == M.getNumRows() && SI.NumColumns == M.getNumColumns())
		return M;

		anemetUnsubmitted Not Done Reply Inline Actions The comment says you returning the ColumnMatrix here but you're not. anemet: The comment says you returning the ColumnMatrix here but you're not.
		MatrixVal = M.embedInVector(Builder);
		}

		// Otherwise split MatrixVal.
SmallVector<Value *, 16> SplitVecs;		SmallVector<Value *, 16> SplitVecs;
Value *Undef = UndefValue::get(VType);		Value *Undef = UndefValue::get(VType);

for (unsigned MaskStart = 0; MaskStart < VType->getNumElements();		for (unsigned MaskStart = 0; MaskStart < VType->getNumElements();
MaskStart += SI.NumRows) {		MaskStart += SI.NumRows) {
Constant *Mask = createSequentialMask(Builder, MaskStart, SI.NumRows, 0);		Constant *Mask = createSequentialMask(Builder, MaskStart, SI.NumRows, 0);
Value *V = Builder.CreateShuffleVector(MatrixVal, Undef, Mask, "split");		Value *V = Builder.CreateShuffleVector(MatrixVal, Undef, Mask, "split");
SplitVecs.push_back(V);		SplitVecs.push_back(V);
}		}

return {SplitVecs};		return {SplitVecs};
}		}

// Replace intrinsic calls		/// If \p V already has a known shape return false. Otherwise set the shape
bool VisitCallInst(CallInst *Inst) {		/// for instructions that support it.
if (!Inst->getCalledFunction() \|\| !Inst->getCalledFunction()->isIntrinsic())		bool setShapeInfo(Value *V, ShapeInfo Shape) {
		assert(Shape && "Shape not set");
		if (isa<UndefValue>(V) \|\| !supportsShapeInfo(V))
return false;		return false;

switch (Inst->getCalledFunction()->getIntrinsicID()) {		auto SIter = ShapeMap.find(V);
		if (SIter != ShapeMap.end()) {
		LLVM_DEBUG(dbgs() << " not overriding existing shape: "
		<< SIter->second.NumRows << " "
		<< SIter->second.NumColumns << " for " << *V << "\n");
		return false;
		}

		ShapeMap.insert({V, Shape});
		LLVM_DEBUG(dbgs() << " " << Shape.NumRows << " x " << Shape.NumColumns
		<< " for " << *V << "\n");
		return true;
		}

		/// Returns true if shape information can be used for \p V. The supported
		/// instructions must match the instructions that can be lowered by this pass.
		bool supportsShapeInfo(Value *V) {
		Instruction *Inst = dyn_cast<Instruction>(V);
		if (!Inst)
		return false;

		IntrinsicInst *II = dyn_cast<IntrinsicInst>(Inst);
		if (II)
		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
LowerMultiply(Inst);
break;
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
		anemetUnsubmitted Done Reply Inline Actions I may be missing something but do these need the lambda? anemet: I may be missing something but do these need the lambda?
LowerTranspose(Inst);
break;
case Intrinsic::matrix_columnwise_load:		case Intrinsic::matrix_columnwise_load:
LowerColumnwiseLoad(Inst);
break;
case Intrinsic::matrix_columnwise_store:		case Intrinsic::matrix_columnwise_store:
LowerColumnwiseStore(Inst);		return true;
break;
default:		default:
return false;		return false;
}		}
Inst->eraseFromParent();		return isa<StoreInst>(Inst);
return true;		}

		/// Propagate the shape information of instructions to their users.
		void propagateShapeForward() {
		// The work list contains instructions for which we can compute the shape,
		// either based on the information provided by matrix intrinsics or known
		// shapes of operands.
		SmallVector<Instruction *, 8> WorkList;

		// Initialize the work list with ops carrying shape information. Initially
		// only the shape of matrix intrinsics is known.
		for (BasicBlock &BB : Func)
		for (Instruction &Inst : BB) {
		IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst);
		if (!II)
		continue;

		switch (II->getIntrinsicID()) {
		case Intrinsic::matrix_multiply:
		case Intrinsic::matrix_transpose:
		case Intrinsic::matrix_columnwise_load:
		case Intrinsic::matrix_columnwise_store:
		WorkList.push_back(&Inst);
		break;
		default:
		break;
		}
		}

		// Pop an element for which we guaranteed to have at least one of the
		// operand shapes. Add the shape for this and then add users to the work
		// list.
		LLVM_DEBUG(dbgs() << "Forward-propagate shapes:\n");
		while (!WorkList.empty()) {
		Instruction *Inst = WorkList.back();
		WorkList.pop_back();

		// New entry, set the value and insert operands
		bool Propagate = false;

		Value *MatrixA;
		Value *MatrixB;
		Value *M;
		Value *N;
		Value *K;
		if (match(Inst, m_Intrinsic<Intrinsic::matrix_multiply>(
		m_Value(MatrixA), m_Value(MatrixB), m_Value(M),
		m_Value(N), m_Value(K)))) {
		Propagate = setShapeInfo(Inst, {M, K});
		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_transpose>(
		m_Value(MatrixA), m_Value(M), m_Value(N)))) {
		// Flip dimensions.
		Propagate = setShapeInfo(Inst, {N, M});
		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_columnwise_store>(
		m_Value(MatrixA), m_Value(), m_Value(),
		m_Value(M), m_Value(N)))) {
		Propagate = setShapeInfo(Inst, {N, M});
		} else if (match(Inst,
		m_Intrinsic<Intrinsic::matrix_columnwise_load>(
		m_Value(), m_Value(), m_Value(M), m_Value(N)))) {
		Propagate = setShapeInfo(Inst, {M, N});
		} else if (match(Inst, m_Store(m_Value(MatrixA), m_Value()))) {
		auto OpShape = ShapeMap.find(MatrixA);
		if (OpShape != ShapeMap.end())
		setShapeInfo(Inst, OpShape->second);
		LuoYuankeUnsubmitted Not Done Reply Inline Actions It seems only store instruction is propagated with the shape information. Why? Take below pseudo code for example. Are v2 and v3 propagated with the shape information? v1 = matrix_columnwise_load(..., m, n) v2 = max(v1, 0) v3 = v1 / v2 store v3, ptr LuoYuanke: It seems only store instruction is propagated with the shape information. Why? Take below…
		fhahnAuthorUnsubmitted Done Reply Inline Actions This patch mostly adds the infrastructure for the propagation and only uses it for store instructions. So in the example, the shape is not propagated by this patch. Additional support for loads (D70900), binary operators (D70898) and back-propagation (D70899) are added in follow-up commits, to make reviewing them more manageable. The whole patch series is linked in Phabricator (see the 'stack' section). Please note that we could propagate shape information to more instructions, e.g. phis or selects. That can be added as follow-up as well, it is just a matter of priorities (we found loads/stores/binary operators to be by far the most common operations in matrix expressions). Any help with covering more cases would be very welcome :) fhahn: This patch mostly adds the infrastructure for the propagation and only uses it for store…
		LuoYuankeUnsubmitted Not Done Reply Inline Actions Thank you for reply. Do you propagate the shape information recursively? If the matrix is stored to memory, does the propagation break? Can we get the shape information when reload the matrix from memory? v1 = matrix_multipy(..., m, n, k) store v1, ptr * v2 = load ptr* How do we pass the shape information across function and return shape from function? Do you plan to have matrix as first class type? v1 = matrix_multipy(..., m, n, k) v2 = call foo(v1) LuoYuanke: Thank you for reply. Do you propagate the shape information recursively? If the matrix is…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Thank you for reply. Do you propagate the shape information recursively? It is propagated iteratively: once we propagated shape information to an instruction, we add its users to the worklist. A later patches add back propagation as well and D70901 implements iteration until no new shape information can be discovered. If the matrix is stored to memory, does the propagation break? Can we get the shape information when reload the matrix from memory? Currently yes, we do not propagate through memory instructions. For simple cases like the one above should not really show up, as such loads should be promoted to use the value directly. We could handle more involved cases by using MemorySSA/additional alias analysis. Currently that is not a high priority for us, but we would be happy to collaborate on that as well. How do we pass the shape information across function and return shape from function? Do you plan to have matrix as first class type? Currently we do not propagate the shape information across function boundaries and we do not plan on proposing a dedicated matrix type. The original proposal was focused around a dedicated type, but it was decided to go with a more lightweight solution and potential revisit the matrix type once there is a strong need. For propagating across function boundaries one way to go about would be to turn the lowering into a module pass. fhahn: > Thank you for reply. Do you propagate the shape information recursively? It is propagated…
		continue;
		}

		if (Propagate)
		for (auto *User : Inst->users())
		if (ShapeMap.count(User) == 0)
		WorkList.push_back(cast<Instruction>(User));
		}
}		}

bool Visit() {		bool Visit() {
		if (EnableShapePropagation)
		propagateShapeForward();

ReversePostOrderTraversal<Function *> RPOT(&Func);		ReversePostOrderTraversal<Function *> RPOT(&Func);
bool Changed = false;		bool Changed = false;
for (auto *BB : RPOT) {		for (auto *BB : RPOT) {
for (Instruction &Inst : make_early_inc_range(*BB)) {		for (Instruction &Inst : make_early_inc_range(*BB)) {
		IRBuilder<> Builder(&Inst);

if (CallInst *CInst = dyn_cast<CallInst>(&Inst))		if (CallInst *CInst = dyn_cast<CallInst>(&Inst))
Changed \|= VisitCallInst(CInst);		Changed \|= VisitCallInst(CInst);

		Value *Op1;
		Value *Op2;
		if (match(&Inst, m_Store(m_Value(Op1), m_Value(Op2))))
		Changed \|= VisitStore(&Inst, Op1, Op2, Builder);
}		}
}		}

		for (Instruction *Inst : reverse(ToRemove))
		Inst->eraseFromParent();

return Changed;		return Changed;
}		}

LoadInst createColumnLoad(Value ColumnPtr, Type *EltType,		LoadInst createColumnLoad(Value ColumnPtr, Type *EltType,
IRBuilder<> Builder) {		IRBuilder<> Builder) {
unsigned Align = DL.getABITypeAlignment(EltType);		unsigned Align = DL.getABITypeAlignment(EltType);
return Builder.CreateAlignedLoad(ColumnPtr, Align);		return Builder.CreateAlignedLoad(ColumnPtr, Align);
}		}

StoreInst createColumnStore(Value ColumnValue, Value *ColumnPtr,		StoreInst createColumnStore(Value ColumnValue, Value *ColumnPtr,
Type *EltType, IRBuilder<> Builder) {		Type *EltType, IRBuilder<> Builder) {
unsigned Align = DL.getABITypeAlignment(EltType);		unsigned Align = DL.getABITypeAlignment(EltType);
return Builder.CreateAlignedStore(ColumnValue, ColumnPtr, Align);		return Builder.CreateAlignedStore(ColumnValue, ColumnPtr, Align);
}		}


/// Turns \p BasePtr into an elementwise pointer to \p EltType.		/// Turns \p BasePtr into an elementwise pointer to \p EltType.
Value createElementPtr(Value BasePtr, Type *EltType, IRBuilder<> &Builder) {		Value createElementPtr(Value BasePtr, Type *EltType, IRBuilder<> &Builder) {
unsigned AS = cast<PointerType>(BasePtr->getType())->getAddressSpace();		unsigned AS = cast<PointerType>(BasePtr->getType())->getAddressSpace();
Type *EltPtrType = PointerType::get(EltType, AS);		Type *EltPtrType = PointerType::get(EltType, AS);
return Builder.CreatePointerCast(BasePtr, EltPtrType);		return Builder.CreatePointerCast(BasePtr, EltPtrType);
}		}

		/// Replace intrinsic calls
		bool VisitCallInst(CallInst *Inst) {
		if (!Inst->getCalledFunction() \|\| !Inst->getCalledFunction()->isIntrinsic())
		return false;

		switch (Inst->getCalledFunction()->getIntrinsicID()) {
		case Intrinsic::matrix_multiply:
		LowerMultiply(Inst);
		break;
		case Intrinsic::matrix_transpose:
		LowerTranspose(Inst);
		break;
		case Intrinsic::matrix_columnwise_load:
		LowerColumnwiseLoad(Inst);
		break;
		case Intrinsic::matrix_columnwise_store:
		LowerColumnwiseStore(Inst);
		break;
		default:
		return false;
		}
		return true;
		}

/// Lowers llvm.matrix.columnwise.load.		/// Lowers llvm.matrix.columnwise.load.
///		///
/// The intrinsic loads a matrix from memory using a stride between columns.		/// The intrinsic loads a matrix from memory using a stride between columns.
void LowerColumnwiseLoad(CallInst *Inst) {		void LowerColumnwiseLoad(CallInst *Inst) {
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
Value *Ptr = Inst->getArgOperand(0);		Value *Ptr = Inst->getArgOperand(0);
Value *Stride = Inst->getArgOperand(1);		Value *Stride = Inst->getArgOperand(1);
auto VType = cast<VectorType>(Inst->getType());		auto VType = cast<VectorType>(Inst->getType());
ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(2)),
cast<ConstantInt>(Inst->getArgOperand(3)));
Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);		Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);
		ShapeInfo Shape(Inst->getArgOperand(2), Inst->getArgOperand(3));

ColumnMatrixTy Result;		ColumnMatrixTy Result;
// Distance between start of one column and the start of the next		// Distance between start of one column and the start of the next
for (unsigned C = 0, E = Shape.NumColumns; C < E; ++C) {		for (unsigned C = 0, E = Shape.NumColumns; C < E; ++C) {
Value *GEP =		Value *GEP =
computeColumnAddr(EltPtr, Builder.getInt32(C), Stride, Shape.NumRows,		computeColumnAddr(EltPtr, Builder.getInt32(C), Stride, Shape.NumRows,
VType->getElementType(), Builder);		VType->getElementType(), Builder);
Value *Column = createColumnLoad(GEP, VType->getElementType(), Builder);		Value *Column = createColumnLoad(GEP, VType->getElementType(), Builder);
Result.addColumn(Column);		Result.addColumn(Column);
}		}

Inst->replaceAllUsesWith(Result.embedInVector(Builder));		finalizeLowering(Inst, Result, Builder);
}		}

/// Lowers llvm.matrix.columnwise.store.		void LowerStore(Instruction Inst, Value Matrix, Value Ptr, Value Stride,
///		ShapeInfo Shape) {
/// The intrinsic store a matrix back memory using a stride between columns.
void LowerColumnwiseStore(CallInst *Inst) {
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
Value *Matrix = Inst->getArgOperand(0);
Value *Ptr = Inst->getArgOperand(1);
Value *Stride = Inst->getArgOperand(2);
ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(3)),
cast<ConstantInt>(Inst->getArgOperand(4)));
auto VType = cast<VectorType>(Matrix->getType());		auto VType = cast<VectorType>(Matrix->getType());
Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);		Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);

auto LM = getMatrix(Matrix, Shape, Builder);		auto LM = getMatrix(Matrix, Shape, Builder);
for (auto C : enumerate(LM.columns())) {		for (auto C : enumerate(LM.columns())) {
Value *GEP =		Value *GEP =
computeColumnAddr(EltPtr, Builder.getInt32(C.index()), Stride,		computeColumnAddr(EltPtr, Builder.getInt32(C.index()), Stride,
Shape.NumRows, VType->getElementType(), Builder);		Shape.NumRows, VType->getElementType(), Builder);
createColumnStore(C.value(), GEP, VType->getElementType(), Builder);		createColumnStore(C.value(), GEP, VType->getElementType(), Builder);
}		}

		ToRemove.push_back(Inst);
		}

		/// Lowers llvm.matrix.columnwise.store.
		///
		/// The intrinsic store a matrix back memory using a stride between columns.
		void LowerColumnwiseStore(CallInst *Inst) {
		Value *Matrix = Inst->getArgOperand(0);
		Value *Ptr = Inst->getArgOperand(1);
		Value *Stride = Inst->getArgOperand(2);
		LowerStore(Inst, Matrix, Ptr, Stride,
		{Inst->getArgOperand(3), Inst->getArgOperand(4)});
}		}

/// Extract a column vector of \p NumElts starting at index (\p I, \p J) from		/// Extract a column vector of \p NumElts starting at index (\p I, \p J) from
/// the matrix \p LM represented as a vector of column vectors.		/// the matrix \p LM represented as a vector of column vectors.
Value *extractVector(const ColumnMatrixTy &LM, unsigned I, unsigned J,		Value *extractVector(const ColumnMatrixTy &LM, unsigned I, unsigned J,
unsigned NumElts, IRBuilder<> Builder) {		unsigned NumElts, IRBuilder<> Builder) {
Value *Col = LM.getColumn(J);		Value *Col = LM.getColumn(J);
Value *Undef = UndefValue::get(Col->getType());		Value *Undef = UndefValue::get(Col->getType());
Show All 39 Lines	Value createMulAdd(Value Sum, Value A, Value B, bool UseFPOp,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {
Value *Mul = UseFPOp ? Builder.CreateFMul(A, B) : Builder.CreateMul(A, B);		Value *Mul = UseFPOp ? Builder.CreateFMul(A, B) : Builder.CreateMul(A, B);
if (!Sum)		if (!Sum)
return Mul;		return Mul;

return UseFPOp ? Builder.CreateFAdd(Sum, Mul) : Builder.CreateAdd(Sum, Mul);		return UseFPOp ? Builder.CreateFAdd(Sum, Mul) : Builder.CreateAdd(Sum, Mul);
}		}

		/// Cache \p Matrix as result of \p Inst and update the uses of \p Inst. For
		/// users with shape information, there's nothing to do: the will use the
		anemetUnsubmitted Done Reply Inline Actions Please add a comment of what's happening here. anemet: Please add a comment of what's happening here.
		/// cached value when they are lowered. For other users, \p Matrix is
		/// flattened and the uses are updated to use it. Also marks \p Inst for
		/// deletion.
		void finalizeLowering(Instruction *Inst, ColumnMatrixTy Matrix,
		IRBuilder<> &Builder) {
		Inst2ColumnMatrix.insert(std::make_pair(Inst, Matrix));

		ToRemove.push_back(Inst);
		Value *Flattened = nullptr;
		for (auto I = Inst->use_begin(), E = Inst->use_end(); I != E;) {
		Use &U = *I++;
		if (ShapeMap.find(U.getUser()) == ShapeMap.end()) {
		anemetUnsubmitted Not Done Reply Inline Actions nit: ShapeMap.count? anemet: nit: ShapeMap.count?
		fhahnAuthorUnsubmitted Done Reply Inline Actions I usually prefer find(), but I guess for DenseMap it does not matter performance wise and I can change it before committing if you prefer. fhahn: I usually prefer find(), but I guess for DenseMap it does not matter performance wise and I…
		if (!Flattened)
		Flattened = Matrix.embedInVector(Builder);
		U.set(Flattened);
		}
		}
		}

/// Lowers llvm.matrix.multiply.		/// Lowers llvm.matrix.multiply.
void LowerMultiply(CallInst *MatMul) {		void LowerMultiply(CallInst *MatMul) {
IRBuilder<> Builder(MatMul);		IRBuilder<> Builder(MatMul);
auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();		auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();
ShapeInfo LShape(cast<ConstantInt>(MatMul->getArgOperand(2)),		ShapeInfo LShape(MatMul->getArgOperand(2), MatMul->getArgOperand(3));
cast<ConstantInt>(MatMul->getArgOperand(3)));		ShapeInfo RShape(MatMul->getArgOperand(3), MatMul->getArgOperand(4));
ShapeInfo RShape(cast<ConstantInt>(MatMul->getArgOperand(3)),
cast<ConstantInt>(MatMul->getArgOperand(4)));

const ColumnMatrixTy &Lhs =		const ColumnMatrixTy &Lhs =
getMatrix(MatMul->getArgOperand(0), LShape, Builder);		getMatrix(MatMul->getArgOperand(0), LShape, Builder);
const ColumnMatrixTy &Rhs =		const ColumnMatrixTy &Rhs =
getMatrix(MatMul->getArgOperand(1), RShape, Builder);		getMatrix(MatMul->getArgOperand(1), RShape, Builder);

const unsigned R = LShape.NumRows;		const unsigned R = LShape.NumRows;
const unsigned M = LShape.NumColumns;		const unsigned M = LShape.NumColumns;
Show All 25 Lines	for (unsigned J = 0; J < C; ++J) {
Value *RH = Builder.CreateExtractElement(Rhs.getColumn(J), K);		Value *RH = Builder.CreateExtractElement(Rhs.getColumn(J), K);
Value *Splat = Builder.CreateVectorSplat(BlockSize, RH, "splat");		Value *Splat = Builder.CreateVectorSplat(BlockSize, RH, "splat");
Sum = createMulAdd(Sum, L, Splat, EltType->isFloatingPointTy(),		Sum = createMulAdd(Sum, L, Splat, EltType->isFloatingPointTy(),
Builder);		Builder);
}		}
Result.setColumn(J, insertVector(Result.getColumn(J), I, Sum, Builder));		Result.setColumn(J, insertVector(Result.getColumn(J), I, Sum, Builder));
}		}
}		}
		finalizeLowering(MatMul, Result, Builder);
MatMul->replaceAllUsesWith(Result.embedInVector(Builder));
}		}

/// Lowers llvm.matrix.transpose.		/// Lowers llvm.matrix.transpose.
void LowerTranspose(CallInst *Inst) {		void LowerTranspose(CallInst *Inst) {
ColumnMatrixTy Result;		ColumnMatrixTy Result;
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
Value *InputVal = Inst->getArgOperand(0);		Value *InputVal = Inst->getArgOperand(0);
VectorType *VectorTy = cast<VectorType>(InputVal->getType());		VectorType *VectorTy = cast<VectorType>(InputVal->getType());
ShapeInfo ArgShape(cast<ConstantInt>(Inst->getArgOperand(1)),		ShapeInfo ArgShape(Inst->getArgOperand(1), Inst->getArgOperand(2));
cast<ConstantInt>(Inst->getArgOperand(2)));
ColumnMatrixTy InputMatrix = getMatrix(InputVal, ArgShape, Builder);		ColumnMatrixTy InputMatrix = getMatrix(InputVal, ArgShape, Builder);

for (unsigned Row = 0; Row < ArgShape.NumRows; ++Row) {		for (unsigned Row = 0; Row < ArgShape.NumRows; ++Row) {
// Build a single column vector for this row. First initialize it.		// Build a single column vector for this row. First initialize it.
Value *ResultColumn = UndefValue::get(		Value *ResultColumn = UndefValue::get(
VectorType::get(VectorTy->getElementType(), ArgShape.NumColumns));		VectorType::get(VectorTy->getElementType(), ArgShape.NumColumns));

// Go through the elements of this row and insert it into the resulting		// Go through the elements of this row and insert it into the resulting
// column vector.		// column vector.
for (auto C : enumerate(InputMatrix.columns())) {		for (auto C : enumerate(InputMatrix.columns())) {
Value *Elt = Builder.CreateExtractElement(C.value(), Row);		Value *Elt = Builder.CreateExtractElement(C.value(), Row);
// We insert at index Column since that is the row index after the		// We insert at index Column since that is the row index after the
// transpose.		// transpose.
ResultColumn =		ResultColumn =
Builder.CreateInsertElement(ResultColumn, Elt, C.index());		Builder.CreateInsertElement(ResultColumn, Elt, C.index());
}		}
Result.addColumn(ResultColumn);		Result.addColumn(ResultColumn);
}		}

Inst->replaceAllUsesWith(Result.embedInVector(Builder));		finalizeLowering(Inst, Result, Builder);
		}

		bool VisitStore(Instruction Inst, Value StoredVal, Value *Ptr,
		IRBuilder<> &Builder) {
		auto I = ShapeMap.find(StoredVal);
		if (I == ShapeMap.end())
		return false;

		LowerStore(Inst, StoredVal, Ptr, Builder.getInt32(I->second.NumRows), I->second);
		return true;
}		}
};		};
} // namespace		} // namespace

PreservedAnalyses LowerMatrixIntrinsicsPass::run(Function &F,		PreservedAnalyses LowerMatrixIntrinsicsPass::run(Function &F,
FunctionAnalysisManager &AM) {		FunctionAnalysisManager &AM) {
auto &TTI = AM.getResult<TargetIRAnalysis>(F);		auto &TTI = AM.getResult<TargetIRAnalysis>(F);
LowerMatrixIntrinsics LMT(F, TTI);		LowerMatrixIntrinsics LMT(F, TTI);
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s			; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
	; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s			; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s


	define void @transpose_multiply(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {			define void @transpose_multiply(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {
	; CHECK-LABEL: @transpose_multiply(			; CHECK-LABEL: @transpose_multiply(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:

				; Load input matrixes %A and %B.

	; CHECK-NEXT: [[A:%.]] = load <9 x double>, <9 x double> [[A_PTR:%.*]]			; CHECK-NEXT: [[A:%.]] = load <9 x double>, <9 x double> [[A_PTR:%.*]]
	; CHECK-NEXT: [[B:%.]] = load <9 x double>, <9 x double> [[B_PTR:%.*]]			; CHECK-NEXT: [[B:%.]] = load <9 x double>, <9 x double> [[B_PTR:%.*]]

				; Extract columns from loaded value %A.

	; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>			; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>

				; Transpose %A.

	; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[SPLIT]], i64 0			; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[SPLIT]], i64 0
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 0			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <3 x double> [[TMP1]], double [[TMP2]], i64 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <3 x double> [[TMP1]], double [[TMP2]], i64 1
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 0			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 0
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <3 x double> [[TMP3]], double [[TMP4]], i64 2			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <3 x double> [[TMP3]], double [[TMP4]], i64 2
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <3 x double> [[SPLIT]], i64 1			; CHECK-NEXT: [[TMP6:%.*]] = extractelement <3 x double> [[SPLIT]], i64 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <3 x double> undef, double [[TMP6]], i64 0			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <3 x double> undef, double [[TMP6]], i64 0
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 1			; CHECK-NEXT: [[TMP8:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 1
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <3 x double> [[TMP7]], double [[TMP8]], i64 1			; CHECK-NEXT: [[TMP9:%.*]] = insertelement <3 x double> [[TMP7]], double [[TMP8]], i64 1
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 1			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 1
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <3 x double> [[TMP9]], double [[TMP10]], i64 2			; CHECK-NEXT: [[TMP11:%.*]] = insertelement <3 x double> [[TMP9]], double [[TMP10]], i64 2
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x double> [[SPLIT]], i64 2			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x double> [[SPLIT]], i64 2
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <3 x double> undef, double [[TMP12]], i64 0			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <3 x double> undef, double [[TMP12]], i64 0
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 2			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 2
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1			; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1
	; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2			; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2			; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2
	; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> [[TMP11]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; Extract columns from %B.
	; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <6 x double> [[TMP18]], <6 x double> [[TMP19]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>			; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[SPLIT6:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT7:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; Lower multiply(transpose(%A), %B)
	; CHECK-NEXT: [[SPLIT8:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP21:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0			; CHECK-NEXT: [[TMP18:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP21]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP18]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP22:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]			; CHECK-NEXT: [[TMP19:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
	; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[BLOCK6:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[TMP20:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT7:%.*]] = insertelement <1 x double> undef, double [[TMP20]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT8:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT7]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP21:%.*]] = fmul <1 x double> [[BLOCK6]], [[SPLAT_SPLAT8]]
				; CHECK-NEXT: [[TMP22:%.*]] = fadd <1 x double> [[TMP19]], [[TMP21]]
				; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]			; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]
	; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]			; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]
	; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP26:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP26]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP26]], i32 0			; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP28:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP28]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP27:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]			; CHECK-NEXT: [[TMP29:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]
	; CHECK-NEXT: [[TMP28:%.*]] = fadd <1 x double> [[TMP25]], [[TMP27]]			; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP29:%.*]] = shufflevector <1 x double> [[TMP28]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP30:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP30:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP29]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP30]], i32 0
	; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP31:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP31]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP32:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]			; CHECK-NEXT: [[TMP31:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]
	; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP32:%.*]] = fadd <1 x double> [[TMP29]], [[TMP31]]
	; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]			; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]
	; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]			; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]
	; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP36:%.*]] = shufflevector <1 x double> [[TMP35]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP36:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP37:%.*]] = shufflevector <3 x double> [[TMP27]], <3 x double> [[TMP36]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP36]], i32 0			; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP38:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP38]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP37:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]			; CHECK-NEXT: [[TMP39:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]
	; CHECK-NEXT: [[TMP38:%.*]] = fadd <1 x double> [[TMP35]], [[TMP37]]			; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <1 x double> [[TMP38]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP40:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <3 x double> [[TMP30]], <3 x double> [[TMP39]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP40]], i32 0
	; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP41:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP41]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP42:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]			; CHECK-NEXT: [[TMP41:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]
	; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP42:%.*]] = fadd <1 x double> [[TMP39]], [[TMP41]]
	; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]			; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]
	; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]			; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]
	; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <1 x double> [[TMP45]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP46:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <3 x double> [[TMP37]], <3 x double> [[TMP46]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP46]], i32 0			; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP48:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP48]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP47:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]			; CHECK-NEXT: [[TMP49:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]
	; CHECK-NEXT: [[TMP48:%.*]] = fadd <1 x double> [[TMP45]], [[TMP47]]			; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP49:%.*]] = shufflevector <1 x double> [[TMP48]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP50:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP50:%.*]] = shufflevector <3 x double> [[TMP40]], <3 x double> [[TMP49]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP50]], i32 0
	; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP51:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP51]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP52:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]			; CHECK-NEXT: [[TMP51:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]
	; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP52:%.*]] = fadd <1 x double> [[TMP49]], [[TMP51]]
	; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]			; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]
	; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]			; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]
	; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP56:%.*]] = shufflevector <1 x double> [[TMP55]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP56:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP57:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP56]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP56]], i32 0			; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP58:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP58]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP57:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]			; CHECK-NEXT: [[TMP59:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]
	; CHECK-NEXT: [[TMP58:%.*]] = fadd <1 x double> [[TMP55]], [[TMP57]]			; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP59:%.*]] = shufflevector <1 x double> [[TMP58]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP60:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP59]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP60]], i32 0
	; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP61:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP61]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP62:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]			; CHECK-NEXT: [[TMP61:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]
	; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP62:%.*]] = fadd <1 x double> [[TMP59]], [[TMP61]]
	; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]			; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]
	; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]			; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]
	; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP66:%.*]] = shufflevector <1 x double> [[TMP65]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP66:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP67:%.*]] = shufflevector <3 x double> [[TMP57]], <3 x double> [[TMP66]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP66]], i32 0			; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP68:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP68]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP67:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]			; CHECK-NEXT: [[TMP69:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]
	; CHECK-NEXT: [[TMP68:%.*]] = fadd <1 x double> [[TMP65]], [[TMP67]]			; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP69:%.*]] = shufflevector <1 x double> [[TMP68]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP70:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP70:%.*]] = shufflevector <3 x double> [[TMP60]], <3 x double> [[TMP69]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP70]], i32 0
	; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP71:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP71]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP72:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]			; CHECK-NEXT: [[TMP71:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]
	; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP72:%.*]] = fadd <1 x double> [[TMP69]], [[TMP71]]
	; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]			; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]
	; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]			; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]
	; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP76:%.*]] = shufflevector <1 x double> [[TMP75]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP76:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP77:%.*]] = shufflevector <3 x double> [[TMP67]], <3 x double> [[TMP76]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP76]], i32 0			; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP78:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP78]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP77:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]			; CHECK-NEXT: [[TMP79:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]
	; CHECK-NEXT: [[TMP78:%.*]] = fadd <1 x double> [[TMP75]], [[TMP77]]			; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP79:%.*]] = shufflevector <1 x double> [[TMP78]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP80:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP80:%.*]] = shufflevector <3 x double> [[TMP70]], <3 x double> [[TMP79]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP80]], i32 0
	; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP81:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP81]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP82:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]			; CHECK-NEXT: [[TMP81:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]
	; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP82:%.*]] = fadd <1 x double> [[TMP79]], [[TMP81]]
	; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]			; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]
	; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]			; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]
	; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP86:%.*]] = shufflevector <1 x double> [[TMP85]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP86:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP87:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP86]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP86]], i32 0			; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP88:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP88]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP87:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]			; CHECK-NEXT: [[TMP89:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]
	; CHECK-NEXT: [[TMP88:%.*]] = fadd <1 x double> [[TMP85]], [[TMP87]]			; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP89:%.*]] = shufflevector <1 x double> [[TMP88]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP90:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP90:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP89]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP90]], i32 0
	; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP91:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP91]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP92:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]			; CHECK-NEXT: [[TMP91:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]
	; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP92:%.*]] = fadd <1 x double> [[TMP89]], [[TMP91]]
	; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]			; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]
	; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]			; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]
	; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP96:%.*]] = shufflevector <1 x double> [[TMP95]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP96:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP97:%.*]] = shufflevector <3 x double> [[TMP87]], <3 x double> [[TMP96]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP96]], i32 0			; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP98:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP98]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP97:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]			; CHECK-NEXT: [[TMP99:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]
	; CHECK-NEXT: [[TMP98:%.*]] = fadd <1 x double> [[TMP95]], [[TMP97]]			; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP99:%.*]] = shufflevector <1 x double> [[TMP98]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP100:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP100:%.*]] = shufflevector <3 x double> [[TMP90]], <3 x double> [[TMP99]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP100]], i32 0
	; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP101:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP101]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP102:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]			; CHECK-NEXT: [[TMP101:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]
	; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP102:%.*]] = fadd <1 x double> [[TMP99]], [[TMP101]]
	; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]			; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]
	; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]			; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]
	; CHECK-NEXT: [[BLOCK84:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP106:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT85:%.*]] = insertelement <1 x double> undef, double [[TMP106]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT86:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT85]], <1 x double> undef, <1 x i32> zeroinitializer			; Store result columns.
	; CHECK-NEXT: [[TMP107:%.*]] = fmul <1 x double> [[BLOCK84]], [[SPLAT_SPLAT86]]
	; CHECK-NEXT: [[TMP108:%.*]] = fadd <1 x double> [[TMP105]], [[TMP107]]			; CHECK-NEXT: [[TMP108:%.]] = bitcast <9 x double> [[C_PTR:%.]] to double
	; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <1 x double> [[TMP108]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP109:%.]] = bitcast double [[TMP108]] to <3 x double>*
	; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <3 x double> [[TMP100]], <3 x double> [[TMP109]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: store <3 x double> [[TMP47]], <3 x double>* [[TMP109]], align 8
	; CHECK-NEXT: [[TMP111:%.*]] = shufflevector <3 x double> [[TMP50]], <3 x double> [[TMP80]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>			; CHECK-NEXT: [[TMP110:%.]] = getelementptr double, double [[TMP108]], i32 3
	; CHECK-NEXT: [[TMP112:%.*]] = shufflevector <3 x double> [[TMP110]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP111:%.]] = bitcast double [[TMP110]] to <3 x double>*
	; CHECK-NEXT: [[TMP113:%.*]] = shufflevector <6 x double> [[TMP111]], <6 x double> [[TMP112]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>			; CHECK-NEXT: store <3 x double> [[TMP77]], <3 x double>* [[TMP111]], align 8
	; CHECK-NEXT: store <9 x double> [[TMP113]], <9 x double>* [[C_PTR:%.*]]			; CHECK-NEXT: [[TMP112:%.]] = getelementptr double, double [[TMP108]], i32 6
				; CHECK-NEXT: [[TMP113:%.]] = bitcast double [[TMP112]] to <3 x double>*
				; CHECK-NEXT: store <3 x double> [[TMP107]], <3 x double>* [[TMP113]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;

	entry:			entry:
	%a = load <9 x double>, <9 x double>* %A.Ptr			%a = load <9 x double>, <9 x double>* %A.Ptr
	%b = load <9 x double>, <9 x double>* %B.Ptr			%b = load <9 x double>, <9 x double>* %B.Ptr
	%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)			%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)
	%c = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)			%c = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)
	store <9 x double> %c, <9 x double>* %C.Ptr			store <9 x double> %c, <9 x double>* %C.Ptr
	ret void			ret void
	}			}

	declare <9 x double> @llvm.matrix.transpose(<9 x double>, i32, i32)			declare <9 x double> @llvm.matrix.transpose(<9 x double>, i32, i32)
	declare <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double>, <9 x double>, i32, i32, i32)			declare <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double>, <9 x double>, i32, i32, i32)

	define void @transpose_multiply_add(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {			define void @transpose_multiply_add(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {
	; CHECK-LABEL: @transpose_multiply_add(			; CHECK-LABEL: @transpose_multiply_add(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:

				; Load input matrixes %A and %B.

	; CHECK-NEXT: [[A:%.]] = load <9 x double>, <9 x double> [[A_PTR:%.*]]			; CHECK-NEXT: [[A:%.]] = load <9 x double>, <9 x double> [[A_PTR:%.*]]
	; CHECK-NEXT: [[B:%.]] = load <9 x double>, <9 x double> [[B_PTR:%.*]]			; CHECK-NEXT: [[B:%.]] = load <9 x double>, <9 x double> [[B_PTR:%.*]]

				; Extract columns from loaded value %A.

	; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>			; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>

				; Transpose %A.

	; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[SPLIT]], i64 0			; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[SPLIT]], i64 0
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 0			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <3 x double> [[TMP1]], double [[TMP2]], i64 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <3 x double> [[TMP1]], double [[TMP2]], i64 1
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 0			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 0
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <3 x double> [[TMP3]], double [[TMP4]], i64 2			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <3 x double> [[TMP3]], double [[TMP4]], i64 2
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <3 x double> [[SPLIT]], i64 1			; CHECK-NEXT: [[TMP6:%.*]] = extractelement <3 x double> [[SPLIT]], i64 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <3 x double> undef, double [[TMP6]], i64 0			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <3 x double> undef, double [[TMP6]], i64 0
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 1			; CHECK-NEXT: [[TMP8:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 1
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <3 x double> [[TMP7]], double [[TMP8]], i64 1			; CHECK-NEXT: [[TMP9:%.*]] = insertelement <3 x double> [[TMP7]], double [[TMP8]], i64 1
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 1			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 1
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <3 x double> [[TMP9]], double [[TMP10]], i64 2			; CHECK-NEXT: [[TMP11:%.*]] = insertelement <3 x double> [[TMP9]], double [[TMP10]], i64 2
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x double> [[SPLIT]], i64 2			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x double> [[SPLIT]], i64 2
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <3 x double> undef, double [[TMP12]], i64 0			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <3 x double> undef, double [[TMP12]], i64 0
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 2			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 2
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1			; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1
	; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2			; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2			; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2
	; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> [[TMP11]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; Extract columns from %B.
	; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <6 x double> [[TMP18]], <6 x double> [[TMP19]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>			; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[SPLIT6:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT7:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; Lower multiply(transpose(%A), %B)
	; CHECK-NEXT: [[SPLIT8:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP21:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0			; CHECK-NEXT: [[TMP18:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP21]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP18]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP22:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]			; CHECK-NEXT: [[TMP19:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
	; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[BLOCK6:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[TMP20:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT7:%.*]] = insertelement <1 x double> undef, double [[TMP20]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT8:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT7]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP21:%.*]] = fmul <1 x double> [[BLOCK6]], [[SPLAT_SPLAT8]]
				; CHECK-NEXT: [[TMP22:%.*]] = fadd <1 x double> [[TMP19]], [[TMP21]]
				; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]			; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]
	; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]			; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]
	; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP26:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP26]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP26]], i32 0			; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP28:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP28]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP27:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]			; CHECK-NEXT: [[TMP29:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]
	; CHECK-NEXT: [[TMP28:%.*]] = fadd <1 x double> [[TMP25]], [[TMP27]]			; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP29:%.*]] = shufflevector <1 x double> [[TMP28]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP30:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP30:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP29]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP30]], i32 0
	; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP31:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP31]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP32:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]			; CHECK-NEXT: [[TMP31:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]
	; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP32:%.*]] = fadd <1 x double> [[TMP29]], [[TMP31]]
	; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]			; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]
	; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]			; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]
	; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP36:%.*]] = shufflevector <1 x double> [[TMP35]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP36:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP37:%.*]] = shufflevector <3 x double> [[TMP27]], <3 x double> [[TMP36]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP36]], i32 0			; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP38:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP38]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP37:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]			; CHECK-NEXT: [[TMP39:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]
	; CHECK-NEXT: [[TMP38:%.*]] = fadd <1 x double> [[TMP35]], [[TMP37]]			; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <1 x double> [[TMP38]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP40:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <3 x double> [[TMP30]], <3 x double> [[TMP39]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP40]], i32 0
	; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP41:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP41]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP42:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]			; CHECK-NEXT: [[TMP41:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]
	; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP42:%.*]] = fadd <1 x double> [[TMP39]], [[TMP41]]
	; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]			; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]
	; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]			; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]
	; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <1 x double> [[TMP45]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP46:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <3 x double> [[TMP37]], <3 x double> [[TMP46]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP46]], i32 0			; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP48:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP48]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP47:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]			; CHECK-NEXT: [[TMP49:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]
	; CHECK-NEXT: [[TMP48:%.*]] = fadd <1 x double> [[TMP45]], [[TMP47]]			; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP49:%.*]] = shufflevector <1 x double> [[TMP48]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP50:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP50:%.*]] = shufflevector <3 x double> [[TMP40]], <3 x double> [[TMP49]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP50]], i32 0
	; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP51:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP51]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP52:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]			; CHECK-NEXT: [[TMP51:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]
	; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP52:%.*]] = fadd <1 x double> [[TMP49]], [[TMP51]]
	; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]			; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]
	; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]			; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]
	; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP56:%.*]] = shufflevector <1 x double> [[TMP55]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP56:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP57:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP56]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP56]], i32 0			; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP58:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP58]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP57:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]			; CHECK-NEXT: [[TMP59:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]
	; CHECK-NEXT: [[TMP58:%.*]] = fadd <1 x double> [[TMP55]], [[TMP57]]			; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP59:%.*]] = shufflevector <1 x double> [[TMP58]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP60:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP59]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP60]], i32 0
	; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP61:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP61]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP62:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]			; CHECK-NEXT: [[TMP61:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]
	; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP62:%.*]] = fadd <1 x double> [[TMP59]], [[TMP61]]
	; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]			; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]
	; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]			; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]
	; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP66:%.*]] = shufflevector <1 x double> [[TMP65]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP66:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP67:%.*]] = shufflevector <3 x double> [[TMP57]], <3 x double> [[TMP66]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP66]], i32 0			; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP68:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP68]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP67:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]			; CHECK-NEXT: [[TMP69:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]
	; CHECK-NEXT: [[TMP68:%.*]] = fadd <1 x double> [[TMP65]], [[TMP67]]			; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP69:%.*]] = shufflevector <1 x double> [[TMP68]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP70:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP70:%.*]] = shufflevector <3 x double> [[TMP60]], <3 x double> [[TMP69]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP70]], i32 0
	; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP71:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP71]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP72:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]			; CHECK-NEXT: [[TMP71:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]
	; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP72:%.*]] = fadd <1 x double> [[TMP69]], [[TMP71]]
	; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]			; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]
	; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]			; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]
	; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP76:%.*]] = shufflevector <1 x double> [[TMP75]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP76:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP77:%.*]] = shufflevector <3 x double> [[TMP67]], <3 x double> [[TMP76]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP76]], i32 0			; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP78:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP78]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP77:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]			; CHECK-NEXT: [[TMP79:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]
	; CHECK-NEXT: [[TMP78:%.*]] = fadd <1 x double> [[TMP75]], [[TMP77]]			; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP79:%.*]] = shufflevector <1 x double> [[TMP78]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP80:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP80:%.*]] = shufflevector <3 x double> [[TMP70]], <3 x double> [[TMP79]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP80]], i32 0
	; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP81:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP81]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP82:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]			; CHECK-NEXT: [[TMP81:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]
	; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP82:%.*]] = fadd <1 x double> [[TMP79]], [[TMP81]]
	; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]			; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]
	; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]			; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]
	; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP86:%.*]] = shufflevector <1 x double> [[TMP85]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP86:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP87:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP86]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP86]], i32 0			; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP88:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP88]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP87:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]			; CHECK-NEXT: [[TMP89:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]
	; CHECK-NEXT: [[TMP88:%.*]] = fadd <1 x double> [[TMP85]], [[TMP87]]			; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP89:%.*]] = shufflevector <1 x double> [[TMP88]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP90:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP90:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP89]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP90]], i32 0
	; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP91:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP91]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP92:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]			; CHECK-NEXT: [[TMP91:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]
	; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP92:%.*]] = fadd <1 x double> [[TMP89]], [[TMP91]]
	; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]			; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]
	; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]			; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]
	; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP96:%.*]] = shufflevector <1 x double> [[TMP95]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP96:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP97:%.*]] = shufflevector <3 x double> [[TMP87]], <3 x double> [[TMP96]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP96]], i32 0			; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP98:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP98]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP97:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]			; CHECK-NEXT: [[TMP99:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]
	; CHECK-NEXT: [[TMP98:%.*]] = fadd <1 x double> [[TMP95]], [[TMP97]]			; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP99:%.*]] = shufflevector <1 x double> [[TMP98]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP100:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP100:%.*]] = shufflevector <3 x double> [[TMP90]], <3 x double> [[TMP99]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP100]], i32 0
	; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP101:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP101]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP102:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]			; CHECK-NEXT: [[TMP101:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]
	; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP102:%.*]] = fadd <1 x double> [[TMP99]], [[TMP101]]
	; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]			; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]
	; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]			; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]
	; CHECK-NEXT: [[BLOCK84:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP106:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; Embed result of multiply into flat vector.
	; CHECK-NEXT: [[SPLAT_SPLATINSERT85:%.*]] = insertelement <1 x double> undef, double [[TMP106]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT86:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT85]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP107:%.*]] = fmul <1 x double> [[BLOCK84]], [[SPLAT_SPLAT86]]			; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[TMP108:%.*]] = fadd <1 x double> [[TMP105]], [[TMP107]]			; CHECK-NEXT: [[TMP108:%.*]] = shufflevector <3 x double> [[TMP47]], <3 x double> [[TMP77]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <1 x double> [[TMP108]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <3 x double> [[TMP107]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <3 x double> [[TMP100]], <3 x double> [[TMP109]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <6 x double> [[TMP108]], <6 x double> [[TMP109]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[TMP111:%.*]] = shufflevector <3 x double> [[TMP50]], <3 x double> [[TMP80]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP112:%.*]] = shufflevector <3 x double> [[TMP110]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; Load %C and add result of multiply.
	; CHECK-NEXT: [[TMP113:%.*]] = shufflevector <6 x double> [[TMP111]], <6 x double> [[TMP112]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[C:%.]] = load <9 x double>, <9 x double> [[C_PTR:%.*]]			; CHECK-NEXT: [[C:%.]] = load <9 x double>, <9 x double> [[C_PTR:%.*]]
	; CHECK-NEXT: [[RES:%.*]] = fadd <9 x double> [[C]], [[TMP113]]			; CHECK-NEXT: [[RES:%.*]] = fadd <9 x double> [[C]], [[TMP110]]
	; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[C_PTR]]			; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[C_PTR]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%a = load <9 x double>, <9 x double>* %A.Ptr			%a = load <9 x double>, <9 x double>* %A.Ptr
	%b = load <9 x double>, <9 x double>* %B.Ptr			%b = load <9 x double>, <9 x double>* %B.Ptr
	%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)			%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)
	%mult = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)			%mult = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)
	%c = load <9 x double>, <9 x double>* %C.Ptr			%c = load <9 x double>, <9 x double>* %C.Ptr
	%res = fadd <9 x double> %c, %mult			%res = fadd <9 x double> %c, %mult

	store <9 x double> %res, <9 x double>* %C.Ptr			store <9 x double> %res, <9 x double>* %C.Ptr
	ret void			ret void
	}			}

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

				; Check that we do not emit shufflevectors to flatten the result of the
				; transpose and store the columns directly.
				define void @transpose_store(<8 x double> %a, <8 x double>* %Ptr) {
				; CHECK-LABEL: @transpose_store(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <8 x double> [[A:%.]], <8 x double> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 2, i32 3>
				; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 4, i32 5>
				; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 6, i32 7>
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <2 x double> [[SPLIT]], i64 0
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> undef, double [[TMP0]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 0
				; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP1]], double [[TMP2]], i64 1
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP3]], double [[TMP4]], i64 2
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i64 3
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[SPLIT]], i64 1
				; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x double> undef, double [[TMP8]], i64 0
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 1
				; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x double> [[TMP9]], double [[TMP10]], i64 1
				; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x double> [[TMP11]], double [[TMP12]], i64 2
				; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x double> [[TMP13]], double [[TMP14]], i64 3
				; CHECK-NEXT: [[TMP16:%.]] = bitcast <8 x double> [[PTR:%.]] to double
				; CHECK-NEXT: [[TMP17:%.]] = bitcast double [[TMP16]] to <4 x double>*
				; CHECK-NEXT: store <4 x double> [[TMP7]], <4 x double>* [[TMP17]], align 8
				; CHECK-NEXT: [[TMP18:%.]] = getelementptr double, double [[TMP16]], i32 4
				; CHECK-NEXT: [[TMP19:%.]] = bitcast double [[TMP18]] to <4 x double>*
				; CHECK-NEXT: store <4 x double> [[TMP15]], <4 x double>* [[TMP19]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				anemetUnsubmitted Not Done Reply Inline Actions FileCheck is never executed with the SHAPE prefix. anemet: FileCheck is never executed with the SHAPE prefix.
				fhahnAuthorUnsubmitted Done Reply Inline Actions I've dropped those, but added an explanation of what we are checking. fhahn: I've dropped those, but added an explanation of what we are checking.
				%c = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)
				store <8 x double> %c, <8 x double>* %Ptr
				ret void
				}

				declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-mixed-users.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

				; Currently we only lower stores with shape information, but need to embed the
				; matrix in a flat vector for function calls and returns.
				define <8 x double> @strided_load_4x4(<8 x double> %in, <8 x double>* %Ptr) {
				; CHECK-LABEL: @strided_load_4x4(
				; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <8 x double> [[IN:%.]], <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <8 x double> [[IN]], <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x double> [[SPLIT]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[TMP1]], i64 0
				; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP3]], i64 1
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x double> [[SPLIT]], i64 1
				; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> undef, double [[TMP5]], i64 0
				; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 1
				; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP6]], double [[TMP7]], i64 1
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x double> [[SPLIT]], i64 2
				; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x double> undef, double [[TMP9]], i64 0
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 2
				; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x double> [[TMP10]], double [[TMP11]], i64 1
				; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x double> [[SPLIT]], i64 3
				; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> undef, double [[TMP13]], i64 0
				; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 3
				; CHECK-NEXT: [[TMP16:%.*]] = insertelement <2 x double> [[TMP14]], double [[TMP15]], i64 1
				; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <2 x double> [[TMP12]], <2 x double> [[TMP16]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <4 x double> [[TMP17]], <4 x double> [[TMP18]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[TMP20:%.]] = bitcast <8 x double> [[PTR:%.]] to double
				; CHECK-NEXT: [[TMP21:%.]] = bitcast double [[TMP20]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP21]], align 8
				; CHECK-NEXT: [[TMP22:%.]] = getelementptr double, double [[TMP20]], i32 2
				; CHECK-NEXT: [[TMP23:%.]] = bitcast double [[TMP22]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[TMP23]], align 8
				; CHECK-NEXT: [[TMP24:%.]] = getelementptr double, double [[TMP20]], i32 4
				; CHECK-NEXT: [[TMP25:%.]] = bitcast double [[TMP24]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP12]], <2 x double>* [[TMP25]], align 8
				; CHECK-NEXT: [[TMP26:%.]] = getelementptr double, double [[TMP20]], i32 6
				; CHECK-NEXT: [[TMP27:%.]] = bitcast double [[TMP26]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP16]], <2 x double>* [[TMP27]], align 8
				; CHECK-NEXT: call void @foo(<8 x double> [[TMP19]])
				; CHECK-NEXT: ret <8 x double> [[TMP19]]
				;
				%transposed = call <8 x double> @llvm.matrix.transpose(<8 x double> %in, i32 4, i32 2)
				store <8 x double> %transposed, <8 x double>* %Ptr
				call void @foo(<8 x double> %transposed)
				ret <8 x double> %transposed
				}

				declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

				declare void @foo(<8 x double>)