This is an archive of the discontinued LLVM Phabricator instance.

[Matrix] Add forward shape propagation and first shape aware lowerings.
ClosedPublic

Authored by fhahn on Dec 2 2019, 5:37 AM.

Download Raw Diff

Details

Reviewers

anemet
Gerolf
hfinkel
andrew.w.kaylor
reames

Commits

rG109e4e3851e2: [Matrix] Add forward shape propagation and first shape aware lowerings.

Summary

This patch adds infrastructure for forward shape propagation to
LowerMatrixIntrinsics. It also updates the pass to make use of
the shape information to break up larger vector operations and to
eliminate unnecessary conversion operations between columnwise matrixes
and flattened vectors: if shape information is available for an
instruction, lower the operation to a set of instructions operating on
columns. For example, a store of a matrix is broken down into separate
stores for each column. For users that do not have shape
information (e.g. because they do not yet support shape information
aware lowering), we pack the result columns into a flat vector and
update those users.

It also adds shape aware lowering for the first non-intrinsic
instruction: vector stores.

Example:

For

%c  = call <4 x double> @llvm.matrix.transpose(<4 x double> %a, i32 2, i32 2)
store <4 x double> %c, <4 x double>* %Ptr

We generate the code below without shape propagation. Note %9 which
combines the columns of the transposed matrix into a flat vector.

%split = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 0, i32 1>
%split1 = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 2, i32 3>
%1 = extractelement <2 x double> %split, i64 0
%2 = insertelement <2 x double> undef, double %1, i64 0
%3 = extractelement <2 x double> %split1, i64 0
%4 = insertelement <2 x double> %2, double %3, i64 1
%5 = extractelement <2 x double> %split, i64 1
%6 = insertelement <2 x double> undef, double %5, i64 0
%7 = extractelement <2 x double> %split1, i64 1
%8 = insertelement <2 x double> %6, double %7, i64 1
%9 = shufflevector <2 x double> %4, <2 x double> %8, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
store <4 x double> %9, <4 x double>* %Ptr

With this patch, we propagate the 2x2 shape information from the
transpose to the store and we generate the code below. Note that we
store the columns directly and do not need an extra shuffle.

%9 = bitcast <4 x double>* %Ptr to double*
%10 = bitcast double* %9 to <2 x double>*
store <2 x double> %4, <2 x double>* %10, align 8
%11 = getelementptr double, double* %9, i32 2
%12 = bitcast double* %11 to <2 x double>*
store <2 x double> %8, <2 x double>* %12, align 8

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Dec 2 2019, 5:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 2 2019, 5:37 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

strip unnecessary test changes

Build result: FAILURE - Could not check out parent git hash "30959a9a1249e0d3b2f18c6622847da457308e49". It was not found in the repository. Did you configure the "Parent Revision" in Phabricator properly? Trying to apply the patch to the master branch instead...

ERROR: arc patch failed with error code 1. Check build log for details.
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B41705: Diff 231684!Dec 2 2019, 5:44 AM

Harbormaster failed remote builds in B41706: Diff 231685!

fhahn added a parent revision: D70456: [Matrix] Add first set of matrix intrinsics and initial lowering pass..Dec 2 2019, 5:44 AM

fhahn added a child revision: D70898: [Matrix] Propagate and use shape info for binary operators..

fhahn added a child revision: D70899: [Matrix] Implement back-propagation of shape information..Dec 2 2019, 5:49 AM

tschuett added a subscriber: tschuett.Dec 2 2019, 5:52 AM

reames resigned from this revision.Dec 2 2019, 4:49 PM

LuoYuanke added a subscriber: LuoYuanke.Dec 2 2019, 7:05 PM

Also the existing test diffs are hard to read, please explain what's going on there.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
96–104	Needs an update explaining the shape propagation and its use.
167	Is this supposed to indicate empty or undefined? I am not sure that the implicit conversion is self-explanatory here.
192–193	Update comment
201–212	The comment says you returning the ColumnMatrix here but you're not.
257	I may be missing something but do these need the lambda?
517	Please add a comment of what's happening here.

Address Adam's comments

In D70897#1768320, @anemet wrote:

Also the existing test diffs are hard to read, please explain what's going on there.

I've added comments to break up the check lines

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

Harbormaster failed remote builds in B41856: Diff 232136!Dec 4 2019, 8:17 AM

LuoYuanke added inline comments.Dec 8 2019, 2:35 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
329	It seems only store instruction is propagated with the shape information. Why? Take below pseudo code for example. Are v2 and v3 propagated with the shape information? v1 = matrix_columnwise_load(..., m, n) v2 = max(v1, 0) v3 = v1 / v2 store v3, ptr

fhahn marked an inline comment as done.Dec 8 2019, 3:24 AM

fhahn added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
329	This patch mostly adds the infrastructure for the propagation and only uses it for store instructions. So in the example, the shape is not propagated by this patch. Additional support for loads (D70900), binary operators (D70898) and back-propagation (D70899) are added in follow-up commits, to make reviewing them more manageable. The whole patch series is linked in Phabricator (see the 'stack' section). Please note that we could propagate shape information to more instructions, e.g. phis or selects. That can be added as follow-up as well, it is just a matter of priorities (we found loads/stores/binary operators to be by far the most common operations in matrix expressions). Any help with covering more cases would be very welcome :)

LuoYuanke added inline comments.Dec 8 2019, 4:33 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
329	Thank you for reply. Do you propagate the shape information recursively? If the matrix is stored to memory, does the propagation break? Can we get the shape information when reload the matrix from memory? v1 = matrix_multipy(..., m, n, k) store v1, ptr * v2 = load ptr* How do we pass the shape information across function and return shape from function? Do you plan to have matrix as first class type? v1 = matrix_multipy(..., m, n, k) v2 = call foo(v1)

fhahn marked an inline comment as done.Dec 8 2019, 7:00 AM

fhahn added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
329	Thank you for reply. Do you propagate the shape information recursively? It is propagated iteratively: once we propagated shape information to an instruction, we add its users to the worklist. A later patches add back propagation as well and D70901 implements iteration until no new shape information can be discovered. If the matrix is stored to memory, does the propagation break? Can we get the shape information when reload the matrix from memory? Currently yes, we do not propagate through memory instructions. For simple cases like the one above should not really show up, as such loads should be promoted to use the value directly. We could handle more involved cases by using MemorySSA/additional alias analysis. Currently that is not a high priority for us, but we would be happy to collaborate on that as well. How do we pass the shape information across function and return shape from function? Do you plan to have matrix as first class type? Currently we do not propagate the shape information across function boundaries and we do not plan on proposing a dedicated matrix type. The original proposal was focused around a dedicated type, but it was decided to go with a more lightweight solution and potential revisit the matrix type once there is a strong need. For propagating across function boundaries one way to go about would be to turn the lowering into a module pass.

Update after changing %stride.

clang-format: pass.

Build artifacts: console-log.txt, diff.json

Harbormaster failed remote builds in B42397: Diff 233628!Dec 12 2019, 8:51 AM

ping

In D70897#1769062, @fhahn wrote:

In D70897#1768320, @anemet wrote:

Also the existing test diffs are hard to read, please explain what's going on there.

I've added comments to break up the check lines

Thanks for that. Can you also explain the nature of the changes with one example. I am assuming we're removing embedVectors/extractVectors, i.e. bunch of shuffles, pointing to one example would be useful.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
108	produced
113	all instruction that we have
122	Nice write-up!
529	nit: ShapeMap.count?
llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll
38	FileCheck is never executed with the SHAPE prefix.

anemet requested changes to this revision.Dec 19 2019, 3:14 PM

This revision now requires changes to proceed.Dec 19 2019, 3:14 PM

Address comments, thanks!

fhahn edited the summary of this revision. (Show Details)Dec 19 2019, 4:43 PM

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

In D70897#1791861, @anemet wrote:

In D70897#1769062, @fhahn wrote:

In D70897#1768320, @anemet wrote:

Also the existing test diffs are hard to read, please explain what's going on there.

I've added comments to break up the check lines

Thanks for that. Can you also explain the nature of the changes with one example. I am assuming we're removing embedVectors/extractVectors, i.e. bunch of shuffles, pointing to one example would be useful.

I've updated the description of the patch to include an example.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
529	I usually prefer find(), but I guess for DenseMap it does not matter performance wise and I can change it before committing if you prefer.
llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll
38	I've dropped those, but added an explanation of what we are checking.

Harbormaster failed remote builds in B42802: Diff 234807!Dec 19 2019, 4:49 PM

LGTM

This revision is now accepted and ready to land.Dec 20 2019, 9:20 AM

Closed by commit rG109e4e3851e2: [Matrix] Add forward shape propagation and first shape aware lowerings. (authored by fhahn). · Explain WhyDec 23 2019, 4:58 AM

This revision was automatically updated to reflect the committed changes.

fhahn removed a child revision: D70899: [Matrix] Implement back-propagation of shape information..Jan 9 2020, 1:31 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LowerMatrixIntrinsics.cpp

273 lines

test/

Transforms/

LowerMatrixIntrinsics/

bigger-expressions-double.ll

519 lines

propagate-forward.ll

75 lines

propagate-mixed-users.ll

56 lines

Diff 231685

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

Show All 23 Lines
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
		#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"

using namespace llvm;		using namespace llvm;
		using namespace PatternMatch;

#define DEBUG_TYPE "lower-matrix-intrinsics"		#define DEBUG_TYPE "lower-matrix-intrinsics"

		static cl::opt<bool> EnableShapePropagation("matrix-propagate-shape",
		cl::init(true));

namespace {		namespace {
// Given a \p MatrixPtr for the in-memory representation of a matrix,		// Given a \p MatrixPtr for the in-memory representation of a matrix,
// compute the address of the element at index \p Row, \p Col. \p Offset is		// compute the address of the element at index \p Row, \p Col. \p Offset is
// the number of elements between the start of two consecutive columns, so		// the number of elements between the start of two consecutive columns, so
// start address of column 1 = start address of column 0 + \p Offset.		// start address of column 1 = start address of column 0 + \p Offset.
// To load a matrix from contiguous memory, set \p Offset to the number of		// To load a matrix from contiguous memory, set \p Offset to the number of
// rows. To load a sub-matrix, set \p Offset to the number of rows plus the		// rows. To load a sub-matrix, set \p Offset to the number of rows plus the
// stride between the end of a column and the start of the next one.		// stride between the end of a column and the start of the next one.
Show All 34 Lines	Value *EltPtr =
EltType, Offset, Builder);		EltType, Offset, Builder);

Type *ColumnType = VectorType::get(EltType, NumRows);		Type *ColumnType = VectorType::get(EltType, NumRows);
Type *ColumnPtrType = PointerType::get(		Type *ColumnPtrType = PointerType::get(
ColumnType, cast<PointerType>(Base->getType())->getAddressSpace());		ColumnType, cast<PointerType>(Base->getType())->getAddressSpace());
return Builder.CreatePointerCast(EltPtr, ColumnPtrType);		return Builder.CreatePointerCast(EltPtr, ColumnPtrType);
}		}

/// LowerMatrixIntrinsics contains the methods used to lower matrix intrinsics.		/// LowerMatrixIntrinsics contains the methods used to lower matrix intrinsics.
///		///
/// Currently, the lowering for each matrix intrinsic is done as follows:		/// Currently, the lowering for each matrix intrinsic is done as follows:
/// 1. Split the operand vectors containing an embedded matrix into a set of		/// 1. Split the operand vectors containing an embedded matrix into a set of
/// column vectors, based on the shape information from the intrinsic.		/// column vectors, based on the shape information from the intrinsic.
/// 2. Apply the transformation described by the intrinsic on the column		/// 2. Apply the transformation described by the intrinsic on the column
/// vectors, which yields a set of column vectors containing result matrix.		/// vectors, which yields a set of column vectors containing result matrix.
/// 3. Embed the columns of the result matrix in a flat vector and replace all		/// 3. Embed the columns of the result matrix in a flat vector and replace all
/// uses of the intrinsic result with it.		/// uses of the intrinsic result with it.
		anemetUnsubmitted Done Reply Inline Actions Needs an update explaining the shape propagation and its use. anemet: Needs an update explaining the shape propagation and its use.
class LowerMatrixIntrinsics {		class LowerMatrixIntrinsics {
Function &Func;		Function &Func;
const DataLayout &DL;		const DataLayout &DL;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
		anemetUnsubmitted Done Reply Inline Actions produced anemet: produced

/// Wrapper class representing a matrix as a set of column vectors.		/// Wrapper class representing a matrix as a set of column vectors.
/// All column vectors must have the same vector type.		/// All column vectors must have the same vector type.
class ColumnMatrixTy {		class ColumnMatrixTy {
SmallVector<Value *, 16> Columns;		SmallVector<Value *, 16> Columns;
		anemetUnsubmitted Done Reply Inline Actions all instruction that we have anemet: all instruction that we have

public:		public:
ColumnMatrixTy() : Columns() {}		ColumnMatrixTy() : Columns() {}
ColumnMatrixTy(ArrayRef<Value *> Cols)		ColumnMatrixTy(ArrayRef<Value *> Cols)
: Columns(Cols.begin(), Cols.end()) {}		: Columns(Cols.begin(), Cols.end()) {}

Value *getColumn(unsigned i) const { return Columns[i]; }		Value *getColumn(unsigned i) const { return Columns[i]; }

void setColumn(unsigned i, Value *V) { Columns[i] = V; }		void setColumn(unsigned i, Value *V) { Columns[i] = V; }
		anemetUnsubmitted Not Done Reply Inline Actions Nice write-up! anemet: Nice write-up!

size_t getNumColumns() const { return Columns.size(); }		size_t getNumColumns() const { return Columns.size(); }
		size_t getNumRows() const {
		assert(Columns.size() > 0);
		return cast<VectorType>(Columns[0]->getType())->getNumElements();
		}

const SmallVectorImpl<Value *> &getColumnVectors() const { return Columns; }		const SmallVectorImpl<Value *> &getColumnVectors() const { return Columns; }

SmallVectorImpl<Value *> &getColumnVectors() { return Columns; }		SmallVectorImpl<Value *> &getColumnVectors() { return Columns; }

void addColumn(Value *V) { Columns.push_back(V); }		void addColumn(Value *V) { Columns.push_back(V); }

iterator_range<SmallVector<Value *, 8>::iterator> columns() {		iterator_range<SmallVector<Value *, 8>::iterator> columns() {
Show All 13 Lines	struct ShapeInfo {
unsigned NumColumns;		unsigned NumColumns;

ShapeInfo(unsigned NumRows = 0, unsigned NumColumns = 0)		ShapeInfo(unsigned NumRows = 0, unsigned NumColumns = 0)
: NumRows(NumRows), NumColumns(NumColumns) {}		: NumRows(NumRows), NumColumns(NumColumns) {}

ShapeInfo(ConstantInt NumRows, ConstantInt NumColumns)		ShapeInfo(ConstantInt NumRows, ConstantInt NumColumns)
: NumRows(NumRows->getZExtValue()),		: NumRows(NumRows->getZExtValue()),
NumColumns(NumColumns->getZExtValue()) {}		NumColumns(NumColumns->getZExtValue()) {}

		bool operator==(const ShapeInfo &other) {
		return NumRows == other.NumRows && NumColumns == other.NumColumns;
		}
		bool operator!=(const ShapeInfo &other) { return !(*this == other); }

		operator bool() const {
		assert(NumRows == 0 \|\| NumColumns != 0);
		return NumRows != 0;
		}
		anemetUnsubmitted Done Reply Inline Actions Is this supposed to indicate empty or undefined? I am not sure that the implicit conversion is self-explanatory here. anemet: Is this supposed to indicate empty or undefined? I am not sure that the implicit conversion is…
};		};

		/// Maps instructions to their shape information. The shape information
		/// describes the shape to be used while lowering. This matches the shape of
		/// the result value of the instruction, with the only exceptions being store
		/// instructions and the matrix_columnwise_store intrinsics. For those, the
		/// shape information indicates that those instructions should be lowered
		/// using shape information as well.
		DenseMap<Value *, ShapeInfo> ShapeMap;

		/// List of instructions to remove. While lowering, we are not replacing all
		/// users of a lowered instruction, if shape information is available and
		/// those need to be removed after we finished lowering.
		SmallVector<Instruction *, 16> ToRemove;

		/// Map from instructions to their produced column matrix.
		DenseMap<Value *, ColumnMatrixTy> Inst2ColumnMatrix;

public:		public:
LowerMatrixIntrinsics(Function &F, TargetTransformInfo &TTI)		LowerMatrixIntrinsics(Function &F, TargetTransformInfo &TTI)
: Func(F), DL(F.getParent()->getDataLayout()), TTI(TTI) {}		: Func(F), DL(F.getParent()->getDataLayout()), TTI(TTI) {}

/// Return the set of column vectors that a matrix value is lowered to.		/// Return the set of column vectors that a matrix value is lowered to.
///		///
/// We split the flat vector \p MatrixVal containing a matrix with shape \p SI		/// We split the flat vector \p MatrixVal containing a matrix with shape \p SI
/// into column vectors.		/// into column vectors.
		anemetUnsubmitted Done Reply Inline Actions Update comment anemet: Update comment
ColumnMatrixTy getMatrix(Value *MatrixVal, const ShapeInfo &SI,		ColumnMatrixTy getMatrix(Value *MatrixVal, const ShapeInfo &SI,
IRBuilder<> Builder) {		IRBuilder<> Builder) {
VectorType *VType = dyn_cast<VectorType>(MatrixVal->getType());		VectorType *VType = dyn_cast<VectorType>(MatrixVal->getType());
assert(VType && "MatrixVal must be a vector type");		assert(VType && "MatrixVal must be a vector type");
assert(VType->getNumElements() == SI.NumRows * SI.NumColumns &&		assert(VType->getNumElements() == SI.NumRows * SI.NumColumns &&
"The vector size must match the number of matrix elements");		"The vector size must match the number of matrix elements");

		// Check if we lowered MatrixVal using shape information. In that case,
		// return the existing column matrix.
		auto Found = Inst2ColumnMatrix.find(MatrixVal);
		if (Found != Inst2ColumnMatrix.end()) {
		ColumnMatrixTy &M = Found->second;
		if (SI.NumRows == M.getNumRows() && SI.NumColumns == M.getNumColumns())
		return M;

		MatrixVal = M.embedInVector(Builder);
		}

		// Otherwise split MatrixVal.
		anemetUnsubmitted Not Done Reply Inline Actions The comment says you returning the ColumnMatrix here but you're not. anemet: The comment says you returning the ColumnMatrix here but you're not.
SmallVector<Value *, 16> SplitVecs;		SmallVector<Value *, 16> SplitVecs;
Value *Undef = UndefValue::get(VType);		Value *Undef = UndefValue::get(VType);

for (unsigned MaskStart = 0; MaskStart < VType->getNumElements();		for (unsigned MaskStart = 0; MaskStart < VType->getNumElements();
MaskStart += SI.NumRows) {		MaskStart += SI.NumRows) {
Constant *Mask = createSequentialMask(Builder, MaskStart, SI.NumRows, 0);		Constant *Mask = createSequentialMask(Builder, MaskStart, SI.NumRows, 0);
Value *V = Builder.CreateShuffleVector(MatrixVal, Undef, Mask, "split");		Value *V = Builder.CreateShuffleVector(MatrixVal, Undef, Mask, "split");
SplitVecs.push_back(V);		SplitVecs.push_back(V);
}		}

return {SplitVecs};		return {SplitVecs};
}		}

// Replace intrinsic calls		/// If \p V already has a known shape return false. Otherwise set the shape
bool VisitCallInst(CallInst *Inst) {		/// for instructions that support it.
if (!Inst->getCalledFunction() \|\| !Inst->getCalledFunction()->isIntrinsic())		bool setShapeInfo(Value *V, std::function<ShapeInfo()> CreateShape) {
		if (isa<UndefValue>(V) \|\| !supportsShapeInfo(V))
return false;		return false;

switch (Inst->getCalledFunction()->getIntrinsicID()) {		auto SIter = ShapeMap.find(V);
		if (SIter != ShapeMap.end()) {
		LLVM_DEBUG(dbgs() << " not overriding existing shape: "
		<< SIter->second.NumRows << " "
		<< SIter->second.NumColumns << " for " << *V << "\n");
		return false;
		}

		ShapeInfo Shape = CreateShape();
		ShapeMap.insert({V, Shape});
		LLVM_DEBUG(dbgs() << " " << Shape.NumRows << " x " << Shape.NumColumns
		<< " for " << *V << "\n");
		return true;
		}

		bool setShapeInfo(Value V, Value Rows, Value *Cols) {
		return setShapeInfo(V, [&]() -> ShapeInfo {
		return {(unsigned)cast<ConstantInt>(Rows)->getZExtValue(),
		(unsigned)cast<ConstantInt>(Cols)->getZExtValue()};
		});
		}

		bool setShapeInfo(Value *V, ShapeInfo Shape) {
		assert(Shape && "Shape not set");
		return setShapeInfo(V, [&]() { return Shape; });
		}

		anemetUnsubmitted Done Reply Inline Actions I may be missing something but do these need the lambda? anemet: I may be missing something but do these need the lambda?
		/// Returns true if shape information can be used for \p V. The supported
		/// instructions must match the instructions that can be lowered by this pass.
		bool supportsShapeInfo(Value *V) {
		Instruction *Inst = dyn_cast<Instruction>(V);
		if (!Inst)
		return false;

		IntrinsicInst *II = dyn_cast<IntrinsicInst>(Inst);
		if (II)
		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
LowerMultiply(Inst);
break;
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
LowerTranspose(Inst);
break;
case Intrinsic::matrix_columnwise_load:		case Intrinsic::matrix_columnwise_load:
LowerColumnwiseLoad(Inst);
break;
case Intrinsic::matrix_columnwise_store:		case Intrinsic::matrix_columnwise_store:
LowerColumnwiseStore(Inst);		return true;
break;
default:		default:
return false;		return false;
}		}
Inst->eraseFromParent();		return isa<StoreInst>(Inst);
return true;		}

		/// Propagate the shape information of instructions to their users.
		void propagateShapeForward() {
		// The work list contains instructions for which we can compute the shape,
		// either based on the information provided by matrix intrinsics or known
		// shapes of operands.
		SmallVector<Instruction *, 8> WorkList;

		// Initialize the work list with ops carrying shape information. Initially
		// only the shape of matrix intrinsics is known.
		for (BasicBlock &BB : Func)
		for (Instruction &Inst : BB) {
		IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst);
		if (!II)
		continue;

		switch (II->getIntrinsicID()) {
		case Intrinsic::matrix_multiply:
		case Intrinsic::matrix_transpose:
		case Intrinsic::matrix_columnwise_load:
		case Intrinsic::matrix_columnwise_store:
		WorkList.push_back(&Inst);
		break;
		default:
		break;
		}
		}

		// Pop an element for which we guaranteed to have at least one of the
		// operand shapes. Add the shape for this and then add users to the work
		// list.
		LLVM_DEBUG(dbgs() << "Forward-propagate shapes:\n");
		while (!WorkList.empty()) {
		Instruction *Inst = WorkList.back();
		WorkList.pop_back();

		// New entry, set the value and insert operands
		bool Propagate = false;

		Value *MatrixA;
		Value *MatrixB;
		Value *M;
		Value *N;
		Value *K;
		if (match(Inst, m_Intrinsic<Intrinsic::matrix_multiply>(
		m_Value(MatrixA), m_Value(MatrixB), m_Value(M),
		m_Value(N), m_Value(K)))) {
		Propagate = setShapeInfo(Inst, M, K);
		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_transpose>(
		m_Value(MatrixA), m_Value(M), m_Value(N)))) {
		// Flip dimensions.
		Propagate = setShapeInfo(Inst, N, M);
		LuoYuankeUnsubmitted Not Done Reply Inline Actions It seems only store instruction is propagated with the shape information. Why? Take below pseudo code for example. Are v2 and v3 propagated with the shape information? v1 = matrix_columnwise_load(..., m, n) v2 = max(v1, 0) v3 = v1 / v2 store v3, ptr LuoYuanke: It seems only store instruction is propagated with the shape information. Why? Take below…
		fhahnAuthorUnsubmitted Done Reply Inline Actions This patch mostly adds the infrastructure for the propagation and only uses it for store instructions. So in the example, the shape is not propagated by this patch. Additional support for loads (D70900), binary operators (D70898) and back-propagation (D70899) are added in follow-up commits, to make reviewing them more manageable. The whole patch series is linked in Phabricator (see the 'stack' section). Please note that we could propagate shape information to more instructions, e.g. phis or selects. That can be added as follow-up as well, it is just a matter of priorities (we found loads/stores/binary operators to be by far the most common operations in matrix expressions). Any help with covering more cases would be very welcome :) fhahn: This patch mostly adds the infrastructure for the propagation and only uses it for store…
		LuoYuankeUnsubmitted Not Done Reply Inline Actions Thank you for reply. Do you propagate the shape information recursively? If the matrix is stored to memory, does the propagation break? Can we get the shape information when reload the matrix from memory? v1 = matrix_multipy(..., m, n, k) store v1, ptr * v2 = load ptr* How do we pass the shape information across function and return shape from function? Do you plan to have matrix as first class type? v1 = matrix_multipy(..., m, n, k) v2 = call foo(v1) LuoYuanke: Thank you for reply. Do you propagate the shape information recursively? If the matrix is…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Thank you for reply. Do you propagate the shape information recursively? It is propagated iteratively: once we propagated shape information to an instruction, we add its users to the worklist. A later patches add back propagation as well and D70901 implements iteration until no new shape information can be discovered. If the matrix is stored to memory, does the propagation break? Can we get the shape information when reload the matrix from memory? Currently yes, we do not propagate through memory instructions. For simple cases like the one above should not really show up, as such loads should be promoted to use the value directly. We could handle more involved cases by using MemorySSA/additional alias analysis. Currently that is not a high priority for us, but we would be happy to collaborate on that as well. How do we pass the shape information across function and return shape from function? Do you plan to have matrix as first class type? Currently we do not propagate the shape information across function boundaries and we do not plan on proposing a dedicated matrix type. The original proposal was focused around a dedicated type, but it was decided to go with a more lightweight solution and potential revisit the matrix type once there is a strong need. For propagating across function boundaries one way to go about would be to turn the lowering into a module pass. fhahn: > Thank you for reply. Do you propagate the shape information recursively? It is propagated…
		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_columnwise_store>(
		m_Value(MatrixA), m_Value(), m_Value(),
		m_Value(M), m_Value(N)))) {
		Propagate = setShapeInfo(Inst, N, M);
		} else if (match(Inst,
		m_Intrinsic<Intrinsic::matrix_columnwise_load>(
		m_Value(), m_Value(), m_Value(M), m_Value(N)))) {
		Propagate = setShapeInfo(Inst, M, N);
		} else if (match(Inst, m_Store(m_Value(MatrixA), m_Value()))) {
		auto OpShape = ShapeMap.find(MatrixA);
		if (OpShape != ShapeMap.end())
		setShapeInfo(Inst, OpShape->second);
		continue;
		}

		if (Propagate)
		for (auto *User : Inst->users())
		if (ShapeMap.count(User) == 0)
		WorkList.push_back(cast<Instruction>(User));
		}
}		}

bool Visit() {		bool Visit() {
		if (EnableShapePropagation)
		propagateShapeForward();

ReversePostOrderTraversal<Function *> RPOT(&Func);		ReversePostOrderTraversal<Function *> RPOT(&Func);
bool Changed = false;		bool Changed = false;
for (auto *BB : RPOT) {		for (auto *BB : RPOT) {
for (Instruction &Inst : make_early_inc_range(*BB)) {		for (Instruction &Inst : make_early_inc_range(*BB)) {
		IRBuilder<> Builder(&Inst);

if (CallInst *CInst = dyn_cast<CallInst>(&Inst))		if (CallInst *CInst = dyn_cast<CallInst>(&Inst))
Changed \|= VisitCallInst(CInst);		Changed \|= VisitCallInst(CInst);

		Value *Op1;
		Value *Op2;
		if (match(&Inst, m_Store(m_Value(Op1), m_Value(Op2))))
		Changed \|= VisitStore(&Inst, Op1, Op2, Builder);
}		}
}		}

		for (Instruction *Inst : reverse(ToRemove))
		Inst->eraseFromParent();

return Changed;		return Changed;
}		}

LoadInst createColumnLoad(Value ColumnPtr, Type *EltType,		LoadInst createColumnLoad(Value ColumnPtr, Type *EltType,
IRBuilder<> Builder) {		IRBuilder<> Builder) {
unsigned Align = DL.getABITypeAlignment(EltType);		unsigned Align = DL.getABITypeAlignment(EltType);
return Builder.CreateAlignedLoad(ColumnPtr, Align);		return Builder.CreateAlignedLoad(ColumnPtr, Align);
}		}

StoreInst createColumnStore(Value ColumnValue, Value *ColumnPtr,		StoreInst createColumnStore(Value ColumnValue, Value *ColumnPtr,
Type *EltType, IRBuilder<> Builder) {		Type *EltType, IRBuilder<> Builder) {
unsigned Align = DL.getABITypeAlignment(EltType);		unsigned Align = DL.getABITypeAlignment(EltType);
return Builder.CreateAlignedStore(ColumnValue, ColumnPtr, Align);		return Builder.CreateAlignedStore(ColumnValue, ColumnPtr, Align);
}		}

		// Replace intrinsic calls
		bool VisitCallInst(CallInst *Inst) {
		if (!Inst->getCalledFunction() \|\| !Inst->getCalledFunction()->isIntrinsic())
		return false;

		switch (Inst->getCalledFunction()->getIntrinsicID()) {
		case Intrinsic::matrix_multiply:
		LowerMultiply(Inst);
		break;
		case Intrinsic::matrix_transpose:
		LowerTranspose(Inst);
		break;
		case Intrinsic::matrix_columnwise_load:
		LowerColumnwiseLoad(Inst);
		break;
		case Intrinsic::matrix_columnwise_store:
		LowerColumnwiseStore(Inst);
		break;
		default:
		return false;
		}
		return true;
		}

/// Lowers llvm.matrix.columnwise.load.		/// Lowers llvm.matrix.columnwise.load.
///		///
/// The intrinsic loads a matrix from memory using a stride between columns.		/// The intrinsic loads a matrix from memory using a stride between columns.
void LowerColumnwiseLoad(CallInst *Inst) {		void LowerColumnwiseLoad(CallInst *Inst) {
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
Value *Ptr = Inst->getArgOperand(0);		Value *Ptr = Inst->getArgOperand(0);
Value *Stride = Inst->getArgOperand(1);		Value *Stride = Inst->getArgOperand(1);
auto VType = cast<VectorType>(Inst->getType());		auto VType = cast<VectorType>(Inst->getType());

ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(2)),		ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(2)),
cast<ConstantInt>(Inst->getArgOperand(3)));		cast<ConstantInt>(Inst->getArgOperand(3)));

ColumnMatrixTy Result;		ColumnMatrixTy Result;

// Distance between start of one column and the start of the next		// Distance between start of one column and the start of the next
for (unsigned C = 0, E = Shape.NumColumns; C < E; ++C) {		for (unsigned C = 0, E = Shape.NumColumns; C < E; ++C) {
Value *GEP =		Value *GEP =
computeColumnAddr(Ptr, 0, C, Stride, VType, Shape.NumRows, Builder);		computeColumnAddr(Ptr, 0, C, Stride, VType, Shape.NumRows, Builder);
Value *Column = createColumnLoad(GEP, VType->getElementType(), Builder);		Value *Column = createColumnLoad(GEP, VType->getElementType(), Builder);
Result.addColumn(Column);		Result.addColumn(Column);
}		}

Inst->replaceAllUsesWith(Result.embedInVector(Builder));		finalizeLowering(Inst, Result, Builder);
}		}

/// Lowers llvm.matrix.columnwise.store.		/// Lowers llvm.matrix.columnwise.store.
///		///
/// The intrinsic store a matrix back memory using a stride between columns.		/// The intrinsic store a matrix back memory using a stride between columns.
void LowerColumnwiseStore(CallInst *Inst) {		void LowerColumnwiseStore(CallInst *Inst) {
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
Value *Matrix = Inst->getArgOperand(0);		Value *Matrix = Inst->getArgOperand(0);
Value *Ptr = Inst->getArgOperand(1);		Value *Ptr = Inst->getArgOperand(1);
Value *Stride = Inst->getArgOperand(2);		Value *Stride = Inst->getArgOperand(2);
ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(3)),		ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(3)),
cast<ConstantInt>(Inst->getArgOperand(4)));		cast<ConstantInt>(Inst->getArgOperand(4)));

auto VType = cast<VectorType>(Matrix->getType());		auto VType = cast<VectorType>(Matrix->getType());

auto LM = getMatrix(Matrix, Shape, Builder);		auto LM = getMatrix(Matrix, Shape, Builder);

for (auto C : enumerate(LM.columns())) {		for (auto C : enumerate(LM.columns())) {
Value *GEP = computeColumnAddr(Ptr, 0, C.index(), Stride, VType,		Value *GEP = computeColumnAddr(Ptr, 0, C.index(), Stride, VType,
Shape.NumRows, Builder);		Shape.NumRows, Builder);
createColumnStore(C.value(), GEP, VType->getElementType(), Builder);		createColumnStore(C.value(), GEP, VType->getElementType(), Builder);
}		}

		ToRemove.push_back(Inst);
}		}

/// Extract a column vector of \p NumElts starting at index (\p I, \p J) from		/// Extract a column vector of \p NumElts starting at index (\p I, \p J) from
/// the matrix \p LM represented as a vector of column vectors.		/// the matrix \p LM represented as a vector of column vectors.
Value *extractVector(const ColumnMatrixTy &LM, unsigned I, unsigned J,		Value *extractVector(const ColumnMatrixTy &LM, unsigned I, unsigned J,
unsigned NumElts, IRBuilder<> Builder) {		unsigned NumElts, IRBuilder<> Builder) {
Value *Col = LM.getColumn(J);		Value *Col = LM.getColumn(J);
Value *Undef = UndefValue::get(Col->getType());		Value *Undef = UndefValue::get(Col->getType());
Show All 39 Lines	Value createMulAdd(Value Sum, Value A, Value B, bool UseFPOp,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {
Value *Mul = UseFPOp ? Builder.CreateFMul(A, B) : Builder.CreateMul(A, B);		Value *Mul = UseFPOp ? Builder.CreateFMul(A, B) : Builder.CreateMul(A, B);
if (!Sum)		if (!Sum)
return Mul;		return Mul;

return UseFPOp ? Builder.CreateFAdd(Sum, Mul) : Builder.CreateAdd(Sum, Mul);		return UseFPOp ? Builder.CreateFAdd(Sum, Mul) : Builder.CreateAdd(Sum, Mul);
}		}

		void finalizeLowering(Instruction *Inst, ColumnMatrixTy Matrix,
		IRBuilder<> &Builder) {
		anemetUnsubmitted Done Reply Inline Actions Please add a comment of what's happening here. anemet: Please add a comment of what's happening here.
		Inst2ColumnMatrix.insert(std::make_pair(Inst, Matrix));

		ToRemove.push_back(Inst);
		Value *Flattened = nullptr;
		for (auto I = Inst->use_begin(), E = Inst->use_end(); I != E;) {
		Use &U = *I++;
		if (ShapeMap.find(U.getUser()) == ShapeMap.end()) {
		if (!Flattened)
		Flattened = Matrix.embedInVector(Builder);
		U.set(Flattened);
		}
		}
		anemetUnsubmitted Not Done Reply Inline Actions nit: ShapeMap.count? anemet: nit: ShapeMap.count?
		fhahnAuthorUnsubmitted Done Reply Inline Actions I usually prefer find(), but I guess for DenseMap it does not matter performance wise and I can change it before committing if you prefer. fhahn: I usually prefer find(), but I guess for DenseMap it does not matter performance wise and I…
		}
/// Lowers llvm.matrix.multiply.		/// Lowers llvm.matrix.multiply.
void LowerMultiply(CallInst *MatMul) {		void LowerMultiply(CallInst *MatMul) {
IRBuilder<> Builder(MatMul);		IRBuilder<> Builder(MatMul);
auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();		auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();
ShapeInfo LShape(cast<ConstantInt>(MatMul->getArgOperand(2)),		ShapeInfo LShape(cast<ConstantInt>(MatMul->getArgOperand(2)),
cast<ConstantInt>(MatMul->getArgOperand(3)));		cast<ConstantInt>(MatMul->getArgOperand(3)));
ShapeInfo RShape(cast<ConstantInt>(MatMul->getArgOperand(3)),		ShapeInfo RShape(cast<ConstantInt>(MatMul->getArgOperand(3)),
cast<ConstantInt>(MatMul->getArgOperand(4)));		cast<ConstantInt>(MatMul->getArgOperand(4)));
Show All 33 Lines	for (unsigned J = 0; J < C; ++J) {
Value *RH = Builder.CreateExtractElement(Rhs.getColumn(J), K);		Value *RH = Builder.CreateExtractElement(Rhs.getColumn(J), K);
Value *Splat = Builder.CreateVectorSplat(BlockSize, RH, "splat");		Value *Splat = Builder.CreateVectorSplat(BlockSize, RH, "splat");
Sum = createMulAdd(Sum, L, Splat, EltType->isFloatingPointTy(),		Sum = createMulAdd(Sum, L, Splat, EltType->isFloatingPointTy(),
Builder);		Builder);
}		}
Result.setColumn(J, insertVector(Result.getColumn(J), I, Sum, Builder));		Result.setColumn(J, insertVector(Result.getColumn(J), I, Sum, Builder));
}		}
}		}
		finalizeLowering(MatMul, Result, Builder);
MatMul->replaceAllUsesWith(Result.embedInVector(Builder));
}		}

/// Lowers llvm.matrix.transpose.		/// Lowers llvm.matrix.transpose.
void LowerTranspose(CallInst *Inst) {		void LowerTranspose(CallInst *Inst) {
ColumnMatrixTy Result;		ColumnMatrixTy Result;
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
Value *InputVal = Inst->getArgOperand(0);		Value *InputVal = Inst->getArgOperand(0);
VectorType *VectorTy = cast<VectorType>(InputVal->getType());		VectorType *VectorTy = cast<VectorType>(InputVal->getType());
Show All 13 Lines	for (unsigned Row = 0; Row < ArgShape.NumRows; ++Row) {
// We insert at index Column since that is the row index after the		// We insert at index Column since that is the row index after the
// transpose.		// transpose.
ResultColumn =		ResultColumn =
Builder.CreateInsertElement(ResultColumn, Elt, C.index());		Builder.CreateInsertElement(ResultColumn, Elt, C.index());
}		}
Result.addColumn(ResultColumn);		Result.addColumn(ResultColumn);
}		}

Inst->replaceAllUsesWith(Result.embedInVector(Builder));		finalizeLowering(Inst, Result, Builder);
		}

		bool VisitStore(Value Inst, Value StoredVal, Value *Ptr,
		IRBuilder<> &Builder) {
		auto I = ShapeMap.find(StoredVal);
		if (I == ShapeMap.end())
		return false;
		ShapeInfo &SI = I->second;

		VectorType *MType = cast<VectorType>(StoredVal->getType());
		auto Matrix = getMatrix(StoredVal, SI, Builder);
		SmallVector<Value *, 8>
		OpMapping; // Map the store operation to the column stores
		OpMapping.reserve(SI.NumColumns);

		for (auto E : enumerate(Matrix.columns())) {
		Value *ColumnGEP = computeColumnAddr(
		Ptr, 0, E.index(), Builder.getInt32(0), MType, SI.NumRows, Builder);
		StoreInst *StoreInst = createColumnStore(
		E.value(), ColumnGEP, MType->getElementType(), Builder);
		OpMapping.push_back(StoreInst);
		}

		ToRemove.push_back(cast<Instruction>(Inst));
		return true;
}		}
};		};
} // namespace		} // namespace

PreservedAnalyses LowerMatrixIntrinsicsPass::run(Function &F,		PreservedAnalyses LowerMatrixIntrinsicsPass::run(Function &F,
FunctionAnalysisManager &AM) {		FunctionAnalysisManager &AM) {
auto &TTI = AM.getResult<TargetIRAnalysis>(F);		auto &TTI = AM.getResult<TargetIRAnalysis>(F);
LowerMatrixIntrinsics LMT(F, TTI);		LowerMatrixIntrinsics LMT(F, TTI);
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll

	Show All 23 Lines
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 1			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 1
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <3 x double> [[TMP9]], double [[TMP10]], i64 2			; CHECK-NEXT: [[TMP11:%.*]] = insertelement <3 x double> [[TMP9]], double [[TMP10]], i64 2
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x double> [[SPLIT]], i64 2			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x double> [[SPLIT]], i64 2
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <3 x double> undef, double [[TMP12]], i64 0			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <3 x double> undef, double [[TMP12]], i64 0
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 2			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 2
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1			; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1
	; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2			; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2			; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2
	; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> [[TMP11]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <6 x double> [[TMP18]], <6 x double> [[TMP19]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>			; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[TMP18:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
	; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>			; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP18]], i32 0
	; CHECK-NEXT: [[SPLIT6:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT7:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT8:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP21:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP21]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP22:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]			; CHECK-NEXT: [[TMP19:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
	; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[BLOCK6:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[TMP20:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT7:%.*]] = insertelement <1 x double> undef, double [[TMP20]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT8:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT7]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP21:%.*]] = fmul <1 x double> [[BLOCK6]], [[SPLAT_SPLAT8]]
				; CHECK-NEXT: [[TMP22:%.*]] = fadd <1 x double> [[TMP19]], [[TMP21]]
				; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]			; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]
	; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]			; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]
	; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP26:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP26]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP26]], i32 0			; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP28:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP28]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP27:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]			; CHECK-NEXT: [[TMP29:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]
	; CHECK-NEXT: [[TMP28:%.*]] = fadd <1 x double> [[TMP25]], [[TMP27]]			; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP29:%.*]] = shufflevector <1 x double> [[TMP28]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP30:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP30:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP29]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP30]], i32 0
	; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP31:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP31]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP32:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]			; CHECK-NEXT: [[TMP31:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]
	; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP32:%.*]] = fadd <1 x double> [[TMP29]], [[TMP31]]
	; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]			; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]
	; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]			; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]
	; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP36:%.*]] = shufflevector <1 x double> [[TMP35]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP36:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP37:%.*]] = shufflevector <3 x double> [[TMP27]], <3 x double> [[TMP36]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP36]], i32 0			; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP38:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP38]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP37:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]			; CHECK-NEXT: [[TMP39:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]
	; CHECK-NEXT: [[TMP38:%.*]] = fadd <1 x double> [[TMP35]], [[TMP37]]			; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <1 x double> [[TMP38]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP40:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <3 x double> [[TMP30]], <3 x double> [[TMP39]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP40]], i32 0
	; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP41:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP41]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP42:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]			; CHECK-NEXT: [[TMP41:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]
	; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP42:%.*]] = fadd <1 x double> [[TMP39]], [[TMP41]]
	; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]			; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]
	; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]			; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]
	; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <1 x double> [[TMP45]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP46:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <3 x double> [[TMP37]], <3 x double> [[TMP46]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP46]], i32 0			; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP48:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP48]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP47:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]			; CHECK-NEXT: [[TMP49:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]
	; CHECK-NEXT: [[TMP48:%.*]] = fadd <1 x double> [[TMP45]], [[TMP47]]			; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP49:%.*]] = shufflevector <1 x double> [[TMP48]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP50:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP50:%.*]] = shufflevector <3 x double> [[TMP40]], <3 x double> [[TMP49]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP50]], i32 0
	; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP51:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP51]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP52:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]			; CHECK-NEXT: [[TMP51:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]
	; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP52:%.*]] = fadd <1 x double> [[TMP49]], [[TMP51]]
	; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]			; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]
	; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]			; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]
	; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP56:%.*]] = shufflevector <1 x double> [[TMP55]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP56:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP57:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP56]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP56]], i32 0			; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP58:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP58]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP57:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]			; CHECK-NEXT: [[TMP59:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]
	; CHECK-NEXT: [[TMP58:%.*]] = fadd <1 x double> [[TMP55]], [[TMP57]]			; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP59:%.*]] = shufflevector <1 x double> [[TMP58]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP60:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP59]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP60]], i32 0
	; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP61:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP61]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP62:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]			; CHECK-NEXT: [[TMP61:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]
	; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP62:%.*]] = fadd <1 x double> [[TMP59]], [[TMP61]]
	; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]			; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]
	; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]			; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]
	; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP66:%.*]] = shufflevector <1 x double> [[TMP65]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP66:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP67:%.*]] = shufflevector <3 x double> [[TMP57]], <3 x double> [[TMP66]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP66]], i32 0			; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP68:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP68]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP67:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]			; CHECK-NEXT: [[TMP69:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]
	; CHECK-NEXT: [[TMP68:%.*]] = fadd <1 x double> [[TMP65]], [[TMP67]]			; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP69:%.*]] = shufflevector <1 x double> [[TMP68]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP70:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP70:%.*]] = shufflevector <3 x double> [[TMP60]], <3 x double> [[TMP69]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP70]], i32 0
	; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP71:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP71]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP72:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]			; CHECK-NEXT: [[TMP71:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]
	; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP72:%.*]] = fadd <1 x double> [[TMP69]], [[TMP71]]
	; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]			; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]
	; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]			; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]
	; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP76:%.*]] = shufflevector <1 x double> [[TMP75]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP76:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP77:%.*]] = shufflevector <3 x double> [[TMP67]], <3 x double> [[TMP76]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP76]], i32 0			; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP78:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP78]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP77:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]			; CHECK-NEXT: [[TMP79:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]
	; CHECK-NEXT: [[TMP78:%.*]] = fadd <1 x double> [[TMP75]], [[TMP77]]			; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP79:%.*]] = shufflevector <1 x double> [[TMP78]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP80:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP80:%.*]] = shufflevector <3 x double> [[TMP70]], <3 x double> [[TMP79]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP80]], i32 0
	; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP81:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP81]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP82:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]			; CHECK-NEXT: [[TMP81:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]
	; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP82:%.*]] = fadd <1 x double> [[TMP79]], [[TMP81]]
	; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]			; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]
	; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]			; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]
	; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP86:%.*]] = shufflevector <1 x double> [[TMP85]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP86:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP87:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP86]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP86]], i32 0			; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP88:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP88]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP87:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]			; CHECK-NEXT: [[TMP89:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]
	; CHECK-NEXT: [[TMP88:%.*]] = fadd <1 x double> [[TMP85]], [[TMP87]]			; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP89:%.*]] = shufflevector <1 x double> [[TMP88]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP90:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP90:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP89]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP90]], i32 0
	; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP91:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP91]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP92:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]			; CHECK-NEXT: [[TMP91:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]
	; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP92:%.*]] = fadd <1 x double> [[TMP89]], [[TMP91]]
	; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]			; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]
	; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]			; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]
	; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP96:%.*]] = shufflevector <1 x double> [[TMP95]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP96:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP97:%.*]] = shufflevector <3 x double> [[TMP87]], <3 x double> [[TMP96]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP96]], i32 0			; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP98:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP98]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP97:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]			; CHECK-NEXT: [[TMP99:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]
	; CHECK-NEXT: [[TMP98:%.*]] = fadd <1 x double> [[TMP95]], [[TMP97]]			; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP99:%.*]] = shufflevector <1 x double> [[TMP98]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP100:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP100:%.*]] = shufflevector <3 x double> [[TMP90]], <3 x double> [[TMP99]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP100]], i32 0
	; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP101:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP101]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP102:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]			; CHECK-NEXT: [[TMP101:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]
	; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP102:%.*]] = fadd <1 x double> [[TMP99]], [[TMP101]]
	; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]			; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]
	; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]			; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]
	; CHECK-NEXT: [[BLOCK84:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP106:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT85:%.*]] = insertelement <1 x double> undef, double [[TMP106]], i32 0			; CHECK-NEXT: [[TMP108:%.]] = bitcast <9 x double> [[C_PTR:%.]] to double
	; CHECK-NEXT: [[SPLAT_SPLAT86:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT85]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP109:%.]] = bitcast double [[TMP108]] to <3 x double>*
	; CHECK-NEXT: [[TMP107:%.*]] = fmul <1 x double> [[BLOCK84]], [[SPLAT_SPLAT86]]			; CHECK-NEXT: store <3 x double> [[TMP47]], <3 x double>* [[TMP109]], align 8
	; CHECK-NEXT: [[TMP108:%.*]] = fadd <1 x double> [[TMP105]], [[TMP107]]			; CHECK-NEXT: [[TMP110:%.]] = bitcast <9 x double> [[C_PTR]] to double*
	; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <1 x double> [[TMP108]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP111:%.]] = getelementptr double, double [[TMP110]], i32 3
	; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <3 x double> [[TMP100]], <3 x double> [[TMP109]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[TMP112:%.]] = bitcast double [[TMP111]] to <3 x double>*
	; CHECK-NEXT: [[TMP111:%.*]] = shufflevector <3 x double> [[TMP50]], <3 x double> [[TMP80]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>			; CHECK-NEXT: store <3 x double> [[TMP77]], <3 x double>* [[TMP112]], align 8
	; CHECK-NEXT: [[TMP112:%.*]] = shufflevector <3 x double> [[TMP110]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP113:%.]] = bitcast <9 x double> [[C_PTR]] to double*
	; CHECK-NEXT: [[TMP113:%.*]] = shufflevector <6 x double> [[TMP111]], <6 x double> [[TMP112]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>			; CHECK-NEXT: [[TMP114:%.]] = getelementptr double, double [[TMP113]], i32 6
	; CHECK-NEXT: store <9 x double> [[TMP113]], <9 x double>* [[C_PTR:%.*]]			; CHECK-NEXT: [[TMP115:%.]] = bitcast double [[TMP114]] to <3 x double>*
				; CHECK-NEXT: store <3 x double> [[TMP107]], <3 x double>* [[TMP115]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%a = load <9 x double>, <9 x double>* %A.Ptr			%a = load <9 x double>, <9 x double>* %A.Ptr
	%b = load <9 x double>, <9 x double>* %B.Ptr			%b = load <9 x double>, <9 x double>* %B.Ptr
	%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)			%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)
	%c = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)			%c = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)
	store <9 x double> %c, <9 x double>* %C.Ptr			store <9 x double> %c, <9 x double>* %C.Ptr
	Show All 24 Lines
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 1			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 1
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <3 x double> [[TMP9]], double [[TMP10]], i64 2			; CHECK-NEXT: [[TMP11:%.*]] = insertelement <3 x double> [[TMP9]], double [[TMP10]], i64 2
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x double> [[SPLIT]], i64 2			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <3 x double> [[SPLIT]], i64 2
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <3 x double> undef, double [[TMP12]], i64 0			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <3 x double> undef, double [[TMP12]], i64 0
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 2			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 2
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1			; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1
	; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2			; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2			; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2
	; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> [[TMP11]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>			; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <6 x double> [[TMP18]], <6 x double> [[TMP19]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>			; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>			; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>			; CHECK-NEXT: [[TMP18:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
	; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>			; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP18]], i32 0
	; CHECK-NEXT: [[SPLIT6:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
	; CHECK-NEXT: [[SPLIT7:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[SPLIT8:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP21:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP21]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP22:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]			; CHECK-NEXT: [[TMP19:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
	; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[BLOCK6:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[TMP20:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[SPLAT_SPLATINSERT7:%.*]] = insertelement <1 x double> undef, double [[TMP20]], i32 0
				; CHECK-NEXT: [[SPLAT_SPLAT8:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT7]], <1 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP21:%.*]] = fmul <1 x double> [[BLOCK6]], [[SPLAT_SPLAT8]]
				; CHECK-NEXT: [[TMP22:%.*]] = fadd <1 x double> [[TMP19]], [[TMP21]]
				; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]			; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]
	; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]			; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]
	; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP26:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP26]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP26]], i32 0			; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP28:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP28]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP27:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]			; CHECK-NEXT: [[TMP29:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]
	; CHECK-NEXT: [[TMP28:%.*]] = fadd <1 x double> [[TMP25]], [[TMP27]]			; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP29:%.*]] = shufflevector <1 x double> [[TMP28]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP30:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP30:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP29]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP30]], i32 0
	; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP31:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP31]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP32:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]			; CHECK-NEXT: [[TMP31:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]
	; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP32:%.*]] = fadd <1 x double> [[TMP29]], [[TMP31]]
	; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]			; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]
	; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]			; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]
	; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP36:%.*]] = shufflevector <1 x double> [[TMP35]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP36:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP37:%.*]] = shufflevector <3 x double> [[TMP27]], <3 x double> [[TMP36]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP36]], i32 0			; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP38:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP38]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP37:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]			; CHECK-NEXT: [[TMP39:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]
	; CHECK-NEXT: [[TMP38:%.*]] = fadd <1 x double> [[TMP35]], [[TMP37]]			; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <1 x double> [[TMP38]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP40:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
	; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <3 x double> [[TMP30]], <3 x double> [[TMP39]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP40]], i32 0
	; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP41:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP41]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP42:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]			; CHECK-NEXT: [[TMP41:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]
	; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP42:%.*]] = fadd <1 x double> [[TMP39]], [[TMP41]]
	; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1			; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]			; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]
	; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]			; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]
	; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <1 x double> [[TMP45]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP46:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2			; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <3 x double> [[TMP37]], <3 x double> [[TMP46]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP46]], i32 0			; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP48:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP48]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP47:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]			; CHECK-NEXT: [[TMP49:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]
	; CHECK-NEXT: [[TMP48:%.*]] = fadd <1 x double> [[TMP45]], [[TMP47]]			; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP49:%.*]] = shufflevector <1 x double> [[TMP48]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP50:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP50:%.*]] = shufflevector <3 x double> [[TMP40]], <3 x double> [[TMP49]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP50]], i32 0
	; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP51:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP51]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP52:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]			; CHECK-NEXT: [[TMP51:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]
	; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP52:%.*]] = fadd <1 x double> [[TMP49]], [[TMP51]]
	; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]			; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]
	; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]			; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]
	; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP56:%.*]] = shufflevector <1 x double> [[TMP55]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP56:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP57:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP56]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP56]], i32 0			; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP58:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP58]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP57:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]			; CHECK-NEXT: [[TMP59:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]
	; CHECK-NEXT: [[TMP58:%.*]] = fadd <1 x double> [[TMP55]], [[TMP57]]			; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP59:%.*]] = shufflevector <1 x double> [[TMP58]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP60:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP59]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP60]], i32 0
	; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP61:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP61]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP62:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]			; CHECK-NEXT: [[TMP61:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]
	; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP62:%.*]] = fadd <1 x double> [[TMP59]], [[TMP61]]
	; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]			; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]
	; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]			; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]
	; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP66:%.*]] = shufflevector <1 x double> [[TMP65]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP66:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP67:%.*]] = shufflevector <3 x double> [[TMP57]], <3 x double> [[TMP66]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP66]], i32 0			; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP68:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP68]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP67:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]			; CHECK-NEXT: [[TMP69:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]
	; CHECK-NEXT: [[TMP68:%.*]] = fadd <1 x double> [[TMP65]], [[TMP67]]			; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP69:%.*]] = shufflevector <1 x double> [[TMP68]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP70:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
	; CHECK-NEXT: [[TMP70:%.*]] = shufflevector <3 x double> [[TMP60]], <3 x double> [[TMP69]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP70]], i32 0
	; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP71:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP71]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP72:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]			; CHECK-NEXT: [[TMP71:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]
	; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP72:%.*]] = fadd <1 x double> [[TMP69]], [[TMP71]]
	; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1			; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]			; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]
	; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]			; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]
	; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP76:%.*]] = shufflevector <1 x double> [[TMP75]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP76:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2			; CHECK-NEXT: [[TMP77:%.*]] = shufflevector <3 x double> [[TMP67]], <3 x double> [[TMP76]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP76]], i32 0			; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP78:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP78]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP77:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]			; CHECK-NEXT: [[TMP79:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]
	; CHECK-NEXT: [[TMP78:%.*]] = fadd <1 x double> [[TMP75]], [[TMP77]]			; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP79:%.*]] = shufflevector <1 x double> [[TMP78]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP80:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP80:%.*]] = shufflevector <3 x double> [[TMP70]], <3 x double> [[TMP79]], <3 x i32> <i32 0, i32 1, i32 3>			; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP80]], i32 0
	; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP81:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP81]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP82:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]			; CHECK-NEXT: [[TMP81:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]
	; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP82:%.*]] = fadd <1 x double> [[TMP79]], [[TMP81]]
	; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]			; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]
	; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]			; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]
	; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP86:%.*]] = shufflevector <1 x double> [[TMP85]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP86:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP87:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP86]], <3 x i32> <i32 3, i32 1, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP86]], i32 0			; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP88:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP88]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP87:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]			; CHECK-NEXT: [[TMP89:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]
	; CHECK-NEXT: [[TMP88:%.*]] = fadd <1 x double> [[TMP85]], [[TMP87]]			; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP89:%.*]] = shufflevector <1 x double> [[TMP88]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP90:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP90:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP89]], <3 x i32> <i32 3, i32 1, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP90]], i32 0
	; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
	; CHECK-NEXT: [[TMP91:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP91]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP92:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]			; CHECK-NEXT: [[TMP91:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]
	; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP92:%.*]] = fadd <1 x double> [[TMP89]], [[TMP91]]
	; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
				; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]			; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]
	; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]			; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]
	; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>			; CHECK-NEXT: [[TMP96:%.*]] = shufflevector <1 x double> [[TMP95]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP96:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP97:%.*]] = shufflevector <3 x double> [[TMP87]], <3 x double> [[TMP96]], <3 x i32> <i32 0, i32 3, i32 2>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP96]], i32 0			; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP98:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
				; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP98]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP97:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]			; CHECK-NEXT: [[TMP99:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]
	; CHECK-NEXT: [[TMP98:%.*]] = fadd <1 x double> [[TMP95]], [[TMP97]]			; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP99:%.*]] = shufflevector <1 x double> [[TMP98]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP100:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
	; CHECK-NEXT: [[TMP100:%.*]] = shufflevector <3 x double> [[TMP90]], <3 x double> [[TMP99]], <3 x i32> <i32 0, i32 3, i32 2>			; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP100]], i32 0
	; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
	; CHECK-NEXT: [[TMP101:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
	; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP101]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP102:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]			; CHECK-NEXT: [[TMP101:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]
	; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP102:%.*]] = fadd <1 x double> [[TMP99]], [[TMP101]]
	; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1			; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
				; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
	; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0			; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0
	; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]			; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]
	; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]			; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]
	; CHECK-NEXT: [[BLOCK84:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>			; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP106:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2			; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[SPLAT_SPLATINSERT85:%.*]] = insertelement <1 x double> undef, double [[TMP106]], i32 0			; CHECK-NEXT: [[TMP108:%.*]] = shufflevector <3 x double> [[TMP47]], <3 x double> [[TMP77]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[SPLAT_SPLAT86:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT85]], <1 x double> undef, <1 x i32> zeroinitializer			; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <3 x double> [[TMP107]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP107:%.*]] = fmul <1 x double> [[BLOCK84]], [[SPLAT_SPLAT86]]			; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <6 x double> [[TMP108]], <6 x double> [[TMP109]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[TMP108:%.*]] = fadd <1 x double> [[TMP105]], [[TMP107]]
	; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <1 x double> [[TMP108]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <3 x double> [[TMP100]], <3 x double> [[TMP109]], <3 x i32> <i32 0, i32 1, i32 3>
	; CHECK-NEXT: [[TMP111:%.*]] = shufflevector <3 x double> [[TMP50]], <3 x double> [[TMP80]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
	; CHECK-NEXT: [[TMP112:%.*]] = shufflevector <3 x double> [[TMP110]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP113:%.*]] = shufflevector <6 x double> [[TMP111]], <6 x double> [[TMP112]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
	; CHECK-NEXT: [[C:%.]] = load <9 x double>, <9 x double> [[C_PTR:%.*]]			; CHECK-NEXT: [[C:%.]] = load <9 x double>, <9 x double> [[C_PTR:%.*]]
	; CHECK-NEXT: [[RES:%.*]] = fadd <9 x double> [[C]], [[TMP113]]			; CHECK-NEXT: [[RES:%.*]] = fadd <9 x double> [[C]], [[TMP110]]
	; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[C_PTR]]			; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[C_PTR]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%a = load <9 x double>, <9 x double>* %A.Ptr			%a = load <9 x double>, <9 x double>* %A.Ptr
	%b = load <9 x double>, <9 x double>* %B.Ptr			%b = load <9 x double>, <9 x double>* %B.Ptr
	%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)			%a.trans = call <9 x double> @llvm.matrix.transpose(<9 x double> %a, i32 3, i32 3)
	%mult = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)			%mult = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a.trans, <9 x double> %b, i32 3, i32 3, i32 3)
	%c = load <9 x double>, <9 x double>* %C.Ptr			%c = load <9 x double>, <9 x double>* %C.Ptr
	%res = fadd <9 x double> %c, %mult			%res = fadd <9 x double> %c, %mult

	store <9 x double> %res, <9 x double>* %C.Ptr			store <9 x double> %res, <9 x double>* %C.Ptr
	ret void			ret void
	}			}

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s


				define void @transpose_store(<8 x double> %a, <8 x double>* %Ptr) {
				; CHECK-LABEL: @transpose_store(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <8 x double> [[A:%.]], <8 x double> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 2, i32 3>
				; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 4, i32 5>
				; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 6, i32 7>
				; CHECK-NEXT: [[TMP0:%.*]] = extractelement <2 x double> [[SPLIT]], i64 0
				; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> undef, double [[TMP0]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 0
				; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP1]], double [[TMP2]], i64 1
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 0
				; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP3]], double [[TMP4]], i64 2
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 0
				; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i64 3
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[SPLIT]], i64 1
				; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x double> undef, double [[TMP8]], i64 0
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 1
				; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x double> [[TMP9]], double [[TMP10]], i64 1
				; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x double> [[TMP11]], double [[TMP12]], i64 2
				; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1
				; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x double> [[TMP13]], double [[TMP14]], i64 3
				; CHECK-NEXT: [[TMP16:%.]] = bitcast <8 x double> [[PTR:%.]] to double
				; CHECK-NEXT: [[TMP17:%.]] = bitcast double [[TMP16]] to <4 x double>*
				; CHECK-NEXT: store <4 x double> [[TMP7]], <4 x double>* [[TMP17]], align 8
				; CHECK-NEXT: [[TMP18:%.]] = bitcast <8 x double> [[PTR]] to double*
				; CHECK-NEXT: [[TMP19:%.]] = getelementptr double, double [[TMP18]], i32 4
				; CHECK-NEXT: [[TMP20:%.]] = bitcast double [[TMP19]] to <4 x double>*
				; CHECK-NEXT: store <4 x double> [[TMP15]], <4 x double>* [[TMP20]], align 8
				; CHECK-NEXT: ret void
				;
				; SHAPE-LABEL: @transpose(
				anemetUnsubmitted Not Done Reply Inline Actions FileCheck is never executed with the SHAPE prefix. anemet: FileCheck is never executed with the SHAPE prefix.
				fhahnAuthorUnsubmitted Done Reply Inline Actions I've dropped those, but added an explanation of what we are checking. fhahn: I've dropped those, but added an explanation of what we are checking.
				; SHAPE-NEXT: entry:
				; SHAPE-NEXT: [[A:%.]] = load <8 x double>, <8 x double> [[PTR_A:%.*]], align 16
				; SHAPE-NEXT: [[SPLIT:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 0, i32 1>
				; SHAPE-NEXT: [[SPLIT1:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 2, i32 3>
				; SHAPE-NEXT: [[SPLIT2:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 4, i32 5>
				; SHAPE-NEXT: [[SPLIT3:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 6, i32 7>
				; SHAPE-NEXT: [[TMP0:%.*]] = extractelement <2 x double> [[SPLIT]], i64 0
				; SHAPE-NEXT: [[TMP1:%.*]] = insertelement <4 x double> undef, double [[TMP0]], i64 0
				; SHAPE-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 0
				; SHAPE-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP1]], double [[TMP2]], i64 1
				; SHAPE-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 0
				; SHAPE-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP3]], double [[TMP4]], i64 2
				; SHAPE-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 0
				; SHAPE-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i64 3
				; SHAPE-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[SPLIT]], i64 1
				; SHAPE-NEXT: [[TMP9:%.*]] = insertelement <4 x double> undef, double [[TMP8]], i64 0
				; SHAPE-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 1
				; SHAPE-NEXT: [[TMP11:%.*]] = insertelement <4 x double> [[TMP9]], double [[TMP10]], i64 1
				; SHAPE-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1
				; SHAPE-NEXT: [[TMP13:%.*]] = insertelement <4 x double> [[TMP11]], double [[TMP12]], i64 2
				; SHAPE-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1
				; SHAPE-NEXT: [[TMP15:%.*]] = insertelement <4 x double> [[TMP13]], double [[TMP14]], i64 3
				; SHAPE-NEXT: [[TMP16:%.]] = bitcast <8 x double> [[PTR_B:%.]] to double
				; SHAPE-NEXT: [[TMP17:%.]] = bitcast double [[TMP16]] to <4 x double>*
				; SHAPE-NEXT: store <4 x double> [[TMP7]], <4 x double>* [[TMP17]], align 8
				; SHAPE-NEXT: [[TMP18:%.]] = bitcast <8 x double> [[PTR_B]] to double*
				; SHAPE-NEXT: [[TMP19:%.]] = getelementptr double, double [[TMP18]], i32 4
				; SHAPE-NEXT: [[TMP20:%.]] = bitcast double [[TMP19]] to <4 x double>*
				; SHAPE-NEXT: store <4 x double> [[TMP15]], <4 x double>* [[TMP20]], align 8
				; SHAPE-NEXT: ret void
				entry:
				%c = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)
				store <8 x double> %c, <8 x double>* %Ptr
				ret void
				}

				declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

llvm/test/Transforms/LowerMatrixIntrinsics/propagate-mixed-users.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -S < %s \| FileCheck %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S < %s \| FileCheck %s

				; Currently we only lower stores with shape information, but need to embed the
				; matrix in a flat vector for function calls and returns.
				define <8 x double> @strided_load_4x4(<8 x double> %in, <8 x double>* %Ptr) {
				; CHECK-LABEL: @strided_load_4x4(
				; CHECK-NEXT: [[SPLIT:%.]] = shufflevector <8 x double> [[IN:%.]], <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <8 x double> [[IN]], <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x double> [[SPLIT]], i64 0
				; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[TMP1]], i64 0
				; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP3]], i64 1
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x double> [[SPLIT]], i64 1
				; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> undef, double [[TMP5]], i64 0
				; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 1
				; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP6]], double [[TMP7]], i64 1
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x double> [[SPLIT]], i64 2
				; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x double> undef, double [[TMP9]], i64 0
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 2
				; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x double> [[TMP10]], double [[TMP11]], i64 1
				; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x double> [[SPLIT]], i64 3
				; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> undef, double [[TMP13]], i64 0
				; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 3
				; CHECK-NEXT: [[TMP16:%.*]] = insertelement <2 x double> [[TMP14]], double [[TMP15]], i64 1
				; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <2 x double> [[TMP12]], <2 x double> [[TMP16]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <4 x double> [[TMP17]], <4 x double> [[TMP18]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK-NEXT: [[TMP20:%.]] = bitcast <8 x double> [[PTR:%.]] to double
				; CHECK-NEXT: [[TMP21:%.]] = bitcast double [[TMP20]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP21]], align 8
				; CHECK-NEXT: [[TMP22:%.]] = bitcast <8 x double> [[PTR]] to double*
				; CHECK-NEXT: [[TMP23:%.]] = getelementptr double, double [[TMP22]], i32 2
				; CHECK-NEXT: [[TMP24:%.]] = bitcast double [[TMP23]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[TMP24]], align 8
				; CHECK-NEXT: [[TMP25:%.]] = bitcast <8 x double> [[PTR]] to double*
				; CHECK-NEXT: [[TMP26:%.]] = getelementptr double, double [[TMP25]], i32 4
				; CHECK-NEXT: [[TMP27:%.]] = bitcast double [[TMP26]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP12]], <2 x double>* [[TMP27]], align 8
				; CHECK-NEXT: [[TMP28:%.]] = bitcast <8 x double> [[PTR]] to double*
				; CHECK-NEXT: [[TMP29:%.]] = getelementptr double, double [[TMP28]], i32 6
				; CHECK-NEXT: [[TMP30:%.]] = bitcast double [[TMP29]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP16]], <2 x double>* [[TMP30]], align 8
				; CHECK-NEXT: call void @foo(<8 x double> [[TMP19]])
				; CHECK-NEXT: ret <8 x double> [[TMP19]]
				;
				%transposed = call <8 x double> @llvm.matrix.transpose(<8 x double> %in, i32 4, i32 2)
				store <8 x double> %transposed, <8 x double>* %Ptr
				call void @foo(<8 x double> %transposed)
				ret <8 x double> %transposed
				}

				declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

				declare void @foo(<8 x double>)