This is an archive of the discontinued LLVM Phabricator instance.

[Matrix] Add initial tiling for load/multiply/store chains.
ClosedPublic

Authored by fhahn on Mar 3 2020, 1:43 PM.

Download Raw Diff

Details

Reviewers

anemet
Gerolf
hfinkel
andrew.w.kaylor
LuoYuanke

Commits

rGd1fed7081d80: [Matrix] Add initial tiling for load/multiply/store chains.

Summary

This patch adds initial fusion for load/multiply/store chains of matrix
operations.

The patch contains roughly two parts:

Code generation for a fused load/multiply/store chain (LowerMatrixMultiplyFused).

First, we ensure that both loads of the multiply operands do not alias the store. If they do, we create new non-aliasing copies of the operands. Note that this may introduce new basic block. Finally we process TileSize x TileSize blocks. That is: load tiles from the input operands, multiply and store them.

Identify fusion candidates & matrix instructions.

As a first step, collect all instructions with shape info and fusion candidates (currently @llvm.matrix.multiply calls). Next, try to fuse candidates and collect instructions eliminated by fusion. Finally iterate over all matrix instructions, skip the ones eliminated by fusion and lower the rest as usual.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Mar 3 2020, 1:43 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2020, 1:43 PM

Herald added subscribers: llvm-commits, tschuett, hiraditya. · View Herald Transcript

fhahn edited the summary of this revision. (Show Details)Mar 3 2020, 1:46 PM

fhahn added reviewers: anemet, Gerolf, hfinkel, andrew.w.kaylor, LuoYuanke.

fhahn retitled this revision from [Matrix] Add initial tiling for multiplies. to [Matrix] Add initial tiling for load/multiply/store chains..Mar 3 2020, 1:48 PM

fhahn added parent revisions: D75565: [Matrix] Move multiply-add code generation into separate function (NFC)., D75564: [Matrix] Hoist load/store generation logic, add helpers for tiled access..

Strip outdated comment.

Harbormaster failed remote builds in B47964: Diff 248015!Mar 3 2020, 2:24 PM

Harbormaster completed remote builds in B47968: Diff 248022.Mar 3 2020, 3:03 PM

Use correct strides for loads/stores and ensure we always move the instructions after the multiply/store to a new BB.

Harbormaster completed remote builds in B48115: Diff 248324.Mar 4 2020, 4:20 PM

ping

Can you please describe the approach in the description/in a comment?

Rebased and fixed a small merge error.

fhahn mentioned this in D76325: [Matrix] Add option to use row-major matrix layout as default..Mar 20 2020, 4:37 AM

fhahn added a child revision: D76325: [Matrix] Add option to use row-major matrix layout as default..Mar 20 2020, 4:37 AM

Harbormaster failed remote builds in B49862: Diff 251599!Mar 20 2020, 4:51 AM

fhahn edited the summary of this revision. (Show Details)Mar 20 2020, 8:20 AM

A couple of small code simplifications: use range based iterator over BB, use pattern match for @llvm.matrix.multiply.

Harbormaster failed remote builds in B49896: Diff 251654!Mar 20 2020, 9:11 AM

I have a few specific comments below but overall it would be great if we could simplify VisitBBFusion to avoid recursion and invalidating the iterator...

This patch adds initial fusion for load/multiply/store chains of matrix
operations.

The patch contains roughly two parts:

Code generation for a fused load/multiply/store chain (LowerMatrixMultiplyFused).
First, we ensure that both loads of the multiply operands do not alias.

You mean any of the loads and the store?

If they do, we create new non-aliasing copies of the operands. Note that this may introduce new basic block. Then we split the block containing the multiply at the multiply,

At the multiply or at the store?

to simplify processing by returning the remainder of the original block to continue analysis (see 2.). Finally we process TileSize x TileSize blocks, that is, load tiles from the input operands, multiply and store them.

Identify fusion candidates.
To identify candidates for fusion, we look for @llvm.matrix.multiply with operands that are loads and a single use of the result in a store. To avoid generating unnecessary code for loads that later on get fused, we do a first pass over the function and only try fusing instructions, while keeping track of all other instructions with shape information in the function. We continue with the regular code generation for the remaining instructions with shape information after finishing fusion.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
544	Document MatrixInsts
551	The name NextBB is strange here. It suggests that that is next one we are going to iterate to. Sounds like this is more like a NewBB?
561–563	Why is this necessary now? We weren't doing this before during traversal. When Touched is true we always return so I think Touched is always false here, no?
607	Needs comment on the logic here, what is returned in MatrixInsts, etc.
1100	updated LoweredMatrix
1102	Document the BB returned

In D75566#1937436, @anemet wrote:

I have a few specific comments below but overall it would be great if we could simplify VisitBBFusion to avoid recursion and invalidating the iterator...

Thanks Adam!

I've split the code into 3 parts:

Collect all instructions with shape information and fusion candidates.
Iterate over fusion candidates and try to fuse. Also collect set of instructions completely eliminated by fusion.
Iterate over all instructions with shape info, skip the ones eliminated by fusion and lower the rest.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
544	Code is gone.
551	Code is gone.
561–563	Code is gone.
607	Code is gone.
1102	The return value is gone.

fhahn edited the summary of this revision. (Show Details)Mar 24 2020, 8:32 AM

Harbormaster failed remote builds in B50262: Diff 252334!Mar 24 2020, 9:07 AM

ping

Very nice!

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
52–54	Add cl::desc
952	Please add a comment explaining the code that is generated below this point

anemet requested changes to this revision.Mar 29 2020, 12:32 PM

This revision now requires changes to proceed.Mar 29 2020, 12:32 PM

Thanks Adam.

Added cl::desc to new options and alos cl::hidden. Added comments to getNonAliasingPointer

Adjust bb numbers in tests after change to creating BBs first.

Harbormaster failed remote builds in B50879: Diff 253448!Mar 29 2020, 1:23 PM

Harbormaster failed remote builds in B50882: Diff 253453!Mar 29 2020, 2:28 PM

LuoYuanke added inline comments.Mar 29 2020, 8:25 PM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
1000	It seems the cost of Copy is not added to the cost model.
1010	I'm confused about line 1034 and 1035. Should it be this? There is no edge from Fusion to Copy and from Copy to Check1. DTUpdates.push_back({DT.Insert, Copy, Fusion}); DTUpdates.push_back({DT.Insert, Check1, Copy});
1082	It seems the tile size should be 2 dimension. 1 for row, and 1 for column.

Remove unnecessary updates, use DT::dominates in stead of OrderedInstructions, after DT::dominates now uses the BB local numbering recently committed.

fhahn added inline comments.Mar 30 2020, 3:44 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
1000	Yes currently the cost-model is kept simple, because we initially are focusing on bringing up the code-generation and are trying to make sure the matrix intrinsics are applicable to a wide range of uses cases. My current priority is progressing the clang patches, adding support for row-major matrixes and running the lowering pass by default on IR containing matrix intrinsics. I've added a few additional TODOs to improve the cost-modeling. Currently we only automatically fuse operations, if we would run out of registers without fusion, so it only kicks in for matrixes sizes where the cost of coping should be negligible. But this should definitely be improved in the future. Not sure when we will get to it though and any contributions in that direction would be very welcome.
1010	Right, those updates are not needed, which is much clearer after grouping all the update code together. It seems like the verification does not catch those unnecessary updates.
1082	Agreed, but as mentioned above I think using square tiles is a good compromise to bring up the infrastructure. After the initial commit, it should also be easier for other people to work on improving the tiling.

Harbormaster failed remote builds in B50932: Diff 253543!Mar 30 2020, 4:17 AM

fhahn mentioned this in D70456: [Matrix] Add first set of matrix intrinsics and initial lowering pass..Apr 3 2020, 5:40 AM

anemet requested changes to this revision.Apr 5 2020, 1:41 PM

anemet added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
57	Say it's square-shaped.
985–987	Are you being overly conservative here? If the end of the store is before the beginning of the load they still don't alias.

This revision now requires changes to proceed.Apr 5 2020, 1:41 PM

Clarify that the tile size option is for a square shaped tile and add TODO at option to allow non-square tiles.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
985–987	Yes, I think but the previous condition (line 1008) ensures that the load begins before the end of the store if we check the condition here.

LGTM!

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
985–987	Ah ok!

This revision is now accepted and ready to land.Apr 5 2020, 2:12 PM

Harbormaster failed remote builds in B51863: Diff 255198!Apr 5 2020, 2:27 PM

Closed by commit rGd1fed7081d80: [Matrix] Add initial tiling for load/multiply/store chains. (authored by fhahn). · Explain WhyApr 6 2020, 1:36 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LowerMatrixIntrinsics.cpp

287 lines

test/

Transforms/

LowerMatrixIntrinsics/

multiply-fused.ll

259 lines

Diff 248022

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

Show All 13 Lines
// (WIP).		// (WIP).
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Scalar/LowerMatrixIntrinsics.h"		#include "llvm/Transforms/Scalar/LowerMatrixIntrinsics.h"
#include "llvm/ADT/GraphTraits.h"		#include "llvm/ADT/GraphTraits.h"
#include "llvm/ADT/PostOrderIterator.h"		#include "llvm/ADT/PostOrderIterator.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
		#include "llvm/Analysis/AliasAnalysis.h"
		#include "llvm/Analysis/DomTreeUpdater.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
		#include "llvm/Analysis/OrderedInstructions.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
		#include "llvm/Transforms/Utils/BasicBlockUtils.h"

using namespace llvm;		using namespace llvm;
using namespace PatternMatch;		using namespace PatternMatch;

#define DEBUG_TYPE "lower-matrix-intrinsics"		#define DEBUG_TYPE "lower-matrix-intrinsics"

static cl::opt<bool> EnableShapePropagation(		static cl::opt<bool> EnableShapePropagation(
"matrix-propagate-shape", cl::init(true), cl::Hidden,		"matrix-propagate-shape", cl::init(true), cl::Hidden,
cl::desc("Enable/disable shape propagation from matrix intrinsics to other "		cl::desc("Enable/disable shape propagation from matrix intrinsics to other "
"instructions."));		"instructions."));

		static cl::opt<bool> FuseMatrix("fuse-matrix", cl::init(true));
		static cl::opt<unsigned> TileSize("fuse-matrix-tile-size", cl::init(4));
		static cl::opt<bool> ForceFusion("force-fuse-matrix", cl::init(false));
		anemetUnsubmitted Done Reply Inline Actions Add cl::desc anemet: Add cl::desc
static cl::opt<bool> AllowContractEnabled(		static cl::opt<bool> AllowContractEnabled(
"matrix-allow-contract", cl::init(false), cl::Hidden,		"matrix-allow-contract", cl::init(false), cl::Hidden,
cl::desc("Allow the use of FMAs if available and profitable. This may "		cl::desc("Allow the use of FMAs if available and profitable. This may "
		anemetUnsubmitted Not Done Reply Inline Actions Say it's square-shaped. anemet: Say it's square-shaped.
"result in different results, due to less rounding error."));		"result in different results, due to less rounding error."));

namespace {		namespace {

// Given an element poitner \p BasePtr to the start of a (sub) matrix, compute		// Given an element poitner \p BasePtr to the start of a (sub) matrix, compute
// the start address of column \p Col with type (\p EltType x \p NumRows)		// the start address of column \p Col with type (\p EltType x \p NumRows)
// assuming \p Stride elements between start two consecutive columns.		// assuming \p Stride elements between start two consecutive columns.
// \p Stride must be >= \p NumRows.		// \p Stride must be >= \p NumRows.
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
/// 2.4. Cache the result column matrix for the instruction we lowered		/// 2.4. Cache the result column matrix for the instruction we lowered
/// 3. After we lowered all instructions in a function, remove the now		/// 3. After we lowered all instructions in a function, remove the now
/// obsolete instructions.		/// obsolete instructions.
///		///
class LowerMatrixIntrinsics {		class LowerMatrixIntrinsics {
Function &Func;		Function &Func;
const DataLayout &DL;		const DataLayout &DL;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
		AliasAnalysis &AA;
		DominatorTree &DT;
		LoopInfo &LI;
		OrderedInstructions OI;
OptimizationRemarkEmitter &ORE;		OptimizationRemarkEmitter &ORE;

/// Contains estimates of the number of operations (loads, stores, compute) required to lower a matrix operation.		/// Contains estimates of the number of operations (loads, stores, compute) required to lower a matrix operation.
struct OpInfoTy {		struct OpInfoTy {
/// Number of stores emitted to generate this matrix.		/// Number of stores emitted to generate this matrix.
unsigned NumStores = 0;		unsigned NumStores = 0;
/// Number of loads emitted to generate this matrix.		/// Number of loads emitted to generate this matrix.
unsigned NumLoads = 0;		unsigned NumLoads = 0;
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	class LowerMatrixIntrinsics {
/// those need to be removed after we finished lowering.		/// those need to be removed after we finished lowering.
SmallVector<Instruction *, 16> ToRemove;		SmallVector<Instruction *, 16> ToRemove;

/// Map from instructions to their produced column matrix.		/// Map from instructions to their produced column matrix.
MapVector<Value *, ColumnMatrixTy> Inst2ColumnMatrix;		MapVector<Value *, ColumnMatrixTy> Inst2ColumnMatrix;

public:		public:
LowerMatrixIntrinsics(Function &F, TargetTransformInfo &TTI,		LowerMatrixIntrinsics(Function &F, TargetTransformInfo &TTI,
		AliasAnalysis &AA, DominatorTree &DT, LoopInfo &LI,
OptimizationRemarkEmitter &ORE)		OptimizationRemarkEmitter &ORE)
: Func(F), DL(F.getParent()->getDataLayout()), TTI(TTI), ORE(ORE) {}		: Func(F), DL(F.getParent()->getDataLayout()), TTI(TTI), AA(AA), DT(DT),
		LI(LI), OI(&DT), ORE(ORE) {}

unsigned getNumOps(Type *VT) {		unsigned getNumOps(Type *VT) {
assert(isa<VectorType>(VT) && "Expected vector type");		assert(isa<VectorType>(VT) && "Expected vector type");
return getNumOps(VT->getScalarType(),		return getNumOps(VT->getScalarType(),
cast<VectorType>(VT)->getNumElements());		cast<VectorType>(VT)->getNumElements());
}		}

//		//
▲ Show 20 Lines • Show All 239 Lines • ▼ Show 20 Lines	while (!WorkList.empty()) {
for (size_t I = BeforeProcessingV; I != WorkList.size(); I++)		for (size_t I = BeforeProcessingV; I != WorkList.size(); I++)
for (User *U : WorkList[I]->users())		for (User *U : WorkList[I]->users())
if (isa<Instruction>(U) && V != U)		if (isa<Instruction>(U) && V != U)
NewWorkList.push_back(cast<Instruction>(U));		NewWorkList.push_back(cast<Instruction>(U));
}		}
return NewWorkList;		return NewWorkList;
}		}

		/// Visit \b BB and try to fuse matrix instructions.
		anemetUnsubmitted Done Reply Inline Actions Document MatrixInsts anemet: Document MatrixInsts
		fhahnAuthorUnsubmitted Done Reply Inline Actions Code is gone. fhahn: Code is gone.
		bool visitBBFusion(BasicBlock *BB,
		SmallVectorImpl<Instruction *> &MatrixInsts) {
		bool Changed = false;
		for (auto I = BB->begin(); I != BB->end(); ++I) {
		Instruction &Inst = *I;
		bool Touched = false;
		if (IntrinsicInst *IInst = dyn_cast<IntrinsicInst>(&Inst)) {
		anemetUnsubmitted Done Reply Inline Actions The name NextBB is strange here. It suggests that that is next one we are going to iterate to. Sounds like this is more like a NewBB? anemet: The name NextBB is strange here. It suggests that that is next one we are going to iterate to.
		fhahnAuthorUnsubmitted Done Reply Inline Actions Code is gone. fhahn: Code is gone.
		if (IInst->getIntrinsicID() == Intrinsic::matrix_multiply) {

		if (BasicBlock *NextBB = LowerMatrixMultiplyFused(IInst)) {
		Touched = true;
		// We create new basic blocks when fusing multiplies. Those will not
		// be part of the RPO, so we visit the BB containing the remainder
		// of the original instructions.
		visitBBFusion(NextBB, MatrixInsts);
		return true;
		}
		}
		}
		anemetUnsubmitted Done Reply Inline Actions Why is this necessary now? We weren't doing this before during traversal. When Touched is true we always return so I think Touched is always false here, no? anemet: Why is this necessary now? We weren't doing this before during traversal. When Touched is…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Code is gone. fhahn: Code is gone.
		// Collect instructions producing matrix values, stores or bitcasts.
		if (!Touched && ShapeMap.find(&Inst) != ShapeMap.end())
		MatrixInsts.push_back(&Inst);

		Changed \|= Touched;
		}
		return Changed;
		}

bool Visit() {		bool Visit() {
if (EnableShapePropagation) {		if (EnableShapePropagation) {
SmallVector<Instruction *, 32> WorkList;		SmallVector<Instruction *, 32> WorkList;

// Initially only the shape of matrix intrinsics is known.		// Initially only the shape of matrix intrinsics is known.
// Initialize the work list with ops carrying shape information.		// Initialize the work list with ops carrying shape information.
for (BasicBlock &BB : Func)		for (BasicBlock &BB : Func)
for (Instruction &Inst : BB) {		for (Instruction &Inst : BB) {
Show All 13 Lines	if (EnableShapePropagation) {
}		}
}		}
// Propagate shapes until nothing changes any longer.		// Propagate shapes until nothing changes any longer.
while (!WorkList.empty()) {		while (!WorkList.empty()) {
WorkList = propagateShapeForward(WorkList);		WorkList = propagateShapeForward(WorkList);
WorkList = propagateShapeBackward(WorkList);		WorkList = propagateShapeBackward(WorkList);
}		}
}		}
		bool Changed = false;

		SmallVector<Instruction *, 16> MatrixInsts;
ReversePostOrderTraversal<Function *> RPOT(&Func);		ReversePostOrderTraversal<Function *> RPOT(&Func);
bool Changed = false;		for (auto *BB : RPOT)
for (auto *BB : RPOT) {		Changed \|= visitBBFusion(BB, MatrixInsts);
		anemetUnsubmitted Done Reply Inline Actions Needs comment on the logic here, what is returned in MatrixInsts, etc. anemet: Needs comment on the logic here, what is returned in MatrixInsts, etc.
		fhahnAuthorUnsubmitted Done Reply Inline Actions Code is gone. fhahn: Code is gone.
for (Instruction &Inst : make_early_inc_range(*BB)) {
IRBuilder<> Builder(&Inst);

if (CallInst *CInst = dyn_cast<CallInst>(&Inst))		for (Instruction *Inst : MatrixInsts) {
		IRBuilder<> Builder(Inst);

		if (CallInst *CInst = dyn_cast<CallInst>(Inst))
Changed \|= VisitCallInst(CInst);		Changed \|= VisitCallInst(CInst);

Value *Op1;		Value *Op1;
Value *Op2;		Value *Op2;
if (auto *BinOp = dyn_cast<BinaryOperator>(&Inst))		if (auto *BinOp = dyn_cast<BinaryOperator>(Inst))
Changed \|= VisitBinaryOperator(BinOp);		Changed \|= VisitBinaryOperator(BinOp);
if (match(&Inst, m_Load(m_Value(Op1))))		if (match(Inst, m_Load(m_Value(Op1))))
Changed \|= VisitLoad(&Inst, Op1, Builder);		Changed \|= VisitLoad(Inst, Op1, Builder);
else if (match(&Inst, m_Store(m_Value(Op1), m_Value(Op2))))		else if (match(Inst, m_Store(m_Value(Op1), m_Value(Op2))))
Changed \|= VisitStore(&Inst, Op1, Op2, Builder);		Changed \|= VisitStore(Inst, Op1, Op2, Builder);
}
}		}

RemarkGenerator RemarkGen(Inst2ColumnMatrix, ORE, DL);		RemarkGenerator RemarkGen(Inst2ColumnMatrix, ORE, DL);
RemarkGen.emitRemarks();		RemarkGen.emitRemarks();

for (Instruction *Inst : reverse(ToRemove))		for (Instruction *Inst : reverse(ToRemove))
Inst->eraseFromParent();		Inst->eraseFromParent();

▲ Show 20 Lines • Show All 296 Lines • ▼ Show 20 Lines	for (unsigned J = 0; J < C; ++J) {
}		}
Result.setColumn(J, insertVector(Result.getColumn(J), I, Sum, Builder));		Result.setColumn(J, insertVector(Result.getColumn(J), I, Sum, Builder));
}		}

Result.addNumComputeOps(NumOps);		Result.addNumComputeOps(NumOps);
}		}
}		}

		/// Ensure that the memory in \p Load does not alias \p Store by potentially
		/// copying it to a new location. This new or otherwise the original location
		/// is returned.
		Value getNonAliasingPointer(LoadInst Load, StoreInst *Store,
		CallInst *MatMul) {
		MemoryLocation St = MemoryLocation::get(Store);
		MemoryLocation Ld = MemoryLocation::get(Load);

		AliasResult LdAliased = AA.alias(Ld, St);

		// If we can statically determine noalias we're good.
		if (!LdAliased)
		return Load->getPointerOperand();

		IRBuilder<> Builder(MatMul);
		Type *IntPtrTy = Builder.getIntPtrTy(Load->getModule()->getDataLayout());

		Value *St_b =
		anemetUnsubmitted Done Reply Inline Actions Please add a comment explaining the code that is generated below this point anemet: Please add a comment explaining the code that is generated below this point
		Builder.CreatePtrToInt(const_cast<Value *>(St.Ptr), IntPtrTy, "st_b");
		Value *St_e =
		Builder.CreateAdd(St_b, ConstantInt::get(IntPtrTy, St.Size.getValue()),
		"st_e", true, true);

		BasicBlock *Check0 = MatMul->getParent();

		// FIXME: Use lazy DTU and update SplitBlock to accept a DTU instead of a
		// DT. Manually collect dominator tree updates, to avoid unnecessary work,
		// as we adjust Check0 and Check1's branches.
		SmallVector<DominatorTree::UpdateType, 4> DTUpdates;
		for (BasicBlock *Succ : successors(Check0))
		DTUpdates.push_back({DT.Delete, Check0, Succ});

		BasicBlock *Check1 = SplitBlock(MatMul->getParent(), MatMul, nullptr, &LI,
		nullptr, "alias_cont");
		BasicBlock *Copy =
		SplitBlock(MatMul->getParent(), MatMul, nullptr, &LI, nullptr, "copy");
		BasicBlock *Fusion = SplitBlock(MatMul->getParent(), MatMul, nullptr, &LI,
		nullptr, "no_alias");
		DTUpdates.push_back({DT.Insert, Fusion, Copy});
		DTUpdates.push_back({DT.Insert, Copy, Check1});

		Check0->getTerminator()->eraseFromParent();
		Builder.SetInsertPoint(Check0);
		Value *Ld_b =
		Builder.CreatePtrToInt(const_cast<Value *>(Ld.Ptr), IntPtrTy, "ld_b");
		Builder.CreateCondBr(Builder.CreateICmpULT(Ld_b, St_e), Check1, Fusion);
		DTUpdates.push_back({DT.Insert, Check0, Check1});
		DTUpdates.push_back({DT.Insert, Check0, Fusion});

		Check1->getTerminator()->eraseFromParent();
		Builder.SetInsertPoint(Check1, Check1->begin());
		Value *Ld_e =
		Builder.CreateAdd(Ld_b, ConstantInt::get(IntPtrTy, Ld.Size.getValue()),
		anemetUnsubmitted Not Done Reply Inline Actions Are you being overly conservative here? If the end of the store is before the beginning of the load they still don't alias. anemet: Are you being overly conservative here? If the end of the store is before the beginning of the…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Yes, I think but the previous condition (line 1008) ensures that the load begins before the end of the store if we check the condition here. fhahn: Yes, I think but the previous condition (line 1008) ensures that the load begins before the end…
		anemetUnsubmitted Not Done Reply Inline Actions Ah ok! anemet: Ah ok!
		"ld_e", true, true);
		Builder.CreateCondBr(Builder.CreateICmpULT(St_b, Ld_e), Copy, Fusion);
		DTUpdates.push_back({DT.Insert, Check1, Copy});
		DTUpdates.push_back({DT.Insert, Check1, Fusion});
		DT.applyUpdates(DTUpdates);

		Builder.SetInsertPoint(Copy, Copy->begin());
		AllocaInst *NewLd =
		Builder.CreateAlloca(Load->getType(), Load->getPointerAddressSpace());
		Builder.CreateMemCpy(NewLd, MaybeAlign(NewLd->getAlignment()),
		Load->getPointerOperand(), Load->getAlign(),
		Ld.Size.getValue());

		LuoYuankeUnsubmitted Done Reply Inline Actions It seems the cost of Copy is not added to the cost model. LuoYuanke: It seems the cost of Copy is not added to the cost model.
		fhahnAuthorUnsubmitted Done Reply Inline Actions Yes currently the cost-model is kept simple, because we initially are focusing on bringing up the code-generation and are trying to make sure the matrix intrinsics are applicable to a wide range of uses cases. My current priority is progressing the clang patches, adding support for row-major matrixes and running the lowering pass by default on IR containing matrix intrinsics. I've added a few additional TODOs to improve the cost-modeling. Currently we only automatically fuse operations, if we would run out of registers without fusion, so it only kicks in for matrixes sizes where the cost of coping should be negligible. But this should definitely be improved in the future. Not sure when we will get to it though and any contributions in that direction would be very welcome. fhahn: Yes currently the cost-model is kept simple, because we initially are focusing on bringing up…
		Builder.SetInsertPoint(Fusion, Fusion->begin());
		PHINode *PHI = Builder.CreatePHI(Load->getPointerOperandType(), 3);
		PHI->addIncoming(Load->getPointerOperand(), Check0);
		PHI->addIncoming(Load->getPointerOperand(), Check1);
		PHI->addIncoming(NewLd, Copy);

		return PHI;
		}

		bool isFusionProfitable(CallInst *MatMul) {
		LuoYuankeUnsubmitted Done Reply Inline Actions I'm confused about line 1034 and 1035. Should it be this? There is no edge from Fusion to Copy and from Copy to Check1. DTUpdates.push_back({DT.Insert, Copy, Fusion}); DTUpdates.push_back({DT.Insert, Check1, Copy}); LuoYuanke: I'm confused about line 1034 and 1035. Should it be this? There is no edge from Fusion to Copy…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Right, those updates are not needed, which is much clearer after grouping all the update code together. It seems like the verification does not catch those unnecessary updates. fhahn: Right, those updates are not needed, which is much clearer after grouping all the update code…
		if (ForceFusion)
		return true;

		ShapeInfo LShape(MatMul->getArgOperand(2), MatMul->getArgOperand(3));
		ShapeInfo RShape(MatMul->getArgOperand(3), MatMul->getArgOperand(4));

		const unsigned R = LShape.NumRows;
		const unsigned C = RShape.NumColumns;
		const unsigned M = LShape.NumColumns;
		auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();

		const unsigned VF =
		std::max<unsigned>(TTI.getRegisterBitWidth(true) /
		EltType->getPrimitiveSizeInBits().getFixedSize(),
		1U);

		// Cost model for tiling
		//
		// For tiling to be beneficial, we need reuse either along the R or
		// the C axis. We vectorize along the R axis so that means at least
		// 3 elements.
		if (R <= VF && C == 1)
		return false;
		// Then we need enough elements to exceed the number of vector
		// registers we have. Note that this is an oversimplification since
		// fusing also takes some extra loads which may exceed the number of
		// reloads necessary.
		unsigned Op0Regs = (R + VF - 1) / VF * M;
		unsigned Op1Regs = (M + VF - 1) / VF * C;
		return Op0Regs + Op1Regs > TTI.getNumberOfRegisters(true);
		}

		ColumnMatrixTy getZeroMatrix(Type *EltType, unsigned R, unsigned C) {
		ColumnMatrixTy Res;
		Type *ColumType = VectorType::get(EltType, R);
		for (unsigned I = 0; I < C; ++I)
		Res.addColumn(ConstantAggregateZero::get(ColumType));
		return Res;
		}

		BasicBlock emitSIMDTiling(CallInst MatMul, LoadInst *LoadOp0,
		LoadInst LoadOp1, StoreInst Store) {
		if (!isFusionProfitable(MatMul))
		return nullptr;

		ShapeInfo LShape(MatMul->getArgOperand(2), MatMul->getArgOperand(3));
		ShapeInfo RShape(MatMul->getArgOperand(3), MatMul->getArgOperand(4));

		const unsigned R = LShape.NumRows;
		const unsigned C = RShape.NumColumns;
		const unsigned M = LShape.NumColumns;
		auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();

		Value *APtr = getNonAliasingPointer(LoadOp0, Store, MatMul);
		Value *BPtr = getNonAliasingPointer(LoadOp1, Store, MatMul);
		Value *CPtr = Store->getPointerOperand();

		bool AllowContract = AllowContractEnabled \|\| (isa<FPMathOperator>(MatMul) &&
		MatMul->hasAllowContract());

		IRBuilder<> Builder(Store);
		for (unsigned J = 0; J < C; J += TileSize)
		for (unsigned I = 0; I < R; I += TileSize) {
		const unsigned TileR = std::min(R - I, unsigned(TileSize));
		const unsigned TileC = std::min(C - J, unsigned(TileSize));
		ColumnMatrixTy Res = getZeroMatrix(EltType, TileR, TileC);

		for (unsigned K = 0; K < M; K += TileSize) {
		const unsigned TileM = std::min(M - K, unsigned(TileSize));
		ColumnMatrixTy A =
		loadMatrix(APtr, LShape, I, K, {TileR, TileM}, EltType, Builder);
		ColumnMatrixTy B =
		LuoYuankeUnsubmitted Done Reply Inline Actions It seems the tile size should be 2 dimension. 1 for row, and 1 for column. LuoYuanke: It seems the tile size should be 2 dimension. 1 for row, and 1 for column.
		fhahnAuthorUnsubmitted Done Reply Inline Actions Agreed, but as mentioned above I think using square tiles is a good compromise to bring up the infrastructure. After the initial commit, it should also be easier for other people to work on improving the tiling. fhahn: Agreed, but as mentioned above I think using square tiles is a good compromise to bring up the…
		loadMatrix(BPtr, RShape, K, J, {TileM, TileC}, EltType, Builder);
		emitChainedMatrixMultiply(Res, A, B, AllowContract, Builder, true);
		}
		storeMatrix(Res, CPtr, {R, M}, I, J, EltType, Builder);
		}

		Store->eraseFromParent();
		BasicBlock *Cont = MatMul->getParent();
		MatMul->eraseFromParent();
		return Cont;
		}

		/// Try to lower matrix multiply chains by fusing operations.
		///
		/// Currently we only lower {ld, ld} -> matmul -> st chains.
		//
		/// No need to return LoweredMatrix since the single store user will be
		/// lowered as part of this.
		anemetUnsubmitted Done Reply Inline Actions updated LoweredMatrix anemet: updated LoweredMatrix
		BasicBlock LowerMatrixMultiplyFused(CallInst MatMul) {
		if (!FuseMatrix)
		anemetUnsubmitted Done Reply Inline Actions Document the BB returned anemet: Document the BB returned
		fhahnAuthorUnsubmitted Done Reply Inline Actions The return value is gone. fhahn: The return value is gone.
		return nullptr;

		if (auto *LoadOp0 = dyn_cast<LoadInst>(MatMul->getOperand(0)))
		if (auto *LoadOp1 = dyn_cast<LoadInst>(MatMul->getOperand(1)))
		if (MatMul->hasOneUse())
		if (auto Store = dyn_cast<StoreInst>(MatMul->user_begin())) {
		// The store address must dominate the MatMul instruction, otherwise
		// we create invalid IR.
		// FIXME: See if we can hoist the store address computation.
		auto *AddrI = dyn_cast<Instruction>(Store->getOperand(1));
		if (AddrI && (!OI.dominates(AddrI, MatMul)))
		return nullptr;

		return emitSIMDTiling(MatMul, LoadOp0, LoadOp1, Store);
		}

		return nullptr;
		}

/// Lowers llvm.matrix.multiply.		/// Lowers llvm.matrix.multiply.
void LowerMultiply(CallInst *MatMul) {		void LowerMultiply(CallInst *MatMul) {
IRBuilder<> Builder(MatMul);		IRBuilder<> Builder(MatMul);
auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();		auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();
ShapeInfo LShape(MatMul->getArgOperand(2), MatMul->getArgOperand(3));		ShapeInfo LShape(MatMul->getArgOperand(2), MatMul->getArgOperand(3));
ShapeInfo RShape(MatMul->getArgOperand(3), MatMul->getArgOperand(4));		ShapeInfo RShape(MatMul->getArgOperand(3), MatMul->getArgOperand(4));

const ColumnMatrixTy &Lhs =		const ColumnMatrixTy &Lhs =
▲ Show 20 Lines • Show All 507 Lines • ▼ Show 20 Lines
};		};
};		};
} // namespace		} // namespace

PreservedAnalyses LowerMatrixIntrinsicsPass::run(Function &F,		PreservedAnalyses LowerMatrixIntrinsicsPass::run(Function &F,
FunctionAnalysisManager &AM) {		FunctionAnalysisManager &AM) {
auto &TTI = AM.getResult<TargetIRAnalysis>(F);		auto &TTI = AM.getResult<TargetIRAnalysis>(F);
auto &ORE = AM.getResult<OptimizationRemarkEmitterAnalysis>(F);		auto &ORE = AM.getResult<OptimizationRemarkEmitterAnalysis>(F);
LowerMatrixIntrinsics LMT(F, TTI, ORE);		auto &AA = AM.getResult<AAManager>(F);
		auto &DT = AM.getResult<DominatorTreeAnalysis>(F);
		auto &LI = AM.getResult<LoopAnalysis>(F);

		LowerMatrixIntrinsics LMT(F, TTI, AA, DT, LI, ORE);
if (LMT.Visit()) {		if (LMT.Visit()) {
PreservedAnalyses PA;		PreservedAnalyses PA;
PA.preserveSet<CFGAnalyses>();		PA.preserveSet<CFGAnalyses>();
return PA;		return PA;
}		}
return PreservedAnalyses::all();		return PreservedAnalyses::all();
}		}

namespace {		namespace {

class LowerMatrixIntrinsicsLegacyPass : public FunctionPass {		class LowerMatrixIntrinsicsLegacyPass : public FunctionPass {
public:		public:
static char ID;		static char ID;

LowerMatrixIntrinsicsLegacyPass() : FunctionPass(ID) {		LowerMatrixIntrinsicsLegacyPass() : FunctionPass(ID) {
initializeLowerMatrixIntrinsicsLegacyPassPass(		initializeLowerMatrixIntrinsicsLegacyPassPass(
*PassRegistry::getPassRegistry());		*PassRegistry::getPassRegistry());
}		}

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
auto &ORE = getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();		auto &ORE = getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();
LowerMatrixIntrinsics LMT(F, TTI, ORE);		auto &AA = getAnalysis<AAResultsWrapperPass>().getAAResults();
		auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
		auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
		LowerMatrixIntrinsics LMT(F, TTI, AA, DT, LI, ORE);
bool C = LMT.Visit();		bool C = LMT.Visit();
return C;		return C;
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addRequired<OptimizationRemarkEmitterWrapperPass>();		AU.addRequired<OptimizationRemarkEmitterWrapperPass>();
AU.setPreservesCFG();		AU.addRequired<AAResultsWrapperPass>();
		AU.addRequired<DominatorTreeWrapperPass>();
		AU.addPreserved<DominatorTreeWrapperPass>();
		AU.addRequired<LoopInfoWrapperPass>();
		AU.addPreserved<LoopInfoWrapperPass>();
}		}
};		};
} // namespace		} // namespace

static const char pass_name[] = "Lower the matrix intrinsics";		static const char pass_name[] = "Lower the matrix intrinsics";
char LowerMatrixIntrinsicsLegacyPass::ID = 0;		char LowerMatrixIntrinsicsLegacyPass::ID = 0;
INITIALIZE_PASS_BEGIN(LowerMatrixIntrinsicsLegacyPass, DEBUG_TYPE, pass_name,		INITIALIZE_PASS_BEGIN(LowerMatrixIntrinsicsLegacyPass, DEBUG_TYPE, pass_name,
false, false)		false, false)
INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass)		INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_END(LowerMatrixIntrinsicsLegacyPass, DEBUG_TYPE, pass_name,		INITIALIZE_PASS_END(LowerMatrixIntrinsicsLegacyPass, DEBUG_TYPE, pass_name,
false, false)		false, false)

Pass *llvm::createLowerMatrixIntrinsicsPass() {		Pass *llvm::createLowerMatrixIntrinsicsPass() {
return new LowerMatrixIntrinsicsLegacyPass();		return new LowerMatrixIntrinsicsLegacyPass();
}		}

llvm/test/Transforms/LowerMatrixIntrinsics/multiply-fused.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-matrix-intrinsics -fuse-matrix-tile-size=2 -force-fuse-matrix -instcombine %s -S \| FileCheck %s

				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "aarch64-apple-ios"

				define void @multiply(<16 x double> * %A, <16 x double> * %B, <16 x double>* %C) {
				; CHECK-LABEL: @multiply(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[ST_B:%.]] = ptrtoint <16 x double> [[C:%.*]] to i64
				; CHECK-NEXT: [[ST_E:%.*]] = add nuw nsw i64 [[ST_B]], 128
				; CHECK-NEXT: [[LD_B:%.]] = ptrtoint <16 x double> [[A:%.*]] to i64
				; CHECK-NEXT: [[TMP0:%.*]] = icmp ugt i64 [[ST_E]], [[LD_B]]
				; CHECK-NEXT: br i1 [[TMP0]], label [[ALIAS_CONT:%.]], label [[NO_ALIAS:%.]]
				; CHECK: alias_cont:
				; CHECK-NEXT: [[LD_E:%.*]] = add nuw nsw i64 [[LD_B]], 128
				; CHECK-NEXT: [[TMP1:%.*]] = icmp ugt i64 [[LD_E]], [[ST_B]]
				; CHECK-NEXT: br i1 [[TMP1]], label [[COPY:%.*]], label [[NO_ALIAS]]
				; CHECK: copy:
				; CHECK-NEXT: [[TMP2:%.*]] = alloca <16 x double>, align 128
				; CHECK-NEXT: [[TMP3:%.]] = bitcast <16 x double> [[TMP2]] to i8*
				; CHECK-NEXT: [[TMP4:%.]] = bitcast <16 x double> [[A]] to i8*
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 128 dereferenceable(128) [[TMP3]], i8* nonnull align 16 dereferenceable(128) [[TMP4]], i64 128, i1 false)
				; CHECK-NEXT: br label [[NO_ALIAS]]
				; CHECK: no_alias:
				; CHECK-NEXT: [[TMP5:%.]] = phi <16 x double> [ [[A]], [[ENTRY:%.*]] ], [ [[A]], [[ALIAS_CONT]] ], [ [[TMP2]], [[COPY]] ]
				; CHECK-NEXT: [[ST_B1:%.]] = ptrtoint <16 x double> [[C]] to i64
				; CHECK-NEXT: [[ST_E2:%.*]] = add nuw nsw i64 [[ST_B1]], 128
				; CHECK-NEXT: [[LD_B6:%.]] = ptrtoint <16 x double> [[B:%.*]] to i64
				; CHECK-NEXT: [[TMP6:%.*]] = icmp ugt i64 [[ST_E2]], [[LD_B6]]
				; CHECK-NEXT: br i1 [[TMP6]], label [[ALIAS_CONT3:%.]], label [[NO_ALIAS5:%.]]
				; CHECK: alias_cont3:
				; CHECK-NEXT: [[LD_E7:%.*]] = add nuw nsw i64 [[LD_B6]], 128
				; CHECK-NEXT: [[TMP7:%.*]] = icmp ugt i64 [[LD_E7]], [[ST_B1]]
				; CHECK-NEXT: br i1 [[TMP7]], label [[COPY4:%.*]], label [[NO_ALIAS5]]
				; CHECK: copy4:
				; CHECK-NEXT: [[TMP8:%.*]] = alloca <16 x double>, align 128
				; CHECK-NEXT: [[TMP9:%.]] = bitcast <16 x double> [[TMP8]] to i8*
				; CHECK-NEXT: [[TMP10:%.]] = bitcast <16 x double> [[B]] to i8*
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 128 dereferenceable(128) [[TMP9]], i8* nonnull align 16 dereferenceable(128) [[TMP10]], i64 128, i1 false)
				; CHECK-NEXT: br label [[NO_ALIAS5]]
				; CHECK: no_alias5:
				; CHECK-NEXT: [[TMP11:%.]] = phi <16 x double> [ [[B]], [[NO_ALIAS]] ], [ [[B]], [[ALIAS_CONT3]] ], [ [[TMP8]], [[COPY4]] ]
				; CHECK-NEXT: [[COL_CAST8:%.]] = bitcast <16 x double> [[TMP5]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD:%.]] = load <2 x double>, <2 x double> [[COL_CAST8]], align 8
				; CHECK-NEXT: [[COL_GEP:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST9:%.]] = bitcast double [[COL_GEP]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD10:%.]] = load <2 x double>, <2 x double> [[COL_CAST9]], align 8
				; CHECK-NEXT: [[COL_CAST12:%.]] = bitcast <16 x double> [[TMP11]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD13:%.]] = load <2 x double>, <2 x double> [[COL_CAST12]], align 8
				; CHECK-NEXT: [[COL_GEP14:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST15:%.]] = bitcast double [[COL_GEP14]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD16:%.]] = load <2 x double>, <2 x double> [[COL_CAST15]], align 8
				; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <2 x double> [[COL_LOAD13]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP12:%.*]] = fmul <2 x double> [[COL_LOAD]], [[SPLAT_SPLAT]]
				; CHECK-NEXT: [[SPLAT_SPLAT19:%.*]] = shufflevector <2 x double> [[COL_LOAD13]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP13:%.*]] = fmul <2 x double> [[COL_LOAD10]], [[SPLAT_SPLAT19]]
				; CHECK-NEXT: [[TMP14:%.*]] = fadd <2 x double> [[TMP12]], [[TMP13]]
				; CHECK-NEXT: [[SPLAT_SPLAT22:%.*]] = shufflevector <2 x double> [[COL_LOAD16]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP15:%.*]] = fmul <2 x double> [[COL_LOAD]], [[SPLAT_SPLAT22]]
				; CHECK-NEXT: [[SPLAT_SPLAT25:%.*]] = shufflevector <2 x double> [[COL_LOAD16]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP16:%.*]] = fmul <2 x double> [[COL_LOAD10]], [[SPLAT_SPLAT25]]
				; CHECK-NEXT: [[TMP17:%.*]] = fadd <2 x double> [[TMP15]], [[TMP16]]
				; CHECK-NEXT: [[TMP18:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 8
				; CHECK-NEXT: [[COL_CAST27:%.]] = bitcast double [[TMP18]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD28:%.]] = load <2 x double>, <2 x double> [[COL_CAST27]], align 8
				; CHECK-NEXT: [[COL_GEP29:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST30:%.]] = bitcast double [[COL_GEP29]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD31:%.]] = load <2 x double>, <2 x double> [[COL_CAST30]], align 8
				; CHECK-NEXT: [[TMP19:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST33:%.]] = bitcast double [[TMP19]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD34:%.]] = load <2 x double>, <2 x double> [[COL_CAST33]], align 8
				; CHECK-NEXT: [[COL_GEP35:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 4
				; CHECK-NEXT: [[COL_CAST36:%.]] = bitcast double [[COL_GEP35]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD37:%.]] = load <2 x double>, <2 x double> [[COL_CAST36]], align 8
				; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <2 x double> [[COL_LOAD34]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP20:%.*]] = fmul <2 x double> [[COL_LOAD28]], [[SPLAT_SPLAT41]]
				; CHECK-NEXT: [[TMP21:%.*]] = fadd <2 x double> [[TMP14]], [[TMP20]]
				; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <2 x double> [[COL_LOAD34]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP22:%.*]] = fmul <2 x double> [[COL_LOAD31]], [[SPLAT_SPLAT44]]
				; CHECK-NEXT: [[TMP23:%.*]] = fadd <2 x double> [[TMP21]], [[TMP22]]
				; CHECK-NEXT: [[SPLAT_SPLAT48:%.*]] = shufflevector <2 x double> [[COL_LOAD37]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP24:%.*]] = fmul <2 x double> [[COL_LOAD28]], [[SPLAT_SPLAT48]]
				; CHECK-NEXT: [[TMP25:%.*]] = fadd <2 x double> [[TMP17]], [[TMP24]]
				; CHECK-NEXT: [[SPLAT_SPLAT51:%.*]] = shufflevector <2 x double> [[COL_LOAD37]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP26:%.*]] = fmul <2 x double> [[COL_LOAD31]], [[SPLAT_SPLAT51]]
				; CHECK-NEXT: [[TMP27:%.*]] = fadd <2 x double> [[TMP25]], [[TMP26]]
				; CHECK-NEXT: [[COL_CAST53:%.]] = bitcast <16 x double> [[C]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP23]], <2 x double>* [[COL_CAST53]], align 8
				; CHECK-NEXT: [[COL_GEP54:%.]] = getelementptr <16 x double>, <16 x double> [[C]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST55:%.]] = bitcast double [[COL_GEP54]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP27]], <2 x double>* [[COL_CAST55]], align 8
				; CHECK-NEXT: [[TMP28:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST57:%.]] = bitcast double [[TMP28]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD58:%.]] = load <2 x double>, <2 x double> [[COL_CAST57]], align 8
				; CHECK-NEXT: [[COL_GEP59:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 4
				; CHECK-NEXT: [[COL_CAST60:%.]] = bitcast double [[COL_GEP59]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD61:%.]] = load <2 x double>, <2 x double> [[COL_CAST60]], align 8
				; CHECK-NEXT: [[COL_CAST63:%.]] = bitcast <16 x double> [[TMP11]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD64:%.]] = load <2 x double>, <2 x double> [[COL_CAST63]], align 8
				; CHECK-NEXT: [[COL_GEP65:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST66:%.]] = bitcast double [[COL_GEP65]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD67:%.]] = load <2 x double>, <2 x double> [[COL_CAST66]], align 8
				; CHECK-NEXT: [[SPLAT_SPLAT70:%.*]] = shufflevector <2 x double> [[COL_LOAD64]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP29:%.*]] = fmul <2 x double> [[COL_LOAD58]], [[SPLAT_SPLAT70]]
				; CHECK-NEXT: [[SPLAT_SPLAT73:%.*]] = shufflevector <2 x double> [[COL_LOAD64]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP30:%.*]] = fmul <2 x double> [[COL_LOAD61]], [[SPLAT_SPLAT73]]
				; CHECK-NEXT: [[TMP31:%.*]] = fadd <2 x double> [[TMP29]], [[TMP30]]
				; CHECK-NEXT: [[SPLAT_SPLAT76:%.*]] = shufflevector <2 x double> [[COL_LOAD67]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP32:%.*]] = fmul <2 x double> [[COL_LOAD58]], [[SPLAT_SPLAT76]]
				; CHECK-NEXT: [[SPLAT_SPLAT79:%.*]] = shufflevector <2 x double> [[COL_LOAD67]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP33:%.*]] = fmul <2 x double> [[COL_LOAD61]], [[SPLAT_SPLAT79]]
				; CHECK-NEXT: [[TMP34:%.*]] = fadd <2 x double> [[TMP32]], [[TMP33]]
				; CHECK-NEXT: [[TMP35:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST81:%.]] = bitcast double [[TMP35]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD82:%.]] = load <2 x double>, <2 x double> [[COL_CAST81]], align 8
				; CHECK-NEXT: [[COL_GEP83:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 12
				; CHECK-NEXT: [[COL_CAST84:%.]] = bitcast double [[COL_GEP83]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD85:%.]] = load <2 x double>, <2 x double> [[COL_CAST84]], align 8
				; CHECK-NEXT: [[TMP36:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST87:%.]] = bitcast double [[TMP36]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD88:%.]] = load <2 x double>, <2 x double> [[COL_CAST87]], align 8
				; CHECK-NEXT: [[COL_GEP89:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 4
				; CHECK-NEXT: [[COL_CAST90:%.]] = bitcast double [[COL_GEP89]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD91:%.]] = load <2 x double>, <2 x double> [[COL_CAST90]], align 8
				; CHECK-NEXT: [[SPLAT_SPLAT95:%.*]] = shufflevector <2 x double> [[COL_LOAD88]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP37:%.*]] = fmul <2 x double> [[COL_LOAD82]], [[SPLAT_SPLAT95]]
				; CHECK-NEXT: [[TMP38:%.*]] = fadd <2 x double> [[TMP31]], [[TMP37]]
				; CHECK-NEXT: [[SPLAT_SPLAT98:%.*]] = shufflevector <2 x double> [[COL_LOAD88]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP39:%.*]] = fmul <2 x double> [[COL_LOAD85]], [[SPLAT_SPLAT98]]
				; CHECK-NEXT: [[TMP40:%.*]] = fadd <2 x double> [[TMP38]], [[TMP39]]
				; CHECK-NEXT: [[SPLAT_SPLAT102:%.*]] = shufflevector <2 x double> [[COL_LOAD91]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP41:%.*]] = fmul <2 x double> [[COL_LOAD82]], [[SPLAT_SPLAT102]]
				; CHECK-NEXT: [[TMP42:%.*]] = fadd <2 x double> [[TMP34]], [[TMP41]]
				; CHECK-NEXT: [[SPLAT_SPLAT105:%.*]] = shufflevector <2 x double> [[COL_LOAD91]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP43:%.*]] = fmul <2 x double> [[COL_LOAD85]], [[SPLAT_SPLAT105]]
				; CHECK-NEXT: [[TMP44:%.*]] = fadd <2 x double> [[TMP42]], [[TMP43]]
				; CHECK-NEXT: [[TMP45:%.]] = getelementptr <16 x double>, <16 x double> [[C]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST107:%.]] = bitcast double [[TMP45]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP40]], <2 x double>* [[COL_CAST107]], align 8
				; CHECK-NEXT: [[COL_GEP108:%.]] = getelementptr <16 x double>, <16 x double> [[C]], i64 0, i64 4
				; CHECK-NEXT: [[COL_CAST109:%.]] = bitcast double [[COL_GEP108]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP44]], <2 x double>* [[COL_CAST109]], align 8
				; CHECK-NEXT: [[COL_CAST111:%.]] = bitcast <16 x double> [[TMP5]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD112:%.]] = load <2 x double>, <2 x double> [[COL_CAST111]], align 8
				; CHECK-NEXT: [[COL_GEP113:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST114:%.]] = bitcast double [[COL_GEP113]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD115:%.]] = load <2 x double>, <2 x double> [[COL_CAST114]], align 8
				; CHECK-NEXT: [[TMP46:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 8
				; CHECK-NEXT: [[COL_CAST117:%.]] = bitcast double [[TMP46]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD118:%.]] = load <2 x double>, <2 x double> [[COL_CAST117]], align 8
				; CHECK-NEXT: [[COL_GEP119:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST120:%.]] = bitcast double [[COL_GEP119]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD121:%.]] = load <2 x double>, <2 x double> [[COL_CAST120]], align 8
				; CHECK-NEXT: [[SPLAT_SPLAT124:%.*]] = shufflevector <2 x double> [[COL_LOAD118]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP47:%.*]] = fmul <2 x double> [[COL_LOAD112]], [[SPLAT_SPLAT124]]
				; CHECK-NEXT: [[SPLAT_SPLAT127:%.*]] = shufflevector <2 x double> [[COL_LOAD118]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP48:%.*]] = fmul <2 x double> [[COL_LOAD115]], [[SPLAT_SPLAT127]]
				; CHECK-NEXT: [[TMP49:%.*]] = fadd <2 x double> [[TMP47]], [[TMP48]]
				; CHECK-NEXT: [[SPLAT_SPLAT130:%.*]] = shufflevector <2 x double> [[COL_LOAD121]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP50:%.*]] = fmul <2 x double> [[COL_LOAD112]], [[SPLAT_SPLAT130]]
				; CHECK-NEXT: [[SPLAT_SPLAT133:%.*]] = shufflevector <2 x double> [[COL_LOAD121]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP51:%.*]] = fmul <2 x double> [[COL_LOAD115]], [[SPLAT_SPLAT133]]
				; CHECK-NEXT: [[TMP52:%.*]] = fadd <2 x double> [[TMP50]], [[TMP51]]
				; CHECK-NEXT: [[TMP53:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 8
				; CHECK-NEXT: [[COL_CAST135:%.]] = bitcast double [[TMP53]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD136:%.]] = load <2 x double>, <2 x double> [[COL_CAST135]], align 8
				; CHECK-NEXT: [[COL_GEP137:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST138:%.]] = bitcast double [[COL_GEP137]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD139:%.]] = load <2 x double>, <2 x double> [[COL_CAST138]], align 8
				; CHECK-NEXT: [[TMP54:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST141:%.]] = bitcast double [[TMP54]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD142:%.]] = load <2 x double>, <2 x double> [[COL_CAST141]], align 8
				; CHECK-NEXT: [[COL_GEP143:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 12
				; CHECK-NEXT: [[COL_CAST144:%.]] = bitcast double [[COL_GEP143]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD145:%.]] = load <2 x double>, <2 x double> [[COL_CAST144]], align 8
				; CHECK-NEXT: [[SPLAT_SPLAT149:%.*]] = shufflevector <2 x double> [[COL_LOAD142]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP55:%.*]] = fmul <2 x double> [[COL_LOAD136]], [[SPLAT_SPLAT149]]
				; CHECK-NEXT: [[TMP56:%.*]] = fadd <2 x double> [[TMP49]], [[TMP55]]
				; CHECK-NEXT: [[SPLAT_SPLAT152:%.*]] = shufflevector <2 x double> [[COL_LOAD142]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP57:%.*]] = fmul <2 x double> [[COL_LOAD139]], [[SPLAT_SPLAT152]]
				; CHECK-NEXT: [[TMP58:%.*]] = fadd <2 x double> [[TMP56]], [[TMP57]]
				; CHECK-NEXT: [[SPLAT_SPLAT156:%.*]] = shufflevector <2 x double> [[COL_LOAD145]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP59:%.*]] = fmul <2 x double> [[COL_LOAD136]], [[SPLAT_SPLAT156]]
				; CHECK-NEXT: [[TMP60:%.*]] = fadd <2 x double> [[TMP52]], [[TMP59]]
				; CHECK-NEXT: [[SPLAT_SPLAT159:%.*]] = shufflevector <2 x double> [[COL_LOAD145]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP61:%.*]] = fmul <2 x double> [[COL_LOAD139]], [[SPLAT_SPLAT159]]
				; CHECK-NEXT: [[TMP62:%.*]] = fadd <2 x double> [[TMP60]], [[TMP61]]
				; CHECK-NEXT: [[TMP63:%.]] = getelementptr <16 x double>, <16 x double> [[C]], i64 0, i64 8
				; CHECK-NEXT: [[COL_CAST161:%.]] = bitcast double [[TMP63]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP58]], <2 x double>* [[COL_CAST161]], align 8
				; CHECK-NEXT: [[COL_GEP162:%.]] = getelementptr <16 x double>, <16 x double> [[C]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST163:%.]] = bitcast double [[COL_GEP162]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP62]], <2 x double>* [[COL_CAST163]], align 8
				; CHECK-NEXT: [[TMP64:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 2
				; CHECK-NEXT: [[COL_CAST165:%.]] = bitcast double [[TMP64]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD166:%.]] = load <2 x double>, <2 x double> [[COL_CAST165]], align 8
				; CHECK-NEXT: [[COL_GEP167:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 4
				; CHECK-NEXT: [[COL_CAST168:%.]] = bitcast double [[COL_GEP167]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD169:%.]] = load <2 x double>, <2 x double> [[COL_CAST168]], align 8
				; CHECK-NEXT: [[TMP65:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 8
				; CHECK-NEXT: [[COL_CAST171:%.]] = bitcast double [[TMP65]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD172:%.]] = load <2 x double>, <2 x double> [[COL_CAST171]], align 8
				; CHECK-NEXT: [[COL_GEP173:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST174:%.]] = bitcast double [[COL_GEP173]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD175:%.]] = load <2 x double>, <2 x double> [[COL_CAST174]], align 8
				; CHECK-NEXT: [[SPLAT_SPLAT178:%.*]] = shufflevector <2 x double> [[COL_LOAD172]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP66:%.*]] = fmul <2 x double> [[COL_LOAD166]], [[SPLAT_SPLAT178]]
				; CHECK-NEXT: [[SPLAT_SPLAT181:%.*]] = shufflevector <2 x double> [[COL_LOAD172]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP67:%.*]] = fmul <2 x double> [[COL_LOAD169]], [[SPLAT_SPLAT181]]
				; CHECK-NEXT: [[TMP68:%.*]] = fadd <2 x double> [[TMP66]], [[TMP67]]
				; CHECK-NEXT: [[SPLAT_SPLAT184:%.*]] = shufflevector <2 x double> [[COL_LOAD175]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP69:%.*]] = fmul <2 x double> [[COL_LOAD166]], [[SPLAT_SPLAT184]]
				; CHECK-NEXT: [[SPLAT_SPLAT187:%.*]] = shufflevector <2 x double> [[COL_LOAD175]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP70:%.*]] = fmul <2 x double> [[COL_LOAD169]], [[SPLAT_SPLAT187]]
				; CHECK-NEXT: [[TMP71:%.*]] = fadd <2 x double> [[TMP69]], [[TMP70]]
				; CHECK-NEXT: [[TMP72:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST189:%.]] = bitcast double [[TMP72]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD190:%.]] = load <2 x double>, <2 x double> [[COL_CAST189]], align 8
				; CHECK-NEXT: [[COL_GEP191:%.]] = getelementptr <16 x double>, <16 x double> [[TMP5]], i64 0, i64 12
				; CHECK-NEXT: [[COL_CAST192:%.]] = bitcast double [[COL_GEP191]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD193:%.]] = load <2 x double>, <2 x double> [[COL_CAST192]], align 8
				; CHECK-NEXT: [[TMP73:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST195:%.]] = bitcast double [[TMP73]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD196:%.]] = load <2 x double>, <2 x double> [[COL_CAST195]], align 8
				; CHECK-NEXT: [[COL_GEP197:%.]] = getelementptr <16 x double>, <16 x double> [[TMP11]], i64 0, i64 12
				; CHECK-NEXT: [[COL_CAST198:%.]] = bitcast double [[COL_GEP197]] to <2 x double>*
				; CHECK-NEXT: [[COL_LOAD199:%.]] = load <2 x double>, <2 x double> [[COL_CAST198]], align 8
				; CHECK-NEXT: [[SPLAT_SPLAT203:%.*]] = shufflevector <2 x double> [[COL_LOAD196]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP74:%.*]] = fmul <2 x double> [[COL_LOAD190]], [[SPLAT_SPLAT203]]
				; CHECK-NEXT: [[TMP75:%.*]] = fadd <2 x double> [[TMP68]], [[TMP74]]
				; CHECK-NEXT: [[SPLAT_SPLAT206:%.*]] = shufflevector <2 x double> [[COL_LOAD196]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP76:%.*]] = fmul <2 x double> [[COL_LOAD193]], [[SPLAT_SPLAT206]]
				; CHECK-NEXT: [[TMP77:%.*]] = fadd <2 x double> [[TMP75]], [[TMP76]]
				; CHECK-NEXT: [[SPLAT_SPLAT210:%.*]] = shufflevector <2 x double> [[COL_LOAD199]], <2 x double> undef, <2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP78:%.*]] = fmul <2 x double> [[COL_LOAD190]], [[SPLAT_SPLAT210]]
				; CHECK-NEXT: [[TMP79:%.*]] = fadd <2 x double> [[TMP71]], [[TMP78]]
				; CHECK-NEXT: [[SPLAT_SPLAT213:%.*]] = shufflevector <2 x double> [[COL_LOAD199]], <2 x double> undef, <2 x i32> <i32 1, i32 1>
				; CHECK-NEXT: [[TMP80:%.*]] = fmul <2 x double> [[COL_LOAD193]], [[SPLAT_SPLAT213]]
				; CHECK-NEXT: [[TMP81:%.*]] = fadd <2 x double> [[TMP79]], [[TMP80]]
				; CHECK-NEXT: [[TMP82:%.]] = getelementptr <16 x double>, <16 x double> [[C]], i64 0, i64 10
				; CHECK-NEXT: [[COL_CAST215:%.]] = bitcast double [[TMP82]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP77]], <2 x double>* [[COL_CAST215]], align 8
				; CHECK-NEXT: [[COL_GEP216:%.]] = getelementptr <16 x double>, <16 x double> [[C]], i64 0, i64 12
				; CHECK-NEXT: [[COL_CAST217:%.]] = bitcast double [[COL_GEP216]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP81]], <2 x double>* [[COL_CAST217]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a = load <16 x double>, <16 x double>* %A, align 16
				%b = load <16 x double>, <16 x double>* %B, align 16

				%c = call <16 x double> @llvm.matrix.multiply(<16 x double> %a, <16 x double> %b, i32 4, i32 4, i32 4)

				store <16 x double> %c, <16 x double>* %C, align 16
				ret void
				}

				declare <16 x double> @llvm.matrix.multiply(<16 x double>, <16 x double>, i32, i32, i32)