This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
6/12
LowerMatrixIntrinsics.cpp
-
test/Transforms/LowerMatrixIntrinsics/
-
Transforms/
-
LowerMatrixIntrinsics/
-
remarks-inlining.ll
-
remarks-shared-subtrees.ll
1/1
transpose-and-multiply-fold.ll

Differential D102733

[Matrix] Factor and distribute transposes across multiplies
ClosedPublic

Authored by anemet on May 18 2021, 4:22 PM.

Download Raw Diff

Details

Reviewers

fhahn

Commits

rGdfd1bbd00ac0: [Matrix] Factor and distribute transposes across multiplies

Summary

Now that we can fold some transposes into multiplies (CM: A * B^t and RM:
A^t * B), we want to move them around to create the optimal expressions:

fold away double transposes while still using them to assert the shape
sink transposes hoping they cancel out
lift transposes when both operands are transposed

This also modifies the matrix remarks to include the number of exposed
transposes (i.e. transposes that we couldn't fold into a multiply).

The adjustment to the test remarks-inlining is a bit subtle: I am changing the
double transpose to a single transpose so that we don't remove it completely.
More importantly this changes some of the total instruction count, most
notable stores because we can no longer use a vector store.

Diff Detail

Unit TestsFailed

	Time	Test
	8,300 ms	x64 debian > libarcher.races::lock-unrelated.c

Event Timeline

anemet created this revision.May 18 2021, 4:22 PM

Herald added subscribers: tschuett, hiraditya. · View Herald TranscriptMay 18 2021, 4:22 PM

anemet requested review of this revision.May 18 2021, 4:22 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2021, 4:22 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

anemet edited the summary of this revision. (Show Details)May 18 2021, 4:23 PM

Harbormaster completed remote builds in B105126: Diff 346295.May 18 2021, 5:58 PM

fhahn added inline comments.May 20 2021, 6:28 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
679	For the later transforms, we collect a worklist once which contains all matrix instructions. Could we use the same here to avoid having to iterate over each function again?
748	If we have a TT matmul, lift the transpose until we have a non-TT situation. Is this comment accurate? IIUC we only convert TT multiplies to versions where we can fold one transpose into the multiply?
llvm/test/Transforms/LowerMatrixIntrinsics/transpose-and-multiply-fold.ll
11	I think it would also be good to have tests that check the generated IR, together with some combinations with non-square matrixes.

fhahn added inline comments.May 20 2021, 6:32 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
688	It should be enough to capture `&II`

anemet marked 3 inline comments as done.May 21 2021, 8:29 AM

anemet added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
679	Unless we really think this is a performance issue, I'd like to avoid the extra bookkeeping and just represent everything in the IR and no on-the-side data structure that needs updating. As I was saying offline I think we already have too much bookkeeping going on (e.g. for the remarks) so it's hard to know what to update at times (). Having a backward and a forward matrix algebraic simplification pass (which is what optimizeTransposes is) that is logically separated from the lowering pass I think makes a good sense in terms of "separation of concerns". What do you think?
748	Rephrased the comment.

anemet marked an inline comment as done.May 21 2021, 8:59 AM

anemet added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
679	(The only state that is live across optimizeTransposes is the shape-info so that we have gather as much shape info as possible before removing shape-carrying operations like a double-transpose. I should probably add a comment about this.)

We also have a stripped down version of the pass to run in the backend pipelines (LowerMatrixIntrinsicsMinimalLegacyPass) We should probably no perform the optimizations there.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
679	Having a backward and a forward matrix algebraic simplification pass (which is what optimizeTransposes is) that is logically separated from the lowering pass I think makes a good sense in terms of "separation of concerns". What do you think? I think the worklist for matrix instructions is slightly different, as it would not directly impact any code related to transformations, just the way we visit them. It's not needed right now, but I think we will need it sooner or later as add more simplficiations that may depend on each other. Compile-time wise it should be fine, given that it is only run when the pass is explicitly enabled. But we should keep an eye on it, to avoid increasing compile time for our adopters too much.

fhahn added inline comments.May 21 2021, 9:22 AM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
679	While looking at this, I realized that we can do an easy early exit if there are no matrix intrinsics at all D102931. That should alleviate the impact on functions without matrix code :)

Address Florian's comments.

Harbormaster completed remote builds in B106042: Diff 347587.May 24 2021, 11:21 PM

LGTM, thanks!

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
207	nit: `///` for doc-comment as the other members?

This revision is now accepted and ready to land.May 25 2021, 8:35 AM

This revision was landed with ongoing or failed builds.May 25 2021, 11:16 AM

Closed by commit rGdfd1bbd00ac0: [Matrix] Factor and distribute transposes across multiplies (authored by anemet). · Explain Why

This revision was automatically updated to reflect the committed changes.

anemet added a commit: rGdfd1bbd00ac0: [Matrix] Factor and distribute transposes across multiplies.

int3 added a subscriber: int3.May 25 2021, 12:20 PM

int3 added inline comments.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
815	I am getting a link-time failure when doing a local `ninja lld`: Undefined symbols for architecture x86_64: "llvm::Value::dump() const", referenced from: (anonymous namespace)::LowerMatrixIntrinsics::Visit() in libLLVMScalarOpts.a(LowerMatrixIntrinsics.cpp.o) ld: symbol(s) not found for architecture x86_64 Not sure if/how it passes the buildbots, but it does indeed look like Value.cpp doesn't implement `dump()`.
815	(and commenting out this line does indeed fix my local build)

int3 added inline comments.May 25 2021, 12:23 PM

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
815	nevermind, I see a fix has just been pushed :)

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LowerMatrixIntrinsics.cpp

136 lines

test/

Transforms/

LowerMatrixIntrinsics/

remarks-inlining.ll

29 lines

remarks-shared-subtrees.ll

8 lines

transpose-and-multiply-fold.ll

168 lines

Diff 346295

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

Show All 28 Lines
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/DebugInfoMetadata.h"		#include "llvm/IR/DebugInfoMetadata.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
		#include "llvm/IR/MatrixBuilder.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/Alignment.h"		#include "llvm/Support/Alignment.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
Show All 34 Lines
static cl::opt<MatrixLayoutTy> MatrixLayout(		static cl::opt<MatrixLayoutTy> MatrixLayout(
"matrix-default-layout", cl::init(MatrixLayoutTy::ColumnMajor),		"matrix-default-layout", cl::init(MatrixLayoutTy::ColumnMajor),
cl::desc("Sets the default matrix layout"),		cl::desc("Sets the default matrix layout"),
cl::values(clEnumValN(MatrixLayoutTy::ColumnMajor, "column-major",		cl::values(clEnumValN(MatrixLayoutTy::ColumnMajor, "column-major",
"Use column-major layout"),		"Use column-major layout"),
clEnumValN(MatrixLayoutTy::RowMajor, "row-major",		clEnumValN(MatrixLayoutTy::RowMajor, "row-major",
"Use row-major layout")));		"Use row-major layout")));

		static cl::opt<bool> PrintAfterTransposeOpt("matrix-print-after-transpose-opt",
		cl::init(false));

/// Helper function to either return Scope, if it is a subprogram or the		/// Helper function to either return Scope, if it is a subprogram or the
/// attached subprogram for a local scope.		/// attached subprogram for a local scope.
static DISubprogram getSubprogram(DIScope Scope) {		static DISubprogram getSubprogram(DIScope Scope) {
if (auto *Subprogram = dyn_cast<DISubprogram>(Scope))		if (auto *Subprogram = dyn_cast<DISubprogram>(Scope))
return Subprogram;		return Subprogram;
return cast<DILocalScope>(Scope)->getSubprogram();		return cast<DILocalScope>(Scope)->getSubprogram();
}		}

▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	class LowerMatrixIntrinsics {
/// Contains estimates of the number of operations (loads, stores, compute) required to lower a matrix operation.		/// Contains estimates of the number of operations (loads, stores, compute) required to lower a matrix operation.
struct OpInfoTy {		struct OpInfoTy {
/// Number of stores emitted to generate this matrix.		/// Number of stores emitted to generate this matrix.
unsigned NumStores = 0;		unsigned NumStores = 0;
/// Number of loads emitted to generate this matrix.		/// Number of loads emitted to generate this matrix.
unsigned NumLoads = 0;		unsigned NumLoads = 0;
/// Number of compute operations emitted to generate this matrix.		/// Number of compute operations emitted to generate this matrix.
unsigned NumComputeOps = 0;		unsigned NumComputeOps = 0;
		// Most of the time transposes can be fused with matrix multiplies or can be
		fhahnUnsubmitted Not Done Reply Inline Actions nit: `///` for doc-comment as the other members? fhahn: nit: `///` for doc-comment as the other members?
		// folded away via algebraic simplifications. This is the number of
		// transposes that we failed to make "free" via such optimizations.
		unsigned NumExposedTransposes = 0;

OpInfoTy &operator+=(const OpInfoTy &RHS) {		OpInfoTy &operator+=(const OpInfoTy &RHS) {
NumStores += RHS.NumStores;		NumStores += RHS.NumStores;
NumLoads += RHS.NumLoads;		NumLoads += RHS.NumLoads;
NumComputeOps += RHS.NumComputeOps;		NumComputeOps += RHS.NumComputeOps;
		NumExposedTransposes += RHS.NumExposedTransposes;
return *this;		return *this;
}		}
};		};

/// Wrapper class representing a matrix as a set of vectors, either in row or		/// Wrapper class representing a matrix as a set of vectors, either in row or
/// column major layout. All vectors must have the same vector type.		/// column major layout. All vectors must have the same vector type.
class MatrixTy {		class MatrixTy {
SmallVector<Value *, 16> Vectors;		SmallVector<Value *, 16> Vectors;
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	public:

void setNumLoads(unsigned N) { OpInfo.NumLoads = N; }		void setNumLoads(unsigned N) { OpInfo.NumLoads = N; }

MatrixTy &addNumStores(unsigned N) {		MatrixTy &addNumStores(unsigned N) {
OpInfo.NumStores += N;		OpInfo.NumStores += N;
return *this;		return *this;
}		}

		MatrixTy &addNumExposedTransposes(unsigned N) {
		OpInfo.NumExposedTransposes += N;
		return *this;
		}

MatrixTy &addNumComputeOps(unsigned N) {		MatrixTy &addNumComputeOps(unsigned N) {
OpInfo.NumComputeOps += N;		OpInfo.NumComputeOps += N;
return *this;		return *this;
}		}

unsigned getNumStores() const { return OpInfo.NumStores; }		unsigned getNumStores() const { return OpInfo.NumStores; }
unsigned getNumLoads() const { return OpInfo.NumLoads; }		unsigned getNumLoads() const { return OpInfo.NumLoads; }
unsigned getNumComputeOps() const { return OpInfo.NumComputeOps; }		unsigned getNumComputeOps() const { return OpInfo.NumComputeOps; }
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	struct ShapeInfo {
}		}
};		};

/// Maps instructions to their shape information. The shape information		/// Maps instructions to their shape information. The shape information
/// describes the shape to be used while lowering. This matches the shape of		/// describes the shape to be used while lowering. This matches the shape of
/// the result value of the instruction, with the only exceptions being store		/// the result value of the instruction, with the only exceptions being store
/// instructions and the matrix_column_major_store intrinsics. For those, the		/// instructions and the matrix_column_major_store intrinsics. For those, the
/// shape information indicates that those instructions should be lowered		/// shape information indicates that those instructions should be lowered
/// using shape information as well.		/// using shape information as well. A ValueMap is used so that when
DenseMap<Value *, ShapeInfo> ShapeMap;		/// sub-passes like optimizeTransposes performs RAUW the map stays
		/// up-to-date.
		ValueMap<Value *, ShapeInfo> ShapeMap;

/// List of instructions to remove. While lowering, we are not replacing all		/// List of instructions to remove. While lowering, we are not replacing all
/// users of a lowered instruction, if shape information is available and		/// users of a lowered instruction, if shape information is available and
/// those need to be removed after we finished lowering.		/// those need to be removed after we finished lowering.
SmallVector<Instruction *, 16> ToRemove;		SmallVector<Instruction *, 16> ToRemove;

/// Map from instructions to their produced column matrix.		/// Map from instructions to their produced column matrix.
MapVector<Value *, MatrixTy> Inst2ColumnMatrix;		MapVector<Value *, MatrixTy> Inst2ColumnMatrix;
▲ Show 20 Lines • Show All 257 Lines • ▼ Show 20 Lines	while (!WorkList.empty()) {
for (size_t I = BeforeProcessingV; I != WorkList.size(); I++)		for (size_t I = BeforeProcessingV; I != WorkList.size(); I++)
for (User *U : WorkList[I]->users())		for (User *U : WorkList[I]->users())
if (isa<Instruction>(U) && V != U)		if (isa<Instruction>(U) && V != U)
NewWorkList.push_back(cast<Instruction>(U));		NewWorkList.push_back(cast<Instruction>(U));
}		}
return NewWorkList;		return NewWorkList;
}		}

		/// Try moving transposes in order to fold them away or into multiplies.
		void optimizeTransposes() {
		fhahnUnsubmitted Not Done Reply Inline Actions For the later transforms, we collect a worklist once which contains all matrix instructions. Could we use the same here to avoid having to iterate over each function again? fhahn: For the later transforms, we collect a worklist once which contains all matrix instructions.
		anemetAuthorUnsubmitted Done Reply Inline Actions Unless we really think this is a performance issue, I'd like to avoid the extra bookkeeping and just represent everything in the IR and no on-the-side data structure that needs updating. As I was saying offline I think we already have too much bookkeeping going on (e.g. for the remarks) so it's hard to know what to update at times (). Having a backward and a forward matrix algebraic simplification pass (which is what optimizeTransposes is) that is logically separated from the lowering pass I think makes a good sense in terms of "separation of concerns". What do you think? anemet: Unless we really think this is a performance issue, I'd like to avoid the extra bookkeeping and…
		fhahnUnsubmitted Not Done Reply Inline Actions Having a backward and a forward matrix algebraic simplification pass (which is what optimizeTransposes is) that is logically separated from the lowering pass I think makes a good sense in terms of "separation of concerns". What do you think? I think the worklist for matrix instructions is slightly different, as it would not directly impact any code related to transformations, just the way we visit them. It's not needed right now, but I think we will need it sooner or later as add more simplficiations that may depend on each other. Compile-time wise it should be fine, given that it is only run when the pass is explicitly enabled. But we should keep an eye on it, to avoid increasing compile time for our adopters too much. fhahn: > Having a backward and a forward matrix algebraic simplification pass (which is what…
		anemetAuthorUnsubmitted Done Reply Inline Actions (The only state that is live across optimizeTransposes is the shape-info so that we have gather as much shape info as possible before removing shape-carrying operations like a double-transpose. I should probably add a comment about this.) anemet: (The only state that is live across optimizeTransposes is the shape-info so that we have gather…
		fhahnUnsubmitted Not Done Reply Inline Actions While looking at this, I realized that we can do an easy early exit if there are no matrix intrinsics at all D102931. That should alleviate the impact on functions without matrix code :) fhahn: While looking at this, I realized that we can do an easy early exit if there are no matrix…
		// First sink all transposes inside matmuls, hoping that we end up with NN,
		// NT or TN variants.
		for (BasicBlock &BB: reverse(Func)) {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - for (BasicBlock &BB: reverse(Func)) { + for (BasicBlock &BB : reverse(Func)) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - for (BasicBlock &BB: reverse(Func)) { + for…
		for (auto II = BB.rbegin(); II != BB.rend();) {
		Instruction &I = *II;
		// We may remove II. By default continue on the next/prev instruction.
		++II;
		// If we were to erase II, move again.
		auto eraseFromParent = [&](Value *V) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'eraseFromParent' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'eraseFromParent' [readability-identifier…
		fhahnUnsubmitted Done Reply Inline Actions It should be enough to capture `&II` fhahn: It should be enough to capture `&II`
		auto *Inst = cast<Instruction>(V);
		if (Inst->use_empty()) {
		if (Inst == &*II) {
		++II;
		}
		Inst->eraseFromParent();
		}
		};

		// If we're creating a new instruction, continue from there.
		Instruction *NewInst = nullptr;

		IRBuilder <> IB(&I);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - IRBuilder <> IB(&I); + IRBuilder<> IB(&I); Lint: Pre-merge checks: clang-format: please reformat the code ``` - IRBuilder <> IB(&I); + IRBuilder<>…
		MatrixBuilder<IRBuilder<>> Builder(IB);

		Value TA, TAMA, *TAMB;
		ConstantInt R, K, *C;
		if (match(&I, m_Intrinsic<Intrinsic::matrix_transpose>(m_Value(TA)))) {

		// Transpose of a transpose is a nop
		Value *TATA;
		if (match(TA,
		m_Intrinsic<Intrinsic::matrix_transpose>(m_Value(TATA)))) {
		I.replaceAllUsesWith(TATA);
		eraseFromParent(&I);
		eraseFromParent(TA);
		}

		// (A * B)^t -> B^t * A^t
		// RxK KxC CxK KxR
		else if (match(TA, m_Intrinsic<Intrinsic::matrix_multiply>(
		m_Value(TAMA), m_Value(TAMB), m_ConstantInt(R),
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - m_Value(TAMA), m_Value(TAMB), m_ConstantInt(R), - m_ConstantInt(K), m_ConstantInt(C)))) { + m_Value(TAMA), m_Value(TAMB), m_ConstantInt(R), + m_ConstantInt(K), m_ConstantInt(C)))) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - m_Value(TAMA), m_Value…
		m_ConstantInt(K), m_ConstantInt(C)))) {
		Value *T0 = Builder.CreateMatrixTranspose(TAMB, K->getZExtValue(),
		C->getZExtValue(),
		TAMB->getName() + "_t");
		// We are being run after shape prop, add shape for newly created
		// instructions so that we lower them later.
		setShapeInfo(T0, {C, K});
		Value *T1 = Builder.CreateMatrixTranspose(TAMA, R->getZExtValue(),
		K->getZExtValue(),
		TAMA->getName() + "_t");
		setShapeInfo(T1, {K, R});
		NewInst = Builder.CreateMatrixMultiply(T0, T1, C->getZExtValue(),
		K->getZExtValue(),
		R->getZExtValue(), "mmul");
		setShapeInfo(NewInst, {C, R});
		I.replaceAllUsesWith(NewInst);
		eraseFromParent(&I);
		eraseFromParent(TA);
		}
		}

		// If we replaced I with a new instruction, continue from there.
		if (NewInst)
		II = std::next(BasicBlock::reverse_iterator(NewInst));
		}
		}

		// If we have a TT matmul, lift the transpose until we have a non-TT situation.
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - // If we have a TT matmul, lift the transpose until we have a non-TT situation. - for (BasicBlock &BB: Func) { + // If we have a TT matmul, lift the transpose until we have a non-TT + // situation. + for (BasicBlock &BB : Func) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - // If we have a TT matmul, lift the transpose…
		fhahnUnsubmitted Done Reply Inline Actions If we have a TT matmul, lift the transpose until we have a non-TT situation. Is this comment accurate? IIUC we only convert TT multiplies to versions where we can fold one transpose into the multiply? fhahn: > If we have a TT matmul, lift the transpose until we have a non-TT situation. Is this…
		anemetAuthorUnsubmitted Done Reply Inline Actions Rephrased the comment. anemet: Rephrased the comment.
		for (BasicBlock &BB: Func) {
		for (BasicBlock::iterator II = BB.begin(); II != BB.end();) {
		Instruction I = &II;
		// We may remove I.
		++II;
		Value A, B, AT, BT;
		ConstantInt R, K, *C;
		if (match(&*I, m_Intrinsic<Intrinsic::matrix_multiply>(
		m_Value(A), m_Value(B), m_ConstantInt(R),
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - m_Value(A), m_Value(B), m_ConstantInt(R), - m_ConstantInt(K), m_ConstantInt(C))) && + m_Value(A), m_Value(B), m_ConstantInt(R), + m_ConstantInt(K), m_ConstantInt(C))) && Lint: Pre-merge checks: clang-format: please reformat the code ``` - m_Value(A), m_Value(B)…
		m_ConstantInt(K), m_ConstantInt(C))) &&
		match(A, m_Intrinsic<Intrinsic::matrix_transpose>(m_Value(AT))) &&
		match(B, m_Intrinsic<Intrinsic::matrix_transpose>(m_Value((BT))))) {
		IRBuilder<> IB(&*I);
		MatrixBuilder<IRBuilder<>> Builder(IB);
		Value *M = Builder.CreateMatrixMultiply(
		BT, AT, C->getZExtValue(), K->getZExtValue(), R->getZExtValue());
		setShapeInfo(M, {C, R});
		Value *NewInst = Builder.CreateMatrixTranspose(M, R->getZExtValue(),
		C->getZExtValue());
		setShapeInfo(NewInst, {C, R});
		I->replaceAllUsesWith(NewInst);
		if (I->use_empty())
		I->eraseFromParent();
		if (A->use_empty())
		cast<Instruction>(A)->eraseFromParent();
		if (B->use_empty())
		cast<Instruction>(B)->eraseFromParent();
		}
		}
		}
		}

bool Visit() {		bool Visit() {
if (EnableShapePropagation) {		if (EnableShapePropagation) {
SmallVector<Instruction *, 32> WorkList;		SmallVector<Instruction *, 32> WorkList;

// Initially only the shape of matrix intrinsics is known.		// Initially only the shape of matrix intrinsics is known.
// Initialize the work list with ops carrying shape information.		// Initialize the work list with ops carrying shape information.
for (BasicBlock &BB : Func)		for (BasicBlock &BB : Func)
for (Instruction &Inst : BB) {		for (Instruction &Inst : BB) {
Show All 14 Lines	if (EnableShapePropagation) {
}		}
// Propagate shapes until nothing changes any longer.		// Propagate shapes until nothing changes any longer.
while (!WorkList.empty()) {		while (!WorkList.empty()) {
WorkList = propagateShapeForward(WorkList);		WorkList = propagateShapeForward(WorkList);
WorkList = propagateShapeBackward(WorkList);		WorkList = propagateShapeBackward(WorkList);
}		}
}		}

		optimizeTransposes();
		if (PrintAfterTransposeOpt) {
		dbgs() << "Dump after matrix transpose optimization:\n";
		Func.dump();
		}
		int3Unsubmitted Not Done Reply Inline Actions I am getting a link-time failure when doing a local `ninja lld`: Undefined symbols for architecture x86_64: "llvm::Value::dump() const", referenced from: (anonymous namespace)::LowerMatrixIntrinsics::Visit() in libLLVMScalarOpts.a(LowerMatrixIntrinsics.cpp.o) ld: symbol(s) not found for architecture x86_64 Not sure if/how it passes the buildbots, but it does indeed look like Value.cpp doesn't implement `dump()`. int3: I am getting a link-time failure when doing a local `ninja lld`: ``` Undefined symbols for…
		int3Unsubmitted Not Done Reply Inline Actions (and commenting out this line does indeed fix my local build) int3: (and commenting out this line does indeed fix my local build)
		int3Unsubmitted Done Reply Inline Actions nevermind, I see a fix has just been pushed :) int3: nevermind, I see a fix has just been pushed :)

bool Changed = false;		bool Changed = false;
SmallVector<CallInst *, 16> MaybeFusableInsts;		SmallVector<CallInst *, 16> MaybeFusableInsts;
SmallVector<Instruction *, 16> MatrixInsts;		SmallVector<Instruction *, 16> MatrixInsts;

// First, collect all instructions with shape information and candidates for		// First, collect all instructions with shape information and candidates for
// fusion (currently only matrix multiplies).		// fusion (currently only matrix multiplies).
ReversePostOrderTraversal<Function *> RPOT(&Func);		ReversePostOrderTraversal<Function *> RPOT(&Func);
for (auto *BB : RPOT)		for (auto *BB : RPOT)
▲ Show 20 Lines • Show All 785 Lines • ▼ Show 20 Lines	for (unsigned I = 0; I < NewNumVecs; ++I) {
Result.addVector(ResultVector);		Result.addVector(ResultVector);
}		}

// TODO: Improve estimate of operations needed for transposes. Currently we		// TODO: Improve estimate of operations needed for transposes. Currently we
// just count the insertelement/extractelement instructions, but do not		// just count the insertelement/extractelement instructions, but do not
// account for later simplifications/combines.		// account for later simplifications/combines.
finalizeLowering(		finalizeLowering(
Inst,		Inst,
Result.addNumComputeOps(2 * ArgShape.NumRows * ArgShape.NumColumns),		Result.addNumComputeOps(2 * ArgShape.NumRows * ArgShape.NumColumns)
		.addNumExposedTransposes(1),
Builder);		Builder);
}		}

/// Lower load instructions, if shape information is available.		/// Lower load instructions, if shape information is available.
bool VisitLoad(LoadInst Inst, Value Ptr, IRBuilder<> &Builder) {		bool VisitLoad(LoadInst Inst, Value Ptr, IRBuilder<> &Builder) {
auto I = ShapeMap.find(Inst);		auto I = ShapeMap.find(Inst);
if (I == ShapeMap.end())		if (I == ShapeMap.end())
return false;		return false;
▲ Show 20 Lines • Show All 498 Lines • ▼ Show 20 Lines	void emitRemarks() {

OptimizationRemark Rem(DEBUG_TYPE, "matrix-lowered", Loc,		OptimizationRemark Rem(DEBUG_TYPE, "matrix-lowered", Loc,
cast<Instruction>(L)->getParent());		cast<Instruction>(L)->getParent());

Rem << "Lowered with ";		Rem << "Lowered with ";
Rem << ore::NV("NumStores", Counts.NumStores) << " stores, "		Rem << ore::NV("NumStores", Counts.NumStores) << " stores, "
<< ore::NV("NumLoads", Counts.NumLoads) << " loads, "		<< ore::NV("NumLoads", Counts.NumLoads) << " loads, "
<< ore::NV("NumComputeOps", Counts.NumComputeOps)		<< ore::NV("NumComputeOps", Counts.NumComputeOps)
<< " compute ops";		<< " compute ops, "
		<< ore::NV("NumExposedTransposes", Counts.NumExposedTransposes)
		<< " exposed transposes";

if (SharedCounts.NumStores > 0 \|\| SharedCounts.NumLoads > 0 \|\|		if (SharedCounts.NumStores > 0 \|\| SharedCounts.NumLoads > 0 \|\|
SharedCounts.NumComputeOps > 0) {		SharedCounts.NumComputeOps > 0) {
Rem << ",\nadditionally "		Rem << ",\nadditionally "
<< ore::NV("NumStores", SharedCounts.NumStores) << " stores, "		<< ore::NV("NumStores", SharedCounts.NumStores) << " stores, "
<< ore::NV("NumLoads", SharedCounts.NumLoads) << " loads, "		<< ore::NV("NumLoads", SharedCounts.NumLoads) << " loads, "
<< ore::NV("NumFPOps", SharedCounts.NumComputeOps)		<< ore::NV("NumFPOps", SharedCounts.NumComputeOps)
<< " compute ops"		<< " compute ops"
▲ Show 20 Lines • Show All 139 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	; void test(double A, double B, double *C) {			; void test(double A, double B, double *C) {
	; store(add(load<double, 3, 5>(A), load<double, 3, 5>(B)), C);			; store(add(load<double, 3, 5>(A), load<double, 3, 5>(B)), C);
	; }			; }
	;			;

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "aarch64-apple-ios"			target triple = "aarch64-apple-ios"

	; CHECK-LABEL: remark: load.h:41:43: Lowered with 0 stores, 10 loads, 0 compute ops			; CHECK-LABEL: remark: load.h:41:43: Lowered with 0 stores, 10 loads, 0 compute ops, 0 exposed transposes
	; CHECK-NEXT: load(addr %A)			; CHECK-NEXT: load(addr %A)

	; CHECK-LABEL: remark: load.h:41:43: Lowered with 0 stores, 10 loads, 0 compute ops			; CHECK-LABEL: remark: load.h:41:43: Lowered with 0 stores, 10 loads, 0 compute ops, 0 exposed transposes
	; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)			; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)

	; CHECK-LABEL: remark: load.h:41:11: Lowered with 0 stores, 1 loads, 0 compute ops			; CHECK-LABEL: remark: load.h:41:11: Lowered with 0 stores, 1 loads, 0 compute ops, 0 exposed transposes
	; CHECK-NEXT: load(addr %D)			; CHECK-NEXT: load(addr %D)

	; CHECK-LABEL: remark: assign.h:32:43: Lowered with 0 stores, 10 loads, 0 compute ops			; CHECK-LABEL: remark: assign.h:32:43: Lowered with 0 stores, 10 loads, 0 compute ops, 0 exposed transposes
	; CHECK-NEXT: load(addr %A)			; CHECK-NEXT: load(addr %A)

	; CHECK-LABEL: remark: assign.h:32:43: Lowered with 0 stores, 10 loads, 0 compute ops			; CHECK-LABEL: remark: assign.h:32:43: Lowered with 0 stores, 10 loads, 0 compute ops, 0 exposed transposes
	; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)			; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)

	; CHECK-LABEL: remark: toplevel.c:410:0: Lowered with 10 stores, 20 loads, 10 compute ops			; CHECK-LABEL: remark: toplevel.c:410:0: Lowered with 10 stores, 20 loads, 10 compute ops, 0 exposed transposes
	; CHECK-NEXT: store(			; CHECK-NEXT: store(
	; CHECK-NEXT: fadd(			; CHECK-NEXT: fadd(
	; CHECK-NEXT: load(addr %A),			; CHECK-NEXT: load(addr %A),
	; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)),			; CHECK-NEXT: column.major.load.3x5.double(addr %B, 5)),
	; CHECK-NEXT: addr %C)			; CHECK-NEXT: addr %C)

	; CHECK-LABEL: remark: toplevel.c:510:0: Lowered with 1 stores, 1 loads, 8 compute ops			; CHECK-LABEL: remark: toplevel.c:510:0: Lowered with 2 stores, 1 loads, 4 compute ops, 1 exposed transposes
	; CHECK-NEXT: store(			; CHECK-NEXT: store(
	; CHECK-NEXT: transpose.1x2.float(transpose.2x1.float(load(addr %D))),			; CHECK-NEXT: transpose.2x1.float(load(addr %D)),
	; CHECK-NEXT: addr %D)			; CHECK-NEXT: addr %D)

	; CHECK-LABEL: remark: add.h:66:11: Lowered with 0 stores, 0 loads, 10 compute ops			; CHECK-LABEL: remark: add.h:66:11: Lowered with 0 stores, 0 loads, 10 compute ops, 0 exposed transposes
	; CHECK-NEXT: fadd(			; CHECK-NEXT: fadd(
	; CHECK-NEXT: addr %A,			; CHECK-NEXT: addr %A,
	; CHECK-NEXT: scalar)			; CHECK-NEXT: scalar)

	; CHECK-LABEL: remark: store.h:10:11: Lowered with 10 stores, 0 loads, 0 compute ops			; CHECK-LABEL: remark: store.h:10:11: Lowered with 10 stores, 0 loads, 0 compute ops, 0 exposed transposes
	; CHECK-NEXT: store(			; CHECK-NEXT: store(
	; CHECK-NEXT: scalar,			; CHECK-NEXT: scalar,
	; CHECK-NEXT: addr %C)			; CHECK-NEXT: addr %C)

	; CHECK-LABEL: remark: store.h:66:11: Lowered with 1 stores, 0 loads, 0 compute ops			; CHECK-LABEL: remark: store.h:66:11: Lowered with 2 stores, 0 loads, 0 compute ops, 0 exposed transposes
	; CHECK-NEXT: store(			; CHECK-NEXT: store(
	; CHECK-NEXT: scalar,			; CHECK-NEXT: scalar,
	; CHECK-NEXT: addr %D)			; CHECK-NEXT: addr %D)

	; CHECK-LABEL: remark: transpose.h:13:11: Lowered with 0 stores, 0 loads, 8 compute ops			; CHECK-LABEL: remark: transpose.h:13:11: Lowered with 0 stores, 0 loads, 4 compute ops, 1 exposed transposes
	; CHECK-NEXT: transpose.1x2.float(transpose.2x1.float(addr %D))			; CHECK-NEXT: transpose.2x1.float(addr %D)

	define void @toplevel(<15 x double>* %A, double* %B, <15 x double>* %C, <2 x float>* %D) !dbg !16 {			define void @toplevel(<15 x double>* %A, double* %B, <15 x double>* %C, <2 x float>* %D) !dbg !16 {
	entry:			entry:
	%a = load <15 x double>, <15 x double> *%A, align 16, !dbg !3791			%a = load <15 x double>, <15 x double> *%A, align 16, !dbg !3791
	%b = call <15 x double> @llvm.matrix.column.major.load(double* %B, i64 5, i1 false, i32 3, i32 5), !dbg !3793			%b = call <15 x double> @llvm.matrix.column.major.load(double* %B, i64 5, i1 false, i32 3, i32 5), !dbg !3793
	%c = fadd <15 x double> %a, %b, !dbg !100			%c = fadd <15 x double> %a, %b, !dbg !100
	store <15 x double> %c, <15 x double> *%C, align 16, !dbg !102			store <15 x double> %c, <15 x double> *%C, align 16, !dbg !102

	%load = load <2 x float>, <2 x float>* %D, !dbg !104			%load = load <2 x float>, <2 x float>* %D, !dbg !104
	%t1 = call <2 x float> @llvm.matrix.transpose(<2 x float> %load, i32 2, i32 1), !dbg !106			%t1 = call <2 x float> @llvm.matrix.transpose(<2 x float> %load, i32 2, i32 1), !dbg !106
	%t2 = call <2 x float> @llvm.matrix.transpose(<2 x float> %t1, i32 1, i32 2), !dbg !106			store <2 x float> %t1, <2 x float>* %D, !dbg !108
	store <2 x float> %t2, <2 x float>* %D, !dbg !108
	ret void			ret void
	}			}

	declare <15 x double> @llvm.matrix.column.major.load(double*, i64, i1, i32, i32)			declare <15 x double> @llvm.matrix.column.major.load(double*, i64, i1, i32, i32)
	declare <2 x float> @llvm.matrix.transpose(<2 x float>, i32, i32)			declare <2 x float> @llvm.matrix.transpose(<2 x float>, i32, i32)

	!llvm.dbg.cu = !{!0}			!llvm.dbg.cu = !{!0}
	!llvm.module.flags = !{!3, !4}			!llvm.module.flags = !{!3, !4}
	▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/remarks-shared-subtrees.ll

	Show All 11 Lines
	; YAML-NEXT: Function: test_2leafs			; YAML-NEXT: Function: test_2leafs
	; YAML-NEXT: Args:			; YAML-NEXT: Args:
	; YAML-NEXT: - String: 'Lowered with '			; YAML-NEXT: - String: 'Lowered with '
	; YAML-NEXT: - NumStores: '4'			; YAML-NEXT: - NumStores: '4'
	; YAML-NEXT: - String: ' stores, '			; YAML-NEXT: - String: ' stores, '
	; YAML-NEXT: - NumLoads: '0'			; YAML-NEXT: - NumLoads: '0'
	; YAML-NEXT: - String: ' loads, '			; YAML-NEXT: - String: ' loads, '
	; YAML-NEXT: - NumComputeOps: '0'			; YAML-NEXT: - NumComputeOps: '0'
	; YAML-NEXT: - String: ' compute ops'			; YAML-NEXT: - String: ' compute ops, '
				; YAML-NEXT: - NumExposedTransposes: '0'
				; YAML-NEXT: - String: ' exposed transposes'
	; YAML-NEXT: - String: ",\nadditionally "			; YAML-NEXT: - String: ",\nadditionally "
	; YAML-NEXT: - NumStores: '0'			; YAML-NEXT: - NumStores: '0'
	; YAML-NEXT: - String: ' stores, '			; YAML-NEXT: - String: ' stores, '
	; YAML-NEXT: - NumLoads: '4'			; YAML-NEXT: - NumLoads: '4'
	; YAML-NEXT: - String: ' loads, '			; YAML-NEXT: - String: ' loads, '
	; YAML-NEXT: - NumFPOps: '16'			; YAML-NEXT: - NumFPOps: '16'
	; YAML-NEXT: - String: ' compute ops'			; YAML-NEXT: - String: ' compute ops'
	; YAML-NEXT: - String: ' are shared with other expressions'			; YAML-NEXT: - String: ' are shared with other expressions'
	Show All 11 Lines
	; YAML-NEXT: Function: test_2leafs			; YAML-NEXT: Function: test_2leafs
	; YAML-NEXT: Args:			; YAML-NEXT: Args:
	; YAML-NEXT: - String: 'Lowered with '			; YAML-NEXT: - String: 'Lowered with '
	; YAML-NEXT: - NumStores: '30'			; YAML-NEXT: - NumStores: '30'
	; YAML-NEXT: - String: ' stores, '			; YAML-NEXT: - String: ' stores, '
	; YAML-NEXT: - NumLoads: '45'			; YAML-NEXT: - NumLoads: '45'
	; YAML-NEXT: - String: ' loads, '			; YAML-NEXT: - String: ' loads, '
	; YAML-NEXT: - NumComputeOps: '120'			; YAML-NEXT: - NumComputeOps: '120'
	; YAML-NEXT: - String: ' compute ops'			; YAML-NEXT: - String: ' compute ops, '
				; YAML-NEXT: - NumExposedTransposes: '0'
				; YAML-NEXT: - String: ' exposed transposes'
	; YAML-NEXT: - String: ",\nadditionally "			; YAML-NEXT: - String: ",\nadditionally "
	; YAML-NEXT: - NumStores: '0'			; YAML-NEXT: - NumStores: '0'
	; YAML-NEXT: - String: ' stores, '			; YAML-NEXT: - String: ' stores, '
	; YAML-NEXT: - NumLoads: '4'			; YAML-NEXT: - NumLoads: '4'
	; YAML-NEXT: - String: ' loads, '			; YAML-NEXT: - String: ' loads, '
	; YAML-NEXT: - NumFPOps: '16'			; YAML-NEXT: - NumFPOps: '16'
	; YAML-NEXT: - String: ' compute ops'			; YAML-NEXT: - String: ' compute ops'
	; YAML-NEXT: - String: ' are shared with other expressions'			; YAML-NEXT: - String: ' are shared with other expressions'
	▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

llvm/test/Transforms/LowerMatrixIntrinsics/transpose-and-multiply-fold.ll

This file was added.

				; REQUIRES: aarch64-registered-target

				; This test needs to be target specific due to the cost estimate in the output.

				; RUN: opt -lower-matrix-intrinsics -S -o /dev/null -pass-remarks-output=%t < %s && FileCheck --input-file %t %s
				; RUN: opt -passes='lower-matrix-intrinsics' -S -o /dev/null -pass-remarks-output=%t < %s && FileCheck --input-file %t %s

				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "aarch64-apple-ios"

				define void @double_transpose(<9 x double>* %A, <9 x double>* %B) {
				fhahnUnsubmitted Done Reply Inline Actions I think it would also be good to have tests that check the generated IR, together with some combinations with non-square matrixes. fhahn: I think it would also be good to have tests that check the generated IR, together with some…
				; CHECK: Pass: lower-matrix-intrinsics
				; CHECK-NEXT: Name: matrix-lowered
				; CHECK-NEXT: Function: double_transpose
				; CHECK-NEXT: Args:
				; CHECK-NEXT: - String: 'Lowered with '
				; CHECK-NEXT: - NumStores: '6'
				; CHECK-NEXT: - String: ' stores, '
				; CHECK-NEXT: - NumLoads: '6'
				; CHECK-NEXT: - String: ' loads, '
				; CHECK-NEXT: - NumComputeOps: '0'
				; CHECK-NEXT: - String: ' compute ops, '
				; CHECK-NEXT: - NumExposedTransposes: '0'
				; CHECK-NEXT: - String: ' exposed transposes'
				; CHECK-NEXT: - String: \|
				; CHECK: store(
				; CHECK-NEXT: load(addr %A),
				; CHECK-NEXT: addr %B)
				entry:
				%a = load <9 x double>, <9 x double>* %A, align 16
				%at = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %a, i32 3, i32 3)
				%att = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %at, i32 3, i32 3)
				store <9 x double> %att, <9 x double>* %B, align 16
				ret void
				}

				define void @multiply_3x3x3_ntt(<9 x double>* %A, <9 x double>* %B, <9 x double>* %C, <9 x double>* %R) {
				; CHECK: Pass: lower-matrix-intrinsics
				; CHECK-NEXT: Name: matrix-lowered
				; CHECK-NEXT: Function: multiply_3x3x3_ntt
				; CHECK-NEXT: Args:
				; CHECK-NEXT: - String: 'Lowered with '
				; CHECK-NEXT: - NumStores: '6'
				; CHECK-NEXT: - String: ' stores, '
				; CHECK-NEXT: - NumLoads: '18'
				; CHECK-NEXT: - String: ' loads, '
				; CHECK-NEXT: - NumComputeOps: '60'
				; CHECK-NEXT: - String: ' compute ops, '
				; CHECK-NEXT: - NumExposedTransposes: '0'
				; CHECK-NEXT: - String: ' exposed transposes'
				; CHECK-NEXT: - String: \|
				; CHECK: store(
				; CHECK-NEXT: multiply.3x3.3x3.double(
				; CHECK-NEXT: load(addr %A),
				; CHECK-NEXT: transpose.3x3.double(multiply.3x3.3x3.double(
				; CHECK-NEXT: load(addr %C),
				; CHECK-NEXT: load(addr %B)))),
				; CHECK-NEXT: addr %R)
				entry:
				%a = load <9 x double>, <9 x double>* %A, align 16
				%b = load <9 x double>, <9 x double>* %B, align 16
				%c = load <9 x double>, <9 x double>* %C, align 16
				%b_t = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %b, i32 3, i32 3)
				%c_t = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %c, i32 3, i32 3)
				%m1 = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %b_t, <9 x double> %c_t, i32 3, i32 3, i32 3)
				%m2 = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a, <9 x double> %m1, i32 3, i32 3, i32 3)
				store <9 x double> %m2, <9 x double>* %R, align 16
				ret void
				}

				define void @multiply_3x3x3_tt_t(<9 x double>* %A, <9 x double>* %B, <9 x double>* %C) {
				; CHECK: Pass: lower-matrix-intrinsics
				; CHECK-NEXT: Name: matrix-lowered
				; CHECK-NEXT: Function: multiply_3x3x3_tt_t
				; CHECK-NEXT: Args:
				; CHECK-NEXT: - String: 'Lowered with '
				; CHECK-NEXT: - NumStores: '6'
				; CHECK-NEXT: - String: ' stores, '
				; CHECK-NEXT: - NumLoads: '12'
				; CHECK-NEXT: - String: ' loads, '
				; CHECK-NEXT: - NumComputeOps: '30'
				; CHECK-NEXT: - String: ' compute ops, '
				; CHECK-NEXT: - NumExposedTransposes: '0'
				; CHECK-NEXT: - String: ' exposed transposes'
				; CHECK-NEXT: - String: \|
				; CHECK: store(
				; CHECK-NEXT: multiply.3x3.3x3.double(
				; CHECK-NEXT: load(addr %B),
				; CHECK-NEXT: load(addr %A)),
				; CHECK-NEXT: addr %C)
				entry:
				%a = load <9 x double>, <9 x double>* %A, align 16
				%at = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %a, i32 3, i32 3)
				%b = load <9 x double>, <9 x double>* %B, align 16
				%bt = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %b, i32 3, i32 3)
				%c = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %at, <9 x double> %bt, i32 3, i32 3, i32 3)
				%ct = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %c, i32 3, i32 3)
				store <9 x double> %ct, <9 x double>* %C, align 16
				ret void
				}

				define void @multiply_3x3x3_nt_t(<9 x double>* %A, <9 x double>* %B, <9 x double>* %C) {
				; CHECK: Pass: lower-matrix-intrinsics
				; CHECK-NEXT: Name: matrix-lowered
				; CHECK-NEXT: Function: multiply_3x3x3_nt_t
				; CHECK-NEXT: Args:
				; CHECK-NEXT: - String: 'Lowered with '
				; CHECK-NEXT: - NumStores: '6'
				; CHECK-NEXT: - String: ' stores, '
				; CHECK-NEXT: - NumLoads: '12'
				; CHECK-NEXT: - String: ' loads, '
				; CHECK-NEXT: - NumComputeOps: '30'
				; CHECK-NEXT: - String: ' compute ops, '
				; CHECK-NEXT: - NumExposedTransposes: '0'
				; CHECK-NEXT: - String: ' exposed transposes'
				; CHECK-NEXT: - String: \|
				; CHECK: store(
				; CHECK-NEXT: multiply.3x3.3x3.double(
				; CHECK-NEXT: load(addr %B),
				; CHECK-NEXT: transpose.3x3.double(load(addr %A))),
				; CHECK-NEXT: addr %C)
				entry:
				%a = load <9 x double>, <9 x double>* %A, align 16
				%b = load <9 x double>, <9 x double>* %B, align 16
				%bt = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %b, i32 3, i32 3)
				%c = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a, <9 x double> %bt, i32 3, i32 3, i32 3)
				%ct = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %c, i32 3, i32 3)
				store <9 x double> %ct, <9 x double>* %C, align 16
				ret void
				}

				define void @multiply_ntt_t(<9 x double>* %A, <9 x double>* %B, <9 x double>* %C, <9 x double>* %R) {
				; CHECK: Pass: lower-matrix-intrinsics
				; CHECK-NEXT: Name: matrix-lowered
				; CHECK-NEXT: Function: multiply_ntt_t
				; CHECK-NEXT: Args:
				; CHECK-NEXT: - String: 'Lowered with '
				; CHECK-NEXT: - NumStores: '6'
				; CHECK-NEXT: - String: ' stores, '
				; CHECK-NEXT: - NumLoads: '18'
				; CHECK-NEXT: - String: ' loads, '
				; CHECK-NEXT: - NumComputeOps: '60'
				; CHECK-NEXT: - String: ' compute ops, '
				; CHECK-NEXT: - NumExposedTransposes: '0'
				; CHECK-NEXT: - String: ' exposed transposes'
				; CHECK-NEXT: - String: \|
				; CHECK: store(
				; CHECK-NEXT: multiply.3x3.3x3.double(
				; CHECK-NEXT: multiply.3x3.3x3.double(
				; CHECK-NEXT: load(addr %C),
				; CHECK-NEXT: load(addr %B)),
				; CHECK-NEXT: transpose.3x3.double(load(addr %A))),
				; CHECK-NEXT: addr %R)
				entry:
				%a = load <9 x double>, <9 x double>* %A, align 16
				%b = load <9 x double>, <9 x double>* %B, align 16
				%bt = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %b, i32 3, i32 3)
				%c = load <9 x double>, <9 x double>* %C, align 16
				%ct = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %c, i32 3, i32 3)
				%btct = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %bt, <9 x double> %ct, i32 3, i32 3, i32 3)
				%abtct= call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %a, <9 x double> %btct, i32 3, i32 3, i32 3)
				%abtct_t = call <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double> %abtct, i32 3, i32 3)
				store <9 x double> %abtct_t, <9 x double>* %R, align 16
				ret void
				}

				declare <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double>, <9 x double>, i32 immarg, i32 immarg, i32 immarg)
				declare <9 x double> @llvm.matrix.transpose.v9f64.v9f64(<9 x double>, i32 immarg, i32 immarg)