This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
2/7
X86LowerAMXType.cpp
-
test/CodeGen/X86/AMX/
-
CodeGen/
-
X86/
-
AMX/
-
amx-type.ll

Differential D92449

[X86] Sink x86_amx load in AMX type lowering.
Needs ReviewPublic

Authored by LuoYuanke on Dec 1 2020, 7:38 PM.

Download Raw Diff

Details

Reviewers

craig.topper
pengfei
xiangzhangllvm
akashk4

Summary

When transform x86_amx load to amx load intrinsics, compiler need
to know the shape the matrix. The shape can be deduced from the user
of the load. However the shape may be defined after the load
instruction, so we need to sink the load to avoid such issue. This patch
only support sink within a basic block.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

LuoYuanke created this revision.Dec 1 2020, 7:38 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 1 2020, 7:38 PM

Herald added subscribers: llvm-commits, pengfei, hiraditya. · View Herald Transcript

LuoYuanke requested review of this revision.Dec 1 2020, 7:38 PM

Harbormaster completed remote builds in B80757: Diff 308849.Dec 1 2020, 7:39 PM

LuoYuanke added reviewers: craig.topper, pengfei, xiangzhangllvm, akashk4.Dec 1 2020, 7:40 PM

LuoYuanke added a subscriber: annita.zhang.

craig.topper added inline comments.Dec 1 2020, 7:45 PM

llvm/lib/Target/X86/X86LowerAMXType.cpp
100	Does this prevent sinking across a call that may change the memory being loaded?

craig.topper added inline comments.Dec 1 2020, 8:01 PM

llvm/lib/Target/X86/X86LowerAMXType.cpp
100	Or atomics, or anything with side effects.

LuoYuanke added inline comments.Dec 1 2020, 9:50 PM

llvm/lib/Target/X86/X86LowerAMXType.cpp
100	Thank you for review. Yes. I want to prevent all the scenario that may change the memory being loaded. I'll check how many scenario we need to prevent from sinking.

It's better to set parent since the patch is based on D91927.

llvm/lib/Target/X86/X86LowerAMXType.cpp
123–124	Why don't pass the pointer directly for `LD` and `II`?

LuoYuanke added inline comments.Dec 2 2020, 9:42 PM

llvm/lib/Target/X86/X86LowerAMXType.cpp
100	@craig.topper , I do some study, but I still don't understand why load can't sink across atomic instruction or side effect instruction. I notice in MergedLoadStoreMotion::isStoreSinkBarrierInRange(), it only check myThrow() and alias.

pengfei added inline comments.Dec 2 2020, 9:59 PM

llvm/lib/Target/X86/X86LowerAMXType.cpp
100	I only noticed we chain the memory operations during building the DAG and these side effect instructions. Do we have such rules for middle end passes?

craig.topper added inline comments.Dec 2 2020, 10:05 PM

llvm/lib/Target/X86/X86LowerAMXType.cpp
100	Doesn't isStoreSinkBarrierInRange call canInstructionRangeModRef not just isnoalias with a range including every instruction between the store and where it wants to be moved to? "Release" ordering for an atomic means that no load/store can be moved below it, for example.

Address Craig and Pengfei's comments.

Harbormaster completed remote builds in B80924: Diff 309179.Dec 3 2020, 12:48 AM

LuoYuanke added a parent revision: D91927: [X86] Add x86_amx type for intel AMX..Dec 3 2020, 12:52 AM

yubing added a subscriber: yubing.Dec 4 2020, 2:34 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86LowerAMXType.cpp

43 lines

test/

CodeGen/

X86/

AMX/

amx-type.ll

188 lines

Diff 309179

llvm/lib/Target/X86/X86LowerAMXType.cpp

Show All 12 Lines
/// and only AMX intrinsics can operate on the type, we need transform		/// and only AMX intrinsics can operate on the type, we need transform
/// load/store <256 x i32> instruction to AMX load/store. Otherwise we are not		/// load/store <256 x i32> instruction to AMX load/store. Otherwise we are not
/// able to lower the bitcast instruction to X86 instruction.		/// able to lower the bitcast instruction to X86 instruction.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
#include "X86.h"		#include "X86.h"
#include "llvm/ADT/DenseSet.h"		#include "llvm/ADT/DenseSet.h"
		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/CodeGen/Passes.h"		#include "llvm/CodeGen/Passes.h"
#include "llvm/CodeGen/ValueTypes.h"		#include "llvm/CodeGen/ValueTypes.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	case Intrinsic::x86_tdpbssd_internal: {
}		}
break;		break;
}		}
}		}

return std::make_pair(Row, Col);		return std::make_pair(Row, Col);
}		}

		// Start is load instruction. Check if the load can sink through
		// the following instruction. The sink don't across basic block.
		// TODO: improve the sink across basic block.
		static Instruction findSafePointToSink(Instruction LD, Instruction *User,
		AliasAnalysis *AA) {
		Instruction Start = &(++LD->getIterator());
		Instruction *End = nullptr;
		BasicBlock *BB = LD->getParent();
		if (LD->getParent() == User->getParent())
		End = User;
		else
		End = BB->getTerminator();

		MemoryLocation Loc = MemoryLocation::get(LD);
		for (Instruction &Inst :
		make_range(Start->getIterator(), End->getIterator())) {
		craig.topperUnsubmitted Not Done Reply Inline Actions Does this prevent sinking across a call that may change the memory being loaded? craig.topper: Does this prevent sinking across a call that may change the memory being loaded?
		craig.topperUnsubmitted Not Done Reply Inline Actions Or atomics, or anything with side effects. craig.topper: Or atomics, or anything with side effects.
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions @craig.topper , I do some study, but I still don't understand why load can't sink across atomic instruction or side effect instruction. I notice in MergedLoadStoreMotion::isStoreSinkBarrierInRange(), it only check myThrow() and alias. LuoYuanke: @craig.topper , I do some study, but I still don't understand why load can't sink across atomic…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Thank you for review. Yes. I want to prevent all the scenario that may change the memory being loaded. I'll check how many scenario we need to prevent from sinking. LuoYuanke: Thank you for review. Yes. I want to prevent all the scenario that may change the memory being…
		pengfeiUnsubmitted Not Done Reply Inline Actions I only noticed we chain the memory operations during building the DAG and these side effect instructions. Do we have such rules for middle end passes? pengfei: I only noticed we chain the memory operations during building the DAG and these side effect…
		craig.topperUnsubmitted Not Done Reply Inline Actions Doesn't isStoreSinkBarrierInRange call canInstructionRangeModRef not just isnoalias with a range including every instruction between the store and where it wants to be moved to? "Release" ordering for an atomic means that no load/store can be moved below it, for example. craig.topper: Doesn't isStoreSinkBarrierInRange call canInstructionRangeModRef not just isnoalias with a…
		if (Inst.mayThrow())
		return &Inst;

		if (isModOrRefSet(
		intersectModRef(AA->getModRefInfo(&Inst, Loc), ModRefInfo::Mod)))
		return &Inst;
		}

		return End;
		}

// %1 = load x86_amx, x86_amx* %0, align 64		// %1 = load x86_amx, x86_amx* %0, align 64
// %2 = call x86_amx @llvm.x86.tdpbssd.internal(%1, %1, %1, ...)		// %2 = call x86_amx @llvm.x86.tdpbssd.internal(%1, %1, %1, ...)
// -->		// -->
// %1 = call x86_amx @llvm.x86.tileloadd64.internal()		// %1 = call x86_amx @llvm.x86.tileloadd64.internal()
// %2 = call x86_amx @llvm.x86.tdpbssd.internal(%1, %1, %1, ...)		// %2 = call x86_amx @llvm.x86.tdpbssd.internal(%1, %1, %1, ...)
static void transformTileLoad(LoadInst *LD) {		static void transformTileLoad(LoadInst LD, AliasAnalysis AA) {
Value Row = nullptr, Col = nullptr;		Value Row = nullptr, Col = nullptr;
Use &U = *(LD->use_begin());		Use &U = *(LD->use_begin());
unsigned OpNo = U.getOperandNo();		unsigned OpNo = U.getOperandNo();
auto *II = cast<IntrinsicInst>(U.getUser());		auto *II = cast<IntrinsicInst>(U.getUser());
std::tie(Row, Col) = getShape(II, OpNo);		std::tie(Row, Col) = getShape(II, OpNo);
IRBuilder<> Builder(LD);		auto *InsertPt = findSafePointToSink(LD, II, AA);
		IRBuilder<> Builder(InsertPt);
		pengfeiUnsubmitted Not Done Reply Inline Actions Why don't pass the pointer directly for `LD` and `II`? pengfei: Why don't pass the pointer directly for `LD` and `II`?
// Use the maximun column as stride.		// Use the maximun column as stride.
Value *Stride = Builder.getInt64(64);		Value *Stride = Builder.getInt64(64);
Value *I8Ptr =		Value *I8Ptr =
Builder.CreateBitCast(LD->getOperand(0), Builder.getInt8PtrTy());		Builder.CreateBitCast(LD->getOperand(0), Builder.getInt8PtrTy());
std::array<Value *, 4> Args = {Row, Col, I8Ptr, Stride};		std::array<Value *, 4> Args = {Row, Col, I8Ptr, Stride};

Value *NewInst = Builder.CreateIntrinsic(		Value *NewInst = Builder.CreateIntrinsic(
Intrinsic::x86_tileloadd64_internal, None, Args);		Intrinsic::x86_tileloadd64_internal, None, Args);
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	std::array<Value *, 5> Args = {Row, Col, I8Ptr, Stride,
ST->getValueOperand()};		ST->getValueOperand()};
Builder.CreateIntrinsic(Intrinsic::x86_tilestored64_internal, None,		Builder.CreateIntrinsic(Intrinsic::x86_tilestored64_internal, None,
Args);		Args);
}		}

namespace {		namespace {
class X86LowerAMXType {		class X86LowerAMXType {
Function &Func;		Function &Func;
		AliasAnalysis *AA;

public:		public:
X86LowerAMXType(Function &F) : Func(F) {}		X86LowerAMXType(Function &F, AliasAnalysis *AA) : Func(F), AA(AA) {}
bool visit();		bool visit();
};		};

bool X86LowerAMXType::visit() {		bool X86LowerAMXType::visit() {
bool C;		bool C;
SmallVector<Instruction *, 8> DeadInsts;		SmallVector<Instruction *, 8> DeadInsts;

for (BasicBlock &BB : Func) {		for (BasicBlock &BB : Func) {
for (Instruction &Inst : BB) {		for (Instruction &Inst : BB) {
if (!dyn_cast<BitCastInst>(&Inst))		if (!dyn_cast<BitCastInst>(&Inst))
continue;		continue;
auto *Src = Inst.getOperand(0);		auto *Src = Inst.getOperand(0);
Type *Ty = Inst.getType();		Type *Ty = Inst.getType();
if (Ty->isPointerTy() &&		if (Ty->isPointerTy() &&
cast<PointerType>(Ty)->getElementType()->isX86_AMXTy()) {		cast<PointerType>(Ty)->getElementType()->isX86_AMXTy()) {
for (auto UI = Inst.use_begin(), UE = Inst.use_end(); UI != UE;) {		for (auto UI = Inst.use_begin(), UE = Inst.use_end(); UI != UE;) {
Value *I = (UI++)->getUser();		Value *I = (UI++)->getUser();
auto *LD = dyn_cast<LoadInst>(I);		auto *LD = dyn_cast<LoadInst>(I);
// %0 = bitcast <256 x i32>* %tile to x86_amx*		// %0 = bitcast <256 x i32>* %tile to x86_amx*
// %1 = load x86_amx, x86_amx* %0, align 64		// %1 = load x86_amx, x86_amx* %0, align 64
if (LD) {		if (LD) {
transformTileLoad(LD);		transformTileLoad(LD, AA);
DeadInsts.push_back(LD);		DeadInsts.push_back(LD);
}		}
auto *ST = dyn_cast<StoreInst>(I);		auto *ST = dyn_cast<StoreInst>(I);
if (ST) {		if (ST) {
// %addr = bitcast <256 x i32>* %tile to x86_amx*		// %addr = bitcast <256 x i32>* %tile to x86_amx*
// store x86_amx %9, x86_amx* %addr, align 64		// store x86_amx %9, x86_amx* %addr, align 64
// -->		// -->
// call void @llvm.x86.tilestored64.internal(%row, %col, %addr,		// call void @llvm.x86.tilestored64.internal(%row, %col, %addr,
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
public:		public:
static char ID;		static char ID;

X86LowerAMXTypeLegacyPass() : FunctionPass(ID) {		X86LowerAMXTypeLegacyPass() : FunctionPass(ID) {
initializeX86LowerAMXTypeLegacyPassPass(*PassRegistry::getPassRegistry());		initializeX86LowerAMXTypeLegacyPassPass(*PassRegistry::getPassRegistry());
}		}

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
X86LowerAMXType LAT(F);		auto *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
		X86LowerAMXType LAT(F, AA);
bool C = LAT.visit();		bool C = LAT.visit();
return C;		return C;
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
		AU.addRequired<AAResultsWrapperPass>();
AU.setPreservesCFG();		AU.setPreservesCFG();
}		}
};		};

} // anonymous namespace		} // anonymous namespace

static const char PassName[] = "Lower AMX type for load/store";		static const char PassName[] = "Lower AMX type for load/store";
char X86LowerAMXTypeLegacyPass::ID = 0;		char X86LowerAMXTypeLegacyPass::ID = 0;
INITIALIZE_PASS_BEGIN(X86LowerAMXTypeLegacyPass, DEBUG_TYPE, PassName, false,		INITIALIZE_PASS_BEGIN(X86LowerAMXTypeLegacyPass, DEBUG_TYPE, PassName, false,
false)		false)
		INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
INITIALIZE_PASS_END(X86LowerAMXTypeLegacyPass, DEBUG_TYPE, PassName, false,		INITIALIZE_PASS_END(X86LowerAMXTypeLegacyPass, DEBUG_TYPE, PassName, false,
false)		false)

FunctionPass *llvm::createX86LowerAMXTypePass() {		FunctionPass *llvm::createX86LowerAMXTypePass() {
return new X86LowerAMXTypeLegacyPass();		return new X86LowerAMXTypeLegacyPass();
}		}

llvm/test/CodeGen/X86/AMX/amx-type.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-amx-type %s -S \| FileCheck %s			; RUN: opt -lower-amx-type %s -S \| FileCheck %s
	target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	%struct.__tile_str = type { i16, i16, <256 x i32> }			%struct.__tile_str = type { i16, i16, <256 x i32> }

	@buf = dso_local global [1024 x i8] zeroinitializer, align 16			@buf = dso_local global [1024 x i8] zeroinitializer, align 16
	@buf2 = dso_local global [1024 x i8] zeroinitializer, align 16			@buf2 = dso_local global [1024 x i8] zeroinitializer, align 16

				define dso_local void @test_tile_sink(%struct.__tile_str* %a, %struct.__tile_str* %b, %struct.__tile_str* %c) #0 {
				; CHECK-LABEL: @test_tile_sink(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[B_COLPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR:%.]], %struct.__tile_str* [[B:%.*]], i64 0, i32 1
				; CHECK-NEXT: [[B_COL:%.]] = load i16, i16 [[B_COLPTR]], align 2
				; CHECK-NEXT: [[B_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[B]], i64 0, i32 2
				; CHECK-NEXT: [[B_AMXPTR:%.]] = bitcast <256 x i32> [[B_VPTR]] to x86_amx*
				; CHECK-NEXT: [[A_ROWPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_ROW:%.]] = load i16, i16 [[A_ROWPTR]], align 64
				; CHECK-NEXT: [[A_COLPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A]], i64 0, i32 1
				; CHECK-NEXT: [[A_COL:%.]] = load i16, i16 [[A_COLPTR]], align 2
				; CHECK-NEXT: [[A_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A]], i64 0, i32 2
				; CHECK-NEXT: [[A_AMXPTR:%.]] = bitcast <256 x i32> [[A_VPTR]] to x86_amx*
				; CHECK-NEXT: [[C_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[C:%.*]], i64 0, i32 2
				; CHECK-NEXT: [[C_AMXPTR:%.]] = bitcast <256 x i32> [[C_VPTR]] to x86_amx*
				; CHECK-NEXT: [[TMP0:%.]] = bitcast x86_amx [[B_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP1:%.]] = bitcast x86_amx [[A_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP2:%.]] = bitcast x86_amx [[C_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP3:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_ROW]], i16 [[B_COL]], i8 [[TMP2]], i64 64)
				; CHECK-NEXT: [[TMP4:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_ROW]], i16 [[A_COL]], i8 [[TMP1]], i64 64)
				; CHECK-NEXT: [[TMP5:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_COL]], i16 [[B_COL]], i8 [[TMP0]], i64 64)
				; CHECK-NEXT: [[RES:%.]] = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 [[A_ROW]], i16 [[B_COL]], i16 [[A_COL]], x86_amx [[TMP3]], x86_amx [[TMP4]], x86_amx [[TMP5]]) [[ATTR1:#.]]
				; CHECK-NEXT: ret void
				;
				entry:
				%b.colptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %b, i64 0, i32 1
				%b.col = load i16, i16* %b.colptr, align 2
				%b.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %b, i64 0, i32 2
				%b.amxptr = bitcast <256 x i32>* %b.vptr to x86_amx*
				%b.tile = load x86_amx, x86_amx* %b.amxptr, align 64
				%a.rowptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 0
				%a.row = load i16, i16* %a.rowptr, align 64
				%a.colptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 1
				%a.col = load i16, i16* %a.colptr, align 2
				%a.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 2
				%a.amxptr = bitcast <256 x i32>* %a.vptr to x86_amx*
				%a.tile = load x86_amx, x86_amx* %a.amxptr, align 64
				%c.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %c, i64 0, i32 2
				%c.amxptr = bitcast <256 x i32>* %c.vptr to x86_amx*
				%c.tile = load x86_amx, x86_amx* %c.amxptr, align 64, !tbaa !2
				%res = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 %a.row, i16 %b.col, i16 %a.col, x86_amx %c.tile, x86_amx %a.tile, x86_amx %b.tile) #2
				ret void
				}

				define dso_local void @test_tile_sink_noalias(%struct.__tile_str* %a, %struct.__tile_str* %b, %struct.__tile_str* %c) #0 {
				; CHECK-LABEL: @test_tile_sink_noalias(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[B_COLPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR:%.]], %struct.__tile_str* [[B:%.*]], i64 0, i32 1
				; CHECK-NEXT: [[B_COL:%.]] = load i16, i16 [[B_COLPTR]], align 2
				; CHECK-NEXT: [[B_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[B]], i64 0, i32 2
				; CHECK-NEXT: [[B_AMXPTR:%.]] = bitcast <256 x i32> [[B_VPTR]] to x86_amx*
				; CHECK-NEXT: store i16 8, i16* [[B_COLPTR]], align 2
				; CHECK-NEXT: [[A_ROWPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_ROW:%.]] = load i16, i16 [[A_ROWPTR]], align 64
				; CHECK-NEXT: [[A_COLPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A]], i64 0, i32 1
				; CHECK-NEXT: [[A_COL:%.]] = load i16, i16 [[A_COLPTR]], align 2
				; CHECK-NEXT: [[A_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A]], i64 0, i32 2
				; CHECK-NEXT: [[A_AMXPTR:%.]] = bitcast <256 x i32> [[A_VPTR]] to x86_amx*
				; CHECK-NEXT: [[C_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[C:%.*]], i64 0, i32 2
				; CHECK-NEXT: [[C_AMXPTR:%.]] = bitcast <256 x i32> [[C_VPTR]] to x86_amx*
				; CHECK-NEXT: [[TMP0:%.]] = bitcast x86_amx [[B_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP1:%.]] = bitcast x86_amx [[A_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP2:%.]] = bitcast x86_amx [[C_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP3:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_ROW]], i16 [[B_COL]], i8 [[TMP2]], i64 64)
				; CHECK-NEXT: [[TMP4:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_ROW]], i16 [[A_COL]], i8 [[TMP1]], i64 64)
				; CHECK-NEXT: [[TMP5:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_COL]], i16 [[B_COL]], i8 [[TMP0]], i64 64)
				; CHECK-NEXT: [[RES:%.*]] = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 [[A_ROW]], i16 [[B_COL]], i16 [[A_COL]], x86_amx [[TMP3]], x86_amx [[TMP4]], x86_amx [[TMP5]]) [[ATTR1]]
				; CHECK-NEXT: ret void
				;
				entry:
				%b.colptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %b, i64 0, i32 1
				%b.col = load i16, i16* %b.colptr, align 2
				%b.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %b, i64 0, i32 2
				%b.amxptr = bitcast <256 x i32>* %b.vptr to x86_amx*
				%b.tile = load x86_amx, x86_amx* %b.amxptr, align 64
				; test noalias
				store i16 8, i16* %b.colptr
				%a.rowptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 0
				%a.row = load i16, i16* %a.rowptr, align 64
				%a.colptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 1
				%a.col = load i16, i16* %a.colptr, align 2
				%a.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 2
				%a.amxptr = bitcast <256 x i32>* %a.vptr to x86_amx*
				%a.tile = load x86_amx, x86_amx* %a.amxptr, align 64
				%c.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %c, i64 0, i32 2
				%c.amxptr = bitcast <256 x i32>* %c.vptr to x86_amx*
				%c.tile = load x86_amx, x86_amx* %c.amxptr, align 64, !tbaa !2
				%res = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 %a.row, i16 %b.col, i16 %a.col, x86_amx %c.tile, x86_amx %a.tile, x86_amx %b.tile) #2
				ret void
				}

				define dso_local void @test_tile_sink_alias(%struct.__tile_str* %a, %struct.__tile_str* %b, %struct.__tile_str* %c) #0 {
				; CHECK-LABEL: @test_tile_sink_alias(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[B_COLPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR:%.]], %struct.__tile_str* [[B:%.*]], i64 0, i32 1
				; CHECK-NEXT: [[B_COL:%.]] = load i16, i16 [[B_COLPTR]], align 2
				; CHECK-NEXT: [[B_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[B]], i64 0, i32 2
				; CHECK-NEXT: [[B_AMXPTR:%.]] = bitcast <256 x i32> [[B_VPTR]] to x86_amx*
				; CHECK-NEXT: [[A_COLPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A:%.*]], i64 0, i32 1
				; CHECK-NEXT: [[A_COL:%.]] = load i16, i16 [[A_COLPTR]], align 2
				; CHECK-NEXT: [[B_SCALARPTR:%.]] = bitcast <256 x i32> [[B_VPTR]] to i32*
				; CHECK-NEXT: [[TMP0:%.]] = bitcast x86_amx [[B_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP1:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_COL]], i16 [[B_COL]], i8 [[TMP0]], i64 64)
				; CHECK-NEXT: store i32 8, i32* [[B_SCALARPTR]], align 4
				; CHECK-NEXT: [[A_ROWPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A]], i64 0, i32 0
				; CHECK-NEXT: [[A_ROW:%.]] = load i16, i16 [[A_ROWPTR]], align 64
				; CHECK-NEXT: [[A_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A]], i64 0, i32 2
				; CHECK-NEXT: [[A_AMXPTR:%.]] = bitcast <256 x i32> [[A_VPTR]] to x86_amx*
				; CHECK-NEXT: [[C_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[C:%.*]], i64 0, i32 2
				; CHECK-NEXT: [[C_AMXPTR:%.]] = bitcast <256 x i32> [[C_VPTR]] to x86_amx*
				; CHECK-NEXT: [[TMP2:%.]] = bitcast x86_amx [[A_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP3:%.]] = bitcast x86_amx [[C_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP4:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_ROW]], i16 [[B_COL]], i8 [[TMP3]], i64 64)
				; CHECK-NEXT: [[TMP5:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_ROW]], i16 [[A_COL]], i8 [[TMP2]], i64 64)
				; CHECK-NEXT: [[RES:%.*]] = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 [[A_ROW]], i16 [[B_COL]], i16 [[A_COL]], x86_amx [[TMP4]], x86_amx [[TMP5]], x86_amx [[TMP1]]) [[ATTR1]]
				; CHECK-NEXT: ret void
				;
				entry:
				%b.colptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %b, i64 0, i32 1
				%b.col = load i16, i16* %b.colptr, align 2
				%b.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %b, i64 0, i32 2
				%b.amxptr = bitcast <256 x i32>* %b.vptr to x86_amx*
				%b.tile = load x86_amx, x86_amx* %b.amxptr, align 64
				%a.colptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 1
				%a.col = load i16, i16* %a.colptr, align 2
				; test alias
				%b.scalarptr = bitcast <256 x i32>* %b.vptr to i32*
				store i32 8, i32* %b.scalarptr
				%a.rowptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 0
				%a.row = load i16, i16* %a.rowptr, align 64
				%a.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 2
				%a.amxptr = bitcast <256 x i32>* %a.vptr to x86_amx*
				%a.tile = load x86_amx, x86_amx* %a.amxptr, align 64
				%c.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %c, i64 0, i32 2
				%c.amxptr = bitcast <256 x i32>* %c.vptr to x86_amx*
				%c.tile = load x86_amx, x86_amx* %c.amxptr, align 64, !tbaa !2
				%res = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 %a.row, i16 %b.col, i16 %a.col, x86_amx %c.tile, x86_amx %a.tile, x86_amx %b.tile) #2
				ret void
				}

				define dso_local void @test_tile_sink_across_bb(%struct.__tile_str* %a, %struct.__tile_str* %b, %struct.__tile_str* %c) #0 {
				; CHECK-LABEL: @test_tile_sink_across_bb(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[B_COLPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR:%.]], %struct.__tile_str* [[B:%.*]], i64 0, i32 1
				; CHECK-NEXT: [[B_COL:%.]] = load i16, i16 [[B_COLPTR]], align 2
				; CHECK-NEXT: [[B_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[B]], i64 0, i32 2
				; CHECK-NEXT: [[B_AMXPTR:%.]] = bitcast <256 x i32> [[B_VPTR]] to x86_amx*
				; CHECK-NEXT: [[A_COLPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A:%.*]], i64 0, i32 1
				; CHECK-NEXT: [[A_COL:%.]] = load i16, i16 [[A_COLPTR]], align 2
				; CHECK-NEXT: [[A_ROWPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A]], i64 0, i32 0
				; CHECK-NEXT: [[A_ROW:%.]] = load i16, i16 [[A_ROWPTR]], align 64
				; CHECK-NEXT: [[A_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[A]], i64 0, i32 2
				; CHECK-NEXT: [[A_AMXPTR:%.]] = bitcast <256 x i32> [[A_VPTR]] to x86_amx*
				; CHECK-NEXT: [[C_VPTR:%.]] = getelementptr inbounds [[STRUCT___TILE_STR]], %struct.__tile_str [[C:%.*]], i64 0, i32 2
				; CHECK-NEXT: [[C_AMXPTR:%.]] = bitcast <256 x i32> [[C_VPTR]] to x86_amx*
				; CHECK-NEXT: [[TMP0:%.]] = bitcast x86_amx [[B_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP1:%.]] = bitcast x86_amx [[A_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP2:%.]] = bitcast x86_amx [[C_AMXPTR]] to i8*
				; CHECK-NEXT: [[TMP3:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_ROW]], i16 [[B_COL]], i8 [[TMP2]], i64 64)
				; CHECK-NEXT: [[TMP4:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_ROW]], i16 [[A_COL]], i8 [[TMP1]], i64 64)
				; CHECK-NEXT: [[TMP5:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[A_COL]], i16 [[B_COL]], i8 [[TMP0]], i64 64)
				; CHECK-NEXT: br label [[DOTPROD:%.*]]
				; CHECK: dotprod:
				; CHECK-NEXT: [[RES:%.*]] = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 [[A_ROW]], i16 [[B_COL]], i16 [[A_COL]], x86_amx [[TMP3]], x86_amx [[TMP4]], x86_amx [[TMP5]]) [[ATTR1]]
				; CHECK-NEXT: ret void
				;
				entry:
				%b.colptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %b, i64 0, i32 1
				%b.col = load i16, i16* %b.colptr, align 2
				%b.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %b, i64 0, i32 2
				%b.amxptr = bitcast <256 x i32>* %b.vptr to x86_amx*
				%b.tile = load x86_amx, x86_amx* %b.amxptr, align 64
				%a.colptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 1
				%a.col = load i16, i16* %a.colptr, align 2
				%a.rowptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 0
				%a.row = load i16, i16* %a.rowptr, align 64
				%a.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %a, i64 0, i32 2
				%a.amxptr = bitcast <256 x i32>* %a.vptr to x86_amx*
				%a.tile = load x86_amx, x86_amx* %a.amxptr, align 64
				%c.vptr = getelementptr inbounds %struct.__tile_str, %struct.__tile_str* %c, i64 0, i32 2
				%c.amxptr = bitcast <256 x i32>* %c.vptr to x86_amx*
				%c.tile = load x86_amx, x86_amx* %c.amxptr, align 64, !tbaa !2
				br label %dotprod
				dotprod: ; preds = %if.else, %if.then
				%res = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 %a.row, i16 %b.col, i16 %a.col, x86_amx %c.tile, x86_amx %a.tile, x86_amx %b.tile) #2
				ret void
				}

	define dso_local void @test_amx_store(<256 x i32>* %in, i16 %m, i16 %n, i8 *%buf, i64 %s) #2 {			define dso_local void @test_amx_store(<256 x i32>* %in, i16 %m, i16 %n, i8 *%buf, i64 %s) #2 {
	; CHECK-LABEL: @test_amx_store(			; CHECK-LABEL: @test_amx_store(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[T0:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[M:%.]], i16 [[N:%.]], i8 [[BUF:%.]], i64 [[S:%.]]) [[ATTR3:#.*]]			; CHECK-NEXT: [[T0:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[M:%.]], i16 [[N:%.]], i8 [[BUF:%.]], i64 [[S:%.]]) [[ATTR3:#.*]]
	; CHECK-NEXT: [[ADDR:%.]] = bitcast <256 x i32> [[IN:%.]] to x86_amx			; CHECK-NEXT: [[ADDR:%.]] = bitcast <256 x i32> [[IN:%.]] to x86_amx
	; CHECK-NEXT: [[TMP0:%.]] = bitcast x86_amx [[ADDR]] to i8*			; CHECK-NEXT: [[TMP0:%.]] = bitcast x86_amx [[ADDR]] to i8*
	; CHECK-NEXT: call void @llvm.x86.tilestored64.internal(i16 [[M]], i16 [[N]], i8* [[TMP0]], i64 64, x86_amx [[T0]])			; CHECK-NEXT: call void @llvm.x86.tilestored64.internal(i16 [[M]], i16 [[N]], i8* [[TMP0]], i64 64, x86_amx [[T0]])
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	▲ Show 20 Lines • Show All 232 Lines • Show Last 20 Lines