This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMX] Lower tile copy instruction.
ClosedPublic

Authored by LuoYuanke on Feb 19 2021, 11:05 PM.

Download Raw Diff

Details

Reviewers

pengfei
xiangzhangllvm
yubing
craig.topper
LiuChen3

Commits

rG8f48ddd19358: [X86][AMX] Lower tile copy instruction.

Summary

Since there is no tile copy instruction, we need to store tile
register to stack and load from stack to another tile register.
We need extra GR to hold the stride, and we need stack slot to
hold the tile data register. We would run this pass after copy
propagation, so that we don't miss copy optimization. And we
would run this pass before prolog/epilog insertion, so that we
can allocate stack slot.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	360 ms	x64 debian > libarcher.races::task-dependency.c
	420 ms	x64 debian > libarcher.races::task-taskgroup-unrelated.c
	320 ms	x64 debian > libarcher.races::task-taskwait-nested.c
	280 ms	x64 debian > libarcher.races::task-two.c
	320 ms	x64 debian > libarcher.task::task-barrier.c
		View Full Test Results (13 Failed)

Event Timeline

LuoYuanke created this revision.Feb 19 2021, 11:05 PM

Herald added subscribers: nikic, pengfei, hiraditya, mgorny. · View Herald TranscriptFeb 19 2021, 11:05 PM

LuoYuanke requested review of this revision.Feb 19 2021, 11:05 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 19 2021, 11:05 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

LuoYuanke added reviewers: pengfei, xiangzhangllvm, yubing.Feb 19 2021, 11:07 PM

LuoYuanke added a subscriber: annita.zhang.

Remove useless code.

Harbormaster completed remote builds in B90030: Diff 325162.Feb 19 2021, 11:46 PM

Harbormaster completed remote builds in B90031: Diff 325163.Feb 20 2021, 12:19 AM

pengfei added inline comments.Feb 20 2021, 1:15 AM

llvm/lib/Target/X86/X86LowerTileCopy.cpp
2	Comment is wrong.
10	instructions
llvm/lib/Target/X86/X86RegisterInfo.cpp
878	Is it possible to define a special COPY for AMX which can implicitly define a register for stride?
llvm/lib/Target/X86/X86TargetMachine.cpp
584	We are much like handling X87 register copy in pass "X86 FP Stackifier", so I think we can add the pass to addPostRegAlloc like it.
llvm/test/CodeGen/X86/AMX/amx-lower-tile-copy.ll
38	As we had discussed, tilezero should be rematerialized instead of spilling. For non tilezero cases, we still need to consider the spilling as loop invariant and hoist it out of the loop. Anyway, these are optimization thoughs which don't affect the functionality here.

LuoYuanke added inline comments.Feb 20 2021, 2:11 AM

llvm/lib/Target/X86/X86RegisterInfo.cpp
878	Not sure. The COPY instruction is common for all target.
llvm/lib/Target/X86/X86TargetMachine.cpp
584	Sounds good to me.
llvm/test/CodeGen/X86/AMX/amx-lower-tile-copy.ll
38	I would do the optimization in another patch.

Address Pengfei's comments.

LuoYuanke added inline comments.Feb 20 2021, 2:16 AM

llvm/lib/Target/X86/X86RegisterInfo.cpp
878	And COPY instruction is auto generated by some passes.

Harbormaster completed remote builds in B90038: Diff 325175.Feb 20 2021, 2:58 AM

LuoYuanke added reviewers: craig.topper, LiuChen3.Feb 20 2021, 4:20 AM

LGTM.

This revision is now accepted and ready to land.Feb 21 2021, 5:21 PM

This revision was landed with ongoing or failed builds.Feb 22 2021, 3:50 PM

Closed by commit rG8f48ddd19358: [X86][AMX] Lower tile copy instruction. (authored by LuoYuanke). · Explain Why

This revision was automatically updated to reflect the committed changes.

LuoYuanke added a commit: rG8f48ddd19358: [X86][AMX] Lower tile copy instruction..

Do we need to force opt to build a legacypassmanager for this pass?

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

1 line

6 lines

132 lines

6 lines

2 lines

test/

CodeGen/

X86/

AMX/

amx-lower-tile-copy.ll

181 lines

O0-pipeline.ll

1 line

opt-pipeline.ll

1 line

Diff 325175

llvm/lib/Target/X86/CMakeLists.txt

Show All 26 Lines	set(sources
X86AsmPrinter.cpp		X86AsmPrinter.cpp
X86AvoidTrailingCall.cpp		X86AvoidTrailingCall.cpp
X86CallFrameOptimization.cpp		X86CallFrameOptimization.cpp
X86CallingConv.cpp		X86CallingConv.cpp
X86CallLowering.cpp		X86CallLowering.cpp
X86CmovConversion.cpp		X86CmovConversion.cpp
X86DomainReassignment.cpp		X86DomainReassignment.cpp
X86DiscriminateMemOps.cpp		X86DiscriminateMemOps.cpp
		X86LowerTileCopy.cpp
X86LowerAMXType.cpp		X86LowerAMXType.cpp
X86TileConfig.cpp		X86TileConfig.cpp
X86PreTileConfig.cpp		X86PreTileConfig.cpp
X86ExpandPseudo.cpp		X86ExpandPseudo.cpp
X86FastISel.cpp		X86FastISel.cpp
X86FixupBWInsts.cpp		X86FixupBWInsts.cpp
X86FixupLEAs.cpp		X86FixupLEAs.cpp
X86AvoidStoreForwardingBlocks.cpp		X86AvoidStoreForwardingBlocks.cpp
▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86.h

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	FunctionPass *createX86AvoidStoreForwardingBlocks();			FunctionPass *createX86AvoidStoreForwardingBlocks();

	/// Return a pass that lowers EFLAGS copy pseudo instructions.			/// Return a pass that lowers EFLAGS copy pseudo instructions.
	FunctionPass *createX86FlagsCopyLoweringPass();			FunctionPass *createX86FlagsCopyLoweringPass();

	/// Return a pass that expands WinAlloca pseudo-instructions.			/// Return a pass that expands WinAlloca pseudo-instructions.
	FunctionPass *createX86WinAllocaExpander();			FunctionPass *createX86WinAllocaExpander();

				/// Return a pass that config the tile registers.
	FunctionPass *createX86TileConfigPass();			FunctionPass *createX86TileConfigPass();

				/// Return a pass that insert pseudo tile config instruction.
	FunctionPass *createX86PreTileConfigPass();			FunctionPass *createX86PreTileConfigPass();

				/// Return a pass that lower the tile copy instruction.
				FunctionPass *createX86LowerTileCopyPass();

	/// Return a pass that inserts int3 at the end of the function if it ends with a			/// Return a pass that inserts int3 at the end of the function if it ends with a
	/// CALL instruction. The pass does the same for each funclet as well. This			/// CALL instruction. The pass does the same for each funclet as well. This
	/// ensures that the open interval of function start and end PCs contains all			/// ensures that the open interval of function start and end PCs contains all
	/// return addresses for the benefit of the Windows x64 unwinder.			/// return addresses for the benefit of the Windows x64 unwinder.
	FunctionPass *createX86AvoidTrailingCallPass();			FunctionPass *createX86AvoidTrailingCallPass();

	/// Return a pass that optimizes the code-size of x86 call sequences. This is			/// Return a pass that optimizes the code-size of x86 call sequences. This is
	/// done by replacing esp-relative movs with pushes.			/// done by replacing esp-relative movs with pushes.
	▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines
	void initializeX86LoadValueInjectionRetHardeningPassPass(PassRegistry &);			void initializeX86LoadValueInjectionRetHardeningPassPass(PassRegistry &);
	void initializeX86OptimizeLEAPassPass(PassRegistry &);			void initializeX86OptimizeLEAPassPass(PassRegistry &);
	void initializeX86PartialReductionPass(PassRegistry &);			void initializeX86PartialReductionPass(PassRegistry &);
	void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);			void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);
	void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);			void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);
	void initializeX86PreTileConfigPass(PassRegistry &);			void initializeX86PreTileConfigPass(PassRegistry &);
	void initializeX86TileConfigPass(PassRegistry &);			void initializeX86TileConfigPass(PassRegistry &);
	void initializeX86LowerAMXTypeLegacyPassPass(PassRegistry &);			void initializeX86LowerAMXTypeLegacyPassPass(PassRegistry &);
				void initializeX86LowerTileCopyPass(PassRegistry &);

	namespace X86AS {			namespace X86AS {
	enum : unsigned {			enum : unsigned {
	GS = 256,			GS = 256,
	FS = 257,			FS = 257,
	SS = 258,			SS = 258,
	PTR32_SPTR = 270,			PTR32_SPTR = 270,
	PTR32_UPTR = 271,			PTR32_UPTR = 271,
	PTR64 = 272			PTR64 = 272
	};			};
	} // End X86AS namespace			} // End X86AS namespace

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

llvm/lib/Target/X86/X86LowerTileCopy.cpp

This file was added.

				//===-- X86LowerTileCopy.cpp - Expand Tile Copy Instructions---------------===//
				//
				pengfeiUnsubmitted Not Done Reply Inline Actions Comment is wrong. pengfei: Comment is wrong.
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file defines the pass which lower AMX tile copy instructions. Since
				// there is no tile copy instruction, we need store tile register to stack
				pengfeiUnsubmitted Not Done Reply Inline Actions instructions pengfei: instructions
				// and load from stack to another tile register. We need extra GR to hold
				// the stride, and we need stack slot to hold the tile data register.
				// We would run this pass after copy propagation, so that we don't miss copy
				// optimization. And we would run this pass before prolog/epilog insertion,
				// so that we can allocate stack slot.
				//
				//===----------------------------------------------------------------------===//

				#include "X86.h"
				#include "X86InstrBuilder.h"
				#include "X86InstrInfo.h"
				#include "X86Subtarget.h"
				#include "llvm/CodeGen/MachineBasicBlock.h"
				#include "llvm/CodeGen/MachineFrameInfo.h"
				#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstr.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineOperand.h"
				#include "llvm/CodeGen/Passes.h"
				#include "llvm/IR/DebugLoc.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Support/Debug.h"

				using namespace llvm;

				#define DEBUG_TYPE "x86-lower-tile-copy"

				namespace {

				class X86LowerTileCopy : public MachineFunctionPass {
				public:
				static char ID;

				X86LowerTileCopy() : MachineFunctionPass(ID) {}

				void getAnalysisUsage(AnalysisUsage &AU) const override;

				bool runOnMachineFunction(MachineFunction &MF) override;

				StringRef getPassName() const override { return "X86 Lower Tile Copy"; }
				};

				} // namespace

				char X86LowerTileCopy::ID = 0;

				INITIALIZE_PASS_BEGIN(X86LowerTileCopy, "lowertilecopy", "Tile Copy Lowering",
				false, false)
				INITIALIZE_PASS_END(X86LowerTileCopy, "lowertilecopy", "Tile Copy Lowering",
				false, false)

				void X86LowerTileCopy::getAnalysisUsage(AnalysisUsage &AU) const {
				AU.setPreservesAll();
				MachineFunctionPass::getAnalysisUsage(AU);
				}

				FunctionPass *llvm::createX86LowerTileCopyPass() {
				return new X86LowerTileCopy();
				}

				bool X86LowerTileCopy::runOnMachineFunction(MachineFunction &MF) {
				const X86Subtarget &ST = MF.getSubtarget<X86Subtarget>();
				const X86InstrInfo *TII = ST.getInstrInfo();
				bool Changed = false;

				for (MachineBasicBlock &MBB : MF) {
				for (MachineBasicBlock::iterator MII = MBB.begin(), MIE = MBB.end();
				MII != MIE;) {
				MachineInstr &MI = *MII++;
				if (!MI.isCopy())
				continue;
				MachineOperand &DstMO = MI.getOperand(0);
				MachineOperand &SrcMO = MI.getOperand(1);
				Register SrcReg = SrcMO.getReg();
				Register DstReg = DstMO.getReg();
				if (!X86::TILERegClass.contains(DstReg, SrcReg))
				continue;

				const TargetRegisterInfo *TRI = ST.getRegisterInfo();
				// Allocate stack slot for tile register
				unsigned Size = TRI->getSpillSize(X86::TILERegClass);
				Align Alignment = TRI->getSpillAlign(X86::TILERegClass);
				int TileSS = MF.getFrameInfo().CreateSpillStackObject(Size, Alignment);
				// Allocate stack slot for stride register
				Size = TRI->getSpillSize(X86::GR64RegClass);
				Alignment = TRI->getSpillAlign(X86::GR64RegClass);
				int StrideSS = MF.getFrameInfo().CreateSpillStackObject(Size, Alignment);

				// TODO: Pick a killed regiter to avoid save/reload. There is problem
				// to get live interval in this stage.
				Register GR64Cand = X86::RAX;

				const DebugLoc &DL = MI.getDebugLoc();
				// mov %rax (%sp)
				BuildMI(MBB, MI, DL, TII->get(X86::IMPLICIT_DEF), GR64Cand);
				addFrameReference(BuildMI(MBB, MI, DL, TII->get(X86::MOV64mr)), StrideSS)
				.addReg(GR64Cand);
				// mov 64 %rax
				BuildMI(MBB, MI, DL, TII->get(X86::MOV64ri), GR64Cand).addImm(64);
				// tilestored %tmm, (%sp, %idx)
				unsigned Opc = X86::TILESTORED;
				MachineInstr *NewMI =
				addFrameReference(BuildMI(MBB, MI, DL, TII->get(Opc)), TileSS)
				.addReg(SrcReg, getKillRegState(SrcMO.isKill()));
				MachineOperand &MO = NewMI->getOperand(2);
				MO.setReg(GR64Cand);
				MO.setIsKill(true);
				// tileloadd (%sp, %idx), %tmm
				Opc = X86::TILELOADD;
				NewMI = addFrameReference(BuildMI(MBB, MI, DL, TII->get(Opc), DstReg),
				TileSS);
				// restore %rax
				// mov (%sp) %rax
				addFrameReference(BuildMI(MBB, MI, DL, TII->get(X86::MOV64rm), GR64Cand),
				StrideSS);
				MI.eraseFromParent();
				Changed = true;
				}
				}
				return Changed;
				}

llvm/lib/Target/X86/X86RegisterInfo.cpp

Show First 20 Lines • Show All 869 Lines • ▼ Show 20 Lines	static ShapeT getTileShape(Register VirtReg, VirtRegMap *VRM,

const MachineOperand &Def = *MRI->def_begin(VirtReg);		const MachineOperand &Def = *MRI->def_begin(VirtReg);
MachineInstr MI = const_cast<MachineInstr >(Def.getParent());		MachineInstr MI = const_cast<MachineInstr >(Def.getParent());
unsigned OpCode = MI->getOpcode();		unsigned OpCode = MI->getOpcode();
switch (OpCode) {		switch (OpCode) {
default:		default:
llvm_unreachable("Unexpected machine instruction on tile register!");		llvm_unreachable("Unexpected machine instruction on tile register!");
break;		break;
		case X86::COPY: {
		pengfeiUnsubmitted Not Done Reply Inline Actions Is it possible to define a special COPY for AMX which can implicitly define a register for stride? pengfei: Is it possible to define a special COPY for AMX which can implicitly define a register for…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Not sure. The COPY instruction is common for all target. LuoYuanke: Not sure. The COPY instruction is common for all target.
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions And COPY instruction is auto generated by some passes. LuoYuanke: And COPY instruction is auto generated by some passes.
		Register SrcReg = MI->getOperand(1).getReg();
		ShapeT Shape = getTileShape(SrcReg, VRM, MRI);
		VRM->assignVirt2Shape(VirtReg, Shape);
		return Shape;
		}
// We only collect the tile shape that is defined.		// We only collect the tile shape that is defined.
case X86::PTILELOADDV:		case X86::PTILELOADDV:
case X86::PTDPBSSDV:		case X86::PTDPBSSDV:
case X86::PTILEZEROV:		case X86::PTILEZEROV:
MachineOperand &MO1 = MI->getOperand(1);		MachineOperand &MO1 = MI->getOperand(1);
MachineOperand &MO2 = MI->getOperand(2);		MachineOperand &MO2 = MI->getOperand(2);
ShapeT Shape(&MO1, &MO2, MRI);		ShapeT Shape(&MO1, &MO2, MRI);
VRM->assignVirt2Shape(VirtReg, Shape);		VRM->assignVirt2Shape(VirtReg, Shape);
▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetMachine.cpp

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86Target() {
initializeFixupBWInstPassPass(PR);		initializeFixupBWInstPassPass(PR);
initializeEvexToVexInstPassPass(PR);		initializeEvexToVexInstPassPass(PR);
initializeFixupLEAPassPass(PR);		initializeFixupLEAPassPass(PR);
initializeFPSPass(PR);		initializeFPSPass(PR);
initializeX86FixupSetCCPassPass(PR);		initializeX86FixupSetCCPassPass(PR);
initializeX86CallFrameOptimizationPass(PR);		initializeX86CallFrameOptimizationPass(PR);
initializeX86CmovConverterPassPass(PR);		initializeX86CmovConverterPassPass(PR);
initializeX86TileConfigPass(PR);		initializeX86TileConfigPass(PR);
		initializeX86LowerTileCopyPass(PR);
initializeX86ExpandPseudoPass(PR);		initializeX86ExpandPseudoPass(PR);
initializeX86ExecutionDomainFixPass(PR);		initializeX86ExecutionDomainFixPass(PR);
initializeX86DomainReassignmentPass(PR);		initializeX86DomainReassignmentPass(PR);
initializeX86AvoidSFBPassPass(PR);		initializeX86AvoidSFBPassPass(PR);
initializeX86AvoidTrailingCallPassPass(PR);		initializeX86AvoidTrailingCallPassPass(PR);
initializeX86SpeculativeLoadHardeningPassPass(PR);		initializeX86SpeculativeLoadHardeningPassPass(PR);
initializeX86SpeculativeExecutionSideEffectSuppressionPass(PR);		initializeX86SpeculativeExecutionSideEffectSuppressionPass(PR);
initializeX86FlagsCopyLoweringPassPass(PR);		initializeX86FlagsCopyLoweringPassPass(PR);
▲ Show 20 Lines • Show All 419 Lines • ▼ Show 20 Lines
}		}

void X86PassConfig::addMachineSSAOptimization() {		void X86PassConfig::addMachineSSAOptimization() {
addPass(createX86DomainReassignmentPass());		addPass(createX86DomainReassignmentPass());
TargetPassConfig::addMachineSSAOptimization();		TargetPassConfig::addMachineSSAOptimization();
}		}

void X86PassConfig::addPostRegAlloc() {		void X86PassConfig::addPostRegAlloc() {
		addPass(createX86LowerTileCopyPass());
addPass(createX86FloatingPointStackifierPass());		addPass(createX86FloatingPointStackifierPass());
// When -O0 is enabled, the Load Value Injection Hardening pass will fall back		// When -O0 is enabled, the Load Value Injection Hardening pass will fall back
// to using the Speculative Execution Side Effect Suppression pass for		// to using the Speculative Execution Side Effect Suppression pass for
// mitigation. This is to prevent slow downs due to		// mitigation. This is to prevent slow downs due to
// analyses needed by the LVIHardening pass when compiling at -O0.		// analyses needed by the LVIHardening pass when compiling at -O0.
if (getOptLevel() != CodeGenOpt::None)		if (getOptLevel() != CodeGenOpt::None)
addPass(createX86LoadValueInjectionLoadHardeningPass());		addPass(createX86LoadValueInjectionLoadHardeningPass());
}		}
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	void X86PassConfig::addPreEmitPass2() {
addPass(createX86LoadValueInjectionRetHardeningPass());		addPass(createX86LoadValueInjectionRetHardeningPass());
}		}

bool X86PassConfig::addPreRewrite() {		bool X86PassConfig::addPreRewrite() {
addPass(createX86TileConfigPass());		addPass(createX86TileConfigPass());
return true;		return true;
}		}

std::unique_ptr<CSEConfigBase> X86PassConfig::getCSEConfig() const {		std::unique_ptr<CSEConfigBase> X86PassConfig::getCSEConfig() const {
		pengfeiUnsubmitted Not Done Reply Inline Actions We are much like handling X87 register copy in pass "X86 FP Stackifier", so I think we can add the pass to addPostRegAlloc like it. pengfei: We are much like handling X87 register copy in pass "X86 FP Stackifier", so I think we can add…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Sounds good to me. LuoYuanke: Sounds good to me.
return getStandardCSEConfigForOpt(TM->getOptLevel());		return getStandardCSEConfigForOpt(TM->getOptLevel());
}		}

llvm/test/CodeGen/X86/AMX/amx-lower-tile-copy.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+amx-int8 -mattr=+avx512f -verify-machineinstrs \| FileCheck %s

				define dso_local void @test1(i8 *%buf) nounwind {
				; CHECK-LABEL: test1:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: pushq %rbp
				; CHECK-NEXT: pushq %r15
				; CHECK-NEXT: pushq %r14
				; CHECK-NEXT: pushq %rbx
				; CHECK-NEXT: subq $4056, %rsp # imm = 0xFD8
				; CHECK-NEXT: vpxord %zmm0, %zmm0, %zmm0
				; CHECK-NEXT: vmovdqu64 %zmm0, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $1, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: ldtilecfg {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movl $64, %eax
				; CHECK-NEXT: movw $8, %r14w
				; CHECK-NEXT: tileloadd (%rdi,%rax), %tmm3
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: testb %al, %al
				; CHECK-NEXT: jne .LBB0_3
				; CHECK-NEXT: # %bb.1: # %loop.header.preheader
				; CHECK-NEXT: movq %rdi, %rbx
				; CHECK-NEXT: xorl %ebp, %ebp
				; CHECK-NEXT: movl $32, %r15d
				; CHECK-NEXT: .p2align 4, 0x90
				; CHECK-NEXT: .LBB0_2: # %loop.header
				; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: movabsq $64, %rax
				; CHECK-NEXT: tilestored %tmm3, 2048(%rsp,%rax) # 1024-byte Folded Spill
				pengfeiUnsubmitted Not Done Reply Inline Actions As we had discussed, tilezero should be rematerialized instead of spilling. For non tilezero cases, we still need to consider the spilling as loop invariant and hoist it out of the loop. Anyway, these are optimization thoughs which don't affect the functionality here. pengfei: As we had discussed, tilezero should be rematerialized instead of spilling. For non tilezero…
				LuoYuankeAuthorUnsubmitted Done Reply Inline Actions I would do the optimization in another patch. LuoYuanke: I would do the optimization in another patch.
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: callq foo
				; CHECK-NEXT: ldtilecfg {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movabsq $64, %rax
				; CHECK-NEXT: tileloadd 2048(%rsp,%rax), %tmm3 # 1024-byte Folded Reload
				; CHECK-NEXT: tileloadd (%rbx,%r15), %tmm0
				; CHECK-NEXT: tileloadd (%rbx,%r15), %tmm1
				; CHECK-NEXT: # implicit-def: $rax
				; CHECK-NEXT: movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
				; CHECK-NEXT: movabsq $64, %rax
				; CHECK-NEXT: tilestored %tmm3, 1024(%rsp,%rax) # 1024-byte Folded Spill
				; CHECK-NEXT: tileloadd {{[-0-9]+}}(%r{{[sb]}}p), %tmm2 # 1024-byte Folded Reload
				; CHECK-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
				; CHECK-NEXT: tdpbssd %tmm1, %tmm0, %tmm2
				; CHECK-NEXT: tilestored %tmm2, (%rbx,%r15)
				; CHECK-NEXT: incl %ebp
				; CHECK-NEXT: cmpw $100, %bp
				; CHECK-NEXT: jl .LBB0_2
				; CHECK-NEXT: .LBB0_3: # %exit
				; CHECK-NEXT: addq $4056, %rsp # imm = 0xFD8
				; CHECK-NEXT: popq %rbx
				; CHECK-NEXT: popq %r14
				; CHECK-NEXT: popq %r15
				; CHECK-NEXT: popq %rbp
				; CHECK-NEXT: tilerelease
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				entry:
				%t1 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 8, i16 8, i8* %buf, i64 64)
				br i1 undef, label %loop.header, label %exit

				loop.header:
				%ivphi = phi i16 [0, %entry], [%iv, %loop.latch]
				call void @foo()
				br label %loop.body

				loop.body:
				%t2 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 8, i16 8, i8* %buf, i64 32)
				%t3 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 8, i16 8, i8* %buf, i64 32)
				%t4 = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 8, i16 8, i16 8, x86_amx %t1, x86_amx %t2, x86_amx %t3)
				tail call void @llvm.x86.tilestored64.internal(i16 8, i16 8, i8* %buf, i64 32, x86_amx %t4)
				br label %loop.latch

				loop.latch:
				%iv = add i16 %ivphi, 1
				%c = icmp slt i16 %iv, 100
				br i1 %c, label %loop.header, label %exit

				exit:
				ret void
				}

				define dso_local void @test2(i8 *%buf) nounwind {
				; CHECK-LABEL: test2:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: pushq %rbp
				; CHECK-NEXT: pushq %r15
				; CHECK-NEXT: pushq %r14
				; CHECK-NEXT: pushq %rbx
				; CHECK-NEXT: subq $4056, %rsp # imm = 0xFD8
				; CHECK-NEXT: vpxord %zmm0, %zmm0, %zmm0
				; CHECK-NEXT: vmovdqu64 %zmm0, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $1, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movb $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: ldtilecfg {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movw $8, %r14w
				; CHECK-NEXT: tilezero %tmm3
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: testb %al, %al
				; CHECK-NEXT: jne .LBB1_3
				; CHECK-NEXT: # %bb.1: # %loop.header.preheader
				; CHECK-NEXT: movq %rdi, %rbx
				; CHECK-NEXT: xorl %ebp, %ebp
				; CHECK-NEXT: movl $32, %r15d
				; CHECK-NEXT: .p2align 4, 0x90
				; CHECK-NEXT: .LBB1_2: # %loop.header
				; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: movabsq $64, %rax
				; CHECK-NEXT: tilestored %tmm3, 2048(%rsp,%rax) # 1024-byte Folded Spill
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: callq foo
				; CHECK-NEXT: ldtilecfg {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movabsq $64, %rax
				; CHECK-NEXT: tileloadd 2048(%rsp,%rax), %tmm3 # 1024-byte Folded Reload
				; CHECK-NEXT: tileloadd (%rbx,%r15), %tmm0
				; CHECK-NEXT: tileloadd (%rbx,%r15), %tmm1
				; CHECK-NEXT: # implicit-def: $rax
				; CHECK-NEXT: movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
				; CHECK-NEXT: movabsq $64, %rax
				; CHECK-NEXT: tilestored %tmm3, 1024(%rsp,%rax) # 1024-byte Folded Spill
				; CHECK-NEXT: tileloadd {{[-0-9]+}}(%r{{[sb]}}p), %tmm2 # 1024-byte Folded Reload
				; CHECK-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
				; CHECK-NEXT: tdpbssd %tmm1, %tmm0, %tmm2
				; CHECK-NEXT: tilestored %tmm2, (%rbx,%r15)
				; CHECK-NEXT: incl %ebp
				; CHECK-NEXT: cmpw $100, %bp
				; CHECK-NEXT: jl .LBB1_2
				; CHECK-NEXT: .LBB1_3: # %exit
				; CHECK-NEXT: addq $4056, %rsp # imm = 0xFD8
				; CHECK-NEXT: popq %rbx
				; CHECK-NEXT: popq %r14
				; CHECK-NEXT: popq %r15
				; CHECK-NEXT: popq %rbp
				; CHECK-NEXT: tilerelease
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				entry:
				%t1 = tail call x86_amx @llvm.x86.tilezero.internal(i16 8, i16 8)
				br i1 undef, label %loop.header, label %exit

				loop.header:
				%ivphi = phi i16 [0, %entry], [%iv, %loop.latch]
				call void @foo()
				br label %loop.body

				loop.body:
				%t2 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 8, i16 8, i8* %buf, i64 32)
				%t3 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 8, i16 8, i8* %buf, i64 32)
				%t4 = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 8, i16 8, i16 8, x86_amx %t1, x86_amx %t2, x86_amx %t3)
				tail call void @llvm.x86.tilestored64.internal(i16 8, i16 8, i8* %buf, i64 32, x86_amx %t4)
				br label %loop.latch

				loop.latch:
				%iv = add i16 %ivphi, 1
				%c = icmp slt i16 %iv, 100
				br i1 %c, label %loop.header, label %exit

				exit:
				ret void
				}

				declare dso_local void @foo()
				declare x86_amx @llvm.x86.tilezero.internal(i16, i16)
				declare x86_amx @llvm.x86.tileloadd64.internal(i16, i16, i8*, i64)
				declare x86_amx @llvm.x86.tdpbssd.internal(i16, i16, i16, x86_amx, x86_amx, x86_amx)
				declare void @llvm.x86.tilestored64.internal(i16, i16, i8*, i64, x86_amx)

llvm/test/CodeGen/X86/O0-pipeline.ll

	Show All 39 Lines
	; CHECK-NEXT: Local Stack Slot Allocation			; CHECK-NEXT: Local Stack Slot Allocation
	; CHECK-NEXT: X86 speculative load hardening			; CHECK-NEXT: X86 speculative load hardening
	; CHECK-NEXT: MachineDominator Tree Construction			; CHECK-NEXT: MachineDominator Tree Construction
	; CHECK-NEXT: X86 EFLAGS copy lowering			; CHECK-NEXT: X86 EFLAGS copy lowering
	; CHECK-NEXT: X86 WinAlloca Expander			; CHECK-NEXT: X86 WinAlloca Expander
	; CHECK-NEXT: Eliminate PHI nodes for register allocation			; CHECK-NEXT: Eliminate PHI nodes for register allocation
	; CHECK-NEXT: Two-Address instruction pass			; CHECK-NEXT: Two-Address instruction pass
	; CHECK-NEXT: Fast Register Allocator			; CHECK-NEXT: Fast Register Allocator
				; CHECK-NEXT: X86 Lower Tile Copy
	; CHECK-NEXT: Bundle Machine CFG Edges			; CHECK-NEXT: Bundle Machine CFG Edges
	; CHECK-NEXT: X86 FP Stackifier			; CHECK-NEXT: X86 FP Stackifier
	; CHECK-NEXT: Fixup Statepoint Caller Saved			; CHECK-NEXT: Fixup Statepoint Caller Saved
	; CHECK-NEXT: Lazy Machine Block Frequency Analysis			; CHECK-NEXT: Lazy Machine Block Frequency Analysis
	; CHECK-NEXT: Machine Optimization Remark Emitter			; CHECK-NEXT: Machine Optimization Remark Emitter
	; CHECK-NEXT: Prologue/Epilogue Insertion & Frame Finalization			; CHECK-NEXT: Prologue/Epilogue Insertion & Frame Finalization
	; CHECK-NEXT: Post-RA pseudo instruction expansion pass			; CHECK-NEXT: Post-RA pseudo instruction expansion pass
	; CHECK-NEXT: X86 pseudo instruction expansion pass			; CHECK-NEXT: X86 pseudo instruction expansion pass
	Show All 25 Lines

llvm/test/CodeGen/X86/opt-pipeline.ll

	Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: Lazy Machine Block Frequency Analysis			; CHECK-NEXT: Lazy Machine Block Frequency Analysis
	; CHECK-NEXT: Machine Optimization Remark Emitter			; CHECK-NEXT: Machine Optimization Remark Emitter
	; CHECK-NEXT: Greedy Register Allocator			; CHECK-NEXT: Greedy Register Allocator
	; CHECK-NEXT: Tile Register Configure			; CHECK-NEXT: Tile Register Configure
	; CHECK-NEXT: Virtual Register Rewriter			; CHECK-NEXT: Virtual Register Rewriter
	; CHECK-NEXT: Stack Slot Coloring			; CHECK-NEXT: Stack Slot Coloring
	; CHECK-NEXT: Machine Copy Propagation Pass			; CHECK-NEXT: Machine Copy Propagation Pass
	; CHECK-NEXT: Machine Loop Invariant Code Motion			; CHECK-NEXT: Machine Loop Invariant Code Motion
				; CHECK-NEXT: X86 Lower Tile Copy
	; CHECK-NEXT: Bundle Machine CFG Edges			; CHECK-NEXT: Bundle Machine CFG Edges
	; CHECK-NEXT: X86 FP Stackifier			; CHECK-NEXT: X86 FP Stackifier
	; CHECK-NEXT: MachineDominator Tree Construction			; CHECK-NEXT: MachineDominator Tree Construction
	; CHECK-NEXT: Machine Dominance Frontier Construction			; CHECK-NEXT: Machine Dominance Frontier Construction
	; CHECK-NEXT: X86 Load Value Injection (LVI) Load Hardening			; CHECK-NEXT: X86 Load Value Injection (LVI) Load Hardening
	; CHECK-NEXT: Fixup Statepoint Caller Saved			; CHECK-NEXT: Fixup Statepoint Caller Saved
	; CHECK-NEXT: PostRA Machine Sink			; CHECK-NEXT: PostRA Machine Sink
	; CHECK-NEXT: Machine Block Frequency Analysis			; CHECK-NEXT: Machine Block Frequency Analysis
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMX] Lower tile copy instruction.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 325175

llvm/lib/Target/X86/CMakeLists.txt

llvm/lib/Target/X86/X86.h

llvm/lib/Target/X86/X86LowerTileCopy.cpp

llvm/lib/Target/X86/X86RegisterInfo.cpp

llvm/lib/Target/X86/X86TargetMachine.cpp

llvm/test/CodeGen/X86/AMX/amx-lower-tile-copy.ll

llvm/test/CodeGen/X86/O0-pipeline.ll

llvm/test/CodeGen/X86/opt-pipeline.ll

[X86][AMX] Lower tile copy instruction.
ClosedPublic