This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
GCNRegPressure.cpp
3/10
SIFormMemoryClauses.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
1/2
soft-clause-dbg-value.mir

Differential D95748

AMDGPU: Fix dbg_value handling when forming soft clause bundles
ClosedPublic

Authored by arsenm on Jan 30 2021, 2:16 PM.

Download Raw Diff

Details

Reviewers

rampitec
scott.linder
RamNalamothu
foad
dfukalov

Summary

DBG_VALUES placed between memory instructions would change
codegen. Skip over these and re-insert them after the bundle instead
of giving up on bundling.

Diff Detail

Event Timeline

arsenm created this revision.Jan 30 2021, 2:16 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald TranscriptJan 30 2021, 2:16 PM

arsenm requested review of this revision.Jan 30 2021, 2:16 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2021, 2:16 PM

Herald added a subscriber: wdng. · View Herald Transcript

Forgot to git add test

dfukalov added inline comments.Jan 31 2021, 3:52 PM

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
388	Why don't just create DbgInstrs here?
llvm/test/CodeGen/AMDGPU/soft-clause-dbg-value.mir
3	Am I right that we need `phi-node-elimination,` before to remove `IsSSA` property to avoid verifier' fail? Or, dummy copy of first instruction `%0:sreg_64 = COPY $sgpr4_sgpr5 ; defeat IsSSA detection` can be added.

scott.linder added inline comments.Feb 1 2021, 10:29 AM

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
326	Since https://reviews.llvm.org/D92522, if you don't have a strong reason for choosing 8
390–392	When does `Next != BundleNext` after this `for` terminates? It seems like you should be able to just use `Next` below when re-inserting the DBG instructions?

rampitec added inline comments.Feb 1 2021, 10:33 AM

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
388	Right, it can be local to this block and cleared/destroyed after re-insertion. It will take less memory.

Address comments

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
388	This avoids reallocating on subsequent bundles if it's necessary
390–392	It doesn't make a difference
llvm/test/CodeGen/AMDGPU/soft-clause-dbg-value.mir
3	No. The pass cleared properties now clears SSA

rampitec accepted this revision.Feb 1 2021, 3:21 PM

This revision is now accepted and ready to land.Feb 1 2021, 3:21 PM

41877b82f07224041a2a994f9032332fe01e4d1b

scott.linder added inline comments.Feb 2 2021, 8:39 AM

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
390–392	I disagree that it doesn't matter, after reading to this point I have to assume either (a) I don't understand the code, or (b) there is a missing `assert(BundleNext == Next);` Can you either delete `BundleNext` or add an assert?

arsenm added inline comments.Feb 2 2021, 8:47 AM

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
390–392	BundleNext is needed for incrementing the inner loop up to Next

scott.linder added inline comments.Feb 2 2021, 2:17 PM

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
388	I didn't notice this when I first read, but it doesn't really seem necessary to buffer up the debug instrs at all. Can't we just `MBB.insert(Next, BI->removeFromParent())` below? Did you do the buffering and extra loop because it makes tracking when to end iteration simpler, and because the largest the vector can get is `amdgpu-max-memory-clause` anyway?
390–392	Thank you for changing this in the committed version!

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

GCNRegPressure.cpp

1 line

SIFormMemoryClauses.cpp

28 lines

test/

CodeGen/

AMDGPU/

soft-clause-dbg-value.mir

49 lines

Diff 320605

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp

Show First 20 Lines • Show All 378 Lines • ▼ Show 20 Lines	bool GCNDownwardRPTracker::advanceBeforeNext() {

MaxPressure = max(MaxPressure, CurPressure);		MaxPressure = max(MaxPressure, CurPressure);

return true;		return true;
}		}

void GCNDownwardRPTracker::advanceToNext() {		void GCNDownwardRPTracker::advanceToNext() {
LastTrackedMI = &*NextMI++;		LastTrackedMI = &*NextMI++;
		NextMI = skipDebugInstructionsForward(NextMI, MBBEnd);

// Add new registers or mask bits.		// Add new registers or mask bits.
for (const auto &MO : LastTrackedMI->operands()) {		for (const auto &MO : LastTrackedMI->operands()) {
if (!MO.isReg() \|\| !MO.isDef())		if (!MO.isReg() \|\| !MO.isDef())
continue;		continue;
Register Reg = MO.getReg();		Register Reg = MO.getReg();
if (!Reg.isVirtual())		if (!Reg.isVirtual())
continue;		continue;
▲ Show 20 Lines • Show All 98 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp

Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines

static bool isSMEMClauseInst(const MachineInstr &MI) { static bool isSMEMClauseInst(const MachineInstr &MI) {

return SIInstrInfo::isSMRD(MI); return SIInstrInfo::isSMRD(MI);

} }

// There no sense to create store clauses, they do not define anything, // There no sense to create store clauses, they do not define anything,

// thus there is nothing to set early-clobber. // thus there is nothing to set early-clobber.

static bool isValidClauseInst(const MachineInstr &MI, bool IsVMEMClause) { static bool isValidClauseInst(const MachineInstr &MI, bool IsVMEMClause) {

if (MI.isDebugValue() || MI.isBundled()) assert(!MI.isDebugInstr() && "debug instructions should not reach here");

if (MI.isBundled())

return false; return false;

if (!MI.mayLoad() || MI.mayStore()) if (!MI.mayLoad() || MI.mayStore())

return false; return false;

if (AMDGPU::getAtomicNoRetOp(MI.getOpcode()) != -1 || if (AMDGPU::getAtomicNoRetOp(MI.getOpcode()) != -1 ||

AMDGPU::getAtomicRetOp(MI.getOpcode()) != -1) AMDGPU::getAtomicRetOp(MI.getOpcode()) != -1)

return false; return false;

if (IsVMEMClause && !isVMEMClauseInst(MI)) if (IsVMEMClause && !isVMEMClauseInst(MI))

return false; return false;

▲ Show 20 Lines • Show All 198 Lines • ▼ Show 20 Lines bool SIFormMemoryClauses::runOnMachineFunction(MachineFunction &MF) {

SlotIndexes *Ind = LIS->getSlotIndexes(); SlotIndexes *Ind = LIS->getSlotIndexes();

bool Changed = false; bool Changed = false;

MaxVGPRs = TRI->getAllocatableSet(MF, &AMDGPU::VGPR_32RegClass).count(); MaxVGPRs = TRI->getAllocatableSet(MF, &AMDGPU::VGPR_32RegClass).count();

MaxSGPRs = TRI->getAllocatableSet(MF, &AMDGPU::SGPR_32RegClass).count(); MaxSGPRs = TRI->getAllocatableSet(MF, &AMDGPU::SGPR_32RegClass).count();

unsigned FuncMaxClause = AMDGPU::getIntegerAttribute( unsigned FuncMaxClause = AMDGPU::getIntegerAttribute(

MF.getFunction(), "amdgpu-max-memory-clause", MaxClause); MF.getFunction(), "amdgpu-max-memory-clause", MaxClause);

SmallVector<MachineInstr *> DbgInstrs;

scott.linderUnsubmitted

Not Done

MF.getFunction(), "amdgpu-max-memory-clause", MaxClause);

- SmallVector<MachineInstr *, 8> DbgInstrs;

+ SmallVector<MachineInstr *> DbgInstrs;

for (MachineBasicBlock &MBB : MF) {

Since https://reviews.llvm.org/D92522, if you don't have a strong reason for choosing 8

scott.linder: Since https://reviews.llvm.org/D92522, if you don't have a strong reason for choosing 8

for (MachineBasicBlock &MBB : MF) { for (MachineBasicBlock &MBB : MF) {

GCNDownwardRPTracker RPT(*LIS); GCNDownwardRPTracker RPT(*LIS);

MachineBasicBlock::instr_iterator Next; MachineBasicBlock::instr_iterator Next;

for (auto I = MBB.instr_begin(), E = MBB.instr_end(); I != E; I = Next) { for (auto I = MBB.instr_begin(), E = MBB.instr_end(); I != E; I = Next) {

MachineInstr &MI = *I; MachineInstr &MI = *I;

Next = std::next(I); Next = std::next(I);

if (MI.isDebugInstr())

continue;

bool IsVMEM = isVMEMClauseInst(MI); bool IsVMEM = isVMEMClauseInst(MI);

if (!isValidClauseInst(MI, IsVMEM)) if (!isValidClauseInst(MI, IsVMEM))

continue; continue;

if (!RPT.getNext().isValid()) if (!RPT.getNext().isValid())

RPT.reset(MI); RPT.reset(MI);

else { // Advance the state to the current MI. else { // Advance the state to the current MI.

RPT.advance(MachineBasicBlock::const_iterator(MI)); RPT.advance(MachineBasicBlock::const_iterator(MI));

RPT.advanceBeforeNext(); RPT.advanceBeforeNext();

} }

const GCNRPTracker::LiveRegSet LiveRegsCopy(RPT.getLiveRegs()); const GCNRPTracker::LiveRegSet LiveRegsCopy(RPT.getLiveRegs());

RegUse Defs, Uses; RegUse Defs, Uses;

if (!processRegUses(MI, Defs, Uses, RPT)) { if (!processRegUses(MI, Defs, Uses, RPT)) {

RPT.reset(MI, &LiveRegsCopy); RPT.reset(MI, &LiveRegsCopy);

continue; continue;

} }

unsigned Length = 1; unsigned Length = 1;

for ( ; Next != E && Length < FuncMaxClause; ++Next) { for ( ; Next != E && Length < FuncMaxClause; ++Next) {

// Debug instructions should not change the bundling. We need to move

// these after the bundle

if (Next->isDebugInstr())

continue;

if (!isValidClauseInst(*Next, IsVMEM)) if (!isValidClauseInst(*Next, IsVMEM))

break; break;

// A load from pointer which was loaded inside the same bundle is an // A load from pointer which was loaded inside the same bundle is an

// impossible clause because we will need to write and read the same // impossible clause because we will need to write and read the same

// register inside. In this case processRegUses will return false. // register inside. In this case processRegUses will return false.

if (!processRegUses(*Next, Defs, Uses, RPT)) if (!processRegUses(*Next, Defs, Uses, RPT))

break; break;

++Length; ++Length;

} }

if (Length < 2) { if (Length < 2) {

RPT.reset(MI, &LiveRegsCopy); RPT.reset(MI, &LiveRegsCopy);

continue; continue;

} }

Changed = true; Changed = true;

MFI->limitOccupancy(LastRecordedOccupancy); MFI->limitOccupancy(LastRecordedOccupancy);

auto B = BuildMI(MBB, I, DebugLoc(), TII->get(TargetOpcode::BUNDLE)); auto B = BuildMI(MBB, I, DebugLoc(), TII->get(TargetOpcode::BUNDLE));

Ind->insertMachineInstrInMaps(*B); Ind->insertMachineInstrInMaps(*B);

// Restore the state after processing the bundle. // Restore the state after processing the bundle.

RPT.reset(*B, &LiveRegsCopy); RPT.reset(*B, &LiveRegsCopy);

DbgInstrs.clear();

dfukalovUnsubmitted

Not Done

Why don't just create DbgInstrs here?

dfukalov: Why don't just create DbgInstrs here?

rampitecUnsubmitted

Not Done

Right, it can be local to this block and cleared/destroyed after re-insertion. It will take less memory.

rampitec: Right, it can be local to this block and cleared/destroyed after re-insertion. It will take…

arsenmAuthorUnsubmitted

Done

This avoids reallocating on subsequent bundles if it's necessary

arsenm: This avoids reallocating on subsequent bundles if it's necessary

scott.linderUnsubmitted

Not Done

I didn't notice this when I first read, but it doesn't really seem necessary to buffer up the debug instrs at all. Can't we just MBB.insert(Next, BI->removeFromParent()) below? Did you do the buffering and extra loop because it makes tracking when to end iteration simpler, and because the largest the vector can get is amdgpu-max-memory-clause anyway?

scott.linder: I didn't notice this when I first read, but it doesn't really seem necessary to buffer up the…

auto BundleNext = I;

for (auto BI = I; BI != Next; BI = BundleNext) {

BundleNext = std::next(BI);

scott.linderUnsubmitted

Not Done

When does Next != BundleNext after this for terminates? It seems like you should be able to just use Next below when re-inserting the DBG instructions?

scott.linder: When does `Next != BundleNext` after this `for` terminates? It seems like you should be able to…

arsenmAuthorUnsubmitted

Done

It doesn't make a difference

arsenm: It doesn't make a difference

scott.linderUnsubmitted

Not Done

I disagree that it doesn't matter, after reading to this point I have to assume either (a) I don't understand the code, or (b) there is a missing assert(BundleNext == Next);

Can you either delete BundleNext or add an assert?

scott.linder: I disagree that it doesn't matter, after reading to this point I have to assume either (a) I…

arsenmAuthorUnsubmitted

Done

BundleNext is needed for incrementing the inner loop up to Next

arsenm: BundleNext is needed for incrementing the inner loop up to Next

scott.linderUnsubmitted

Not Done

Thank you for changing this in the committed version!

scott.linder: Thank you for changing this in the committed version!

if (BI->isDebugValue()) {

DbgInstrs.push_back(BI->removeFromParent());

continue;

}

for (auto BI = I; BI != Next; ++BI) {

BI->bundleWithPred(); BI->bundleWithPred();

Ind->removeSingleMachineInstrFromMaps(*BI); Ind->removeSingleMachineInstrFromMaps(*BI);

for (MachineOperand &MO : BI->defs()) for (MachineOperand &MO : BI->defs())

if (MO.readsReg()) if (MO.readsReg())

MO.setIsInternalRead(true); MO.setIsInternalRead(true);

} }

// Replace any debug instructions after the new bundle.

for (MachineInstr *DbgInst : DbgInstrs)

MBB.insert(Next, DbgInst);

for (auto &&R : Defs) { for (auto &&R : Defs) {

forAllLanes(R.first, R.second.second, [&R, &B](unsigned SubReg) { forAllLanes(R.first, R.second.second, [&R, &B](unsigned SubReg) {

unsigned S = R.second.first | RegState::EarlyClobber; unsigned S = R.second.first | RegState::EarlyClobber;

if (!SubReg) if (!SubReg)

S &= ~(RegState::Undef | RegState::Dead); S &= ~(RegState::Undef | RegState::Dead);

B.addDef(R.first, S, SubReg); B.addDef(R.first, S, SubReg);

}); });

} }

Show All 28 Lines

llvm/test/CodeGen/AMDGPU/soft-clause-dbg-value.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -mattr=+xnack -run-pass=si-form-memory-clauses -verify-machineinstrs -o - %s \| FileCheck %s

				dfukalovUnsubmitted Not Done Reply Inline Actions Am I right that we need `phi-node-elimination,` before to remove `IsSSA` property to avoid verifier' fail? Or, dummy copy of first instruction `%0:sreg_64 = COPY $sgpr4_sgpr5 ; defeat IsSSA detection` can be added. dfukalov: Am I right that we need `phi-node-elimination,` before to remove `IsSSA` property to avoid…
				arsenmAuthorUnsubmitted Done Reply Inline Actions No. The pass cleared properties now clears SSA arsenm: No. The pass cleared properties now clears SSA
				# Make sure that debug instructions do not change the bundling, and
				# the dbg_values which break the clause are inserted after the new
				# bundles.

				---
				name: sgpr_clause_dbg_value
				tracksRegLiveness: true
				body: \|
				bb.0:
				liveins: $sgpr4_sgpr5, $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8, $vgpr9, $vgpr10, $vgpr11, $vgpr12, $vgpr13, $vgpr14, $vgpr15, $vgpr16, $vgpr17, $vgpr18, $vgpr19, $vgpr20, $vgpr21, $vgpr22, $vgpr23, $vgpr24
				; CHECK-LABEL: name: sgpr_clause_dbg_value
				; CHECK: liveins: $sgpr4_sgpr5, $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8, $vgpr9, $vgpr10, $vgpr11, $vgpr12, $vgpr13, $vgpr14, $vgpr15, $vgpr16, $vgpr17, $vgpr18, $vgpr19, $vgpr20, $vgpr21, $vgpr22, $vgpr23, $vgpr24
				; CHECK: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr4_sgpr5
				; CHECK: early-clobber %2:sreg_32_xm0_xexec, early-clobber %1:sreg_32_xm0_xexec = BUNDLE [[COPY]] {
				; CHECK: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 0, 0, 0 :: (load 4, addrspace 4)
				; CHECK: [[S_LOAD_DWORD_IMM1:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 8, 0, 0 :: (load 4, addrspace 4)
				; CHECK: }
				; CHECK: DBG_VALUE [[S_LOAD_DWORD_IMM]], 0, 0
				; CHECK: DBG_VALUE [[S_LOAD_DWORD_IMM1]], 0, 0
				; CHECK: S_NOP 0
				; CHECK: S_NOP 0
				; CHECK: S_NOP 0
				; CHECK: early-clobber %4:sreg_32_xm0_xexec, early-clobber %3:sreg_32_xm0_xexec, early-clobber %5:sreg_32_xm0_xexec = BUNDLE [[COPY]] {
				; CHECK: [[S_LOAD_DWORD_IMM2:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 16, 0, 0 :: (load 4, addrspace 4)
				; CHECK: [[S_LOAD_DWORD_IMM3:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 32, 0, 0 :: (load 4, addrspace 4)
				; CHECK: [[S_LOAD_DWORD_IMM4:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]], 64, 0, 0 :: (load 4, addrspace 4)
				; CHECK: }
				; CHECK: DBG_VALUE [[S_LOAD_DWORD_IMM2]], 0, 0
				; CHECK: DBG_VALUE [[S_LOAD_DWORD_IMM3]], 0, 0
				; CHECK: S_ENDPGM 0, implicit [[S_LOAD_DWORD_IMM]], implicit [[S_LOAD_DWORD_IMM1]], implicit [[S_LOAD_DWORD_IMM2]], implicit [[S_LOAD_DWORD_IMM3]], implicit [[S_LOAD_DWORD_IMM4]]
				%0:sreg_64 = COPY $sgpr4_sgpr5
				%1:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %0, 0, 0, 0 :: (load 4, align 4, addrspace 4)
				DBG_VALUE %1, 0, 0
				%2:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %0, 8, 0, 0 :: (load 4, align 4, addrspace 4)
				DBG_VALUE %2, 0, 0
				S_NOP 0
				S_NOP 0
				S_NOP 0
				%3:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %0, 16, 0, 0 :: (load 4, align 4, addrspace 4)
				%4:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %0, 32, 0, 0 :: (load 4, align 4, addrspace 4)
				DBG_VALUE %3, 0, 0
				DBG_VALUE %4, 0, 0
				%5:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %0, 64, 0, 0 :: (load 4, align 4, addrspace 4)
				S_ENDPGM 0, implicit %1, implicit %2, implicit %3, implicit %4, implicit %5

				...

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Fix dbg_value handling when forming soft clause bundlesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 320605

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp

llvm/test/CodeGen/AMDGPU/soft-clause-dbg-value.mir

AMDGPU: Fix dbg_value handling when forming soft clause bundles
ClosedPublic