Download Raw Diff

Details

Reviewers

rampitec
arsenm

Summary

A 64 bit add implimented with two 32 bit add instructions uses the
$SCC register for carry.  The S_LSHL_B32 instruction write the $SCC reg
Pass was incorrectly transforming
  S_LSHL_B32
  S_ADD_U32
  S_ADDC_U32
into
  S_ADD_U32
  S_LSHL_B32
  S_ADDC_U32

SILoadStoreOptimizer constructs a list of instructions to move and did
not take into account the dependent S_ADD_u32 instruction when it chose
to move the S_ADDC_U32 instruction.

Diff Detail

Event Timeline

ronlieb created this revision.Apr 9 2019, 6:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 9 2019, 6:15 AM

Herald added subscribers: llvm-commits, jdoerfert, jfb and 3 others. · View Herald Transcript

You should use computeRegisterLiveness instead of adding a more naive search for a def, which may not exist.

The testcase can also be a lot simpler (and should probable be MIR)

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll
80	I doubt you need any of this metadata

This revision now requires changes to proceed.Apr 9 2019, 6:42 AM

test will convert to MIR form.
Patch will change to use computeRegisterLiveness. i will have to use a pretty large neighborhood , as the original code this error occurred in (before running bugpoint) , had the s_add_u32 instruction was separated by over 400 instructions from the s_addc_u32 instruction. We will assert fail if we cannot find the s_add_u32 instruction, so that will alert us to increase neighborhood size. This patch will also handle the corresponding sub instructions.

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll
80	it should vanish once i convert test to MIR form

correction: the original input test did NOT have the instructions separated by more than 1 or 2 instructions The resultant output showed the large separation.
The default neighborhood of 10 is probably more than enough.

after looking at the suggestion of using computeRegisterLiveness, I noticed that it does not return the MI where the register in question is most recently defined.
Rather, it informs on liveness within a range. I dont really see how I would use this method effectively?
The problem I am trying to solve requires identifying a specific instruction that is needed by a subsequent instruction and then adding the identified instruction to a list constructed by SILoadStoreOptimizer.

Regarding the lit test, i cleaned it up quite a bit and will post a new version shortly.
I will create a .MIR test from this to specifically test the SILoadStoreOptimizer pass in a direct fashion.
Any objection to keeping the the .ll test after the MIR test is added?

ronlieb updated this revision to Diff 194355.Apr 9 2019, 9:38 AM

In D60459#1459784, @ronlieb wrote:

correction: the original input test did NOT have the instructions separated by more than 1 or 2 instructions The resultant output showed the large separation.
The default neighborhood of 10 is probably more than enough.

This is correction issue, not performance. Any kind of threshold is an error. If you did not find the def you probably need to bail.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
276	It can be defined in another block. It can be also undef.

ronlieb marked an inline comment as done.Apr 9 2019, 1:25 PM

ronlieb added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
276	Splitting a pair of instructions across basic block boundaries in this situation seems really crazy. These instruction pairs are implementing a 64 bit add or 64 bit subtract. I understand that generally speaking we could see both situations (split or under). If this were to occur in this pass, i would want to assert (which is what this patch will do) so we can go look into it, rather than having broken code generated. To split them would mean that $SCC is live in to the block.

rampitec added inline comments.Apr 9 2019, 1:33 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
276	Why not? What if half of that pair was hoisted out of the block into parent?

i agree it could happen. Not sure what to do about it here.

The current problem i am trying to resolve in somewhat analogous to hoisting 1/2 of the 64 bit add instruction pair. Although in this particular situation we are actually sinking 1/2 of the instruction pair into a later position within the same block. And yes, i can see how in the future a new machine instruction pass might choose to hoist one of the instructions into a pred BB. I realize i can write additional code to scan a previous block. However i think its better that passes not hoist part of an instruction pair, especially ones such as these. To that end i would rather see my patch assert so that we are forced to deal with such a situation should it arise.
Your example, btw, is a good one for why we should have an IR test for the current problem, rather than an MIR test. An MIR test that runs just before SILoadStoreOptimizer will not detect the affects of a new pass. Whereas the IR test attached to this patch stands a better chance of detecting the issue.

In D60459#1460563, @ronlieb wrote:

The current problem i am trying to resolve in somewhat analogous to hoisting 1/2 of the 64 bit add instruction pair. Although in this particular situation we are actually sinking 1/2 of the instruction pair into a later position within the same block. And yes, i can see how in the future a new machine instruction pass might choose to hoist one of the instructions into a pred BB. I realize i can write additional code to scan a previous block. However i think its better that passes not hoist part of an instruction pair, especially ones such as these. To that end i would rather see my patch assert so that we are forced to deal with such a situation should it arise.
Your example, btw, is a good one for why we should have an IR test for the current problem, rather than an MIR test. An MIR test that runs just before SILoadStoreOptimizer will not detect the affects of a new pass. Whereas the IR test attached to this patch stands a better chance of detecting the issue.

Why not just bail the optimization if you didn't find a def reasonable close?

i think bailing the optimization if not found within some reasonable distance (10 seems to be popular), is a good suggestion. Much better than aborting. thx

Added check for instr match missing, and bail on optimization if so.
I prefer the .ll test we have for the patch now over that of creating an MIR test for this issue.

rampitec added inline comments.Apr 9 2019, 5:12 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
305	Does that really mean to bail? Check the uses. You also need a test where you did not find the pair, a mir test.

ronlieb marked an inline comment as done.Apr 9 2019, 6:01 PM

ronlieb added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
305	i see what you mean about the bail and uses , thx. good suggestion for the mir test. will add

added two MIR tests,and refined logic to properly bail.

rampitec added inline comments.Apr 10 2019, 9:40 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
271	Having i and I variables in the same statement is quite misleading.
test/CodeGen/AMDGPU/scc-missing-add.mir
1 ↗	(On Diff #194514)	You can combine both mir tests into a single file and significantly reduce them. For example you do not need all of the IR.

ronlieb marked an inline comment as done.Apr 11 2019, 12:50 PM

ronlieb added inline comments.

test/CodeGen/AMDGPU/scc-missing-add.mir
1 ↗	(On Diff #194514)	after playing with trying to reduce the IR, I don't really think I can. This particular pass seems sensitive to PC relative references within the MIR that are defined within the IR, and other symbol references as well. It sort of falls into this category as described in the MIR documentation: MIR code contains a whole IR module. This is necessary because there are no equivalents in MIR for global variables, references to external functions, function attributes, metadata, debug info. Instead some MIR data references the IR constructs. You can often remove them if the test doesn’t depend on them. And the above really complicates trying to merge the two tests into one.

arsenm added inline comments.Apr 11 2019, 12:53 PM

test/CodeGen/AMDGPU/scc-has-add.mir
10–12 ↗	(On Diff #194514)	You can drop the block names, and IR references in the MMOs to drop the IR section
113–115 ↗	(On Diff #194514)	You can strip out a lot of instructions too. Usually I just create a smaller, totally artificial test case from scratch

ronlieb marked 2 inline comments as done.Apr 12 2019, 2:37 PM

ronlieb added inline comments.

test/CodeGen/AMDGPU/scc-has-add.mir
10–12 ↗	(On Diff #194514)	The SILoadStoreOptimizer pass depends on Alias analysis on memory references as part of its decision to collect a group of instructions. So when i attempt to removes this information from the MIR, the test does not reproduce the issue. Similiarly, trying to remove the IR section also introduces issues with reproducing the issue, as their are PC relative definitions needed in the MIR test. So i think i need to keep the MIR test as is with a bit of cleanup. Also, since the scc-has-add.mir .test closely replicates the scc-add-lshl-addc.ll test, i plan to keep the .ll test, dump the scc-has-add.mir test, and keep the scc-missing-add,mir test.

ronlieb updated this revision to Diff 194964.Apr 12 2019, 2:37 PM

slightly generalized to some physical reg. only look at previous instruction.
The definition is either there, and were all good, or we will bail.

Added use of LivePhysRegs, happily lifted some code Krzy wrote for Hexagon to compute getLiveRegsAt.

ronlieb retitled this revision from SILoadStoreOptimizer pass mischedules s_add,s_addc with interfering s_lshl to SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshl.Apr 18 2019, 8:30 AM

arsenm added inline comments.Apr 29 2019, 2:30 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
196	Capitalize
200–201	Indentation
270–271	This is going to do a ~full scan of the block for every analyzed instruction, so this ends up being O(N^2). I was thinking more a single LivePhysReg instance for the entire block visit, which is lazily moved to the current point as necessary
293	This may not be broad enough. It only covers full defs. Usually modifiesRegister is what you want
297–301	The idea with using LivePhysRegs is to stop using this custom PhysRegUses set
300	The implicitness doesn't matter

I wonder if this is related to D61313?

nhaehnle mentioned this in D61553: AMDGPU: Fix ds_{read,write}2_b64 on SI/gfx6.May 7 2019, 4:27 AM

Superseded by D61313

Diff 194355

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	private:
void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr);		void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr);
/// Promotes constant offset to the immediate by adjusting the base. It		/// Promotes constant offset to the immediate by adjusting the base. It
/// tries to use a base from the nearby instructions that allows it to have		/// tries to use a base from the nearby instructions that allows it to have
/// a 13bit constant offset which gets promoted to the immediate.		/// a 13bit constant offset which gets promoted to the immediate.
bool promoteConstantOffsetToImm(MachineInstr &CI,		bool promoteConstantOffsetToImm(MachineInstr &CI,
MemInfoMap &Visited,		MemInfoMap &Visited,
SmallPtrSet<MachineInstr *, 4> &Promoted);		SmallPtrSet<MachineInstr *, 4> &Promoted);

public:		public:
		arsenmUnsubmitted Done Reply Inline Actions Capitalize arsenm: Capitalize
static char ID;		static char ID;

SILoadStoreOptimizer() : MachineFunctionPass(ID) {		SILoadStoreOptimizer() : MachineFunctionPass(ID) {
initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());		initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());
}		}
		arsenmUnsubmitted Done Reply Inline Actions Indentation arsenm: Indentation

bool optimizeBlock(MachineBasicBlock &MBB);		bool optimizeBlock(MachineBasicBlock &MBB);

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

StringRef getPassName() const override { return "SI Load Store Optimizer"; }		StringRef getPassName() const override { return "SI Load Store Optimizer"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	static bool memAccessesCanBeReordered(MachineBasicBlock::iterator A,
MachineBasicBlock::iterator B,		MachineBasicBlock::iterator B,
AliasAnalysis *AA) {		AliasAnalysis *AA) {
// RAW or WAR - cannot reorder		// RAW or WAR - cannot reorder
// WAW - cannot reorder		// WAW - cannot reorder
// RAR - safe to reorder		// RAR - safe to reorder
return !(A->mayStore() \|\| B->mayStore()) \|\| !A->mayAlias(AA, *B, true);		return !(A->mayStore() \|\| B->mayStore()) \|\| !A->mayAlias(AA, *B, true);
}		}

		// Find the associated instruction which sets SCC for an MI.
		static MachineInstr *addSCCDependInstr(MachineInstr &MI) {
		if (!MI.hasRegisterImplicitUseOperand(AMDGPU::SCC))
		return nullptr;

		MachineBasicBlock::reverse_iterator I = MI, E = MI.getParent()->rend();
		rampitecUnsubmitted Done Reply Inline Actions Having i and I variables in the same statement is quite misleading. rampitec: Having i and I variables in the same statement is quite misleading.
		arsenmUnsubmitted Not Done Reply Inline Actions This is going to do a ~full scan of the block for every analyzed instruction, so this ends up being O(N^2). I was thinking more a single LivePhysReg instance for the entire block visit, which is lazily moved to the current point as necessary arsenm: This is going to do a ~full scan of the block for every analyzed instruction, so this ends up…
		I++;
		for (; I != E; ++I)
		if (I->definesRegister(AMDGPU::SCC))
		return &*I;
		assert(0 && "Failed to find carry instr");
		rampitecUnsubmitted Not Done Reply Inline Actions It can be defined in another block. It can be also undef. rampitec: It can be defined in another block. It can be also undef.
		ronliebAuthorUnsubmitted Done Reply Inline Actions Splitting a pair of instructions across basic block boundaries in this situation seems really crazy. These instruction pairs are implementing a 64 bit add or 64 bit subtract. I understand that generally speaking we could see both situations (split or under). If this were to occur in this pass, i would want to assert (which is what this patch will do) so we can go look into it, rather than having broken code generated. To split them would mean that $SCC is live in to the block. ronlieb: Splitting a pair of instructions across basic block boundaries in this situation seems really…
		rampitecUnsubmitted Not Done Reply Inline Actions Why not? What if half of that pair was hoisted out of the block into parent? rampitec: Why not? What if half of that pair was hoisted out of the block into parent?
		return nullptr;
		}

// Add MI and its defs to the lists if MI reads one of the defs that are		// Add MI and its defs to the lists if MI reads one of the defs that are
// already in the list. Returns true in that case.		// already in the list. Returns true in that case.
static bool addToListsIfDependent(MachineInstr &MI, DenseSet<unsigned> &RegDefs,		static bool addToListsIfDependent(MachineInstr &MI, DenseSet<unsigned> &RegDefs,
DenseSet<unsigned> &PhysRegUses,		DenseSet<unsigned> &PhysRegUses,
SmallVectorImpl<MachineInstr *> &Insts) {		SmallVectorImpl<MachineInstr *> &Insts) {
for (MachineOperand &Use : MI.operands()) {		for (MachineOperand &Use : MI.operands()) {
// If one of the defs is read, then there is a use of Def between I and the		// If one of the defs is read, then there is a use of Def between I and the
// instruction that I will potentially be merged with. We will need to move		// instruction that I will potentially be merged with. We will need to move
// this instruction after the merged instructions.		// this instruction after the merged instructions.
//		//
// Similarly, if there is a def which is read by an instruction that is to		// Similarly, if there is a def which is read by an instruction that is to
// be moved for merging, then we need to move the def-instruction as well.		// be moved for merging, then we need to move the def-instruction as well.
// This can only happen for physical registers such as M0; virtual		// This can only happen for physical registers such as M0; virtual
// registers are in SSA form.		// registers are in SSA form.
		arsenmUnsubmitted Done Reply Inline Actions This may not be broad enough. It only covers full defs. Usually modifiesRegister is what you want arsenm: This may not be broad enough. It only covers full defs. Usually modifiesRegister is what you…
if (Use.isReg() &&		if (Use.isReg() &&
((Use.readsReg() && RegDefs.count(Use.getReg())) \|\|		((Use.readsReg() && RegDefs.count(Use.getReg())) \|\|
(Use.isDef() && TargetRegisterInfo::isPhysicalRegister(Use.getReg()) &&		(Use.isDef() && TargetRegisterInfo::isPhysicalRegister(Use.getReg()) &&
PhysRegUses.count(Use.getReg())))) {		PhysRegUses.count(Use.getReg())))) {
		// If this MI depends on SCC, find and add defining instr.
		MachineInstr *Prev = addSCCDependInstr(MI);
		if (Prev)
		arsenmUnsubmitted Done Reply Inline Actions The implicitness doesn't matter arsenm: The implicitness doesn't matter
		Insts.push_back(&*Prev);
		arsenmUnsubmitted Not Done Reply Inline Actions The idea with using LivePhysRegs is to stop using this custom PhysRegUses set arsenm: The idea with using LivePhysRegs is to stop using this custom PhysRegUses set
Insts.push_back(&MI);		Insts.push_back(&MI);
addDefsUsesToList(MI, RegDefs, PhysRegUses);		addDefsUsesToList(MI, RegDefs, PhysRegUses);
return true;		return true;
}		}
		rampitecUnsubmitted Not Done Reply Inline Actions Does that really mean to bail? Check the uses. You also need a test where you did not find the pair, a mir test. rampitec: Does that really mean to bail? Check the uses. You also need a test where you did not find the…
		ronliebAuthorUnsubmitted Done Reply Inline Actions i see what you mean about the bail and uses , thx. good suggestion for the mir test. will add ronlieb: i see what you mean about the bail and uses , thx. good suggestion for the mir test. will add
}		}

return false;		return false;
}		}

static bool canMoveInstsAcrossMemOp(MachineInstr &MemOp,		static bool canMoveInstsAcrossMemOp(MachineInstr &MemOp,
ArrayRef<MachineInstr *> InstsToMove,		ArrayRef<MachineInstr *> InstsToMove,
AliasAnalysis *AA) {		AliasAnalysis *AA) {
▲ Show 20 Lines • Show All 1,244 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll

This file was added.

				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 %s -o - \| FileCheck -check-prefix=CHECK %s

				; CHECK: s_add_u32
				; CHECK: s_addc_u32
				; CHECK: s_add_u32
				; CHECK: s_addc_u32
				; CHECK: s_add_u32
				; CHECK-NOT: s_lshl_b32
				; CHECK: s_addc_u32
				; CHECK: global_load_dword

				%0 = type { [32 x %1], [32 x %1*], i32, [32 x i32], i32, [8 x i8] }
				%1 = type { %2, [1024 x %3], [1024 x %3*], %10, [1024 x i32], [1024 x i64], [1024 x i64], [1024 x i64], [1024 x i64] }
				%2 = type { %3, %6, i64, [8 x i8], [64 x %7], [1 x %9] }
				%3 = type { %4, %5, %3* }
				%4 = type { i64, i64, i64, i64, i32 }
				%5 = type { i8, i8, i16, i16, i16, i16, i64 }
				%6 = type { %3 }
				%7 = type { %8, %8, i8, i8, [16384 x i8] }
				%8 = type { %8, %8, i8, i8, [0 x i8] }
				%9 = type { %8, %8, i8, i8, [256 x i8] }
				%10 = type { [1024 x i16] }
				%11 = type <{ [20 x i8], i8*, i32, [4 x i8] }>

				@omptarget_nvptx_device_State = external addrspace(1) externally_initialized global [64 x %0], align 16
				@usedSlotIdx = external local_unnamed_addr addrspace(3) externally_initialized global i32, align 4
				@execution_param = external local_unnamed_addr addrspace(3) externally_initialized global i32, align 4
				@omptarget_nvptx_globalArgs = external addrspace(3) externally_initialized global %11, align 8

				define amdgpu_kernel void @__omp_offloading_802_d9e513_main_l28([992 x i32] addrspace(1)* %arg) local_unnamed_addr {
				bb:
				%tmp = tail call i64 @__ockl_get_local_size()
				%tmp1 = trunc i64 %tmp to i32
				br i1 undef, label %bb2, label %bb3

				bb2: ; preds = %bb
				ret void

				bb3: ; preds = %bb
				%tmp4 = load i32, i32 addrspace(3)* @execution_param, align 4
				%tmp5 = and i32 %tmp4, 1
				%tmp6 = icmp eq i32 %tmp5, 0
				%tmp7 = select i1 %tmp6, i32 0, i32 %tmp1
				%tmp8 = trunc i32 %tmp7 to i16
				store i16 %tmp8, i16* undef, align 2
				%tmp9 = getelementptr inbounds %1, %1* null, i64 0, i32 0, i32 4, i64 0, i32 3
				store i8* undef, i8** %tmp9, align 8
				store i8** getelementptr (%11, %11* addrspacecast (%11 addrspace(3)* @omptarget_nvptx_globalArgs to %11), i64 0, i32 0, i64 0), i8* addrspace(3)* getelementptr inbounds (%11, %11 addrspace(3)* @omptarget_nvptx_globalArgs, i32 0, i32 1), align 8
				%tmp10 = tail call i32 @llvm.amdgcn.workgroup.id.x()
				%tmp11 = sext i32 %tmp10 to i64
				%tmp12 = getelementptr inbounds [992 x i32], [992 x i32] addrspace(1)* %arg, i64 0, i64 %tmp11
				%tmp13 = load i32, i32 addrspace(1)* %tmp12, align 4
				%tmp14 = add nsw i32 %tmp13, %tmp10
				store i32 %tmp14, i32 addrspace(1)* %tmp12, align 4
				%tmp15 = load i32, i32 addrspace(3)* @usedSlotIdx, align 4
				%tmp16 = sext i32 %tmp15 to i64
				%tmp17 = getelementptr inbounds [64 x %0], [64 x %0] addrspace(1)* @omptarget_nvptx_device_State, i64 0, i64 %tmp16, i32 3, i64 undef
				%tmp18 = addrspacecast i32 addrspace(1)* %tmp17 to i32*
				%tmp19 = atomicrmw volatile add i32* %tmp18, i32 0 seq_cst
				unreachable
				}

				declare i64 @__ockl_get_local_size() local_unnamed_addr
				declare i32 @llvm.amdgcn.workgroup.id.x()
				arsenmUnsubmitted Not Done Reply Inline Actions I doubt you need any of this metadata arsenm: I doubt you need any of this metadata
				ronliebAuthorUnsubmitted Done Reply Inline Actions it should vanish once i convert test to MIR form ronlieb: it should vanish once i convert test to MIR form

This is an archive of the discontinued LLVM Phabricator instance.

SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshl
AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 194355

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll

This is an archive of the discontinued LLVM Phabricator instance.

SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshlAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 194355

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll

SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshl
AbandonedPublic