Download Raw Diff

Details

Reviewers

rampitec
arsenm

Summary

A 64 bit add implimented with two 32 bit add instructions uses the
$SCC register for carry.  The S_LSHL_B32 instruction write the $SCC reg
Pass was incorrectly transforming
  S_LSHL_B32
  S_ADD_U32
  S_ADDC_U32
into
  S_ADD_U32
  S_LSHL_B32
  S_ADDC_U32

SILoadStoreOptimizer constructs a list of instructions to move and did
not take into account the dependent S_ADD_u32 instruction when it chose
to move the S_ADDC_U32 instruction.

Diff Detail

Event Timeline

ronlieb created this revision.Apr 9 2019, 6:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 9 2019, 6:15 AM

Herald added subscribers: llvm-commits, jdoerfert, jfb and 3 others. · View Herald Transcript

You should use computeRegisterLiveness instead of adding a more naive search for a def, which may not exist.

The testcase can also be a lot simpler (and should probable be MIR)

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll
80	I doubt you need any of this metadata

This revision now requires changes to proceed.Apr 9 2019, 6:42 AM

test will convert to MIR form.
Patch will change to use computeRegisterLiveness. i will have to use a pretty large neighborhood , as the original code this error occurred in (before running bugpoint) , had the s_add_u32 instruction was separated by over 400 instructions from the s_addc_u32 instruction. We will assert fail if we cannot find the s_add_u32 instruction, so that will alert us to increase neighborhood size. This patch will also handle the corresponding sub instructions.

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll
80	it should vanish once i convert test to MIR form

correction: the original input test did NOT have the instructions separated by more than 1 or 2 instructions The resultant output showed the large separation.
The default neighborhood of 10 is probably more than enough.

after looking at the suggestion of using computeRegisterLiveness, I noticed that it does not return the MI where the register in question is most recently defined.
Rather, it informs on liveness within a range. I dont really see how I would use this method effectively?
The problem I am trying to solve requires identifying a specific instruction that is needed by a subsequent instruction and then adding the identified instruction to a list constructed by SILoadStoreOptimizer.

Regarding the lit test, i cleaned it up quite a bit and will post a new version shortly.
I will create a .MIR test from this to specifically test the SILoadStoreOptimizer pass in a direct fashion.
Any objection to keeping the the .ll test after the MIR test is added?

ronlieb updated this revision to Diff 194355.Apr 9 2019, 9:38 AM

In D60459#1459784, @ronlieb wrote:

correction: the original input test did NOT have the instructions separated by more than 1 or 2 instructions The resultant output showed the large separation.
The default neighborhood of 10 is probably more than enough.

This is correction issue, not performance. Any kind of threshold is an error. If you did not find the def you probably need to bail.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
284	It can be defined in another block. It can be also undef.

ronlieb marked an inline comment as done.Apr 9 2019, 1:25 PM

ronlieb added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
284	Splitting a pair of instructions across basic block boundaries in this situation seems really crazy. These instruction pairs are implementing a 64 bit add or 64 bit subtract. I understand that generally speaking we could see both situations (split or under). If this were to occur in this pass, i would want to assert (which is what this patch will do) so we can go look into it, rather than having broken code generated. To split them would mean that $SCC is live in to the block.

rampitec added inline comments.Apr 9 2019, 1:33 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
284	Why not? What if half of that pair was hoisted out of the block into parent?

i agree it could happen. Not sure what to do about it here.

The current problem i am trying to resolve in somewhat analogous to hoisting 1/2 of the 64 bit add instruction pair. Although in this particular situation we are actually sinking 1/2 of the instruction pair into a later position within the same block. And yes, i can see how in the future a new machine instruction pass might choose to hoist one of the instructions into a pred BB. I realize i can write additional code to scan a previous block. However i think its better that passes not hoist part of an instruction pair, especially ones such as these. To that end i would rather see my patch assert so that we are forced to deal with such a situation should it arise.
Your example, btw, is a good one for why we should have an IR test for the current problem, rather than an MIR test. An MIR test that runs just before SILoadStoreOptimizer will not detect the affects of a new pass. Whereas the IR test attached to this patch stands a better chance of detecting the issue.

In D60459#1460563, @ronlieb wrote:

The current problem i am trying to resolve in somewhat analogous to hoisting 1/2 of the 64 bit add instruction pair. Although in this particular situation we are actually sinking 1/2 of the instruction pair into a later position within the same block. And yes, i can see how in the future a new machine instruction pass might choose to hoist one of the instructions into a pred BB. I realize i can write additional code to scan a previous block. However i think its better that passes not hoist part of an instruction pair, especially ones such as these. To that end i would rather see my patch assert so that we are forced to deal with such a situation should it arise.
Your example, btw, is a good one for why we should have an IR test for the current problem, rather than an MIR test. An MIR test that runs just before SILoadStoreOptimizer will not detect the affects of a new pass. Whereas the IR test attached to this patch stands a better chance of detecting the issue.

Why not just bail the optimization if you didn't find a def reasonable close?

i think bailing the optimization if not found within some reasonable distance (10 seems to be popular), is a good suggestion. Much better than aborting. thx

Added check for instr match missing, and bail on optimization if so.
I prefer the .ll test we have for the patch now over that of creating an MIR test for this issue.

rampitec added inline comments.Apr 9 2019, 5:12 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
333	Does that really mean to bail? Check the uses. You also need a test where you did not find the pair, a mir test.

ronlieb marked an inline comment as done.Apr 9 2019, 6:01 PM

ronlieb added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
333	i see what you mean about the bail and uses , thx. good suggestion for the mir test. will add

added two MIR tests,and refined logic to properly bail.

rampitec added inline comments.Apr 10 2019, 9:40 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
279	Having i and I variables in the same statement is quite misleading.
test/CodeGen/AMDGPU/scc-missing-add.mir
2	You can combine both mir tests into a single file and significantly reduce them. For example you do not need all of the IR.

ronlieb marked an inline comment as done.Apr 11 2019, 12:50 PM

ronlieb added inline comments.

test/CodeGen/AMDGPU/scc-missing-add.mir
2	after playing with trying to reduce the IR, I don't really think I can. This particular pass seems sensitive to PC relative references within the MIR that are defined within the IR, and other symbol references as well. It sort of falls into this category as described in the MIR documentation: MIR code contains a whole IR module. This is necessary because there are no equivalents in MIR for global variables, references to external functions, function attributes, metadata, debug info. Instead some MIR data references the IR constructs. You can often remove them if the test doesn’t depend on them. And the above really complicates trying to merge the two tests into one.

arsenm added inline comments.Apr 11 2019, 12:53 PM

test/CodeGen/AMDGPU/scc-has-add.mir
10–12 ↗	(On Diff #194514)	You can drop the block names, and IR references in the MMOs to drop the IR section
113–115 ↗	(On Diff #194514)	You can strip out a lot of instructions too. Usually I just create a smaller, totally artificial test case from scratch

ronlieb marked 2 inline comments as done.Apr 12 2019, 2:37 PM

ronlieb added inline comments.

test/CodeGen/AMDGPU/scc-has-add.mir
10–12 ↗	(On Diff #194514)	The SILoadStoreOptimizer pass depends on Alias analysis on memory references as part of its decision to collect a group of instructions. So when i attempt to removes this information from the MIR, the test does not reproduce the issue. Similiarly, trying to remove the IR section also introduces issues with reproducing the issue, as their are PC relative definitions needed in the MIR test. So i think i need to keep the MIR test as is with a bit of cleanup. Also, since the scc-has-add.mir .test closely replicates the scc-add-lshl-addc.ll test, i plan to keep the .ll test, dump the scc-has-add.mir test, and keep the scc-missing-add,mir test.

ronlieb updated this revision to Diff 194964.Apr 12 2019, 2:37 PM

slightly generalized to some physical reg. only look at previous instruction.
The definition is either there, and were all good, or we will bail.

Added use of LivePhysRegs, happily lifted some code Krzy wrote for Hexagon to compute getLiveRegsAt.

ronlieb retitled this revision from SILoadStoreOptimizer pass mischedules s_add,s_addc with interfering s_lshl to SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshl.Apr 18 2019, 8:30 AM

arsenm added inline comments.Apr 29 2019, 2:30 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
197	Capitalize
201–202	Indentation
278–279	This is going to do a ~full scan of the block for every analyzed instruction, so this ends up being O(N^2). I was thinking more a single LivePhysReg instance for the entire block visit, which is lazily moved to the current point as necessary
301	This may not be broad enough. It only covers full defs. Usually modifiesRegister is what you want
325	The idea with using LivePhysRegs is to stop using this custom PhysRegUses set
328	The implicitness doesn't matter

I wonder if this is related to D61313?

nhaehnle mentioned this in D61553: AMDGPU: Fix ds_{read,write}2_b64 on SI/gfx6.May 7 2019, 4:27 AM

Superseded by D61313

Diff 195750

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"		#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "SIRegisterInfo.h"		#include "SIRegisterInfo.h"
#include "Utils/AMDGPUBaseInfo.h"		#include "Utils/AMDGPUBaseInfo.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
		#include "llvm/CodeGen/LivePhysRegs.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h"		#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineOperand.h"		#include "llvm/CodeGen/MachineOperand.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/IR/DebugLoc.h"		#include "llvm/IR/DebugLoc.h"
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	private:
void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr);		void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr);
/// Promotes constant offset to the immediate by adjusting the base. It		/// Promotes constant offset to the immediate by adjusting the base. It
/// tries to use a base from the nearby instructions that allows it to have		/// tries to use a base from the nearby instructions that allows it to have
/// a 13bit constant offset which gets promoted to the immediate.		/// a 13bit constant offset which gets promoted to the immediate.
bool promoteConstantOffsetToImm(MachineInstr &CI,		bool promoteConstantOffsetToImm(MachineInstr &CI,
MemInfoMap &Visited,		MemInfoMap &Visited,
SmallPtrSet<MachineInstr *, 4> &Promoted);		SmallPtrSet<MachineInstr *, 4> &Promoted);

		// used to extend addToListsIfDependent to express Bailing.
		arsenmUnsubmitted Done Reply Inline Actions Capitalize arsenm: Capitalize
		enum AddToStat {AddToTrue, AddToFalse, AddToBail };
		AddToStat addToListsIfDependent(MachineInstr &MI,
		DenseSet<unsigned> &RegDefs,
		DenseSet<unsigned> &PhysRegUses,
		SmallVectorImpl<MachineInstr *> &Insts);
		arsenmUnsubmitted Done Reply Inline Actions Indentation arsenm: Indentation

public:		public:
static char ID;		static char ID;

SILoadStoreOptimizer() : MachineFunctionPass(ID) {		SILoadStoreOptimizer() : MachineFunctionPass(ID) {
initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());		initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());
}		}

bool optimizeBlock(MachineBasicBlock &MBB);		bool optimizeBlock(MachineBasicBlock &MBB);
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	static bool memAccessesCanBeReordered(MachineBasicBlock::iterator A,
MachineBasicBlock::iterator B,		MachineBasicBlock::iterator B,
AliasAnalysis *AA) {		AliasAnalysis *AA) {
// RAW or WAR - cannot reorder		// RAW or WAR - cannot reorder
// WAW - cannot reorder		// WAW - cannot reorder
// RAR - safe to reorder		// RAR - safe to reorder
return !(A->mayStore() \|\| B->mayStore()) \|\| !A->mayAlias(AA, *B, true);		return !(A->mayStore() \|\| B->mayStore()) \|\| !A->mayAlias(AA, *B, true);
}		}

		static void getLiveRegsAt(LivePhysRegs &Regs, const MachineInstr &MI) {
		const MachineBasicBlock &B = *MI.getParent();
		Regs.addLiveOuts(B);
		auto E = ++MachineBasicBlock::const_iterator(MI.getIterator()).getReverse();
		for (auto I = B.rbegin(); I != E; ++I)
		Regs.stepBackward(*I);
		rampitecUnsubmitted Done Reply Inline Actions Having i and I variables in the same statement is quite misleading. rampitec: Having i and I variables in the same statement is quite misleading.
		arsenmUnsubmitted Not Done Reply Inline Actions This is going to do a ~full scan of the block for every analyzed instruction, so this ends up being O(N^2). I was thinking more a single LivePhysReg instance for the entire block visit, which is lazily moved to the current point as necessary arsenm: This is going to do a ~full scan of the block for every analyzed instruction, so this ends up…
		}

		// Get the adjacent instruction which defines physical Reg used by this MI.
		static MachineInstr *getPhysRegAdjacentInstr(MachineInstr &MI, unsigned Reg,
		const SIRegisterInfo *TRI,
		rampitecUnsubmitted Not Done Reply Inline Actions It can be defined in another block. It can be also undef. rampitec: It can be defined in another block. It can be also undef.
		ronliebAuthorUnsubmitted Done Reply Inline Actions Splitting a pair of instructions across basic block boundaries in this situation seems really crazy. These instruction pairs are implementing a 64 bit add or 64 bit subtract. I understand that generally speaking we could see both situations (split or under). If this were to occur in this pass, i would want to assert (which is what this patch will do) so we can go look into it, rather than having broken code generated. To split them would mean that $SCC is live in to the block. ronlieb: Splitting a pair of instructions across basic block boundaries in this situation seems really…
		rampitecUnsubmitted Not Done Reply Inline Actions Why not? What if half of that pair was hoisted out of the block into parent? rampitec: Why not? What if half of that pair was hoisted out of the block into parent?
		MachineRegisterInfo *MRI) {
		if (!TargetRegisterInfo::isPhysicalRegister(Reg))
		return nullptr;
		// if Reg available at MI, then reg is not live.
		LivePhysRegs LiveAtMI(*TRI);
		getLiveRegsAt(LiveAtMI, MI);
		if (LiveAtMI.available(*MRI, Reg))
		return nullptr;
		// Only look at previous instruction for the defining instr.
		MachineBasicBlock::reverse_iterator I = MI;
		I++;
		// If Reg is not available at I, then reg is not live.
		getLiveRegsAt(LiveAtMI, *I);
		if (!LiveAtMI.available(*MRI, Reg))
		return nullptr;
		// Reg is live, does this instr define it?
		if (I->definesRegister(Reg))
		arsenmUnsubmitted Done Reply Inline Actions This may not be broad enough. It only covers full defs. Usually modifiesRegister is what you want arsenm: This may not be broad enough. It only covers full defs. Usually modifiesRegister is what you…
		return &*I;
		return nullptr;
		}

// Add MI and its defs to the lists if MI reads one of the defs that are		// Add MI and its defs to the lists if MI reads one of the defs that are
// already in the list. Returns true in that case.		// already in the list. Returns true in that case.
static bool addToListsIfDependent(MachineInstr &MI, DenseSet<unsigned> &RegDefs,		SILoadStoreOptimizer::AddToStat SILoadStoreOptimizer::addToListsIfDependent(
		MachineInstr &MI, DenseSet<unsigned> &RegDefs,
DenseSet<unsigned> &PhysRegUses,		DenseSet<unsigned> &PhysRegUses,
SmallVectorImpl<MachineInstr *> &Insts) {		SmallVectorImpl<MachineInstr *> &Insts) {

for (MachineOperand &Use : MI.operands()) {		for (MachineOperand &Use : MI.operands()) {
// If one of the defs is read, then there is a use of Def between I and the		// If one of the defs is read, then there is a use of Def between I and the
// instruction that I will potentially be merged with. We will need to move		// instruction that I will potentially be merged with. We will need to move
// this instruction after the merged instructions.		// this instruction after the merged instructions.
//		//
// Similarly, if there is a def which is read by an instruction that is to		// Similarly, if there is a def which is read by an instruction that is to
// be moved for merging, then we need to move the def-instruction as well.		// be moved for merging, then we need to move the def-instruction as well.
// This can only happen for physical registers such as M0; virtual		// This can only happen for physical registers such as M0; virtual
// registers are in SSA form.		// registers are in SSA form.
if (Use.isReg() &&		if (Use.isReg() &&
((Use.readsReg() && RegDefs.count(Use.getReg())) \|\|		((Use.readsReg() && RegDefs.count(Use.getReg())) \|\|
(Use.isDef() && TargetRegisterInfo::isPhysicalRegister(Use.getReg()) &&		(Use.isDef() && TargetRegisterInfo::isPhysicalRegister(Use.getReg()) &&
PhysRegUses.count(Use.getReg())))) {		PhysRegUses.count(Use.getReg())))) {
		arsenmUnsubmitted Not Done Reply Inline Actions The idea with using LivePhysRegs is to stop using this custom PhysRegUses set arsenm: The idea with using LivePhysRegs is to stop using this custom PhysRegUses set
		// If this MI depends on a physReg such as SCC, find and add defining
		// instr. If not found, bail on this optimization.
		if (Use.isImplicit() &&
		arsenmUnsubmitted Done Reply Inline Actions The implicitness doesn't matter arsenm: The implicitness doesn't matter
		TargetRegisterInfo::isPhysicalRegister(Use.getReg())) {
		MachineInstr *Prev = getPhysRegAdjacentInstr(MI, Use.getReg(),
		TRI, MRI);
		if (Prev)
		Insts.push_back(&*Prev);
		rampitecUnsubmitted Not Done Reply Inline Actions Does that really mean to bail? Check the uses. You also need a test where you did not find the pair, a mir test. rampitec: Does that really mean to bail? Check the uses. You also need a test where you did not find the…
		ronliebAuthorUnsubmitted Done Reply Inline Actions i see what you mean about the bail and uses , thx. good suggestion for the mir test. will add ronlieb: i see what you mean about the bail and uses , thx. good suggestion for the mir test. will add
		else
		return AddToBail;
		}
Insts.push_back(&MI);		Insts.push_back(&MI);
addDefsUsesToList(MI, RegDefs, PhysRegUses);		addDefsUsesToList(MI, RegDefs, PhysRegUses);
return true;		return AddToTrue;
}		}
}		}

return false;		return AddToFalse;
}		}

static bool canMoveInstsAcrossMemOp(MachineInstr &MemOp,		static bool canMoveInstsAcrossMemOp(MachineInstr &MemOp,
ArrayRef<MachineInstr *> InstsToMove,		ArrayRef<MachineInstr *> InstsToMove,
AliasAnalysis *AA) {		AliasAnalysis *AA) {
assert(MemOp.mayLoadOrStore());		assert(MemOp.mayLoadOrStore());

for (MachineInstr *InstToMove : InstsToMove) {		for (MachineInstr *InstToMove : InstsToMove) {
▲ Show 20 Lines • Show All 273 Lines • ▼ Show 20 Lines	if ((getInstClass(MBBI->getOpcode()) != InstClass) \|\|
CI.InstsToMove.push_back(&*MBBI);		CI.InstsToMove.push_back(&*MBBI);
addDefsUsesToList(*MBBI, RegDefsToMove, PhysRegUsesToMove);		addDefsUsesToList(*MBBI, RegDefsToMove, PhysRegUsesToMove);
continue;		continue;
}		}

// When we match I with another DS instruction we will be moving I down		// When we match I with another DS instruction we will be moving I down
// to the location of the matched instruction any uses of I will need to		// to the location of the matched instruction any uses of I will need to
// be moved down as well.		// be moved down as well.
addToListsIfDependent(*MBBI, RegDefsToMove, PhysRegUsesToMove,		AddToStat AStat = addToListsIfDependent(*MBBI, RegDefsToMove,
		PhysRegUsesToMove,
CI.InstsToMove);		CI.InstsToMove);
		if (AStat == AddToBail)
		return false;
continue;		continue;
}		}

// Don't merge volatiles.		// Don't merge volatiles.
if (MBBI->hasOrderedMemoryRef())		if (MBBI->hasOrderedMemoryRef())
return false;		return false;

// Handle a case like		// Handle a case like
// DS_WRITE_B32 addr, v, idx0		// DS_WRITE_B32 addr, v, idx0
// w = DS_READ_B32 addr, idx0		// w = DS_READ_B32 addr, idx0
// DS_WRITE_B32 addr, f(w), idx1		// DS_WRITE_B32 addr, f(w), idx1
// where the DS_READ_B32 ends up in InstsToMove and therefore prevents		// where the DS_READ_B32 ends up in InstsToMove and therefore prevents
// merging of the two writes.		// merging of the two writes.
if (addToListsIfDependent(*MBBI, RegDefsToMove, PhysRegUsesToMove,		AddToStat AStat = addToListsIfDependent(*MBBI, RegDefsToMove,
CI.InstsToMove))		PhysRegUsesToMove,
		CI.InstsToMove);
		if (AStat == AddToTrue)
continue;		continue;
		if (AStat == AddToBail)
		return false;

bool Match = true;		bool Match = true;
for (unsigned i = 0; i < NumAddresses; i++) {		for (unsigned i = 0; i < NumAddresses; i++) {
const MachineOperand &AddrRegNext = MBBI->getOperand(AddrIdx[i]);		const MachineOperand &AddrRegNext = MBBI->getOperand(AddrIdx[i]);

if (AddrReg[i]->isImm() \|\| AddrRegNext.isImm()) {		if (AddrReg[i]->isImm() \|\| AddrRegNext.isImm()) {
if (AddrReg[i]->isImm() != AddrRegNext.isImm() \|\|		if (AddrReg[i]->isImm() != AddrRegNext.isImm() \|\|
AddrReg[i]->getImm() != AddrRegNext.getImm()) {		AddrReg[i]->getImm() != AddrRegNext.getImm()) {
▲ Show 20 Lines • Show All 934 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll

This file was added.

				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 %s -o - \| FileCheck -check-prefix=CHECK %s

				; CHECK: s_add_u32
				; CHECK: s_addc_u32
				; CHECK: s_add_u32
				; CHECK: s_addc_u32
				; CHECK: s_add_u32
				; CHECK-NOT: s_lshl_b32
				; CHECK: s_addc_u32
				; CHECK: global_load_dword

				%0 = type { [32 x %1], [32 x %1*], i32, [32 x i32], i32, [8 x i8] }
				%1 = type { %2, [1024 x %3], [1024 x %3*], %10, [1024 x i32], [1024 x i64], [1024 x i64], [1024 x i64], [1024 x i64] }
				%2 = type { %3, %6, i64, [8 x i8], [64 x %7], [1 x %9] }
				%3 = type { %4, %5, %3* }
				%4 = type { i64, i64, i64, i64, i32 }
				%5 = type { i8, i8, i16, i16, i16, i16, i64 }
				%6 = type { %3 }
				%7 = type { %8, %8, i8, i8, [16384 x i8] }
				%8 = type { %8, %8, i8, i8, [0 x i8] }
				%9 = type { %8, %8, i8, i8, [256 x i8] }
				%10 = type { [1024 x i16] }
				%11 = type <{ [20 x i8], i8*, i32, [4 x i8] }>

				@omptarget_nvptx_device_State = external addrspace(1) externally_initialized global [64 x %0], align 16
				@usedSlotIdx = external local_unnamed_addr addrspace(3) externally_initialized global i32, align 4
				@execution_param = external local_unnamed_addr addrspace(3) externally_initialized global i32, align 4
				@omptarget_nvptx_globalArgs = external addrspace(3) externally_initialized global %11, align 8

				define amdgpu_kernel void @__omp_offloading_802_d9e513_main_l28([992 x i32] addrspace(1)* %arg) local_unnamed_addr {
				bb:
				%tmp = tail call i64 @__ockl_get_local_size()
				%tmp1 = trunc i64 %tmp to i32
				br i1 undef, label %bb2, label %bb3

				bb2: ; preds = %bb
				ret void

				bb3: ; preds = %bb
				%tmp4 = load i32, i32 addrspace(3)* @execution_param, align 4
				%tmp5 = and i32 %tmp4, 1
				%tmp6 = icmp eq i32 %tmp5, 0
				%tmp7 = select i1 %tmp6, i32 0, i32 %tmp1
				%tmp8 = trunc i32 %tmp7 to i16
				store i16 %tmp8, i16* undef, align 2
				%tmp9 = getelementptr inbounds %1, %1* null, i64 0, i32 0, i32 4, i64 0, i32 3
				store i8* undef, i8** %tmp9, align 8
				store i8** getelementptr (%11, %11* addrspacecast (%11 addrspace(3)* @omptarget_nvptx_globalArgs to %11), i64 0, i32 0, i64 0), i8* addrspace(3)* getelementptr inbounds (%11, %11 addrspace(3)* @omptarget_nvptx_globalArgs, i32 0, i32 1), align 8
				%tmp10 = tail call i32 @llvm.amdgcn.workgroup.id.x()
				%tmp11 = sext i32 %tmp10 to i64
				%tmp12 = getelementptr inbounds [992 x i32], [992 x i32] addrspace(1)* %arg, i64 0, i64 %tmp11
				%tmp13 = load i32, i32 addrspace(1)* %tmp12, align 4
				%tmp14 = add nsw i32 %tmp13, %tmp10
				store i32 %tmp14, i32 addrspace(1)* %tmp12, align 4
				%tmp15 = load i32, i32 addrspace(3)* @usedSlotIdx, align 4
				%tmp16 = sext i32 %tmp15 to i64
				%tmp17 = getelementptr inbounds [64 x %0], [64 x %0] addrspace(1)* @omptarget_nvptx_device_State, i64 0, i64 %tmp16, i32 3, i64 undef
				%tmp18 = addrspacecast i32 addrspace(1)* %tmp17 to i32*
				%tmp19 = atomicrmw volatile add i32* %tmp18, i32 0 seq_cst
				unreachable
				}

				declare i64 @__ockl_get_local_size() local_unnamed_addr
				declare i32 @llvm.amdgcn.workgroup.id.x()
				arsenmUnsubmitted Not Done Reply Inline Actions I doubt you need any of this metadata arsenm: I doubt you need any of this metadata
				ronliebAuthorUnsubmitted Done Reply Inline Actions it should vanish once i convert test to MIR form ronlieb: it should vanish once i convert test to MIR form

test/CodeGen/AMDGPU/scc-missing-add.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass=si-load-store-opt -o - %s \| FileCheck -check-prefix=GFX9 %s
				# RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs -run-pass=si-load-store-opt -o - %s \| FileCheck -check-prefix=GFX9 %s
				rampitecUnsubmitted Not Done Reply Inline Actions You can combine both mir tests into a single file and significantly reduce them. For example you do not need all of the IR. rampitec: You can combine both mir tests into a single file and significantly reduce them. For example…
				ronliebAuthorUnsubmitted Done Reply Inline Actions after playing with trying to reduce the IR, I don't really think I can. This particular pass seems sensitive to PC relative references within the MIR that are defined within the IR, and other symbol references as well. It sort of falls into this category as described in the MIR documentation: MIR code contains a whole IR module. This is necessary because there are no equivalents in MIR for global variables, references to external functions, function attributes, metadata, debug info. Instead some MIR data references the IR constructs. You can often remove them if the test doesn’t depend on them. And the above really complicates trying to merge the two tests into one. ronlieb: after playing with trying to reduce the IR, I don't really think I can. This particular pass…

				# This test presents a sequnce of DS_READ instructions that could be combined
				# into a single DS_READ provided all the dependent instructions are correctly
				# identified and moved. In this situation an S_ADDC depends on an S_ADD,
				# however the S_ADD is further away than 10 instructions and will not be found.
				# The SILoadStoreOptimizer pass needs to detect the S_ADD was not found and
				# abandon the transformation.

				# GFX9-LABEL: name: __omp_offloading_802_d9e513_main_l28
				# GFX9: DS_READ
				# GFX9: DS_WRITE
				# GFX9: S_ADD
				# GFX9: S_ADDC
				# GFX9: GLOBAL_LOAD_DWORD
				# GFX9: GLOBAL_STORE_DWORD
				# GFX9: DS_READ

				--- \|

				%0 = type { [32 x %1], [32 x %1*], i32, [32 x i32], i32, [8 x i8] }
				%1 = type { %2, [1024 x %3], [1024 x %3*], %10, [1024 x i32], [1024 x i64], [1024 x i64], [1024 x i64], [1024 x i64] }
				%2 = type { %3, %6, i64, [8 x i8], [64 x %7], [1 x %9] }
				%3 = type { %4, %5, %3* }
				%4 = type { i64, i64, i64, i64, i32 }
				%5 = type { i8, i8, i16, i16, i16, i16, i64 }
				%6 = type { %3 }
				%7 = type { %8, %8, i8, i8, [16384 x i8] }
				%8 = type { %8, %8, i8, i8, [0 x i8] }
				%9 = type { %8, %8, i8, i8, [256 x i8] }
				%10 = type { [1024 x i16] }
				%11 = type <{ [20 x i8], i8*, i32, [4 x i8] }>

				@omptarget_nvptx_device_State = external addrspace(1) externally_initialized global [64 x %0], align 16
				@usedSlotIdx = external local_unnamed_addr addrspace(3) externally_initialized global i32, align 4
				@execution_param = external local_unnamed_addr addrspace(3) externally_initialized global i32, align 4
				@omptarget_nvptx_globalArgs = external addrspace(3) externally_initialized global %11, align 8

				define amdgpu_kernel void @__omp_offloading_802_d9e513_main_l28([992 x i32] addrspace(1)* %arg) local_unnamed_addr #0 {
				bb:
				%tmp = tail call i64 @__ockl_get_local_size()
				br i1 undef, label %bb2, label %bb3, !amdgpu.uniform !0

				bb2: ; preds = %bb
				ret void

				bb3: ; preds = %bb
				%__omp_offloading_802_d9e513_main_l28.kernarg.segment = call nonnull align 16 dereferenceable(44) i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()
				%arg.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %__omp_offloading_802_d9e513_main_l28.kernarg.segment, i64 36
				%arg.kernarg.offset.cast = bitcast i8 addrspace(4)* %arg.kernarg.offset to [992 x i32] addrspace(1)* addrspace(4)*, !amdgpu.uniform !0, !amdgpu.noclobber !0
				%arg.load = load [992 x i32] addrspace(1), [992 x i32] addrspace(1) addrspace(4)* %arg.kernarg.offset.cast, align 4, !invariant.load !0
				%tmp1 = trunc i64 %tmp to i32
				%tmp4 = load i32, i32 addrspace(3)* @execution_param, align 4
				%tmp5 = and i32 %tmp4, 1
				%tmp6 = icmp eq i32 %tmp5, 0
				%tmp7 = select i1 %tmp6, i32 0, i32 %tmp1
				%tmp8 = trunc i32 %tmp7 to i16
				store i16 %tmp8, i16* undef, align 2
				store i8* undef, i8 inttoptr (i64 184 to i8), align 8
				store i8** getelementptr (%11, %11* addrspacecast (%11 addrspace(3)* @omptarget_nvptx_globalArgs to %11), i64 0, i32 0, i64 0), i8* addrspace(3)* getelementptr inbounds (%11, %11 addrspace(3)* @omptarget_nvptx_globalArgs, i32 0, i32 1), align 8
				%tmp10 = tail call i32 @llvm.amdgcn.workgroup.id.x()
				%tmp11 = sext i32 %tmp10 to i64
				%tmp12 = getelementptr inbounds [992 x i32], [992 x i32] addrspace(1)* %arg.load, i64 0, i64 %tmp11, !amdgpu.uniform !0
				%tmp13 = load i32, i32 addrspace(1)* %tmp12, align 4
				%tmp14 = add nsw i32 %tmp13, %tmp10
				store i32 %tmp14, i32 addrspace(1)* %tmp12, align 4
				%tmp15 = load i32, i32 addrspace(3)* @usedSlotIdx, align 4
				%tmp16 = sext i32 %tmp15 to i64
				%tmp17 = getelementptr inbounds [64 x %0], [64 x %0] addrspace(1)* @omptarget_nvptx_device_State, i64 0, i64 %tmp16, i32 3, i64 undef
				%0 = addrspacecast i32 addrspace(1)* %tmp17 to i32*
				%tmp19 = atomicrmw volatile add i32* %0, i32 0 seq_cst
				unreachable
				}

				declare i64 @__ockl_get_local_size() local_unnamed_addr
				declare i32 @llvm.amdgcn.workgroup.id.x()
				declare i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()

				!0 = !{}

				...
				---
				name: __omp_offloading_802_d9e513_main_l28
				body: \|
				bb.0.bb:
				successors: %bb.1(0x7fffffff), %bb.2(0x00000001)
				liveins: $sgpr0_sgpr1, $sgpr2

				%3:sreg_32_xm0 = COPY $sgpr2
				%2:sgpr_64 = COPY $sgpr0_sgpr1
				ADJCALLSTACKUP 0, 0, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr101
				%5:sreg_64 = SI_PC_ADD_REL_OFFSET target-flags(amdgpu-gotprel32-lo) @__ockl_get_local_size + 4, target-flags(amdgpu-gotprel32-hi) @__ockl_get_local_size + 4, implicit-def dead $scc
				%6:sreg_64_xexec = S_LOAD_DWORDX2_IMM killed %5, 0, 0 :: (dereferenceable invariant load 8 from got, addrspace 4)
				%7:sreg_128 = COPY $sgpr96_sgpr97_sgpr98_sgpr99
				%8:sreg_32_xm0 = COPY $sgpr101
				$sgpr0_sgpr1_sgpr2_sgpr3 = COPY %7
				$sgpr4 = COPY %8
				$sgpr30_sgpr31 = SI_CALL killed %6, @__ockl_get_local_size, csr_amdgpu_highregs, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4, implicit-def $vgpr0_vgpr1
				ADJCALLSTACKDOWN 0, 4, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr101
				%53:vreg_64 = COPY $vgpr0_vgpr1
				S_CBRANCH_SCC1 %bb.2, implicit undef $scc
				S_BRANCH %bb.1

				bb.1.bb2:
				S_ENDPGM 0

				bb.2.bb3:
				%10:sreg_64_xexec = S_LOAD_DWORDX2_IMM %2, 36, 0 :: (dereferenceable invariant load 8 from %ir.arg.kernarg.offset.cast, align 4, addrspace 4)
				%12:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%13:vgpr_32 = DS_READ_B32_gfx9 %12, 184, 0, implicit $exec :: (dereferenceable load 4 from @execution_param, addrspace 3)
				%55:vgpr_32 = V_BFE_I32 %13, 0, 1, implicit $exec
				%16:vgpr_32 = V_AND_B32_e32 killed %55, %53.sub0, implicit $exec
				%18:sreg_64 = IMPLICIT_DEF
				%19:vreg_64 = COPY %18
				FLAT_STORE_SHORT killed %19, killed %16, 0, 0, 0, implicit $exec, implicit $flat_scr :: (store 2 into `i16* undef`)
				%20:sreg_32_xm0 = S_GETREG_B32 31759
				%21:sreg_32_xm0 = S_LSHL_B32 killed %20, 16, implicit-def dead $scc
				%56:vgpr_32 = V_MOV_B32_e32 8, implicit $exec
				%57:vgpr_32 = COPY killed %21
				%24:vreg_64 = REG_SEQUENCE killed %56, %subreg.sub0, killed %57, %subreg.sub1
				DS_WRITE_B64_gfx9 %12, killed %24, 168, 0, implicit $exec :: (store 8 into `i8** addrspace(3)* getelementptr inbounds (%11, %11 addrspace(3)* @omptarget_nvptx_globalArgs, i32 0, i32 1)`, addrspace 3)
				%25:sreg_32_xm0 = S_ASHR_I32 %3, 31, implicit-def dead $scc
				%27:sreg_64 = REG_SEQUENCE %3, %subreg.sub0, %25, %subreg.sub1
				%29:sreg_64 = S_LSHL_B64 killed %27, 2, implicit-def dead $scc
				%69:sreg_32_xm0 = S_ADD_U32 %10.sub0, %29.sub0, implicit-def $scc
				%150:vgpr_32 = COPY killed %21
				%151:vgpr_32 = COPY killed %21
				%152:vgpr_32 = COPY killed %21
				%153:vgpr_32 = COPY killed %21
				%154:vgpr_32 = COPY killed %21
				%155:vgpr_32 = COPY killed %21
				%156:vgpr_32 = COPY killed %21
				%157:vgpr_32 = COPY killed %21
				%158:vgpr_32 = COPY killed %21
				%159:vgpr_32 = COPY killed %21
				%160:vgpr_32 = COPY killed %21
				%70:sreg_32_xm0 = S_ADDC_U32 %10.sub1, %29.sub1, implicit-def $scc, implicit $scc
				%30:sreg_64 = REG_SEQUENCE %69, %subreg.sub0, %70, %subreg.sub1
				%130:sreg_64 = REG_SEQUENCE %160, %157
				%131:sreg_64 = REG_SEQUENCE %158, %159
				%32:vreg_64 = COPY %30
				%31:vgpr_32 = GLOBAL_LOAD_DWORD %32, 0, 0, 0, implicit $exec :: (load 4 from %ir.tmp12, addrspace 1)
				%58:vgpr_32 = nsw V_ADD_U32_e64 %31, %3, 0, implicit $exec
				GLOBAL_STORE_DWORD %32, %58, 0, 0, 0, implicit $exec :: (store 4 into %ir.tmp12, addrspace 1)
				%37:vgpr_32 = DS_READ_B32_gfx9 %12, 0, 0, implicit $exec :: (dereferenceable load 4 from @usedSlotIdx, addrspace 3)
				%38:sreg_64 = SI_PC_ADD_REL_OFFSET target-flags(amdgpu-gotprel32-lo) @omptarget_nvptx_device_State + 4, target-flags(amdgpu-gotprel32-hi) @omptarget_nvptx_device_State + 4, implicit-def dead $scc
				%39:sreg_64_xexec = S_LOAD_DWORDX2_IMM killed %38, 0, 0 :: (dereferenceable invariant load 8 from got, addrspace 4)
				%40:sreg_32_xm0 = S_MOV_B32 37501328
				%43:vreg_64 = COPY killed %39
				%41:vreg_64, %42:sreg_64 = V_MAD_I64_I32 killed %37, killed %40, %43, 0, implicit $exec
				%65:sgpr_32 = S_MOV_B32 37501188
				%60:vgpr_32 = V_ADD_I32_e32 %65, %41.sub0, implicit-def $vcc, implicit $exec
				%62:sreg_64_xexec = COPY killed $vcc
				%61:vgpr_32, dead %63:sreg_64_xexec = V_ADDC_U32_e64 %41.sub1, 0, killed %62, 0, implicit $exec
				%59:vreg_64 = REG_SEQUENCE %60, %subreg.sub0, %61, %subreg.sub1
				%52:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				FLAT_ATOMIC_ADD %59, %52, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load store seq_cst 4 on %ir.0)
				...

This is an archive of the discontinued LLVM Phabricator instance.

SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshl
AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 195750

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll

test/CodeGen/AMDGPU/scc-missing-add.mir

This is an archive of the discontinued LLVM Phabricator instance.

SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshlAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 195750

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll

test/CodeGen/AMDGPU/scc-missing-add.mir

SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshl
AbandonedPublic