Download Raw Diff

Details

Reviewers

rampitec
• tstellarAMD
arsenm

Commits

rGf867a40bf60a: [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.
rL285919: [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.

Summary

Change explores the fact that LDS reads may be reordered even if access the same location.

Prior the change, algorithm immediately stops as soon as any memory access encountered between loads that are expected to be merged together. Although, Read-After-Read conflict cannot affect execution correctness.

Improves hcBLAS CGEMM manually loop-unrolled kernels performance by 44%. Also improvement expected on any massive sequences of reads from LDS.

Diff Detail

Repository: rL LLVM

Event Timeline

alex-t updated this revision to Diff 75706.Oct 25 2016, 8:27 AM

alex-t retitled this revision from to [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads..

alex-t updated this object.

alex-t added reviewers: rampitec, arsenm, • tstellarAMD.

alex-t set the repository for this revision to rL LLVM.

alex-t added a subscriber: Restricted Project.

Herald edited edge metadata. · View Herald TranscriptOct 25 2016, 8:27 AM

Herald added subscribers: tony-tye, yaxunl, nhaehnle and 2 others. · View Herald Transcript

whchung added a subscriber: whchung.Oct 25 2016, 8:35 AM

cgemm_loopunroll_nofix.isa13 KBDownload

cgemm_loopunroll_fix.isa14 KBDownload

These 2 files are here to illustrate the ISA before and after the fix. See BB10 to view the effect.

• tstellarAMD added inline comments.Oct 25 2016, 9:02 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
155–156 ↗	(On Diff #75706)	I see this logic repeated in a few places, can we turn it into a helper function?
218–222 ↗	(On Diff #75706)	Need to use c++ style comments.
281–282 ↗	(On Diff #75706)	This doesn't seem correct. The comment mentions that we need to check that it was safe to move all the instruction in InstsToMove down past this instruction. Is this condition being met?
test/CodeGen/AMDGPU/cgemm_loopunroll_ds_combine.ll
1 ↗	(On Diff #75706)	This test case is too big, can you try to reduce it. Also you should run use opt -metarenamer to simplify the names. If you can't get a reduced IR test case, I would recommend writing an MIR test case.

This is good as a fast workaround. In a long run direct selection of a vector operation from DAG seems more desirable.

alex-t added inline comments.Oct 25 2016, 12:46 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
281–282 ↗	(On Diff #75706)	Yes. If this 2 mem accesses may alias but neither is write access it is safe to reorder them. In this condition we check if 2 mem accesses may alias and if at least one may write we break out. Otherwise we are free to search down until the BB ends or we face the instruction that cannot be moved.

alex-t added inline comments.Oct 26 2016, 6:39 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
281–282 ↗	(On Diff #75706)	and yes, you are rigth. I need to check if I users collected so far can be moved across MBBI before I go ahead looking for the merge point.

Fixed according the reviewers comments.
Test case reducing in progress. Not trivial because smaller cases use less registers.
We need large unroll to expose the problem.

Herald edited edge metadata. · View Herald TranscriptOct 26 2016, 10:02 AM

Changed source code. New test is upcoming.

In D25944#579909, @alex-t wrote:

Fixed according the reviewers comments.
Test case reducing in progress. Not trivial because smaller cases use less registers.

If it's not trivial you may want to look at writing a MIR testcase. This will be easier and also be a better test. For examples look in test/CodeGen/MIR/

We need large unroll to expose the problem.

arsenm added inline comments.Oct 26 2016, 4:18 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗	(On Diff #75914)	We don't actually have AA enabled in the backend. We also need to add an address space alias analysis pass. If these were done, would that avoid the need to have this looser check?

Small test case.

alex-t added inline comments.Oct 27 2016, 7:34 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗	(On Diff #75914)	No. Even 2 reads from same segment (address space) and exactly same address are legal to be reordered as they both have no side effects. This is Read-After-Read (RAR) conflict. The check above is necessary to allow reordering Read-After-Write and Write-After-Write in case we can prove that 2 memory operations accesses different locations. As for the address space layer in AA, I added such in HSAIL backend. It is trivial and I can port it to LC quickly if you feel it is necessary.

• tstellarAMD added inline comments.Oct 27 2016, 7:43 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗	(On Diff #75914)	The alias analysis passes are always run so the information is available. There is a usesAA() subtarget query, but that's only used by a few places, mostly in the DAGCobminer.

alex-t marked 2 inline comments as done.Oct 27 2016, 8:18 AM

alex-t added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗	(On Diff #75914)	As far as I understand Matt meant that we'd like to have custom AA layer that intercepts alias queries with 2 different address spaces. Given that: *flat* can alias any *constant* can alias nothing *global group and private* only can alias pointer in same address space

rampitec added inline comments.Oct 27 2016, 1:50 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153 ↗	(On Diff #76023)	As far as I understand that is only legal to reorder two instructions if there are no stores in between of them which might go into the memory of instruction being moved. Also you need to make sure there are no barriers or fences in between. This function only checks two instructions passed but not anything in between. I.e. that is only legal if these two instructions are adjacent. I cannot immediately see it is only called for adjacent instructions, neither see a comment on such limitation.

tony-tye added a subscriber: jlebar.Oct 27 2016, 3:28 PM

tony-tye added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗	(On Diff #75914)	Would be good to consider @jlebar patches to support target-specific AA (D24441 and D12414) which add an NVPTX-specific AA.

alex-t added inline comments.Oct 28 2016, 5:11 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153 ↗	(On Diff #76023)	In fact all the checks is done in a loop within findMatchingDSInst if (MBBI->hasUnmodeledSideEffects()) // We can't re-order this instruction with respect to other memory // opeations, so we fail both conditions mentioned above. return E; if (MBBI->mayLoadOrStore() && !memAccessesCanBeReordered(I, MBBI, TII, AA)) { // We fail condition #1, but we may still be able to satisfy condition // #2. Add this instruction to the move list and then we will check // if condition #2 holds once we have selected the matching instruction. InstsToMove.push_back(&MBBI); addDefsToList(MBBI, DefsToMove); continue; } So any writes in between are in InstsToMove list and will be checked in canMoveInstsAcrossMemOp. Any instructions in between with side effects like barriers etc breaks search. I had no intention to make a neat fix - just temporary workaround. So I preserved basic logic. Indeed there is no need to put potential stores in the move list for later check. We can break out immediately faced mayAlias store in between. If you insist I can change to this.

rampitec added inline comments.Oct 28 2016, 2:25 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153 ↗	(On Diff #76023)	Thanks. LGTM as a w/a provided tests are passed.

arsenm added inline comments.Oct 28 2016, 3:16 PM

test/CodeGen/AMDGPU/ds_read2.ll
496–498 ↗	(On Diff #76023)	You should run instnamer on the test. Also positive checks are more useful

Test case passed through the opt -instrenamer

alex-t marked an inline comment as done.Oct 31 2016, 8:00 AM

alex-t added inline comments.

test/CodeGen/AMDGPU/ds_read2.ll
496–498 ↗	(On Diff #76023)	The goal of the test is to check that *all* LDS reads are combined. So, any "ds_read_b32" in the output is the regression. In the original CGEMM inner loop there were 64 ds_read_b32 and all of them are combined with the patch. The test case is same but shorter excerpt. That's why it is negative.

Closed by commit rL285919: [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads. (authored by alex-t). · Explain WhyNov 3 2016, 7:46 AM

This revision was automatically updated to reflect the committed changes.

alex-t marked an inline comment as done.

Diff 76857

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines

static void addDefsToList(const MachineInstr &MI,		static void addDefsToList(const MachineInstr &MI,
SmallVectorImpl<const MachineOperand *> &Defs) {		SmallVectorImpl<const MachineOperand *> &Defs) {
for (const MachineOperand &Def : MI.defs()) {		for (const MachineOperand &Def : MI.defs()) {
Defs.push_back(&Def);		Defs.push_back(&Def);
}		}
}		}

		static bool memAccessesCanBeReordered(
		MachineBasicBlock::iterator A,
		MachineBasicBlock::iterator B,
		const SIInstrInfo *TII,
		llvm::AliasAnalysis * AA) {
		return (TII->areMemAccessesTriviallyDisjoint(A, B, AA) \|\|
		// RAW or WAR - cannot reorder
		// WAW - cannot reorder
		// RAR - safe to reorder
		!(A->mayStore() \|\| B->mayStore()));
		}

// Add MI and its defs to the lists if MI reads one of the defs that are		// Add MI and its defs to the lists if MI reads one of the defs that are
// already in the list. Returns true in that case.		// already in the list. Returns true in that case.
static bool		static bool
addToListsIfDependent(MachineInstr &MI,		addToListsIfDependent(MachineInstr &MI,
SmallVectorImpl<const MachineOperand *> &Defs,		SmallVectorImpl<const MachineOperand *> &Defs,
SmallVectorImpl<MachineInstr*> &Insts) {		SmallVectorImpl<MachineInstr*> &Insts) {
for (const MachineOperand *Def : Defs) {		for (const MachineOperand *Def : Defs) {
bool ReadDef = MI.readsVirtualRegister(Def->getReg());		bool ReadDef = MI.readsVirtualRegister(Def->getReg());
Show All 16 Lines	canMoveInstsAcrossMemOp(MachineInstr &MemOp,
const SIInstrInfo *TII,		const SIInstrInfo *TII,
AliasAnalysis *AA) {		AliasAnalysis *AA) {

assert(MemOp.mayLoadOrStore());		assert(MemOp.mayLoadOrStore());

for (MachineInstr *InstToMove : InstsToMove) {		for (MachineInstr *InstToMove : InstsToMove) {
if (!InstToMove->mayLoadOrStore())		if (!InstToMove->mayLoadOrStore())
continue;		continue;
if (!TII->areMemAccessesTriviallyDisjoint(MemOp, *InstToMove, AA))		if (!memAccessesCanBeReordered(MemOp, *InstToMove, TII, AA))
return false;		return false;
}		}
return true;		return true;
}		}

bool SILoadStoreOptimizer::offsetsCanBeCombined(unsigned Offset0,		bool SILoadStoreOptimizer::offsetsCanBeCombined(unsigned Offset0,
unsigned Offset1,		unsigned Offset1,
unsigned Size) {		unsigned Size) {
// XXX - Would the same offset be OK? Is there any reason this would happen or		// XXX - Would the same offset be OK? Is there any reason this would happen or
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	if (MBBI->getOpcode() != I->getOpcode()) {
// be merged into.		// be merged into.

if (MBBI->hasUnmodeledSideEffects())		if (MBBI->hasUnmodeledSideEffects())
// We can't re-order this instruction with respect to other memory		// We can't re-order this instruction with respect to other memory
// opeations, so we fail both conditions mentioned above.		// opeations, so we fail both conditions mentioned above.
return E;		return E;

if (MBBI->mayLoadOrStore() &&		if (MBBI->mayLoadOrStore() &&
!TII->areMemAccessesTriviallyDisjoint(I, MBBI, AA)) {		!memAccessesCanBeReordered(I, MBBI, TII, AA)) {
// We fail condition #1, but we may still be able to satisfy condition		// We fail condition #1, but we may still be able to satisfy condition
// #2. Add this instruction to the move list and then we will check		// #2. Add this instruction to the move list and then we will check
// if condition #2 holds once we have selected the matching instruction.		// if condition #2 holds once we have selected the matching instruction.
InstsToMove.push_back(&*MBBI);		InstsToMove.push_back(&*MBBI);
addDefsToList(*MBBI, DefsToMove);		addDefsToList(*MBBI, DefsToMove);
continue;		continue;
}		}

Show All 38 Lines	if (AddrReg0.getReg() == AddrReg1.getReg() &&
canMoveInstsAcrossMemOp(*MBBI, InstsToMove, TII, AA))		canMoveInstsAcrossMemOp(*MBBI, InstsToMove, TII, AA))
return MBBI;		return MBBI;
}		}

// We've found a load/store that we couldn't merge for some reason.		// We've found a load/store that we couldn't merge for some reason.
// We could potentially keep looking, but we'd need to make sure that		// We could potentially keep looking, but we'd need to make sure that
// it was safe to move I and also all the instruction in InstsToMove		// it was safe to move I and also all the instruction in InstsToMove
// down past this instruction.		// down past this instruction.
// FIXME: This is too conservative.		if (!memAccessesCanBeReordered(I, MBBI, TII, AA) \|\| // check if we can move I across MBBI
		!canMoveInstsAcrossMemOp(*MBBI, InstsToMove, TII, AA) // check if we can move all I's users
		)
break;		break;
}		}
return E;		return E;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeRead2Pair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeRead2Pair(
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
MachineBasicBlock::iterator Paired,		MachineBasicBlock::iterator Paired,
unsigned EltSize,		unsigned EltSize,
▲ Show 20 Lines • Show All 217 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/ds_read2.ll

	Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines
	}			}

	define void @misaligned_read2_i64(i64 addrspace(1)* %out, i64 addrspace(3)* %in) #0 {			define void @misaligned_read2_i64(i64 addrspace(1)* %out, i64 addrspace(3)* %in) #0 {
	%load = load i64, i64 addrspace(3)* %in, align 4			%load = load i64, i64 addrspace(3)* %in, align 4
	store i64 %load, i64 addrspace(1)* %out, align 8			store i64 %load, i64 addrspace(1)* %out, align 8
	ret void			ret void
	}			}

				; SI-LABEL: ds_read_diff_base_interleaving
				; SI-NOT: ds_read_b32
				define amdgpu_kernel void @ds_read_diff_base_interleaving(
				float addrspace(1)* nocapture %arg,
				[4 x [4 x float]] addrspace(3)* %arg1,
				[4 x [4 x float]] addrspace(3)* %arg2,
				[4 x [4 x float]] addrspace(3)* %arg3,
				[4 x [4 x float]] addrspace(3)* %arg4) #1 {
				bb:
				%tmp = getelementptr float, float addrspace(1)* %arg, i64 10
				%tmp5 = tail call i32 @llvm.amdgcn.workitem.id.x() #2
				%tmp6 = tail call i32 @llvm.amdgcn.workitem.id.y() #2
				%tmp7 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %arg1, i32 0, i32 %tmp6, i32 0
				%tmp8 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %arg2, i32 0, i32 0, i32 %tmp5
				%tmp9 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %arg3, i32 0, i32 %tmp6, i32 0
				%tmp10 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %arg4, i32 0, i32 0, i32 %tmp5
				%tmp11 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %arg1, i32 0, i32 %tmp6, i32 1
				%tmp12 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %arg2, i32 0, i32 1, i32 %tmp5
				%tmp13 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %arg3, i32 0, i32 %tmp6, i32 1
				%tmp14 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %arg4, i32 0, i32 1, i32 %tmp5
				%tmp15 = load float, float addrspace(3)* %tmp7
				%tmp16 = load float, float addrspace(3)* %tmp8
				%tmp17 = fmul float %tmp15, %tmp16
				%tmp18 = fadd float 2.000000e+00, %tmp17
				%tmp19 = load float, float addrspace(3)* %tmp9
				%tmp20 = load float, float addrspace(3)* %tmp10
				%tmp21 = fmul float %tmp19, %tmp20
				%tmp22 = fsub float %tmp18, %tmp21
				%tmp23 = load float, float addrspace(3)* %tmp11
				%tmp24 = load float, float addrspace(3)* %tmp12
				%tmp25 = fmul float %tmp23, %tmp24
				%tmp26 = fsub float %tmp22, %tmp25
				%tmp27 = load float, float addrspace(3)* %tmp13
				%tmp28 = load float, float addrspace(3)* %tmp14
				%tmp29 = fmul float %tmp27, %tmp28
				%tmp30 = fsub float %tmp26, %tmp29
				store float %tmp30, float addrspace(1)* %tmp
				ret void
				}

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	declare i32 @llvm.amdgcn.workgroup.id.x() #1			declare i32 @llvm.amdgcn.workgroup.id.x() #1

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	declare i32 @llvm.amdgcn.workgroup.id.y() #1			declare i32 @llvm.amdgcn.workgroup.id.y() #1

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	declare i32 @llvm.amdgcn.workitem.id.x() #1			declare i32 @llvm.amdgcn.workitem.id.x() #1
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 76857

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

llvm/trunk/test/CodeGen/AMDGPU/ds_read2.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 76857

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

llvm/trunk/test/CodeGen/AMDGPU/ds_read2.ll

[AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.
ClosedPublic