Download Raw Diff

Details

Reviewers

rampitec
• tstellarAMD
arsenm

Commits

rGf867a40bf60a: [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.
rL285919: [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.

Summary

Change explores the fact that LDS reads may be reordered even if access the same location.

Prior the change, algorithm immediately stops as soon as any memory access encountered between loads that are expected to be merged together. Although, Read-After-Read conflict cannot affect execution correctness.

Improves hcBLAS CGEMM manually loop-unrolled kernels performance by 44%. Also improvement expected on any massive sequences of reads from LDS.

Diff Detail

Event Timeline

alex-t updated this revision to Diff 75706.Oct 25 2016, 8:27 AM

alex-t retitled this revision from to [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads..

alex-t updated this object.

alex-t added reviewers: rampitec, arsenm, • tstellarAMD.

alex-t set the repository for this revision to rL LLVM.

alex-t added a subscriber: Restricted Project.

Herald edited edge metadata. · View Herald TranscriptOct 25 2016, 8:27 AM

Herald added subscribers: tony-tye, yaxunl, nhaehnle and 2 others. · View Herald Transcript

whchung added a subscriber: whchung.Oct 25 2016, 8:35 AM

cgemm_loopunroll_nofix.isa13 KBDownload

cgemm_loopunroll_fix.isa14 KBDownload

These 2 files are here to illustrate the ISA before and after the fix. See BB10 to view the effect.

• tstellarAMD added inline comments.Oct 25 2016, 9:02 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
167–169	I see this logic repeated in a few places, can we turn it into a helper function?
233–237	Need to use c++ style comments.
283–287	This doesn't seem correct. The comment mentions that we need to check that it was safe to move all the instruction in InstsToMove down past this instruction. Is this condition being met?
test/CodeGen/AMDGPU/cgemm_loopunroll_ds_combine.ll
1 ↗	(On Diff #75706)	This test case is too big, can you try to reduce it. Also you should run use opt -metarenamer to simplify the names. If you can't get a reduced IR test case, I would recommend writing an MIR test case.

This is good as a fast workaround. In a long run direct selection of a vector operation from DAG seems more desirable.

alex-t added inline comments.Oct 25 2016, 12:46 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
283–287	Yes. If this 2 mem accesses may alias but neither is write access it is safe to reorder them. In this condition we check if 2 mem accesses may alias and if at least one may write we break out. Otherwise we are free to search down until the BB ends or we face the instruction that cannot be moved.

alex-t added inline comments.Oct 26 2016, 6:39 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
283–287	and yes, you are rigth. I need to check if I users collected so far can be moved across MBBI before I go ahead looking for the merge point.

Fixed according the reviewers comments.
Test case reducing in progress. Not trivial because smaller cases use less registers.
We need large unroll to expose the problem.

Herald edited edge metadata. · View Herald TranscriptOct 26 2016, 10:02 AM

Changed source code. New test is upcoming.

In D25944#579909, @alex-t wrote:

Fixed according the reviewers comments.
Test case reducing in progress. Not trivial because smaller cases use less registers.

If it's not trivial you may want to look at writing a MIR testcase. This will be easier and also be a better test. For examples look in test/CodeGen/MIR/

We need large unroll to expose the problem.

arsenm added inline comments.Oct 26 2016, 4:18 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149	We don't actually have AA enabled in the backend. We also need to add an address space alias analysis pass. If these were done, would that avoid the need to have this looser check?

Small test case.

alex-t added inline comments.Oct 27 2016, 7:34 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149	No. Even 2 reads from same segment (address space) and exactly same address are legal to be reordered as they both have no side effects. This is Read-After-Read (RAR) conflict. The check above is necessary to allow reordering Read-After-Write and Write-After-Write in case we can prove that 2 memory operations accesses different locations. As for the address space layer in AA, I added such in HSAIL backend. It is trivial and I can port it to LC quickly if you feel it is necessary.

• tstellarAMD added inline comments.Oct 27 2016, 7:43 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149	The alias analysis passes are always run so the information is available. There is a usesAA() subtarget query, but that's only used by a few places, mostly in the DAGCobminer.

alex-t marked 2 inline comments as done.Oct 27 2016, 8:18 AM

alex-t added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149	As far as I understand Matt meant that we'd like to have custom AA layer that intercepts alias queries with 2 different address spaces. Given that: *flat* can alias any *constant* can alias nothing *global group and private* only can alias pointer in same address space

rampitec added inline comments.Oct 27 2016, 1:50 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153	As far as I understand that is only legal to reorder two instructions if there are no stores in between of them which might go into the memory of instruction being moved. Also you need to make sure there are no barriers or fences in between. This function only checks two instructions passed but not anything in between. I.e. that is only legal if these two instructions are adjacent. I cannot immediately see it is only called for adjacent instructions, neither see a comment on such limitation.

tony-tye added a subscriber: jlebar.Oct 27 2016, 3:28 PM

tony-tye added inline comments.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149	Would be good to consider @jlebar patches to support target-specific AA (D24441 and D12414) which add an NVPTX-specific AA.

alex-t added inline comments.Oct 28 2016, 5:11 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153	In fact all the checks is done in a loop within findMatchingDSInst if (MBBI->hasUnmodeledSideEffects()) // We can't re-order this instruction with respect to other memory // opeations, so we fail both conditions mentioned above. return E; if (MBBI->mayLoadOrStore() && !memAccessesCanBeReordered(I, MBBI, TII, AA)) { // We fail condition #1, but we may still be able to satisfy condition // #2. Add this instruction to the move list and then we will check // if condition #2 holds once we have selected the matching instruction. InstsToMove.push_back(&MBBI); addDefsToList(MBBI, DefsToMove); continue; } So any writes in between are in InstsToMove list and will be checked in canMoveInstsAcrossMemOp. Any instructions in between with side effects like barriers etc breaks search. I had no intention to make a neat fix - just temporary workaround. So I preserved basic logic. Indeed there is no need to put potential stores in the move list for later check. We can break out immediately faced mayAlias store in between. If you insist I can change to this.

rampitec added inline comments.Oct 28 2016, 2:25 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153	Thanks. LGTM as a w/a provided tests are passed.

arsenm added inline comments.Oct 28 2016, 3:16 PM

test/CodeGen/AMDGPU/ds_read2.ll
496–498	You should run instnamer on the test. Also positive checks are more useful

Test case passed through the opt -instrenamer

alex-t marked an inline comment as done.Oct 31 2016, 8:00 AM

alex-t added inline comments.

test/CodeGen/AMDGPU/ds_read2.ll
496–498	The goal of the test is to check that *all* LDS reads are combined. So, any "ds_read_b32" in the output is the regression. In the original CGEMM inner loop there were 64 ds_read_b32 and all of them are combined with the patch. The test case is same but shorter excerpt. That's why it is negative.

Closed by commit rL285919: [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads. (authored by alex-t). · Explain WhyNov 3 2016, 7:46 AM

This revision was automatically updated to reflect the committed changes.

alex-t marked an inline comment as done.

Diff 76023

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines

static void addDefsToList(const MachineInstr &MI,		static void addDefsToList(const MachineInstr &MI,
SmallVectorImpl<const MachineOperand *> &Defs) {		SmallVectorImpl<const MachineOperand *> &Defs) {
for (const MachineOperand &Def : MI.defs()) {		for (const MachineOperand &Def : MI.defs()) {
Defs.push_back(&Def);		Defs.push_back(&Def);
}		}
}		}

		static bool memAccessesCanBeReordered(
		MachineBasicBlock::iterator A,
		MachineBasicBlock::iterator B,
		const SIInstrInfo *TII,
		llvm::AliasAnalysis * AA) {
		return (TII->areMemAccessesTriviallyDisjoint(A, B, AA) \|\|
		arsenmUnsubmitted Done Reply Inline Actions We don't actually have AA enabled in the backend. We also need to add an address space alias analysis pass. If these were done, would that avoid the need to have this looser check? arsenm: We don't actually have AA enabled in the backend. We also need to add an address space alias…
		alex-tAuthorUnsubmitted Done Reply Inline Actions No. Even 2 reads from same segment (address space) and exactly same address are legal to be reordered as they both have no side effects. This is Read-After-Read (RAR) conflict. The check above is necessary to allow reordering Read-After-Write and Write-After-Write in case we can prove that 2 memory operations accesses different locations. As for the address space layer in AA, I added such in HSAIL backend. It is trivial and I can port it to LC quickly if you feel it is necessary. alex-t: No. Even 2 reads from same segment (address space) and exactly same address are legal to be…
		tstellarAMDUnsubmitted Not Done Reply Inline Actions The alias analysis passes are always run so the information is available. There is a usesAA() subtarget query, but that's only used by a few places, mostly in the DAGCobminer. tstellarAMD: The alias analysis passes are always run so the information is available. There is a usesAA()…
		alex-tAuthorUnsubmitted Not Done Reply Inline Actions As far as I understand Matt meant that we'd like to have custom AA layer that intercepts alias queries with 2 different address spaces. Given that: *flat* can alias any *constant* can alias nothing *global group and private* only can alias pointer in same address space alex-t: As far as I understand Matt meant that we'd like to have custom AA layer that intercepts alias…
		tony-tyeUnsubmitted Not Done Reply Inline Actions Would be good to consider @jlebar patches to support target-specific AA (D24441 and D12414) which add an NVPTX-specific AA. tony-tye: Would be good to consider @jlebar patches to support target-specific AA (D24441 and D12414)…
		// RAW or WAR - cannot reorder
		// WAW - cannot reorder
		// RAR - safe to reorder
		!(A->mayStore() \|\| B->mayStore()));
		rampitecUnsubmitted Not Done Reply Inline Actions As far as I understand that is only legal to reorder two instructions if there are no stores in between of them which might go into the memory of instruction being moved. Also you need to make sure there are no barriers or fences in between. This function only checks two instructions passed but not anything in between. I.e. that is only legal if these two instructions are adjacent. I cannot immediately see it is only called for adjacent instructions, neither see a comment on such limitation. rampitec: As far as I understand that is only legal to reorder two instructions if there are no stores in…
		alex-tAuthorUnsubmitted Not Done Reply Inline Actions In fact all the checks is done in a loop within findMatchingDSInst if (MBBI->hasUnmodeledSideEffects()) // We can't re-order this instruction with respect to other memory // opeations, so we fail both conditions mentioned above. return E; if (MBBI->mayLoadOrStore() && !memAccessesCanBeReordered(I, MBBI, TII, AA)) { // We fail condition #1, but we may still be able to satisfy condition // #2. Add this instruction to the move list and then we will check // if condition #2 holds once we have selected the matching instruction. InstsToMove.push_back(&MBBI); addDefsToList(MBBI, DefsToMove); continue; } So any writes in between are in InstsToMove list and will be checked in canMoveInstsAcrossMemOp. Any instructions in between with side effects like barriers etc breaks search. I had no intention to make a neat fix - just temporary workaround. So I preserved basic logic. Indeed there is no need to put potential stores in the move list for later check. We can break out immediately faced mayAlias store in between. If you insist I can change to this. alex-t: In fact all the checks is done in a loop within findMatchingDSInst ``` if (MBBI…
		rampitecUnsubmitted Not Done Reply Inline Actions Thanks. LGTM as a w/a provided tests are passed. rampitec: Thanks. LGTM as a w/a provided tests are passed.
		}

static bool		static bool
canMoveInstsAcrossMemOp(MachineInstr &MemOp,		canMoveInstsAcrossMemOp(MachineInstr &MemOp,
ArrayRef<MachineInstr*> InstsToMove,		ArrayRef<MachineInstr*> InstsToMove,
const SIInstrInfo *TII,		const SIInstrInfo *TII,
AliasAnalysis *AA) {		AliasAnalysis *AA) {

assert(MemOp.mayLoadOrStore());		assert(MemOp.mayLoadOrStore());

for (MachineInstr *InstToMove : InstsToMove) {		for (MachineInstr *InstToMove : InstsToMove) {
if (!InstToMove->mayLoadOrStore())		if (!InstToMove->mayLoadOrStore())
continue;		continue;
if (!TII->areMemAccessesTriviallyDisjoint(MemOp, *InstToMove, AA))		if (!memAccessesCanBeReordered(MemOp, *InstToMove, TII, AA))
return false;		return false;
}		}
		tstellarAMDUnsubmitted Done Reply Inline Actions I see this logic repeated in a few places, can we turn it into a helper function? tstellarAMD: I see this logic repeated in a few places, can we turn it into a helper function?
return true;		return true;
}		}

bool SILoadStoreOptimizer::offsetsCanBeCombined(unsigned Offset0,		bool SILoadStoreOptimizer::offsetsCanBeCombined(unsigned Offset0,
unsigned Offset1,		unsigned Offset1,
unsigned Size) {		unsigned Size) {
// XXX - Would the same offset be OK? Is there any reason this would happen or		// XXX - Would the same offset be OK? Is there any reason this would happen or
// be useful?		// be useful?
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	if (MBBI->getOpcode() != I->getOpcode()) {
// be merged into.		// be merged into.

if (MBBI->hasUnmodeledSideEffects())		if (MBBI->hasUnmodeledSideEffects())
// We can't re-order this instruction with respect to other memory		// We can't re-order this instruction with respect to other memory
// opeations, so we fail both conditions mentioned above.		// opeations, so we fail both conditions mentioned above.
return E;		return E;

if (MBBI->mayLoadOrStore() &&		if (MBBI->mayLoadOrStore() &&
!TII->areMemAccessesTriviallyDisjoint(I, MBBI, AA)) {		!memAccessesCanBeReordered(I, MBBI, TII, AA)) {
// We fail condition #1, but we may still be able to satisfy condition		// We fail condition #1, but we may still be able to satisfy condition
// #2. Add this instruction to the move list and then we will check		// #2. Add this instruction to the move list and then we will check
// if condition #2 holds once we have selected the matching instruction.		// if condition #2 holds once we have selected the matching instruction.
InstsToMove.push_back(&*MBBI);		InstsToMove.push_back(&*MBBI);
addDefsToList(*MBBI, DefsToMove);		addDefsToList(*MBBI, DefsToMove);
continue;		continue;
}		}

// When we match I with another DS instruction we will be moving I down		// When we match I with another DS instruction we will be moving I down
// to the location of the matched instruction any uses of I will need to		// to the location of the matched instruction any uses of I will need to
		tstellarAMDUnsubmitted Done Reply Inline Actions Need to use c++ style comments. tstellarAMD: Need to use c++ style comments.
// be moved down as well.		// be moved down as well.
for (const MachineOperand *Def : DefsToMove) {		for (const MachineOperand *Def : DefsToMove) {
bool ReadDef = MBBI->readsVirtualRegister(Def->getReg());		bool ReadDef = MBBI->readsVirtualRegister(Def->getReg());
// If ReadDef is true, then there is a use of Def between I		// If ReadDef is true, then there is a use of Def between I
// and the instruction that I will potentially be merged with. We		// and the instruction that I will potentially be merged with. We
// will need to move this instruction after the merged instructions.		// will need to move this instruction after the merged instructions.
if (ReadDef) {		if (ReadDef) {
InstsToMove.push_back(&*MBBI);		InstsToMove.push_back(&*MBBI);
Show All 29 Lines	if (AddrReg0.getReg() == AddrReg1.getReg() &&
canMoveInstsAcrossMemOp(*MBBI, InstsToMove, TII, AA))		canMoveInstsAcrossMemOp(*MBBI, InstsToMove, TII, AA))
return MBBI;		return MBBI;
}		}

// We've found a load/store that we couldn't merge for some reason.		// We've found a load/store that we couldn't merge for some reason.
// We could potentially keep looking, but we'd need to make sure that		// We could potentially keep looking, but we'd need to make sure that
// it was safe to move I and also all the instruction in InstsToMove		// it was safe to move I and also all the instruction in InstsToMove
// down past this instruction.		// down past this instruction.
// FIXME: This is too conservative.		if (!memAccessesCanBeReordered(I, MBBI, TII, AA) \|\| // check if we can move I across MBBI
		!canMoveInstsAcrossMemOp(*MBBI, InstsToMove, TII, AA) // check if we can move all I's users
		)
break;		break;
}		}
		tstellarAMDUnsubmitted Done Reply Inline Actions This doesn't seem correct. The comment mentions that we need to check that it was safe to move all the instruction in InstsToMove down past this instruction. Is this condition being met? tstellarAMD: This doesn't seem correct. The comment mentions that we need to check that it was safe to move…
		alex-tAuthorUnsubmitted Not Done Reply Inline Actions Yes. If this 2 mem accesses may alias but neither is write access it is safe to reorder them. In this condition we check if 2 mem accesses may alias and if at least one may write we break out. Otherwise we are free to search down until the BB ends or we face the instruction that cannot be moved. alex-t: Yes. If this 2 mem accesses may alias but neither is write access it is safe to reorder them.
		alex-tAuthorUnsubmitted Done Reply Inline Actions and yes, you are rigth. I need to check if I users collected so far can be moved across MBBI before I go ahead looking for the merge point. alex-t: and yes, you are rigth. I need to check if I users collected so far can be moved across MBBI…
return E;		return E;
}		}

MachineBasicBlock::iterator SILoadStoreOptimizer::mergeRead2Pair(		MachineBasicBlock::iterator SILoadStoreOptimizer::mergeRead2Pair(
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
MachineBasicBlock::iterator Paired,		MachineBasicBlock::iterator Paired,
unsigned EltSize,		unsigned EltSize,
ArrayRef<MachineInstr*> InstsToMove) {		ArrayRef<MachineInstr*> InstsToMove) {
▲ Show 20 Lines • Show All 216 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/ds_read2.ll

	Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines
	}			}

	define void @misaligned_read2_i64(i64 addrspace(1)* %out, i64 addrspace(3)* %in) #0 {			define void @misaligned_read2_i64(i64 addrspace(1)* %out, i64 addrspace(3)* %in) #0 {
	%load = load i64, i64 addrspace(3)* %in, align 4			%load = load i64, i64 addrspace(3)* %in, align 4
	store i64 %load, i64 addrspace(1)* %out, align 8			store i64 %load, i64 addrspace(1)* %out, align 8
	ret void			ret void
	}			}

				; SI-LABEL: ds_read_diff_base_interleaving
				; SI-NOT: ds_read_b32
				define void @ds_read_diff_base_interleaving(float addrspace(1)* nocapture,
				arsenmUnsubmitted Done Reply Inline Actions You should run instnamer on the test. Also positive checks are more useful arsenm: You should run instnamer on the test. Also positive checks are more useful
				alex-tAuthorUnsubmitted Not Done Reply Inline Actions The goal of the test is to check that *all* LDS reads are combined. So, any "ds_read_b32" in the output is the regression. In the original CGEMM inner loop there were 64 ds_read_b32 and all of them are combined with the patch. The test case is same but shorter excerpt. That's why it is negative. alex-t: The goal of the test is to check that //all// LDS reads are combined. So, any "ds_read_b32"…
				[4 x [4 x float]] addrspace(3) *,
				[4 x [4 x float]] addrspace(3) *,
				[4 x [4 x float]] addrspace(3) *,
				[4 x [4 x float]] addrspace(3) *
				) {

				%st_addr = getelementptr float, float addrspace(1)* %0, i64 10
				%id_x = tail call i32 @llvm.amdgcn.workitem.id.x() #4
				%id_y = tail call i32 @llvm.amdgcn.workitem.id.y() #4

				%6 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %1, i32 0, i32 %id_y, i32 0
				%7 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %2, i32 0, i32 0, i32 %id_x
				%8 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %3, i32 0, i32 %id_y, i32 0
				%9 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %4, i32 0, i32 0, i32 %id_x
				%10 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %1, i32 0, i32 %id_y, i32 1
				%11 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %2, i32 0, i32 1, i32 %id_x
				%12 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %3, i32 0, i32 %id_y, i32 1
				%13 = getelementptr [4 x [4 x float]], [4 x [4 x float]] addrspace(3)* %4, i32 0, i32 1, i32 %id_x


				%14 = load float, float addrspace(3)* %6
				%15 = load float, float addrspace(3)* %7
				%mul3 = fmul float %14, %15
				%add1 = fadd float 2.0, %mul3
				%16 = load float, float addrspace(3)* %8
				%17 = load float, float addrspace(3)* %9
				%mul4 = fmul float %16, %17
				%sub2 = fsub float %add1, %mul4
				%18 = load float, float addrspace(3)* %10
				%19 = load float, float addrspace(3)* %11
				%mul5 = fmul float %18, %19
				%sub3 = fsub float %sub2, %mul5
				%20 = load float, float addrspace(3)* %12
				%21 = load float, float addrspace(3)* %13
				%mul6 = fmul float %20, %21
				%sub4 = fsub float %sub3, %mul6
				store float %sub4, float addrspace(1)* %st_addr
				ret void
				}



	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	declare i32 @llvm.amdgcn.workgroup.id.x() #1			declare i32 @llvm.amdgcn.workgroup.id.x() #1

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	declare i32 @llvm.amdgcn.workgroup.id.y() #1			declare i32 @llvm.amdgcn.workgroup.id.y() #1

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	declare i32 @llvm.amdgcn.workitem.id.x() #1			declare i32 @llvm.amdgcn.workitem.id.x() #1
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 76023

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

test/CodeGen/AMDGPU/ds_read2.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 76023

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

test/CodeGen/AMDGPU/ds_read2.ll

[AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.
ClosedPublic