This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
3/3
SIISelLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
no-dup-inst-prefetch.ll

Differential D123569

[AMDGPU] Try to avoid inserting duplicate s_inst_prefetch
ClosedPublic

Authored by critson on Apr 11 2022, 9:27 PM.

Download Raw Diff

Details

Reviewers

rampitec
foad

Commits

rG35ea326047ef: [AMDGPU] Try to avoid inserting duplicate s_inst_prefetch

Summary

Check for existing s_inst_prefetch instructions when
configuring prefetches during loop alignment.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

critson created this revision.Apr 11 2022, 9:27 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 9:27 PM

Herald added subscribers: hsmhsm, kerbowa, hiraditya and 8 others. · View Herald Transcript

critson requested review of this revision.Apr 11 2022, 9:27 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 9:27 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

critson mentioned this in rG4c59fc53299d: [AMDGPU] Pre-commit test for D123569. NFC..Apr 11 2022, 9:47 PM

Rebase on to pre-committed test.

Harbormaster completed remote builds in B159155: Diff 422120.Apr 11 2022, 10:39 PM

Thank you! I had noticed this before but never got around to investigating properly.

foad added inline comments.Apr 12 2022, 2:42 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12316	Surely the condition should be `PreTerm == Pre->begin() \|\| ...`?
12322	Likewise.

critson marked 2 inline comments as done.Apr 12 2022, 3:35 AM

critson added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12316	A fair point, I had not consider we still want to insert when this is the beginning of the block.

Incorporate reviewer comments

foad accepted this revision.Apr 12 2022, 3:37 AM

This revision is now accepted and ready to land.Apr 12 2022, 3:37 AM

Harbormaster completed remote builds in B159186: Diff 422162.Apr 12 2022, 4:22 AM

Thanks! Did we ever run any benchmarking on this? I have written this before actual HW was available.

In D123569#3445987, @rampitec wrote:

Thanks! Did we ever run any benchmarking on this? I have written this before actual HW was available.

I was curious so did some quick investigation on GFX10.1 (Navi10).

For graphics at the macro scale, I cannot see any performance impact from entirely disabling generation of s_inst_prefetch instructions on our test suite.

Setting up a micro benchmark, I can see a >20% performance uplift setting an appropriate mode, and >20% performance drop for setting an inappropriate mode via s_inst_prefetch.
So these instructions definitely matter, but its an open question if we are using them effectively -- at least they don't seem to be hurting performance.
Additionally the cost of back-to-back s_inst_prefetch is the same as s_nop, so I would not expect to see change in performance for this patch, just saving a few redundant scalar instructions.

In D123569#3450751, @critson wrote:

In D123569#3445987, @rampitec wrote:

Thanks! Did we ever run any benchmarking on this? I have written this before actual HW was available.

I was curious so did some quick investigation on GFX10.1 (Navi10).

For graphics at the macro scale, I cannot see any performance impact from entirely disabling generation of s_inst_prefetch instructions on our test suite.

Setting up a micro benchmark, I can see a >20% performance uplift setting an appropriate mode, and >20% performance drop for setting an inappropriate mode via s_inst_prefetch.
So these instructions definitely matter, but its an open question if we are using them effectively -- at least they don't seem to be hurting performance.
Additionally the cost of back-to-back s_inst_prefetch is the same as s_nop, so I would not expect to see change in performance for this patch, just saving a few redundant scalar instructions.

Thanks Carl! I would suggest this should really matter if we have a loop in a certain range, not too small so it doesn't fit into I$ entirely, not to large to be evicted anyway. It can be somewhat tricky to measure the impact. Certainly nested loops may expose an impact. Maybe compute shaders have better chances to fall into that range.

This revision was landed with ongoing or failed builds.Apr 14 2022, 12:21 AM

Closed by commit rG35ea326047ef: [AMDGPU] Try to avoid inserting duplicate s_inst_prefetch (authored by critson). · Explain Why

This revision was automatically updated to reflect the committed changes.

critson added a commit: rG35ea326047ef: [AMDGPU] Try to avoid inserting duplicate s_inst_prefetch.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

18 lines

test/

CodeGen/

AMDGPU/

no-dup-inst-prefetch.ll

4 lines

Diff 422756

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,306 Lines • ▼ Show 20 Lines	if (MachineBasicBlock *Exit = P->getExitBlock()) {
return CacheLineAlign;		return CacheLineAlign;
}		}
}		}

MachineBasicBlock *Pre = ML->getLoopPreheader();		MachineBasicBlock *Pre = ML->getLoopPreheader();
MachineBasicBlock *Exit = ML->getExitBlock();		MachineBasicBlock *Exit = ML->getExitBlock();

if (Pre && Exit) {		if (Pre && Exit) {
BuildMI(*Pre, Pre->getFirstTerminator(), DebugLoc(),		auto PreTerm = Pre->getFirstTerminator();
TII->get(AMDGPU::S_INST_PREFETCH))		if (PreTerm == Pre->begin() \|\|
		foadUnsubmitted Done Reply Inline Actions Surely the condition should be `PreTerm == Pre->begin() \|\| ...`? foad: Surely the condition should be `PreTerm == Pre->begin() \|\| ...`?
		critsonAuthorUnsubmitted Done Reply Inline Actions A fair point, I had not consider we still want to insert when this is the beginning of the block. critson: A fair point, I had not consider we still want to insert when this is the beginning of the…
		std::prev(PreTerm)->getOpcode() != AMDGPU::S_INST_PREFETCH)
		BuildMI(*Pre, PreTerm, DebugLoc(), TII->get(AMDGPU::S_INST_PREFETCH))
.addImm(1); // prefetch 2 lines behind PC		.addImm(1); // prefetch 2 lines behind PC

BuildMI(*Exit, Exit->getFirstNonDebugInstr(), DebugLoc(),		auto ExitHead = Exit->getFirstNonDebugInstr();
TII->get(AMDGPU::S_INST_PREFETCH))		if (ExitHead == Exit->end() \|\|
		foadUnsubmitted Done Reply Inline Actions Likewise. foad: Likewise.
		ExitHead->getOpcode() != AMDGPU::S_INST_PREFETCH)
		BuildMI(*Exit, ExitHead, DebugLoc(), TII->get(AMDGPU::S_INST_PREFETCH))
.addImm(2); // prefetch 1 line behind PC		.addImm(2); // prefetch 1 line behind PC
}		}

return CacheLineAlign;		return CacheLineAlign;
}		}

LLVM_ATTRIBUTE_UNUSED		LLVM_ATTRIBUTE_UNUSED
static bool isCopyFromRegOfInlineAsm(const SDNode *N) {		static bool isCopyFromRegOfInlineAsm(const SDNode *N) {
assert(N->getOpcode() == ISD::CopyFromReg);		assert(N->getOpcode() == ISD::CopyFromReg);
▲ Show 20 Lines • Show All 387 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck --check-prefix=GFX10 %s			; RUN: llc -march=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck --check-prefix=GFX10 %s

	define amdgpu_cs void @_amdgpu_cs_main(float %0, i32 %1) {			define amdgpu_cs void @_amdgpu_cs_main(float %0, i32 %1) {
	; GFX10-LABEL: _amdgpu_cs_main:			; GFX10-LABEL: _amdgpu_cs_main:
	; GFX10: ; %bb.0: ; %branch1_true			; GFX10: ; %bb.0: ; %branch1_true
	; GFX10-NEXT: v_mov_b32_e32 v2, 0			; GFX10-NEXT: v_mov_b32_e32 v2, 0
	; GFX10-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v1			; GFX10-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v1
	; GFX10-NEXT: v_mov_b32_e32 v1, 0			; GFX10-NEXT: v_mov_b32_e32 v1, 0
	; GFX10-NEXT: s_mov_b32 s4, 0			; GFX10-NEXT: s_mov_b32 s4, 0
	; GFX10-NEXT: s_mov_b32 s1, 0			; GFX10-NEXT: s_mov_b32 s1, 0
	; GFX10-NEXT: ; implicit-def: $sgpr2			; GFX10-NEXT: ; implicit-def: $sgpr2
	; GFX10-NEXT: s_inst_prefetch 0x1			; GFX10-NEXT: s_inst_prefetch 0x1
	; GFX10-NEXT: s_inst_prefetch 0x1
	; GFX10-NEXT: s_inst_prefetch 0x1
	; GFX10-NEXT: s_branch .LBB0_2			; GFX10-NEXT: s_branch .LBB0_2
	; GFX10-NEXT: .p2align 6			; GFX10-NEXT: .p2align 6
	; GFX10-NEXT: .LBB0_1: ; %Flow			; GFX10-NEXT: .LBB0_1: ; %Flow
	; GFX10-NEXT: ; in Loop: Header=BB0_2 Depth=1			; GFX10-NEXT: ; in Loop: Header=BB0_2 Depth=1
	; GFX10-NEXT: s_or_b32 exec_lo, exec_lo, s3			; GFX10-NEXT: s_or_b32 exec_lo, exec_lo, s3
	; GFX10-NEXT: v_mov_b32_e32 v1, v0			; GFX10-NEXT: v_mov_b32_e32 v1, v0
	; GFX10-NEXT: s_and_b32 s0, exec_lo, s2			; GFX10-NEXT: s_and_b32 s0, exec_lo, s2
	; GFX10-NEXT: s_or_b32 s1, s0, s1			; GFX10-NEXT: s_or_b32 s1, s0, s1
	Show All 21 Lines
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_fma_f32 v1, v1, v0, 0			; GFX10-NEXT: v_fma_f32 v1, v1, v0, 0
	; GFX10-NEXT: v_cmp_le_f32_e64 s0, 0, v1			; GFX10-NEXT: v_cmp_le_f32_e64 s0, 0, v1
	; GFX10-NEXT: s_and_b32 s0, s0, exec_lo			; GFX10-NEXT: s_and_b32 s0, s0, exec_lo
	; GFX10-NEXT: s_or_b32 s2, s2, s0			; GFX10-NEXT: s_or_b32 s2, s2, s0
	; GFX10-NEXT: s_branch .LBB0_1			; GFX10-NEXT: s_branch .LBB0_1
	; GFX10-NEXT: .LBB0_4: ; %loop0_merge			; GFX10-NEXT: .LBB0_4: ; %loop0_merge
	; GFX10-NEXT: s_inst_prefetch 0x2			; GFX10-NEXT: s_inst_prefetch 0x2
	; GFX10-NEXT: s_inst_prefetch 0x2
	; GFX10-NEXT: s_inst_prefetch 0x2
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	branch1_true:			branch1_true:
	br label %2			br label %2

	2: ; preds = %branch2_merge, %branch1_true			2: ; preds = %branch2_merge, %branch1_true
	%r1.8.vec.insert14.i1 = phi float [ 0.000000e+00, %branch1_true ], [ %0, %branch2_merge ]			%r1.8.vec.insert14.i1 = phi float [ 0.000000e+00, %branch1_true ], [ %0, %branch2_merge ]
	%3 = call float @llvm.amdgcn.image.sample.lz.3d.f32.f32(i32 1, float 0.000000e+00, float 0.000000e+00, float %r1.8.vec.insert14.i1, <8 x i32> zeroinitializer, <4 x i32> zeroinitializer, i1 false, i32 0, i32 0)			%3 = call float @llvm.amdgcn.image.sample.lz.3d.f32.f32(i32 1, float 0.000000e+00, float 0.000000e+00, float %r1.8.vec.insert14.i1, <8 x i32> zeroinitializer, <4 x i32> zeroinitializer, i1 false, i32 0, i32 0)
	%4 = icmp eq i32 %1, 0			%4 = icmp eq i32 %1, 0
	Show All 19 Lines