This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable load clustering in the post-RA scheduler
ClosedPublic

Authored by foad on Oct 12 2021, 7:49 AM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec
Joe_Nash
kerbowa
critson
piotr

Commits

rGc885857e9d03: [AMDGPU] Enable load clustering in the post-RA scheduler

Summary

This has a couple of benefits:

It can sometimes fix clusters that got broken apart when the register allocator inserted a copy.
Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:

The average length of each run of one or more load instructions increased by about 1%.
The number of runs of two or more load instructions increased by about 4%.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

foad created this revision.Oct 12 2021, 7:49 AM

Herald added subscribers: kerbowa, asbirlea, hiraditya and 7 others. · View Herald TranscriptOct 12 2021, 7:49 AM

foad requested review of this revision.Oct 12 2021, 7:49 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 7:49 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

foad added reviewers: kerbowa, critson, piotr.Oct 12 2021, 7:49 AM

Interesting to see that quite a few of the test changes are clustering the loads, but not ending up with a clause (correctly).
Nice result otherwise though.

In D111646#3058586, @dstuttard wrote:

Interesting to see that quite a few of the test changes are clustering the loads, but not ending up with a clause (correctly).
Nice result otherwise though.

Oh no - they're just pre-gfx10, so no s_clause.

Any data on compile time? Other than potential compile time/ runtime tradeoff LGTM

Harbormaster completed remote builds in B128372: Diff 379036.Oct 12 2021, 10:21 AM

LGTM

This revision is now accepted and ready to land.Oct 12 2021, 10:27 AM

In D111646#3058632, @Joe_Nash wrote:

Any data on compile time? Other than potential compile time/ runtime tradeoff LGTM

We see an average 0.11% slow down in internal compile time testing on a bunch of graphics pipelines. Does that seem acceptable?

In D111646#3060926, @foad wrote:

In D111646#3058632, @Joe_Nash wrote:

Any data on compile time? Other than potential compile time/ runtime tradeoff LGTM

We see an average 0.11% slow down in internal compile time testing on a bunch of graphics pipelines. Does that seem acceptable?

Seems acceptable to me.

Closed by commit rGc885857e9d03: [AMDGPU] Enable load clustering in the post-RA scheduler (authored by foad). · Explain WhyOct 13 2021, 9:12 AM

This revision was automatically updated to reflect the committed changes.

foad added a commit: rGc885857e9d03: [AMDGPU] Enable load clustering in the post-RA scheduler.

Sounds good to me. The runtime improvement from clustering is notoriously difficult to assess, but your static data shows some potential benefit.

Having said that, it probably makes sense to guard the mutation by the optlevel check, so we only enable it with -O2 or higher.

OptLevel > CodeGenOpt::Less

In D111646#3063346, @piotr wrote:

Sounds good to me. The runtime improvement from clustering is notoriously difficult to assess, but your static data shows some potential benefit.

Having said that, it probably makes sense to guard the mutation by the optlevel check, so we only enable it with -O2 or higher.

Interesting. I wonder if we should do that for all the mutations, or even all the (post-ra?) scheduling?

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUTargetMachine.cpp

1 line

test/

CodeGen/

AMDGPU/

GlobalISel/

extractelement.i128.ll

5 lines

udivrem.ll

5 lines

amdgpu-codegenprepare-idiv.ll

4 lines

idiv-licm.ll

2 lines

promote-constOffset-to-imm.ll

6 lines

2 lines

2 lines

2 lines

2 lines

Diff 379433

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 845 Lines • ▼ Show 20 Lines	public:

ScheduleDAGInstrs *		ScheduleDAGInstrs *
createMachineScheduler(MachineSchedContext *C) const override;		createMachineScheduler(MachineSchedContext *C) const override;

ScheduleDAGInstrs *		ScheduleDAGInstrs *
createPostMachineScheduler(MachineSchedContext *C) const override {		createPostMachineScheduler(MachineSchedContext *C) const override {
ScheduleDAGMI *DAG = createGenericSchedPostRA(C);		ScheduleDAGMI *DAG = createGenericSchedPostRA(C);
const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>();
		DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
DAG->addMutation(ST.createFillMFMAShadowMutation(DAG->TII));		DAG->addMutation(ST.createFillMFMAShadowMutation(DAG->TII));
return DAG;		return DAG;
}		}

bool addPreISel() override;		bool addPreISel() override;
void addMachineSSAOptimization() override;		void addMachineSSAOptimization() override;
bool addILPOpts() override;		bool addILPOpts() override;
bool addInstSelector() override;		bool addInstSelector() override;
▲ Show 20 Lines • Show All 662 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll

	Show First 20 Lines • Show All 179 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: v_cndmask_b32_e64 v3, v3, v15, s[4:5]			; GFX9-NEXT: v_cndmask_b32_e64 v3, v3, v15, s[4:5]
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX8-LABEL: extractelement_vgpr_v4i128_vgpr_idx:			; GFX8-LABEL: extractelement_vgpr_v4i128_vgpr_idx:
	; GFX8: ; %bb.0:			; GFX8: ; %bb.0:
	; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX8-NEXT: v_add_u32_e32 v3, vcc, 16, v0			; GFX8-NEXT: v_add_u32_e32 v3, vcc, 16, v0
	; GFX8-NEXT: v_addc_u32_e32 v4, vcc, 0, v1, vcc			; GFX8-NEXT: v_addc_u32_e32 v4, vcc, 0, v1, vcc
	; GFX8-NEXT: flat_load_dwordx4 v[8:11], v[0:1]
	; GFX8-NEXT: flat_load_dwordx4 v[4:7], v[3:4]			; GFX8-NEXT: flat_load_dwordx4 v[4:7], v[3:4]
				; GFX8-NEXT: flat_load_dwordx4 v[8:11], v[0:1]
	; GFX8-NEXT: v_lshlrev_b32_e32 v16, 1, v2			; GFX8-NEXT: v_lshlrev_b32_e32 v16, 1, v2
	; GFX8-NEXT: v_add_u32_e32 v17, vcc, 1, v16			; GFX8-NEXT: v_add_u32_e32 v17, vcc, 1, v16
	; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v17			; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v17
	; GFX8-NEXT: v_cmp_eq_u32_e64 s[4:5], 1, v16			; GFX8-NEXT: v_cmp_eq_u32_e64 s[4:5], 1, v16
	; GFX8-NEXT: v_cmp_eq_u32_e64 s[6:7], 6, v16			; GFX8-NEXT: v_cmp_eq_u32_e64 s[6:7], 6, v16
	; GFX8-NEXT: v_cmp_eq_u32_e64 s[8:9], 7, v16			; GFX8-NEXT: v_cmp_eq_u32_e64 s[8:9], 7, v16
	; GFX8-NEXT: s_waitcnt vmcnt(1)			; GFX8-NEXT: s_waitcnt vmcnt(0)
	; GFX8-NEXT: v_cndmask_b32_e64 v2, v8, v10, s[4:5]			; GFX8-NEXT: v_cndmask_b32_e64 v2, v8, v10, s[4:5]
	; GFX8-NEXT: v_cndmask_b32_e64 v3, v9, v11, s[4:5]			; GFX8-NEXT: v_cndmask_b32_e64 v3, v9, v11, s[4:5]
	; GFX8-NEXT: v_cndmask_b32_e32 v8, v8, v10, vcc			; GFX8-NEXT: v_cndmask_b32_e32 v8, v8, v10, vcc
	; GFX8-NEXT: v_cndmask_b32_e32 v9, v9, v11, vcc			; GFX8-NEXT: v_cndmask_b32_e32 v9, v9, v11, vcc
	; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 2, v16			; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 2, v16
	; GFX8-NEXT: s_waitcnt vmcnt(0)
	; GFX8-NEXT: v_cndmask_b32_e32 v2, v2, v4, vcc			; GFX8-NEXT: v_cndmask_b32_e32 v2, v2, v4, vcc
	; GFX8-NEXT: v_cndmask_b32_e32 v3, v3, v5, vcc			; GFX8-NEXT: v_cndmask_b32_e32 v3, v3, v5, vcc
	; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 2, v17			; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 2, v17
	; GFX8-NEXT: v_cndmask_b32_e32 v4, v8, v4, vcc			; GFX8-NEXT: v_cndmask_b32_e32 v4, v8, v4, vcc
	; GFX8-NEXT: v_cndmask_b32_e32 v5, v9, v5, vcc			; GFX8-NEXT: v_cndmask_b32_e32 v5, v9, v5, vcc
	; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 3, v16			; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 3, v16
	; GFX8-NEXT: v_cndmask_b32_e32 v18, v2, v6, vcc			; GFX8-NEXT: v_cndmask_b32_e32 v18, v2, v6, vcc
	; GFX8-NEXT: v_cndmask_b32_e32 v19, v3, v7, vcc			; GFX8-NEXT: v_cndmask_b32_e32 v19, v3, v7, vcc
	▲ Show 20 Lines • Show All 722 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll

	Show First 20 Lines • Show All 824 Lines • ▼ Show 20 Lines
	; GFX8-NEXT: v_mov_b32_e32 v0, s14			; GFX8-NEXT: v_mov_b32_e32 v0, s14
	; GFX8-NEXT: v_mov_b32_e32 v1, s15			; GFX8-NEXT: v_mov_b32_e32 v1, s15
	; GFX8-NEXT: flat_store_dwordx4 v[0:1], v[4:7]			; GFX8-NEXT: flat_store_dwordx4 v[0:1], v[4:7]
	; GFX8-NEXT: s_endpgm			; GFX8-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: udivrem_v4i32:			; GFX9-LABEL: udivrem_v4i32:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x20			; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x20
	; GFX9-NEXT: v_mov_b32_e32 v2, 0x4f7ffffe
	; GFX9-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x10			; GFX9-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x10
				; GFX9-NEXT: v_mov_b32_e32 v2, 0x4f7ffffe
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_cvt_f32_u32_e32 v0, s0			; GFX9-NEXT: v_cvt_f32_u32_e32 v0, s0
	; GFX9-NEXT: v_cvt_f32_u32_e32 v1, s1			; GFX9-NEXT: v_cvt_f32_u32_e32 v1, s1
	; GFX9-NEXT: s_sub_i32 s6, 0, s0			; GFX9-NEXT: s_sub_i32 s6, 0, s0
	; GFX9-NEXT: s_sub_i32 s7, 0, s1			; GFX9-NEXT: s_sub_i32 s7, 0, s1
	; GFX9-NEXT: v_rcp_iflag_f32_e32 v0, v0			; GFX9-NEXT: v_rcp_iflag_f32_e32 v0, v0
	; GFX9-NEXT: v_rcp_iflag_f32_e32 v1, v1			; GFX9-NEXT: v_rcp_iflag_f32_e32 v1, v1
	; GFX9-NEXT: v_cvt_f32_u32_e32 v5, s2			; GFX9-NEXT: v_cvt_f32_u32_e32 v5, s2
	▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: v_mov_b32_e32 v8, 0			; GFX9-NEXT: v_mov_b32_e32 v8, 0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_store_dwordx4 v8, v[0:3], s[4:5]			; GFX9-NEXT: global_store_dwordx4 v8, v[0:3], s[4:5]
	; GFX9-NEXT: global_store_dwordx4 v8, v[4:7], s[6:7]			; GFX9-NEXT: global_store_dwordx4 v8, v[4:7], s[6:7]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX10-LABEL: udivrem_v4i32:			; GFX10-LABEL: udivrem_v4i32:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
				; GFX10-NEXT: s_clause 0x1
	; GFX10-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x20			; GFX10-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x20
	; GFX10-NEXT: v_mov_b32_e32 v4, 0x4f7ffffe
	; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x10			; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x10
				; GFX10-NEXT: v_mov_b32_e32 v4, 0x4f7ffffe
	; GFX10-NEXT: v_mov_b32_e32 v8, 0			; GFX10-NEXT: v_mov_b32_e32 v8, 0
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: v_cvt_f32_u32_e32 v0, s8			; GFX10-NEXT: v_cvt_f32_u32_e32 v0, s8
	; GFX10-NEXT: v_cvt_f32_u32_e32 v1, s9			; GFX10-NEXT: v_cvt_f32_u32_e32 v1, s9
	; GFX10-NEXT: v_cvt_f32_u32_e32 v2, s10			; GFX10-NEXT: v_cvt_f32_u32_e32 v2, s10
	; GFX10-NEXT: v_cvt_f32_u32_e32 v3, s11			; GFX10-NEXT: v_cvt_f32_u32_e32 v3, s11
	; GFX10-NEXT: s_sub_i32 s6, 0, s8			; GFX10-NEXT: s_sub_i32 s6, 0, s8
	; GFX10-NEXT: v_rcp_iflag_f32_e32 v0, v0			; GFX10-NEXT: v_rcp_iflag_f32_e32 v0, v0
	▲ Show 20 Lines • Show All 1,820 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 11,230 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[SHL_Y:%.]] = shl i64 4096, [[Y:%.]]			; CHECK-NEXT: [[SHL_Y:%.]] = shl i64 4096, [[Y:%.]]
	; CHECK-NEXT: [[R:%.]] = sdiv i64 [[X:%.]], [[SHL_Y]]			; CHECK-NEXT: [[R:%.]] = sdiv i64 [[X:%.]], [[SHL_Y]]
	; CHECK-NEXT: store i64 [[R]], i64 addrspace(1)* [[OUT:%.*]], align 4			; CHECK-NEXT: store i64 [[R]], i64 addrspace(1)* [[OUT:%.*]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	; GFX6-LABEL: sdiv_i64_pow2_shl_denom:			; GFX6-LABEL: sdiv_i64_pow2_shl_denom:
	; GFX6: ; %bb.0:			; GFX6: ; %bb.0:
	; GFX6-NEXT: s_load_dword s4, s[0:1], 0xd			; GFX6-NEXT: s_load_dword s4, s[0:1], 0xd
	; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000
	; GFX6-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9			; GFX6-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9
				; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000
	; GFX6-NEXT: s_mov_b32 s7, 0xf000			; GFX6-NEXT: s_mov_b32 s7, 0xf000
	; GFX6-NEXT: s_mov_b32 s6, -1			; GFX6-NEXT: s_mov_b32 s6, -1
	; GFX6-NEXT: s_waitcnt lgkmcnt(0)			; GFX6-NEXT: s_waitcnt lgkmcnt(0)
	; GFX6-NEXT: s_lshl_b64 s[4:5], s[2:3], s4			; GFX6-NEXT: s_lshl_b64 s[4:5], s[2:3], s4
	; GFX6-NEXT: s_ashr_i32 s2, s5, 31			; GFX6-NEXT: s_ashr_i32 s2, s5, 31
	; GFX6-NEXT: s_add_u32 s4, s4, s2			; GFX6-NEXT: s_add_u32 s4, s4, s2
	; GFX6-NEXT: s_mov_b32 s3, s2			; GFX6-NEXT: s_mov_b32 s3, s2
	; GFX6-NEXT: s_addc_u32 s5, s5, s2			; GFX6-NEXT: s_addc_u32 s5, s5, s2
	▲ Show 20 Lines • Show All 2,104 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[SHL_Y:%.]] = shl i64 4096, [[Y:%.]]			; CHECK-NEXT: [[SHL_Y:%.]] = shl i64 4096, [[Y:%.]]
	; CHECK-NEXT: [[R:%.]] = srem i64 [[X:%.]], [[SHL_Y]]			; CHECK-NEXT: [[R:%.]] = srem i64 [[X:%.]], [[SHL_Y]]
	; CHECK-NEXT: store i64 [[R]], i64 addrspace(1)* [[OUT:%.*]], align 4			; CHECK-NEXT: store i64 [[R]], i64 addrspace(1)* [[OUT:%.*]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	; GFX6-LABEL: srem_i64_pow2_shl_denom:			; GFX6-LABEL: srem_i64_pow2_shl_denom:
	; GFX6: ; %bb.0:			; GFX6: ; %bb.0:
	; GFX6-NEXT: s_load_dword s4, s[0:1], 0xd			; GFX6-NEXT: s_load_dword s4, s[0:1], 0xd
	; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000
	; GFX6-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9			; GFX6-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9
				; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000
	; GFX6-NEXT: s_mov_b32 s7, 0xf000			; GFX6-NEXT: s_mov_b32 s7, 0xf000
	; GFX6-NEXT: s_mov_b32 s6, -1			; GFX6-NEXT: s_mov_b32 s6, -1
	; GFX6-NEXT: s_waitcnt lgkmcnt(0)			; GFX6-NEXT: s_waitcnt lgkmcnt(0)
	; GFX6-NEXT: s_lshl_b64 s[2:3], s[2:3], s4			; GFX6-NEXT: s_lshl_b64 s[2:3], s[2:3], s4
	; GFX6-NEXT: s_ashr_i32 s4, s3, 31			; GFX6-NEXT: s_ashr_i32 s4, s3, 31
	; GFX6-NEXT: s_add_u32 s2, s2, s4			; GFX6-NEXT: s_add_u32 s2, s2, s4
	; GFX6-NEXT: s_mov_b32 s5, s4			; GFX6-NEXT: s_mov_b32 s5, s4
	; GFX6-NEXT: s_addc_u32 s3, s3, s4			; GFX6-NEXT: s_addc_u32 s3, s3, s4
	▲ Show 20 Lines • Show All 1,310 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/idiv-licm.ll

Show First 20 Lines • Show All 485 Lines • ▼ Show 20 Lines	bb3: ; preds = %bb3, %bb
%tmp8 = icmp eq i16 %tmp7, 1024		%tmp8 = icmp eq i16 %tmp7, 1024
br i1 %tmp8, label %bb2, label %bb3		br i1 %tmp8, label %bb2, label %bb3
}		}

define amdgpu_kernel void @urem16_invariant_denom(i16 addrspace(1)* nocapture %arg, i16 %arg1) {		define amdgpu_kernel void @urem16_invariant_denom(i16 addrspace(1)* nocapture %arg, i16 %arg1) {
; GFX9-LABEL: urem16_invariant_denom:		; GFX9-LABEL: urem16_invariant_denom:
; GFX9: ; %bb.0: ; %bb		; GFX9: ; %bb.0: ; %bb
; GFX9-NEXT: s_load_dword s2, s[0:1], 0x2c		; GFX9-NEXT: s_load_dword s2, s[0:1], 0x2c
; GFX9-NEXT: s_mov_b32 s6, 0xffff
; GFX9-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x24		; GFX9-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x24
		; GFX9-NEXT: s_mov_b32 s6, 0xffff
; GFX9-NEXT: v_mov_b32_e32 v1, 0		; GFX9-NEXT: v_mov_b32_e32 v1, 0
; GFX9-NEXT: s_movk_i32 s8, 0x400		; GFX9-NEXT: s_movk_i32 s8, 0x400
; GFX9-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: s_and_b32 s7, s6, s2		; GFX9-NEXT: s_and_b32 s7, s6, s2
; GFX9-NEXT: v_cvt_f32_u32_e32 v2, s7		; GFX9-NEXT: v_cvt_f32_u32_e32 v2, s7
; GFX9-NEXT: v_mov_b32_e32 v4, 0		; GFX9-NEXT: v_mov_b32_e32 v4, 0
; GFX9-NEXT: v_rcp_iflag_f32_e32 v3, v2		; GFX9-NEXT: v_rcp_iflag_f32_e32 v3, v2
; GFX9-NEXT: BB5_1: ; %bb3		; GFX9-NEXT: BB5_1: ; %bb3
▲ Show 20 Lines • Show All 262 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll

	Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines
	; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}			; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096			; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
	; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}			; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	;			;
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}			; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}			; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
				; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
				; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
				; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}			; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}

	; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096			; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
	; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}			; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048			; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
	; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}			; GFX90A: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}}
	▲ Show 20 Lines • Show All 444 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/sdiv64.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -amdgpu-codegenprepare-expand-div64 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN-IR %s			; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -amdgpu-codegenprepare-expand-div64 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN-IR %s

	define amdgpu_kernel void @s_test_sdiv(i64 addrspace(1)* %out, i64 %x, i64 %y) {			define amdgpu_kernel void @s_test_sdiv(i64 addrspace(1)* %out, i64 %x, i64 %y) {
	; GCN-LABEL: s_test_sdiv:			; GCN-LABEL: s_test_sdiv:
	; GCN: ; %bb.0:			; GCN: ; %bb.0:
	; GCN-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0xd			; GCN-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0xd
	; GCN-NEXT: v_mov_b32_e32 v7, 0
	; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9			; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9
				; GCN-NEXT: v_mov_b32_e32 v7, 0
	; GCN-NEXT: s_mov_b32 s7, 0xf000			; GCN-NEXT: s_mov_b32 s7, 0xf000
	; GCN-NEXT: s_mov_b32 s6, -1			; GCN-NEXT: s_mov_b32 s6, -1
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: s_ashr_i32 s2, s5, 31			; GCN-NEXT: s_ashr_i32 s2, s5, 31
	; GCN-NEXT: s_add_u32 s4, s4, s2			; GCN-NEXT: s_add_u32 s4, s4, s2
	; GCN-NEXT: s_mov_b32 s3, s2			; GCN-NEXT: s_mov_b32 s3, s2
	; GCN-NEXT: s_addc_u32 s5, s5, s2			; GCN-NEXT: s_addc_u32 s5, s5, s2
	; GCN-NEXT: s_xor_b64 s[12:13], s[4:5], s[2:3]			; GCN-NEXT: s_xor_b64 s[12:13], s[4:5], s[2:3]
	▲ Show 20 Lines • Show All 2,056 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/srem64.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -amdgpu-codegenprepare-expand-div64 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN-IR %s			; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -amdgpu-codegenprepare-expand-div64 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN-IR %s

	define amdgpu_kernel void @s_test_srem(i64 addrspace(1)* %out, i64 %x, i64 %y) {			define amdgpu_kernel void @s_test_srem(i64 addrspace(1)* %out, i64 %x, i64 %y) {
	; GCN-LABEL: s_test_srem:			; GCN-LABEL: s_test_srem:
	; GCN: ; %bb.0:			; GCN: ; %bb.0:
	; GCN-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0xd			; GCN-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0xd
	; GCN-NEXT: v_mov_b32_e32 v2, 0
	; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9			; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9
				; GCN-NEXT: v_mov_b32_e32 v2, 0
	; GCN-NEXT: s_mov_b32 s7, 0xf000			; GCN-NEXT: s_mov_b32 s7, 0xf000
	; GCN-NEXT: s_mov_b32 s6, -1			; GCN-NEXT: s_mov_b32 s6, -1
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: v_cvt_f32_u32_e32 v0, s12			; GCN-NEXT: v_cvt_f32_u32_e32 v0, s12
	; GCN-NEXT: v_cvt_f32_u32_e32 v1, s13			; GCN-NEXT: v_cvt_f32_u32_e32 v1, s13
	; GCN-NEXT: s_sub_u32 s2, 0, s12			; GCN-NEXT: s_sub_u32 s2, 0, s12
	; GCN-NEXT: s_subb_u32 s3, 0, s13			; GCN-NEXT: s_subb_u32 s3, 0, s13
	; GCN-NEXT: s_mov_b32 s4, s8			; GCN-NEXT: s_mov_b32 s4, s8
	▲ Show 20 Lines • Show All 2,252 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/udiv64.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -amdgpu-codegenprepare-expand-div64 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN-IR %s			; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -amdgpu-codegenprepare-expand-div64 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN-IR %s

	define amdgpu_kernel void @s_test_udiv_i64(i64 addrspace(1)* %out, i64 %x, i64 %y) {			define amdgpu_kernel void @s_test_udiv_i64(i64 addrspace(1)* %out, i64 %x, i64 %y) {
	; GCN-LABEL: s_test_udiv_i64:			; GCN-LABEL: s_test_udiv_i64:
	; GCN: ; %bb.0:			; GCN: ; %bb.0:
	; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0xd			; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0xd
	; GCN-NEXT: v_mov_b32_e32 v2, 0
	; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9			; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9
				; GCN-NEXT: v_mov_b32_e32 v2, 0
	; GCN-NEXT: s_mov_b32 s7, 0xf000			; GCN-NEXT: s_mov_b32 s7, 0xf000
	; GCN-NEXT: s_mov_b32 s6, -1			; GCN-NEXT: s_mov_b32 s6, -1
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: v_cvt_f32_u32_e32 v0, s2			; GCN-NEXT: v_cvt_f32_u32_e32 v0, s2
	; GCN-NEXT: v_cvt_f32_u32_e32 v1, s3			; GCN-NEXT: v_cvt_f32_u32_e32 v1, s3
	; GCN-NEXT: s_sub_u32 s4, 0, s2			; GCN-NEXT: s_sub_u32 s4, 0, s2
	; GCN-NEXT: s_subb_u32 s5, 0, s3			; GCN-NEXT: s_subb_u32 s5, 0, s3
	; GCN-NEXT: v_mac_f32_e32 v0, 0x4f800000, v1			; GCN-NEXT: v_mac_f32_e32 v0, 0x4f800000, v1
	▲ Show 20 Lines • Show All 1,939 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/urem64.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -amdgpu-codegenprepare-expand-div64 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN-IR %s			; RUN: llc -march=amdgcn -mcpu=gfx600 -amdgpu-bypass-slow-div=0 -amdgpu-codegenprepare-expand-div64 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN-IR %s

	define amdgpu_kernel void @s_test_urem_i64(i64 addrspace(1)* %out, i64 %x, i64 %y) {			define amdgpu_kernel void @s_test_urem_i64(i64 addrspace(1)* %out, i64 %x, i64 %y) {
	; GCN-LABEL: s_test_urem_i64:			; GCN-LABEL: s_test_urem_i64:
	; GCN: ; %bb.0:			; GCN: ; %bb.0:
	; GCN-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0xd			; GCN-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0xd
	; GCN-NEXT: v_mov_b32_e32 v2, 0
	; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9			; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9
				; GCN-NEXT: v_mov_b32_e32 v2, 0
	; GCN-NEXT: s_mov_b32 s7, 0xf000			; GCN-NEXT: s_mov_b32 s7, 0xf000
	; GCN-NEXT: s_mov_b32 s6, -1			; GCN-NEXT: s_mov_b32 s6, -1
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: v_cvt_f32_u32_e32 v0, s12			; GCN-NEXT: v_cvt_f32_u32_e32 v0, s12
	; GCN-NEXT: v_cvt_f32_u32_e32 v1, s13			; GCN-NEXT: v_cvt_f32_u32_e32 v1, s13
	; GCN-NEXT: s_sub_u32 s2, 0, s12			; GCN-NEXT: s_sub_u32 s2, 0, s12
	; GCN-NEXT: s_subb_u32 s3, 0, s13			; GCN-NEXT: s_subb_u32 s3, 0, s13
	; GCN-NEXT: s_mov_b32 s4, s8			; GCN-NEXT: s_mov_b32 s4, s8
	▲ Show 20 Lines • Show All 1,622 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable load clustering in the post-RA schedulerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 379433

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll

llvm/test/CodeGen/AMDGPU/idiv-licm.ll

llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll

llvm/test/CodeGen/AMDGPU/sdiv64.ll

llvm/test/CodeGen/AMDGPU/srem64.ll

llvm/test/CodeGen/AMDGPU/udiv64.ll

llvm/test/CodeGen/AMDGPU/urem64.ll

[AMDGPU] Enable load clustering in the post-RA scheduler
ClosedPublic