Download Raw Diff

Details

Reviewers

arsenm
foad

Commits

rG3870b3602552: [AMDGPU] Split unaligned 3 DWORD DS operations

Summary

I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned is small, while performance
hit of using b96 operations on a really unaligned memory is
high.

The only exception is when data is not aligned even by 4, it
is better to use b96 in this case.

Here is the test output on Vega and Navi:

Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b96                                  aligned: 3.4 sec
ds_write_b32 + ds_write_b64                   aligned: 4.5 sec
ds_write_b32 * 3                              aligned: 4.8 sec
ds_write_b96                          misaligned by 1: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 1: 7.2 sec
ds_write_b32 * 3                      misaligned by 1: 10.0 sec
ds_write_b96                          misaligned by 2: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 2: 7.2 sec
ds_write_b32 * 3                      misaligned by 2: 10.1 sec
ds_write_b96                          misaligned by 4: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 4: 4.2 sec
ds_write_b32 * 3                      misaligned by 4: 4.9 sec
ds_write_b96                          misaligned by 8: 4.8 sec
ds_write_b32 + ds_write_b64           misaligned by 8: 4.6 sec
ds_write_b32 * 3                      misaligned by 8: 4.9 sec
ds_read_b96                                   aligned: 3.3 sec
ds_read_b32 + ds_read_b64                     aligned: 4.9 sec
ds_read_b32 * 3                               aligned: 2.6 sec
ds_read_b96                           misaligned by 1: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 1: 7.2 sec
ds_read_b32 * 3                       misaligned by 1: 10.1 sec
ds_read_b96                           misaligned by 2: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 2: 7.2 sec
ds_read_b32 * 3                       misaligned by 2: 10.1 sec
ds_read_b96                           misaligned by 4: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 4: 2.6 sec
ds_read_b32 * 3                       misaligned by 4: 2.6 sec
ds_read_b96                           misaligned by 8: 4.1 sec
ds_read_b32 + ds_read_b64             misaligned by 8: 4.9 sec
ds_read_b32 * 3                       misaligned by 8: 2.6 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b96                                  aligned: 4.1 sec
ds_write_b32 + ds_write_b64                   aligned: 13.0 sec
ds_write_b32 * 3                              aligned: 4.5 sec
ds_write_b96                          misaligned by 1: 12.5 sec
ds_write_b32 + ds_write_b64           misaligned by 1: 22.0 sec
ds_write_b32 * 3                      misaligned by 1: 31.5 sec
ds_write_b96                          misaligned by 2: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 2: 22.0 sec
ds_write_b32 * 3                      misaligned by 2: 31.5 sec
ds_write_b96                          misaligned by 4: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 4: 4.0 sec
ds_write_b32 * 3                      misaligned by 4: 4.5 sec
ds_write_b96                          misaligned by 8: 12.4 sec
ds_write_b32 + ds_write_b64           misaligned by 8: 13.0 sec
ds_write_b32 * 3                      misaligned by 8: 4.5 sec
ds_read_b96                                   aligned: 3.8 sec
ds_read_b32 + ds_read_b64                     aligned: 12.8 sec
ds_read_b32 * 3                               aligned: 4.4 sec
ds_read_b96                           misaligned by 1: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 1: 21.8 sec
ds_read_b32 * 3                       misaligned by 1: 31.5 sec
ds_read_b96                           misaligned by 2: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 2: 21.9 sec
ds_read_b32 * 3                       misaligned by 2: 31.5 sec
ds_read_b96                           misaligned by 4: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 4: 3.8 sec
ds_read_b32 * 3                       misaligned by 4: 4.5 sec
ds_read_b96                           misaligned by 8: 10.9 sec
ds_read_b32 + ds_read_b64             misaligned by 8: 12.8 sec
ds_read_b32 * 3                       misaligned by 8: 4.5 sec

Fixes: SWDEV-330802

Diff Detail

Event Timeline

rampitec created this revision.Apr 11 2022, 10:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 10:43 AM

Herald added subscribers: hsmhsm, kerbowa, hiraditya and 2 others. · View Herald Transcript

rampitec requested review of this revision.Apr 11 2022, 10:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 10:43 AM

Herald added a subscriber: wdng. · View Herald Transcript

Harbormaster completed remote builds in B159038: Diff 421963.Apr 11 2022, 10:43 AM

rampitec added a parent revision: D123343: [AMDGPU] Refactor LDS alignment checks..Apr 11 2022, 10:48 AM

rampitec updated this revision to Diff 421978.Apr 11 2022, 11:15 AM

Harbormaster completed remote builds in B159052: Diff 421978.Apr 11 2022, 11:56 AM

Fixed comment.

Harbormaster completed remote builds in B159110: Diff 422054.Apr 11 2022, 4:45 PM

Looks OK to me. But there will always be benchmarks that go faster and slower with any change like this, because the compiler does not have perfect knowledge about the (mis)alignment of all data.

llvm/lib/Target/AMDGPU/DSInstructions.td
880	Typo "accesses", "performance"
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1564	Note that `Alignment < Align(4)` does not prove that the address is not dword aligned, just that the compiler does not know it's dword aligned. But I guess this is the best we can do for now.

This revision is now accepted and ready to land.Apr 12 2022, 1:25 AM

arsenm accepted this revision.Apr 12 2022, 6:03 AM

Fixed typos.

Herald added subscribers: t-tye, tpr, dstuttard and 2 others. · View Herald TranscriptApr 12 2022, 7:53 AM

In D123524#3444855, @foad wrote:

Looks OK to me. But there will always be benchmarks that go faster and slower with any change like this, because the compiler does not have perfect knowledge about the (mis)alignment of all data.

Yes, sure. The point of the patch is to minimize the cost of the mistake. If the data is really aligned it will be slower now, but if it is really misaligned it is way slower with a single instruction before the patch.

rampitec marked an inline comment as done.Apr 12 2022, 8:02 AM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1564	Right, then in this case if it is misaligned by 1 or 2 it is faster with a single instruction. If it is misaligned by 4 or 8 it would be slightly faster to split into 32 bit instructions, but this is what we do not know. If it is really aligned this is really the fastest. But without Align(4) check we would have to split it to b8 instructions and that will be really slow in any scenario.

This revision was landed with ongoing or failed builds.Apr 12 2022, 8:05 AM

Closed by commit rG3870b3602552: [AMDGPU] Split unaligned 3 DWORD DS operations (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec marked an inline comment as done.

rampitec added a commit: rG3870b3602552: [AMDGPU] Split unaligned 3 DWORD DS operations.

Harbormaster completed remote builds in B159242: Diff 422233.Apr 12 2022, 8:27 AM

Diff 422054

llvm/lib/Target/AMDGPU/DSInstructions.td

	Show First 20 Lines • Show All 871 Lines • ▼ Show 20 Lines

	foreach vt = VReg_128.RegTypes in {			foreach vt = VReg_128.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B128, vt, "load_align16_local">;			defm : DSReadPat_mc <DS_READ_B128, vt, "load_align16_local">;
	defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_align16_local">;			defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_align16_local">;
	}			}

	let SubtargetPredicate = HasUnalignedAccessMode in {			let SubtargetPredicate = HasUnalignedAccessMode in {

	// FIXME: From performance point of view, is ds_read_b96/ds_write_b96 better choice			// Selection will split most of the unaligned 3 dword acceses due to performace
				foadUnsubmitted Done Reply Inline Actions Typo "accesses", "performance" foad: Typo "accesses", "performance"
	// for unaligned accesses?			// reasons when beneficial. Keep these two patterns for the rest of the cases.
	foreach vt = VReg_96.RegTypes in {			foreach vt = VReg_96.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B96, vt, "load_local">;			defm : DSReadPat_mc <DS_READ_B96, vt, "load_local">;
	defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_local">;			defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_local">;
	}			}

	// For performance reasons, do not select ds_read_b128/ds_write_b128 for unaligned			// For performance reasons, do not select ds_read_b128/ds_write_b128 for unaligned
	// accesses.			// accesses.

	▲ Show 20 Lines • Show All 545 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,547 Lines • ▼ Show 20 Lines	case 64:
break;		break;
case 96:		case 96:
if (!Subtarget->hasDS96AndDS128())		if (!Subtarget->hasDS96AndDS128())
return false;		return false;

// 12 byte accessing via ds_read/write_b96 require 16-byte alignment on		// 12 byte accessing via ds_read/write_b96 require 16-byte alignment on
// gfx8 and older.		// gfx8 and older.
RequiredAlignment = Align(16);		RequiredAlignment = Align(16);

		if (Subtarget->hasUnalignedDSAccessEnabled()) {
		// Naturally aligned access is fastest. However, also report it is Fast
		// if memory is aligned less than DWORD. A narrow load or store will be
		// be equally slow as a single ds_read_b96/ds_write_b96, but there will
		// be more of them, so overall we will pay less penalty issuing a single
		// instruction.
		if (IsFast)
		*IsFast = Alignment >= RequiredAlignment \|\| Alignment < Align(4);
		foadUnsubmitted Done Reply Inline Actions Note that `Alignment < Align(4)` does not prove that the address is not dword aligned, just that the compiler does not know it's dword aligned. But I guess this is the best we can do for now. foad: Note that `Alignment < Align(4)` does not prove that the address is not dword aligned, just…
		rampitecAuthorUnsubmitted Done Reply Inline Actions Right, then in this case if it is misaligned by 1 or 2 it is faster with a single instruction. If it is misaligned by 4 or 8 it would be slightly faster to split into 32 bit instructions, but this is what we do not know. If it is really aligned this is really the fastest. But without Align(4) check we would have to split it to b8 instructions and that will be really slow in any scenario. rampitec: Right, then in this case if it is misaligned by 1 or 2 it is faster with a single instruction.
		return true;
		}

break;		break;
case 128:		case 128:
if (!Subtarget->hasDS96AndDS128() \|\| !Subtarget->useDS128())		if (!Subtarget->hasDS96AndDS128() \|\| !Subtarget->useDS128())
return false;		return false;

// 16 byte accessing via ds_read/write_b128 require 16-byte alignment on		// 16 byte accessing via ds_read/write_b128 require 16-byte alignment on
// gfx8 and older, but we can do a 8 byte aligned, 16 byte access in a		// gfx8 and older, but we can do a 8 byte aligned, 16 byte access in a
// single operation using ds_read2/write2_b64.		// single operation using ds_read2/write2_b64.
▲ Show 20 Lines • Show All 11,112 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds-alignment.ll

	Show First 20 Lines • Show All 560 Lines • ▼ Show 20 Lines
	; ALIGNED-NEXT: ds_read2_b32 v[0:1], v2 offset1:1			; ALIGNED-NEXT: ds_read2_b32 v[0:1], v2 offset1:1
	; ALIGNED-NEXT: ds_read_b32 v2, v2 offset:8			; ALIGNED-NEXT: ds_read_b32 v2, v2 offset:8
	; ALIGNED-NEXT: v_mov_b32_e32 v3, s1			; ALIGNED-NEXT: v_mov_b32_e32 v3, s1
	; ALIGNED-NEXT: s_waitcnt lgkmcnt(1)			; ALIGNED-NEXT: s_waitcnt lgkmcnt(1)
	; ALIGNED-NEXT: ds_write2_b32 v3, v0, v1 offset1:1			; ALIGNED-NEXT: ds_write2_b32 v3, v0, v1 offset1:1
	; ALIGNED-NEXT: s_waitcnt lgkmcnt(1)			; ALIGNED-NEXT: s_waitcnt lgkmcnt(1)
	; ALIGNED-NEXT: ds_write_b32 v3, v2 offset:8			; ALIGNED-NEXT: ds_write_b32 v3, v2 offset:8
	; ALIGNED-NEXT: s_endpgm			; ALIGNED-NEXT: s_endpgm
	;
	; UNALIGNED-LABEL: ds12align4:
	; UNALIGNED: ; %bb.0:
	; UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
	; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; UNALIGNED-NEXT: v_mov_b32_e32 v0, s0
	; UNALIGNED-NEXT: ds_read_b96 v[0:2], v0
	; UNALIGNED-NEXT: v_mov_b32_e32 v3, s1
	; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; UNALIGNED-NEXT: ds_write_b96 v3, v[0:2]
	; UNALIGNED-NEXT: s_endpgm
	%val = load <3 x i32>, <3 x i32> addrspace(3)* %in, align 4			%val = load <3 x i32>, <3 x i32> addrspace(3)* %in, align 4
	store <3 x i32> %val, <3 x i32> addrspace(3)* %out, align 4			store <3 x i32> %val, <3 x i32> addrspace(3)* %out, align 4
	ret void			ret void
	}			}

	; TODO: Why does the ALIGNED-SDAG code use ds_write_b64 but not ds_read_b64?
	define amdgpu_kernel void @ds12align8(<3 x i32> addrspace(3)* %in, <3 x i32> addrspace(3)* %out) {			define amdgpu_kernel void @ds12align8(<3 x i32> addrspace(3)* %in, <3 x i32> addrspace(3)* %out) {
	; ALIGNED-SDAG-LABEL: ds12align8:			; ALIGNED-SDAG-LABEL: ds12align8:
	; ALIGNED-SDAG: ; %bb.0:			; ALIGNED-SDAG: ; %bb.0:
	; ALIGNED-SDAG-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24			; ALIGNED-SDAG-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
	; ALIGNED-SDAG-NEXT: s_waitcnt lgkmcnt(0)			; ALIGNED-SDAG-NEXT: s_waitcnt lgkmcnt(0)
	; ALIGNED-SDAG-NEXT: v_mov_b32_e32 v2, s0			; ALIGNED-SDAG-NEXT: v_mov_b32_e32 v2, s0
	; ALIGNED-SDAG-NEXT: ds_read_b64 v[0:1], v2			; ALIGNED-SDAG-NEXT: ds_read_b64 v[0:1], v2
	; ALIGNED-SDAG-NEXT: ds_read_b32 v2, v2 offset:8			; ALIGNED-SDAG-NEXT: ds_read_b32 v2, v2 offset:8
	Show All 12 Lines
	; ALIGNED-GISEL-NEXT: ds_read2_b32 v[0:1], v2 offset1:1			; ALIGNED-GISEL-NEXT: ds_read2_b32 v[0:1], v2 offset1:1
	; ALIGNED-GISEL-NEXT: ds_read_b32 v2, v2 offset:8			; ALIGNED-GISEL-NEXT: ds_read_b32 v2, v2 offset:8
	; ALIGNED-GISEL-NEXT: v_mov_b32_e32 v3, s1			; ALIGNED-GISEL-NEXT: v_mov_b32_e32 v3, s1
	; ALIGNED-GISEL-NEXT: s_waitcnt lgkmcnt(1)			; ALIGNED-GISEL-NEXT: s_waitcnt lgkmcnt(1)
	; ALIGNED-GISEL-NEXT: ds_write2_b32 v3, v0, v1 offset1:1			; ALIGNED-GISEL-NEXT: ds_write2_b32 v3, v0, v1 offset1:1
	; ALIGNED-GISEL-NEXT: s_waitcnt lgkmcnt(1)			; ALIGNED-GISEL-NEXT: s_waitcnt lgkmcnt(1)
	; ALIGNED-GISEL-NEXT: ds_write_b32 v3, v2 offset:8			; ALIGNED-GISEL-NEXT: ds_write_b32 v3, v2 offset:8
	; ALIGNED-GISEL-NEXT: s_endpgm			; ALIGNED-GISEL-NEXT: s_endpgm
	;
	; UNALIGNED-LABEL: ds12align8:
	; UNALIGNED: ; %bb.0:
	; UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
	; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; UNALIGNED-NEXT: v_mov_b32_e32 v0, s0
	; UNALIGNED-NEXT: ds_read_b96 v[0:2], v0
	; UNALIGNED-NEXT: v_mov_b32_e32 v3, s1
	; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; UNALIGNED-NEXT: ds_write_b96 v3, v[0:2]
	; UNALIGNED-NEXT: s_endpgm
	%val = load <3 x i32>, <3 x i32> addrspace(3)* %in, align 8			%val = load <3 x i32>, <3 x i32> addrspace(3)* %in, align 8
	store <3 x i32> %val, <3 x i32> addrspace(3)* %out, align 8			store <3 x i32> %val, <3 x i32> addrspace(3)* %out, align 8
	ret void			ret void
	}			}

	define amdgpu_kernel void @ds12align16(<3 x i32> addrspace(3)* %in, <3 x i32> addrspace(3)* %out) {			define amdgpu_kernel void @ds12align16(<3 x i32> addrspace(3)* %in, <3 x i32> addrspace(3)* %out) {
	; GCN-LABEL: ds12align16:			; GCN-LABEL: ds12align16:
	; GCN: ; %bb.0:			; GCN: ; %bb.0:
	▲ Show 20 Lines • Show All 288 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/lds-misaligned-bug.ll

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	bb:
%v6 = insertelement <4 x i32> %v5, i32 %v3, i32 1		%v6 = insertelement <4 x i32> %v5, i32 %v3, i32 1
%v7 = insertelement <4 x i32> %v6, i32 %v2, i32 2		%v7 = insertelement <4 x i32> %v6, i32 %v2, i32 2
%v8 = insertelement <4 x i32> %v7, i32 %v1, i32 3		%v8 = insertelement <4 x i32> %v7, i32 %v1, i32 3
store <4 x i32> %v8, <4 x i32> addrspace(3)* %ptr, align 4		store <4 x i32> %v8, <4 x i32> addrspace(3)* %ptr, align 4
ret void		ret void
}		}

; GCN-LABEL: test_local_misaligned_v3:		; GCN-LABEL: test_local_misaligned_v3:
; ALIGNED-DAG: ds_read2_b32		; GCN-DAG: ds_read2_b32
; ALIGNED-DAG: ds_read_b32		; GCN-DAG: ds_read_b32
; ALIGNED-DAG: ds_write2_b32		; GCN-DAG: ds_write2_b32
; ALIGNED-DAG: ds_write_b32		; GCN-DAG: ds_write_b32
; UNALIGNED-DAG: ds_read_b96
; UNALIGNED-DAG: ds_write_b96
define amdgpu_kernel void @test_local_misaligned_v3(i32 addrspace(3)* %arg) {		define amdgpu_kernel void @test_local_misaligned_v3(i32 addrspace(3)* %arg) {
bb:		bb:
%lid = tail call i32 @llvm.amdgcn.workitem.id.x()		%lid = tail call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr inbounds i32, i32 addrspace(3)* %arg, i32 %lid		%gep = getelementptr inbounds i32, i32 addrspace(3)* %arg, i32 %lid
%ptr = bitcast i32 addrspace(3)* %gep to <3 x i32> addrspace(3)*		%ptr = bitcast i32 addrspace(3)* %gep to <3 x i32> addrspace(3)*
%load = load <3 x i32>, <3 x i32> addrspace(3)* %ptr, align 4		%load = load <3 x i32>, <3 x i32> addrspace(3)* %ptr, align 4
%v1 = extractelement <3 x i32> %load, i32 0		%v1 = extractelement <3 x i32> %load, i32 0
%v2 = extractelement <3 x i32> %load, i32 1		%v2 = extractelement <3 x i32> %load, i32 1
▲ Show 20 Lines • Show All 208 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Split unaligned 3 DWORD DS operations
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 422054

llvm/lib/Target/AMDGPU/DSInstructions.td

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/ds-alignment.ll

llvm/test/CodeGen/AMDGPU/lds-misaligned-bug.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Split unaligned 3 DWORD DS operationsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 422054

llvm/lib/Target/AMDGPU/DSInstructions.td

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/ds-alignment.ll

llvm/test/CodeGen/AMDGPU/lds-misaligned-bug.ll

[AMDGPU] Split unaligned 3 DWORD DS operations
ClosedPublic