This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Refine 64 bit misaligned LDS ops selection
ClosedPublic

Authored by rampitec on Apr 18 2022, 12:35 PM.

Download Raw Diff

Details

Reviewers

arsenm
foad

Commits

rGac94073daa18: [AMDGPU] Refine 64 bit misaligned LDS ops selection

Summary

Here is the performance data:

Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-

ds_write_b64                       aligned by  8:  3.2 sec
ds_write2_b32                      aligned by  8:  3.2 sec
ds_write_b16 * 4                   aligned by  8:  7.0 sec
ds_write_b8 * 8                    aligned by  8: 13.2 sec
ds_write_b64                       aligned by  1:  7.3 sec
ds_write2_b32                      aligned by  1:  7.5 sec
ds_write_b16 * 4                   aligned by  1: 14.0 sec
ds_write_b8 * 8                    aligned by  1: 13.2 sec
ds_write_b64                       aligned by  2:  7.3 sec
ds_write2_b32                      aligned by  2:  7.5 sec
ds_write_b16 * 4                   aligned by  2:  7.1 sec
ds_write_b8 * 8                    aligned by  2: 13.3 sec
ds_write_b64                       aligned by  4:  4.6 sec
ds_write2_b32                      aligned by  4:  3.2 sec
ds_write_b16 * 4                   aligned by  4:  7.1 sec
ds_write_b8 * 8                    aligned by  4: 13.3 sec
ds_read_b64                        aligned by  8:  2.3 sec
ds_read2_b32                       aligned by  8:  2.2 sec
ds_read_u16 * 4                    aligned by  8:  4.8 sec
ds_read_u8 * 8                     aligned by  8:  8.6 sec
ds_read_b64                        aligned by  1:  4.4 sec
ds_read2_b32                       aligned by  1:  7.3 sec
ds_read_u16 * 4                    aligned by  1: 14.0 sec
ds_read_u8 * 8                     aligned by  1:  8.7 sec
ds_read_b64                        aligned by  2:  4.4 sec
ds_read2_b32                       aligned by  2:  7.3 sec
ds_read_u16 * 4                    aligned by  2:  4.8 sec
ds_read_u8 * 8                     aligned by  2:  8.7 sec
ds_read_b64                        aligned by  4:  4.4 sec
ds_read2_b32                       aligned by  4:  2.3 sec
ds_read_u16 * 4                    aligned by  4:  4.8 sec
ds_read_u8 * 8                     aligned by  4:  8.7 sec

Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030

ds_write_b64                       aligned by  8:  4.4 sec
ds_write2_b32                      aligned by  8:  4.3 sec
ds_write_b16 * 4                   aligned by  8:  7.9 sec
ds_write_b8 * 8                    aligned by  8: 13.0 sec
ds_write_b64                       aligned by  1: 23.2 sec
ds_write2_b32                      aligned by  1: 23.1 sec
ds_write_b16 * 4                   aligned by  1: 44.0 sec
ds_write_b8 * 8                    aligned by  1: 13.0 sec
ds_write_b64                       aligned by  2: 23.2 sec
ds_write2_b32                      aligned by  2: 23.1 sec
ds_write_b16 * 4                   aligned by  2:  7.9 sec
ds_write_b8 * 8                    aligned by  2: 13.1 sec
ds_write_b64                       aligned by  4: 13.5 sec
ds_write2_b32                      aligned by  4:  4.3 sec
ds_write_b16 * 4                   aligned by  4:  7.9 sec
ds_write_b8 * 8                    aligned by  4: 13.1 sec
ds_read_b64                        aligned by  8:  3.5 sec
ds_read2_b32                       aligned by  8:  3.4 sec
ds_read_u16 * 4                    aligned by  8:  5.3 sec
ds_read_u8 * 8                     aligned by  8:  8.5 sec
ds_read_b64                        aligned by  1: 13.1 sec
ds_read2_b32                       aligned by  1: 22.7 sec
ds_read_u16 * 4                    aligned by  1: 43.9 sec
ds_read_u8 * 8                     aligned by  1:  7.9 sec
ds_read_b64                        aligned by  2: 13.1 sec
ds_read2_b32                       aligned by  2: 22.7 sec
ds_read_u16 * 4                    aligned by  2:  5.6 sec
ds_read_u8 * 8                     aligned by  2:  7.9 sec
ds_read_b64                        aligned by  4: 13.1 sec
ds_read2_b32                       aligned by  4:  3.4 sec
ds_read_u16 * 4                    aligned by  4:  5.6 sec
ds_read_u8 * 8                     aligned by  4:  7.9 sec

GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.

Diff Detail

Event Timeline

rampitec created this revision.Apr 18 2022, 12:35 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2022, 12:35 PM

Herald added subscribers: hsmhsm, kerbowa, hiraditya and 7 others. · View Herald Transcript

rampitec requested review of this revision.Apr 18 2022, 12:35 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2022, 12:35 PM

Herald added a subscriber: wdng. · View Herald Transcript

Harbormaster completed remote builds in B160099: Diff 423452.Apr 18 2022, 1:52 PM

I think this is it for LDS. 32 byte access is already fine, we do not want to split it even though that is faster on Navi, but increases register pressure.

One remaining issue which may be addressed is LoadStoreVectorization which would need a change of the definition of "fast" itself.

Another one is global isel, it would be better to handle "fast", which is technically easy, but I can only see a place for it in the legalization, which is a layering violation. So it needs a handler in the lowering instead.

Then there shall be a same handling and experiments for global. Likely global isel shall go after that as I don't think this is a generally good thing to distinguish between address spaces rather than just relying on the "allowed" and "fast" alone.

Another potential area is ignoring "fast" with optimization for size.

LGTM. Is this benchmark getting permanently added to a suite somewhere?

This revision is now accepted and ready to land.Apr 21 2022, 6:49 AM

In D123956#3464624, @arsenm wrote:

LGTM. Is this benchmark getting permanently added to a suite somewhere?

I wish to, I do not want to write it again when we need to bringup a new target. We can discuss on the next meeting where is the right place.

JFYI I have written same tests for global, mubuf and flat scratch since then. We seem to do a right thing for global, but better to always prefer a single dword access on swizzled memory regardless of alignment.

This revision was landed with ongoing or failed builds.Apr 21 2022, 9:37 AM

Closed by commit rGac94073daa18: [AMDGPU] Refine 64 bit misaligned LDS ops selection (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rGac94073daa18: [AMDGPU] Refine 64 bit misaligned LDS ops selection.

JonChesterfield added a subscriber: JonChesterfield.Aug 21 2023, 7:15 AM

Herald added subscribers: wangpc, StephenFan. · View Herald TranscriptAug 21 2023, 7:15 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

DSInstructions.td

9 lines

SIISelLowering.cpp

20 lines

test/

CodeGen/

AMDGPU/

ds-alignment.ll

8 lines

ds_read2.ll

9 lines

ds_write2.ll

14 lines

Diff 423452

llvm/lib/Target/AMDGPU/DSInstructions.td

	Show First 20 Lines • Show All 871 Lines • ▼ Show 20 Lines

	foreach vt = VReg_128.RegTypes in {			foreach vt = VReg_128.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B128, vt, "load_align16_local">;			defm : DSReadPat_mc <DS_READ_B128, vt, "load_align16_local">;
	defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_align16_local">;			defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_align16_local">;
	}			}

	let SubtargetPredicate = HasUnalignedAccessMode in {			let SubtargetPredicate = HasUnalignedAccessMode in {

				// Select 64 bit loads and stores aligned less than 4 as a single ds_read_b64/
				// ds_write_b64 instruction as this is faster than ds_read2_b32/ds_write2_b32
				// which would be used otherwise. In this case a b32 access would still be
				// misaligned, but we will have 2 of them.
				foreach vt = VReg_64.RegTypes in {
				defm : DSReadPat_mc <DS_READ_B64, vt, "load_align_less_than_4_local">;
				defm : DSWritePat_mc <DS_WRITE_B64, vt, "store_align_less_than_4_local">;
				}

	// Selection will split most of the unaligned 3 dword accesses due to performance			// Selection will split most of the unaligned 3 dword accesses due to performance
	// reasons when beneficial. Keep these two patterns for the rest of the cases.			// reasons when beneficial. Keep these two patterns for the rest of the cases.
	foreach vt = VReg_96.RegTypes in {			foreach vt = VReg_96.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B96, vt, "load_local">;			defm : DSReadPat_mc <DS_READ_B96, vt, "load_local">;
	defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_local">;			defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_local">;
	}			}

	// Select 128 bit loads and stores aligned less than 4 as a single ds_read_b128/			// Select 128 bit loads and stores aligned less than 4 as a single ds_read_b128/
	▲ Show 20 Lines • Show All 553 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,539 Lines • ▼ Show 20 Lines	case 64:
// load later in the SILoadStoreOptimizer.		// load later in the SILoadStoreOptimizer.
if (!Subtarget->hasUsableDSOffset() && Alignment < Align(8))		if (!Subtarget->hasUsableDSOffset() && Alignment < Align(8))
return false;		return false;

// 8 byte accessing via ds_read/write_b64 require 8-byte alignment, but we		// 8 byte accessing via ds_read/write_b64 require 8-byte alignment, but we
// can do a 4 byte aligned, 8 byte access in a single operation using		// can do a 4 byte aligned, 8 byte access in a single operation using
// ds_read2/write2_b32 with adjacent offsets.		// ds_read2/write2_b32 with adjacent offsets.
RequiredAlignment = Align(4);		RequiredAlignment = Align(4);

		if (Subtarget->hasUnalignedDSAccessEnabled()) {
		// We will either select ds_read_b64/ds_write_b64 or ds_read2_b32/
		// ds_write2_b32 depending on the alignment. In either case with either
		// alignment there is no faster way of doing this.
		if (IsFast)
		*IsFast = true;
		return true;
		}

break;		break;
case 96:		case 96:
if (!Subtarget->hasDS96AndDS128())		if (!Subtarget->hasDS96AndDS128())
return false;		return false;

// 12 byte accessing via ds_read/write_b96 require 16-byte alignment on		// 12 byte accessing via ds_read/write_b96 require 16-byte alignment on
// gfx8 and older.		// gfx8 and older.

Show All 32 Lines	case 128:
break;		break;
default:		default:
if (Size > 32)		if (Size > 32)
return false;		return false;

break;		break;
}		}

if (IsFast) {		if (IsFast)
// FIXME: Lie it is fast if +unaligned-access-mode is passed so that		*IsFast = Alignment >= RequiredAlignment;
// DS accesses get vectorized. Do this only for sizes below 96 as
// b96 and b128 cases already properly handled.
// Remove Subtarget check once all sizes properly handled.
*IsFast = Alignment >= RequiredAlignment \|\|
(Subtarget->hasUnalignedDSAccessEnabled() && Size < 96);
}

return Alignment >= RequiredAlignment \|\|		return Alignment >= RequiredAlignment \|\|
Subtarget->hasUnalignedDSAccessEnabled();		Subtarget->hasUnalignedDSAccessEnabled();
}		}

if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {		if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
bool AlignedBy4 = Alignment >= Align(4);		bool AlignedBy4 = Alignment >= Align(4);
if (IsFast)		if (IsFast)
▲ Show 20 Lines • Show All 11,108 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds-alignment.ll

	Show First 20 Lines • Show All 272 Lines • ▼ Show 20 Lines
	; ALIGNED-GISEL-NEXT: ds_write_b8 v3, v1 offset:7			; ALIGNED-GISEL-NEXT: ds_write_b8 v3, v1 offset:7
	; ALIGNED-GISEL-NEXT: s_endpgm			; ALIGNED-GISEL-NEXT: s_endpgm
	;			;
	; UNALIGNED-LABEL: ds8align1:			; UNALIGNED-LABEL: ds8align1:
	; UNALIGNED: ; %bb.0:			; UNALIGNED: ; %bb.0:
	; UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24			; UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
	; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; UNALIGNED-NEXT: v_mov_b32_e32 v0, s0			; UNALIGNED-NEXT: v_mov_b32_e32 v0, s0
	; UNALIGNED-NEXT: ds_read2_b32 v[0:1], v0 offset1:1			; UNALIGNED-NEXT: ds_read_b64 v[0:1], v0
	; UNALIGNED-NEXT: v_mov_b32_e32 v2, s1			; UNALIGNED-NEXT: v_mov_b32_e32 v2, s1
	; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; UNALIGNED-NEXT: ds_write2_b32 v2, v0, v1 offset1:1			; UNALIGNED-NEXT: ds_write_b64 v2, v[0:1]
	; UNALIGNED-NEXT: s_endpgm			; UNALIGNED-NEXT: s_endpgm
	%val = load <2 x i32>, <2 x i32> addrspace(3)* %in, align 1			%val = load <2 x i32>, <2 x i32> addrspace(3)* %in, align 1
	store <2 x i32> %val, <2 x i32> addrspace(3)* %out, align 1			store <2 x i32> %val, <2 x i32> addrspace(3)* %out, align 1
	ret void			ret void
	}			}

	define amdgpu_kernel void @ds8align2(<2 x i32> addrspace(3)* %in, <2 x i32> addrspace(3)* %out) {			define amdgpu_kernel void @ds8align2(<2 x i32> addrspace(3)* %in, <2 x i32> addrspace(3)* %out) {
	; ALIGNED-SDAG-LABEL: ds8align2:			; ALIGNED-SDAG-LABEL: ds8align2:
	Show All 36 Lines
	; ALIGNED-GISEL-NEXT: ds_write_b16_d16_hi v4, v0 offset:6			; ALIGNED-GISEL-NEXT: ds_write_b16_d16_hi v4, v0 offset:6
	; ALIGNED-GISEL-NEXT: s_endpgm			; ALIGNED-GISEL-NEXT: s_endpgm
	;			;
	; UNALIGNED-LABEL: ds8align2:			; UNALIGNED-LABEL: ds8align2:
	; UNALIGNED: ; %bb.0:			; UNALIGNED: ; %bb.0:
	; UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24			; UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
	; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; UNALIGNED-NEXT: v_mov_b32_e32 v0, s0			; UNALIGNED-NEXT: v_mov_b32_e32 v0, s0
	; UNALIGNED-NEXT: ds_read2_b32 v[0:1], v0 offset1:1			; UNALIGNED-NEXT: ds_read_b64 v[0:1], v0
	; UNALIGNED-NEXT: v_mov_b32_e32 v2, s1			; UNALIGNED-NEXT: v_mov_b32_e32 v2, s1
	; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; UNALIGNED-NEXT: ds_write2_b32 v2, v0, v1 offset1:1			; UNALIGNED-NEXT: ds_write_b64 v2, v[0:1]
	; UNALIGNED-NEXT: s_endpgm			; UNALIGNED-NEXT: s_endpgm
	%val = load <2 x i32>, <2 x i32> addrspace(3)* %in, align 2			%val = load <2 x i32>, <2 x i32> addrspace(3)* %in, align 2
	store <2 x i32> %val, <2 x i32> addrspace(3)* %out, align 2			store <2 x i32> %val, <2 x i32> addrspace(3)* %out, align 2
	ret void			ret void
	}			}

	define amdgpu_kernel void @ds8align4(<2 x i32> addrspace(3)* %in, <2 x i32> addrspace(3)* %out) {			define amdgpu_kernel void @ds8align4(<2 x i32> addrspace(3)* %in, <2 x i32> addrspace(3)* %out) {
	; GCN-LABEL: ds8align4:			; GCN-LABEL: ds8align4:
	▲ Show 20 Lines • Show All 613 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds_read2.ll

	Show First 20 Lines • Show All 685 Lines • ▼ Show 20 Lines
	; GFX9-ALIGNED-NEXT: s_endpgm			; GFX9-ALIGNED-NEXT: s_endpgm
	;			;
	; GFX9-UNALIGNED-LABEL: unaligned_offset_read2_f32:			; GFX9-UNALIGNED-LABEL: unaligned_offset_read2_f32:
	; GFX9-UNALIGNED: ; %bb.0:			; GFX9-UNALIGNED: ; %bb.0:
	; GFX9-UNALIGNED-NEXT: s_load_dword s2, s[0:1], 0x8			; GFX9-UNALIGNED-NEXT: s_load_dword s2, s[0:1], 0x8
	; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0			; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0
	; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0			; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0
	; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-UNALIGNED-NEXT: v_add3_u32 v0, s2, v2, 5			; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v0, s2, v2
	; GFX9-UNALIGNED-NEXT: ds_read2_b32 v[0:1], v0 offset1:1			; GFX9-UNALIGNED-NEXT: ds_read_b64 v[0:1], v0 offset:5
	; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-UNALIGNED-NEXT: v_add_f32_e32 v0, v0, v1			; GFX9-UNALIGNED-NEXT: v_add_f32_e32 v0, v0, v1
	; GFX9-UNALIGNED-NEXT: global_store_dword v2, v0, s[0:1]			; GFX9-UNALIGNED-NEXT: global_store_dword v2, v0, s[0:1]
	; GFX9-UNALIGNED-NEXT: s_endpgm			; GFX9-UNALIGNED-NEXT: s_endpgm
	%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1			%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
	%base = getelementptr inbounds float, float addrspace(3)* %lds, i32 %x.i			%base = getelementptr inbounds float, float addrspace(3)* %lds, i32 %x.i
	%base.i8 = bitcast float addrspace(3)* %base to i8 addrspace(3)*			%base.i8 = bitcast float addrspace(3)* %base to i8 addrspace(3)*
	%addr0.i8 = getelementptr inbounds i8, i8 addrspace(3)* %base.i8, i32 5			%addr0.i8 = getelementptr inbounds i8, i8 addrspace(3)* %base.i8, i32 5
	▲ Show 20 Lines • Show All 821 Lines • ▼ Show 20 Lines
	; GFX9-ALIGNED-NEXT: v_lshlrev_b32_e32 v3, 8, v6			; GFX9-ALIGNED-NEXT: v_lshlrev_b32_e32 v3, 8, v6
	; GFX9-ALIGNED-NEXT: v_or_b32_sdwa v3, v3, v5 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD			; GFX9-ALIGNED-NEXT: v_or_b32_sdwa v3, v3, v5 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
	; GFX9-ALIGNED-NEXT: v_or_b32_e32 v0, v3, v0			; GFX9-ALIGNED-NEXT: v_or_b32_e32 v0, v3, v0
	; GFX9-ALIGNED-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]			; GFX9-ALIGNED-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
	; GFX9-ALIGNED-NEXT: s_endpgm			; GFX9-ALIGNED-NEXT: s_endpgm
	;			;
	; GFX9-UNALIGNED-LABEL: read2_v2i32_align1_odd_offset:			; GFX9-UNALIGNED-LABEL: read2_v2i32_align1_odd_offset:
	; GFX9-UNALIGNED: ; %bb.0: ; %entry			; GFX9-UNALIGNED: ; %bb.0: ; %entry
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, 0x41
	; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0
	; GFX9-UNALIGNED-NEXT: ds_read2_b32 v[0:1], v0 offset1:1
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0			; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0
				; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0
				; GFX9-UNALIGNED-NEXT: ds_read_b64 v[0:1], v2 offset:65
	; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-UNALIGNED-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]			; GFX9-UNALIGNED-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
	; GFX9-UNALIGNED-NEXT: s_endpgm			; GFX9-UNALIGNED-NEXT: s_endpgm
	entry:			entry:
	%load = load <2 x i32>, <2 x i32> addrspace(3)* bitcast (i8 addrspace(3)* getelementptr (i8, i8 addrspace(3)* bitcast ([100 x <2 x i32>] addrspace(3)* @v2i32_align1 to i8 addrspace(3)), i32 65) to <2 x i32> addrspace(3)), align 1			%load = load <2 x i32>, <2 x i32> addrspace(3)* bitcast (i8 addrspace(3)* getelementptr (i8, i8 addrspace(3)* bitcast ([100 x <2 x i32>] addrspace(3)* @v2i32_align1 to i8 addrspace(3)), i32 65) to <2 x i32> addrspace(3)), align 1
	store <2 x i32> %load, <2 x i32> addrspace(1)* %out			store <2 x i32> %load, <2 x i32> addrspace(1)* %out
	ret void			ret void
	}			}
	Show All 16 Lines

llvm/test/CodeGen/AMDGPU/ds_write2.ll

	Show First 20 Lines • Show All 702 Lines • ▼ Show 20 Lines
	; GFX9-UNALIGNED-LABEL: unaligned_offset_simple_write2_one_val_f64:			; GFX9-UNALIGNED-LABEL: unaligned_offset_simple_write2_one_val_f64:
	; GFX9-UNALIGNED: ; %bb.0:			; GFX9-UNALIGNED: ; %bb.0:
	; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x8			; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x8
	; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x10			; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x10
	; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 3, v0			; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 3, v0
	; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-UNALIGNED-NEXT: global_load_dwordx2 v[0:1], v2, s[2:3]			; GFX9-UNALIGNED-NEXT: global_load_dwordx2 v[0:1], v2, s[2:3]
	; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v2, s4, v2			; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v2, s4, v2
	; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v3, 5, v2
	; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v2, 9, v2
	; GFX9-UNALIGNED-NEXT: s_waitcnt vmcnt(0)			; GFX9-UNALIGNED-NEXT: s_waitcnt vmcnt(0)
	; GFX9-UNALIGNED-NEXT: ds_write2_b32 v3, v0, v1 offset1:1			; GFX9-UNALIGNED-NEXT: ds_write_b64 v2, v[0:1] offset:5
	; GFX9-UNALIGNED-NEXT: ds_write2_b32 v2, v0, v1 offset1:1			; GFX9-UNALIGNED-NEXT: ds_write_b64 v2, v[0:1] offset:9
	; GFX9-UNALIGNED-NEXT: s_endpgm			; GFX9-UNALIGNED-NEXT: s_endpgm
	%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1			%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
	%in.gep = getelementptr double, double addrspace(1)* %in, i32 %x.i			%in.gep = getelementptr double, double addrspace(1)* %in, i32 %x.i
	%val = load double, double addrspace(1)* %in.gep, align 8			%val = load double, double addrspace(1)* %in.gep, align 8
	%base = getelementptr inbounds double, double addrspace(3)* %lds, i32 %x.i			%base = getelementptr inbounds double, double addrspace(3)* %lds, i32 %x.i
	%base.i8 = bitcast double addrspace(3)* %base to i8 addrspace(3)*			%base.i8 = bitcast double addrspace(3)* %base to i8 addrspace(3)*
	%addr0.i8 = getelementptr inbounds i8, i8 addrspace(3)* %base.i8, i32 5			%addr0.i8 = getelementptr inbounds i8, i8 addrspace(3)* %base.i8, i32 5
	%addr0 = bitcast i8 addrspace(3)* %addr0.i8 to double addrspace(3)*			%addr0 = bitcast i8 addrspace(3)* %addr0.i8 to double addrspace(3)*
	▲ Show 20 Lines • Show All 314 Lines • ▼ Show 20 Lines
	; GFX9-ALIGNED-NEXT: ds_write_b8 v1, v1 offset:67			; GFX9-ALIGNED-NEXT: ds_write_b8 v1, v1 offset:67
	; GFX9-ALIGNED-NEXT: ds_write_b8 v1, v1 offset:66			; GFX9-ALIGNED-NEXT: ds_write_b8 v1, v1 offset:66
	; GFX9-ALIGNED-NEXT: ds_write_b8 v1, v1 offset:72			; GFX9-ALIGNED-NEXT: ds_write_b8 v1, v1 offset:72
	; GFX9-ALIGNED-NEXT: ds_write_b8 v1, v1 offset:71			; GFX9-ALIGNED-NEXT: ds_write_b8 v1, v1 offset:71
	; GFX9-ALIGNED-NEXT: s_endpgm			; GFX9-ALIGNED-NEXT: s_endpgm
	;			;
	; GFX9-UNALIGNED-LABEL: write2_v2i32_align1_odd_offset:			; GFX9-UNALIGNED-LABEL: write2_v2i32_align1_odd_offset:
	; GFX9-UNALIGNED: ; %bb.0: ; %entry			; GFX9-UNALIGNED: ; %bb.0: ; %entry
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, 0x41			; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, 0x7b
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v1, 0x7b			; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v1, 0x1c8
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0x1c8			; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0
	; GFX9-UNALIGNED-NEXT: ds_write2_b32 v0, v1, v2 offset1:1			; GFX9-UNALIGNED-NEXT: ds_write_b64 v2, v[0:1] offset:65
	; GFX9-UNALIGNED-NEXT: s_endpgm			; GFX9-UNALIGNED-NEXT: s_endpgm
	entry:			entry:
	store <2 x i32> <i32 123, i32 456>, <2 x i32> addrspace(3)* bitcast (i8 addrspace(3)* getelementptr (i8, i8 addrspace(3)* bitcast ([100 x <2 x i32>] addrspace(3)* @v2i32_align1 to i8 addrspace(3)), i32 65) to <2 x i32> addrspace(3)), align 1			store <2 x i32> <i32 123, i32 456>, <2 x i32> addrspace(3)* bitcast (i8 addrspace(3)* getelementptr (i8, i8 addrspace(3)* bitcast ([100 x <2 x i32>] addrspace(3)* @v2i32_align1 to i8 addrspace(3)), i32 65) to <2 x i32> addrspace(3)), align 1
	ret void			ret void
	}			}

	declare i32 @llvm.amdgcn.workgroup.id.x() #1			declare i32 @llvm.amdgcn.workgroup.id.x() #1
	declare i32 @llvm.amdgcn.workgroup.id.y() #1			declare i32 @llvm.amdgcn.workgroup.id.y() #1
	declare i32 @llvm.amdgcn.workitem.id.x() #1			declare i32 @llvm.amdgcn.workitem.id.x() #1
	declare i32 @llvm.amdgcn.workitem.id.y() #1			declare i32 @llvm.amdgcn.workitem.id.y() #1

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readnone speculatable }			attributes #1 = { nounwind readnone speculatable }
	attributes #2 = { convergent nounwind }			attributes #2 = { convergent nounwind }