This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] ds_read_/ds_write_ operations require strict alignment.
AbandonedPublic

Authored by hsmhsm on Mar 25 2021, 9:52 AM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm
foad
mbrkusanin
t-tye
cfang

Summary

Due to performance reasons, ds_read_*/ds_write_* operations require
strict alignment. Avoid selecting them in under-aligned situations
irrespective of whether "unligned access mode" is enabled or not.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	100 ms	x64 debian > LLVM.CodeGen/AMDGPU::ds-combine-with-dependence.ll
	140 ms	x64 debian > LLVM.CodeGen/AMDGPU::ds-sub-offset.ll
	70 ms	x64 debian > LLVM.CodeGen/AMDGPU::ds_read2_offset_order.ll
	160 ms	x64 debian > LLVM.CodeGen/AMDGPU::ds_read2_superreg.ll
	260 ms	x64 debian > LLVM.CodeGen/AMDGPU::ds_write2.ll
		View Full Test Results (44 Failed)

Event Timeline

hsmhsm created this revision.Mar 25 2021, 9:52 AM

Herald added subscribers: kerbowa, hiraditya, tpr and 5 others. · View Herald TranscriptMar 25 2021, 9:52 AM

hsmhsm requested review of this revision.Mar 25 2021, 9:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2021, 9:52 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

NOTE: As of now, I have intentionally *NOT* updated lit tests since update seems to be not trivial. I actually wanted to initiate review discussion and fix the review comments so that I can update the lit tests once-for-all at the end.

foad added inline comments.Mar 25 2021, 10:15 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1425–1428	I don't see the reason for this change. Everything the old comment said is still true.

Harbormaster completed remote builds in B95712: Diff 333330.Mar 25 2021, 10:26 AM

hsmhsm added inline comments.Mar 25 2021, 10:33 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1425–1428	We have updated the .td file to be strictly aligned irrespective of if "unaligned access mode" is enabled or not. But, here we are relaxing it if we are not having above changes, and hence both glboal ISel and SDAG ISel breaks down in certain cases. Please try the test "lds-misaligned-bug.ll", both glboal ISel and SDAG ISel asserts for this test, without above change and with only changes to .td file.

It is actually hard to see the impact without tests updated, at least some specific tests.

arsenm added inline comments.Mar 25 2021, 10:47 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426	I don't like changing this function to lie and say it doesn't work when it does

After fixing the review comments by Jay, below tests asserts. Yet to investigate it.

LLVM :: CodeGen/AMDGPU/GlobalISel/lds-misaligned-bug.ll
LLVM :: CodeGen/AMDGPU/GlobalISel/load-unaligned.ll
LLVM :: CodeGen/AMDGPU/lds-misaligned-bug.ll

Apart from above, below two tests needed update that is taken care.

CodeGen/AMDGPU/ds_read2.ll
CodeGen/AMDGPU/ds_write2.ll

Do we have tests where pointers are actually 16 byte aligned?

llvm/test/CodeGen/AMDGPU/ds_read2.ll
1076	I would probably expect ds_read2_b32 for align 4.

Harbormaster completed remote builds in B95733: Diff 333360.Mar 25 2021, 12:41 PM

Updated the fix - make sure that the function allowsMisalignedMemoryAccessesImpl()
checks for the strict alignment requirment for ds_read/ds_write operations.

In D99352#2651292, @rampitec wrote:

Do we have tests where pointers are actually 16 byte aligned?

Please check the lit test - store-local.128.ll which is now updated within this patch.

hsmhsm added inline comments.Mar 25 2021, 10:26 PM

llvm/test/CodeGen/AMDGPU/ds_read2.ll
1076	The fix is further updated to take care of this - Now the allowsMisalignedMemoryAccessesImpl() checks for the strict alignment requirment for ds_read/ds_write operations.

hsmhsm added inline comments.Mar 25 2021, 10:28 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426	My understanding is that the function allowsMisalignedMemoryAccessesImpl() should check for the strict alignment requirment for ds_read/ds_write operations. I am now further updated this function accordingly.

Harbormaster completed remote builds in B95836: Diff 333501.Mar 25 2021, 10:53 PM

@foad can we run it trough gfx perf suite? With exception of one ds_write_b64 store it seems reasonable to me. Of course these splits to b8 would be unfortunate but I assume it shall not happen in real life too often and the data is usually better aligned.

llvm/test/CodeGen/AMDGPU/ds_write2.ll
849	It seems it misses the same split to b32.

arsenm added inline comments.Mar 26 2021, 10:29 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426	You are making this stricter than the instructions really allow

rampitec added inline comments.Mar 26 2021, 10:43 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426	That is more or less the point. I think there is some balance here between the perf drop when a wide access is used for misaligned data and perf loss when a narrow access is used on data for which we cannot prove alignment but it just happens to be aligned. This seems to warrant a perf testing on this patch. @cfang does this solve perf regression you were working on?

arsenm added inline comments.Mar 29 2021, 2:56 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426	The function reports if it's legal and works. IsFast is separate and can change, but it should still report this as a legal unaligned access

cfang added inline comments.Mar 29 2021, 4:04 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426	Verified with SHOC SGEMM, this patch could resolve the regression. For single precision (align4), ds_read2_b32 are generated; For double precision (align8), ds_read2_b64 are generated. Both are much better than unaligned ds_read_b128 (which was the reason of regression).

rampitec added inline comments.Mar 29 2021, 4:06 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1426	Ugh! Do you know if we can use ds_read2_b64 for align4? The main question here do we need to split b128 only or everything misaligned?

foad added inline comments.Mar 30 2021, 3:57 AM

llvm/test/CodeGen/AMDGPU/ds_read2.ll
603	@rampitec didn't you say that you rely on unaligned dword reads to get good performance? So I guess this change is unacceptable.

rampitec added inline comments.Mar 30 2021, 9:27 AM

llvm/test/CodeGen/AMDGPU/ds_read2.ll
603	I did, but then we've got info from @cfang about a case where dword aligned loads were causing regressions. Although according to the latest comment that were still b128 reads. In essence the problem is that we do not always know the alignment and data may be better aligned than declared. In this case a wider load will work better than a narrower. I.e. when unaligned access is off that becomes a probability question. Note that chances to get unaligned 128 bit case are higher than unaligned 64 bit case. That is why I have requested to perform measurements. We know for sure b128 is mostly a bad choice if we have underaligned case. I suspect b8 and b16 splits are overkills, but I am not really sure about 64b split into 32b.

arsenm requested changes to this revision.Mar 30 2021, 9:38 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1407–1408	They do not require the alignment for performance reasons. This should report whether it works, and isFast for whether we want to prefer it

This revision now requires changes to proceed.Mar 30 2021, 9:38 AM

In D99352#2653253, @rampitec wrote:

@foad can we run it trough gfx perf suite? With exception of one ds_write_b64 store it seems reasonable to me. Of course these splits to b8 would be unfortunate but I assume it shall not happen in real life too often and the data is usually better aligned.

Vulkan graphics rarely uses DS instructions. Vulkan compute uses them, but we don't have any standard set of benchmarks, and in any case the workloads would be very similar to OpenCL compute. So I don't think we need to do any Vulkan-specific benchmarking.

I am abandoning this patch since an alternative patch - https://reviews.llvm.org/D100008 is landed in upstream. Next step is to do some useful performance related experimentation for unaligned accessing of ds instructions before coming to a common conclusion on more generic fix.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

DSInstructions.td

52 lines

SIISelLowering.cpp

40 lines

test/

CodeGen/

AMDGPU/

ds_read2.ll

354 lines

ds_write2.ll

41 lines

store-local.128.ll

88 lines

Diff 333501

llvm/lib/Target/AMDGPU/DSInstructions.td

	Show First 20 Lines • Show All 707 Lines • ▼ Show 20 Lines

	foreach vt = Reg32Types.types in {			foreach vt = Reg32Types.types in {
	defm : DSReadPat_mc <DS_READ_B32, vt, "load_local">;			defm : DSReadPat_mc <DS_READ_B32, vt, "load_local">;
	}			}

	defm : DSReadPat_mc <DS_READ_B32, i32, "atomic_load_32_local">;			defm : DSReadPat_mc <DS_READ_B32, i32, "atomic_load_32_local">;
	defm : DSReadPat_mc <DS_READ_B64, i64, "atomic_load_64_local">;			defm : DSReadPat_mc <DS_READ_B64, i64, "atomic_load_64_local">;

	let AddedComplexity = 100 in {

	foreach vt = VReg_64.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B64, vt, "load_align8_local">;
	}

	let SubtargetPredicate = isGFX7Plus in {

	foreach vt = VReg_96.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B96, vt, "load_align16_local">;
	}

	foreach vt = VReg_128.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B128, vt, "load_align16_local">;
	}

	let SubtargetPredicate = HasUnalignedAccessMode in {

	foreach vt = VReg_96.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B96, vt, "load_local">;
	}

	foreach vt = VReg_128.RegTypes in {
	defm : DSReadPat_mc <DS_READ_B128, vt, "load_local">;
	}

	} // End SubtargetPredicate = HasUnalignedAccessMode

	} // End SubtargetPredicate = isGFX7Plus

	} // End AddedComplexity = 100

	let OtherPredicates = [D16PreservesUnusedBits] in {			let OtherPredicates = [D16PreservesUnusedBits] in {
	def : DSReadPat_D16<DS_READ_U16_D16_HI, load_d16_hi_local, v2i16>;			def : DSReadPat_D16<DS_READ_U16_D16_HI, load_d16_hi_local, v2i16>;
	def : DSReadPat_D16<DS_READ_U16_D16_HI, load_d16_hi_local, v2f16>;			def : DSReadPat_D16<DS_READ_U16_D16_HI, load_d16_hi_local, v2f16>;
	def : DSReadPat_D16<DS_READ_U8_D16_HI, az_extloadi8_d16_hi_local, v2i16>;			def : DSReadPat_D16<DS_READ_U8_D16_HI, az_extloadi8_d16_hi_local, v2i16>;
	def : DSReadPat_D16<DS_READ_U8_D16_HI, az_extloadi8_d16_hi_local, v2f16>;			def : DSReadPat_D16<DS_READ_U8_D16_HI, az_extloadi8_d16_hi_local, v2f16>;
	def : DSReadPat_D16<DS_READ_I8_D16_HI, sextloadi8_d16_hi_local, v2i16>;			def : DSReadPat_D16<DS_READ_I8_D16_HI, sextloadi8_d16_hi_local, v2i16>;
	def : DSReadPat_D16<DS_READ_I8_D16_HI, sextloadi8_d16_hi_local, v2f16>;			def : DSReadPat_D16<DS_READ_I8_D16_HI, sextloadi8_d16_hi_local, v2f16>;

	▲ Show 20 Lines • Show All 110 Lines • ▼ Show 20 Lines

	foreach vt = VReg_128.RegTypes in {			foreach vt = VReg_128.RegTypes in {
	defm : DS128Bit8ByteAlignedPat_mc<vt>;			defm : DS128Bit8ByteAlignedPat_mc<vt>;
	}			}

	let AddedComplexity = 100 in {			let AddedComplexity = 100 in {

	foreach vt = VReg_64.RegTypes in {			foreach vt = VReg_64.RegTypes in {
				defm : DSReadPat_mc <DS_READ_B64, vt, "load_align8_local">;
				}

				foreach vt = VReg_64.RegTypes in {
	defm : DSWritePat_mc <DS_WRITE_B64, vt, "store_align8_local">;			defm : DSWritePat_mc <DS_WRITE_B64, vt, "store_align8_local">;
	}			}

	let SubtargetPredicate = isGFX7Plus in {			let SubtargetPredicate = isGFX7Plus in {

	foreach vt = VReg_96.RegTypes in {			foreach vt = VReg_96.RegTypes in {
	defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_align16_local">;			defm : DSReadPat_mc <DS_READ_B96, vt, "load_align16_local">;
	}

	foreach vt = VReg_128.RegTypes in {
	defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_align16_local">;
	}			}

	let SubtargetPredicate = HasUnalignedAccessMode in {

	foreach vt = VReg_96.RegTypes in {			foreach vt = VReg_96.RegTypes in {
	defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_local">;			defm : DSWritePat_mc <DS_WRITE_B96, vt, "store_align16_local">;
	}			}

	foreach vt = VReg_128.RegTypes in {			foreach vt = VReg_128.RegTypes in {
	defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_local">;			defm : DSReadPat_mc <DS_READ_B128, vt, "load_align16_local">;
	}			}

	} // End SubtargetPredicate = HasUnalignedAccessMode			foreach vt = VReg_128.RegTypes in {
				defm : DSWritePat_mc <DS_WRITE_B128, vt, "store_align16_local">;
				}

	} // End SubtargetPredicate = isGFX7Plus			} // End SubtargetPredicate = isGFX7Plus

	} // End AddedComplexity = 100			} // End AddedComplexity = 100

	class DSAtomicRetPat<DS_Pseudo inst, ValueType vt, PatFrag frag, bit gds=0> : GCNPat <			class DSAtomicRetPat<DS_Pseudo inst, ValueType vt, PatFrag frag, bit gds=0> : GCNPat <
	(frag (DS1Addr1Offset i32:$ptr, i16:$offset), vt:$value),			(frag (DS1Addr1Offset i32:$ptr, i16:$offset), vt:$value),
	(inst $ptr, getVregSrcForVT<vt>.ret:$value, offset:$offset, (i1 gds))			(inst $ptr, getVregSrcForVT<vt>.ret:$value, offset:$offset, (i1 gds))
	▲ Show 20 Lines • Show All 490 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 1,397 Lines • ▼ Show 20 Lines
	bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(			bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(
	unsigned Size, unsigned AddrSpace, Align Alignment,			unsigned Size, unsigned AddrSpace, Align Alignment,
	MachineMemOperand::Flags Flags, bool *IsFast) const {			MachineMemOperand::Flags Flags, bool *IsFast) const {
	if (IsFast)			if (IsFast)
	*IsFast = false;			*IsFast = false;

	if (AddrSpace == AMDGPUAS::LOCAL_ADDRESS \|\|			if (AddrSpace == AMDGPUAS::LOCAL_ADDRESS \|\|
	AddrSpace == AMDGPUAS::REGION_ADDRESS) {			AddrSpace == AMDGPUAS::REGION_ADDRESS) {
	// Check if alignment requirements for ds_read/write instructions are
	// disabled.
	if (Subtarget->hasUnalignedDSAccessEnabled() &&
	!Subtarget->hasLDSMisalignedBug()) {
	if (IsFast)
	*IsFast = Alignment != Align(2);
	return true;
	}

	if (Size == 64) {			if (Size == 64) {
	// ds_read/write_b64 require 8-byte alignment, but we can do a 4 byte			// ds_read/write_b64 always require 8-byte alignment for performance
	// aligned, 8 byte access in a single operation using ds_read2/write2_b32			// reasons.
				arsenmUnsubmitted Not Done Reply Inline Actions They do not require the alignment for performance reasons. This should report whether it works, and isFast for whether we want to prefer it arsenm: They do not require the alignment for performance reasons. This should report whether it…
	// with adjacent offsets.			bool AlignedBy8 = Alignment >= Align(8);
	bool AlignedBy4 = Alignment >= Align(4);
	if (IsFast)			if (IsFast)
	*IsFast = AlignedBy4;			*IsFast = AlignedBy8;

	return AlignedBy4;			return AlignedBy8;
	}			}
	if (Size == 96) {			if (Size == 96) {
	// ds_read/write_b96 require 16-byte alignment on gfx8 and older.			// ds_read/write_b96 always require 16-byte alignment for performance
	bool Aligned = Alignment >= Align(16);			// reasons.
				bool AlignedBy16 = Alignment >= Align(16);
	if (IsFast)			if (IsFast)
	*IsFast = Aligned;			*IsFast = AlignedBy16;

	return Aligned;			return AlignedBy16;
	}			}
	if (Size == 128) {			if (Size == 128) {
	// ds_read/write_b128 require 16-byte alignment on gfx8 and older, but we			// ds_read/write_b128 always require 16-byte alignment for performance
	// can do a 8 byte aligned, 16 byte access in a single operation using			// reasons.
				arsenmUnsubmitted Not Done Reply Inline Actions I don't like changing this function to lie and say it doesn't work when it does arsenm: I don't like changing this function to lie and say it doesn't work when it does
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions My understanding is that the function allowsMisalignedMemoryAccessesImpl() should check for the strict alignment requirment for ds_read/ds_write operations. I am now further updated this function accordingly. hsmhsm: My understanding is that the function allowsMisalignedMemoryAccessesImpl() should check for the…
				arsenmUnsubmitted Not Done Reply Inline Actions You are making this stricter than the instructions really allow arsenm: You are making this stricter than the instructions really allow
				rampitecUnsubmitted Not Done Reply Inline Actions That is more or less the point. I think there is some balance here between the perf drop when a wide access is used for misaligned data and perf loss when a narrow access is used on data for which we cannot prove alignment but it just happens to be aligned. This seems to warrant a perf testing on this patch. @cfang does this solve perf regression you were working on? rampitec: That is more or less the point. I think there is some balance here between the perf drop when a…
				arsenmUnsubmitted Not Done Reply Inline Actions The function reports if it's legal and works. IsFast is separate and can change, but it should still report this as a legal unaligned access arsenm: The function reports if it's legal and works. IsFast is separate and can change, but it should…
				cfangUnsubmitted Not Done Reply Inline Actions Verified with SHOC SGEMM, this patch could resolve the regression. For single precision (align4), ds_read2_b32 are generated; For double precision (align8), ds_read2_b64 are generated. Both are much better than unaligned ds_read_b128 (which was the reason of regression). cfang: Verified with SHOC SGEMM, this patch could resolve the regression. For single precision…
				rampitecUnsubmitted Not Done Reply Inline Actions Ugh! Do you know if we can use ds_read2_b64 for align4? The main question here do we need to split b128 only or everything misaligned? rampitec: Ugh! Do you know if we can use ds_read2_b64 for align4? The main question here do we need to…
	// ds_read2/write2_b64.			bool AlignedBy16 = Alignment >= Align(16);
	bool Aligned = Alignment >= Align(8);
	if (IsFast)			if (IsFast)
				foadUnsubmitted Not Done Reply Inline Actions I don't see the reason for this change. Everything the old comment said is still true. foad: I don't see the reason for this change. Everything the old comment said is still true.
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions We have updated the .td file to be strictly aligned irrespective of if "unaligned access mode" is enabled or not. But, here we are relaxing it if we are not having above changes, and hence both glboal ISel and SDAG ISel breaks down in certain cases. Please try the test "lds-misaligned-bug.ll", both glboal ISel and SDAG ISel asserts for this test, without above change and with only changes to .td file. hsmhsm: We have updated the .td file to be strictly aligned irrespective of if "unaligned access mode"…
	*IsFast = Aligned;			*IsFast = AlignedBy16;

	return Aligned;			return AlignedBy16;
	}			}
	}			}

	if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {			if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
	bool AlignedBy4 = Alignment >= Align(4);			bool AlignedBy4 = Alignment >= Align(4);
	if (IsFast)			if (IsFast)
	*IsFast = AlignedBy4;			*IsFast = AlignedBy4;

	▲ Show 20 Lines • Show All 10,672 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds_read2.ll

Show First 20 Lines • Show All 589 Lines • ▼ Show 20 Lines
; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8		; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8
; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v3		; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v3
; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1		; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]		; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]
; GFX9-ALIGNED-NEXT: s_endpgm		; GFX9-ALIGNED-NEXT: s_endpgm
;		;
; GFX9-UNALIGNED-LABEL: unaligned_read2_f32:		; GFX9-UNALIGNED-LABEL: unaligned_read2_f32:
; GFX9-UNALIGNED: ; %bb.0:		; GFX9-UNALIGNED: ; %bb.0:
; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24		; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24
; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x2c		; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x2c
; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0		; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v0, s4, v2		; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v1, s4, v0
; GFX9-UNALIGNED-NEXT: ds_read2_b32 v[0:1], v0 offset1:8		; GFX9-UNALIGNED-NEXT: ds_read_u8 v2, v1
		foadUnsubmitted Not Done Reply Inline Actions @rampitec didn't you say that you rely on unaligned dword reads to get good performance? So I guess this change is unacceptable. foad: @rampitec didn't you say that you rely on unaligned dword reads to get good performance? So I…
		rampitecUnsubmitted Not Done Reply Inline Actions I did, but then we've got info from @cfang about a case where dword aligned loads were causing regressions. Although according to the latest comment that were still b128 reads. In essence the problem is that we do not always know the alignment and data may be better aligned than declared. In this case a wider load will work better than a narrower. I.e. when unaligned access is off that becomes a probability question. Note that chances to get unaligned 128 bit case are higher than unaligned 64 bit case. That is why I have requested to perform measurements. We know for sure b128 is mostly a bad choice if we have underaligned case. I suspect b8 and b16 splits are overkills, but I am not really sure about 64b split into 32b. rampitec: I did, but then we've got info from @cfang about a case where dword aligned loads were causing…
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v3, v1 offset:1
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v4, v1 offset:2
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v5, v1 offset:3
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v6, v1 offset:32
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v7, v1 offset:33
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v8, v1 offset:34
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v1, v1 offset:35
		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(6)
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v2, v3, 8, v2
		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(4)
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v3, v5, 8, v4
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v2, v3, 16, v2
		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(2)
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v3, v7, 8, v6
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-UNALIGNED-NEXT: v_add_f32_e32 v0, v0, v1		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8
; GFX9-UNALIGNED-NEXT: global_store_dword v2, v0, s[2:3]		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v3
		; GFX9-UNALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
		; GFX9-UNALIGNED-NEXT: global_store_dword v0, v1, s[2:3]
; GFX9-UNALIGNED-NEXT: s_endpgm		; GFX9-UNALIGNED-NEXT: s_endpgm
%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
%arrayidx0 = getelementptr inbounds float, float addrspace(3)* %lds, i32 %x.i		%arrayidx0 = getelementptr inbounds float, float addrspace(3)* %lds, i32 %x.i
%val0 = load float, float addrspace(3)* %arrayidx0, align 1		%val0 = load float, float addrspace(3)* %arrayidx0, align 1
%add.x = add nsw i32 %x.i, 8		%add.x = add nsw i32 %x.i, 8
%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %lds, i32 %add.x		%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %lds, i32 %add.x
%val1 = load float, float addrspace(3)* %arrayidx1, align 1		%val1 = load float, float addrspace(3)* %arrayidx1, align 1
%sum = fadd float %val0, %val1		%sum = fadd float %val0, %val1
%out.gep = getelementptr inbounds float, float addrspace(1)* %out, i32 %x.i		%out.gep = getelementptr inbounds float, float addrspace(1)* %out, i32 %x.i
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines
; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8		; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8
; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v3		; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v3
; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1		; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]		; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]
; GFX9-ALIGNED-NEXT: s_endpgm		; GFX9-ALIGNED-NEXT: s_endpgm
;		;
; GFX9-UNALIGNED-LABEL: unaligned_offset_read2_f32:		; GFX9-UNALIGNED-LABEL: unaligned_offset_read2_f32:
; GFX9-UNALIGNED: ; %bb.0:		; GFX9-UNALIGNED: ; %bb.0:
; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24		; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24
; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x2c		; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x2c
; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0		; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-UNALIGNED-NEXT: v_add3_u32 v0, s4, v2, 5		; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v1, s4, v0
; GFX9-UNALIGNED-NEXT: ds_read2_b32 v[0:1], v0 offset1:1		; GFX9-UNALIGNED-NEXT: ds_read_u8 v2, v1 offset:5
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v3, v1 offset:6
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v4, v1 offset:7
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v5, v1 offset:8
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v6, v1 offset:9
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v7, v1 offset:10
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v8, v1 offset:11
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v1, v1 offset:12
		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(6)
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v2, v3, 8, v2
		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(4)
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v3, v5, 8, v4
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v2, v3, 16, v2
		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(2)
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v3, v7, 8, v6
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-UNALIGNED-NEXT: v_add_f32_e32 v0, v0, v1		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8
; GFX9-UNALIGNED-NEXT: global_store_dword v2, v0, s[2:3]		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v3
		; GFX9-UNALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
		; GFX9-UNALIGNED-NEXT: global_store_dword v0, v1, s[2:3]
; GFX9-UNALIGNED-NEXT: s_endpgm		; GFX9-UNALIGNED-NEXT: s_endpgm
%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
%base = getelementptr inbounds float, float addrspace(3)* %lds, i32 %x.i		%base = getelementptr inbounds float, float addrspace(3)* %lds, i32 %x.i
%base.i8 = bitcast float addrspace(3)* %base to i8 addrspace(3)*		%base.i8 = bitcast float addrspace(3)* %base to i8 addrspace(3)*
%addr0.i8 = getelementptr inbounds i8, i8 addrspace(3)* %base.i8, i32 5		%addr0.i8 = getelementptr inbounds i8, i8 addrspace(3)* %base.i8, i32 5
%addr0 = bitcast i8 addrspace(3)* %addr0.i8 to float addrspace(3)*		%addr0 = bitcast i8 addrspace(3)* %addr0.i8 to float addrspace(3)*
%val0 = load float, float addrspace(3)* %addr0, align 1		%val0 = load float, float addrspace(3)* %addr0, align 1
%addr1.i8 = getelementptr inbounds i8, i8 addrspace(3)* %base.i8, i32 9		%addr1.i8 = getelementptr inbounds i8, i8 addrspace(3)* %base.i8, i32 9
%addr1 = bitcast i8 addrspace(3)* %addr1.i8 to float addrspace(3)*		%addr1 = bitcast i8 addrspace(3)* %addr1.i8 to float addrspace(3)*
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v4		; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v4
; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1		; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]		; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]
; GFX9-ALIGNED-NEXT: s_endpgm		; GFX9-ALIGNED-NEXT: s_endpgm
;		;
; GFX9-UNALIGNED-LABEL: misaligned_2_simple_read2_f32:		; GFX9-UNALIGNED-LABEL: misaligned_2_simple_read2_f32:
; GFX9-UNALIGNED: ; %bb.0:		; GFX9-UNALIGNED: ; %bb.0:
; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24		; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24
; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x2c		; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x2c
; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0		; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v0, s4, v2		; GFX9-UNALIGNED-NEXT: v_add_u32_e32 v1, s4, v0
; GFX9-UNALIGNED-NEXT: ds_read2_b32 v[0:1], v0 offset1:8		; GFX9-UNALIGNED-NEXT: ds_read_u16 v2, v1
		; GFX9-UNALIGNED-NEXT: ds_read_u16 v3, v1 offset:2
		; GFX9-UNALIGNED-NEXT: ds_read_u16 v4, v1 offset:32
		; GFX9-UNALIGNED-NEXT: ds_read_u16 v1, v1 offset:34
		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(2)
		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v2, v3, 16, v2
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-UNALIGNED-NEXT: v_add_f32_e32 v0, v0, v1		; GFX9-UNALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v4
; GFX9-UNALIGNED-NEXT: global_store_dword v2, v0, s[2:3]		; GFX9-UNALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
		; GFX9-UNALIGNED-NEXT: global_store_dword v0, v1, s[2:3]
; GFX9-UNALIGNED-NEXT: s_endpgm		; GFX9-UNALIGNED-NEXT: s_endpgm
%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
%arrayidx0 = getelementptr inbounds float, float addrspace(3)* %lds, i32 %x.i		%arrayidx0 = getelementptr inbounds float, float addrspace(3)* %lds, i32 %x.i
%val0 = load float, float addrspace(3)* %arrayidx0, align 2		%val0 = load float, float addrspace(3)* %arrayidx0, align 2
%add.x = add nsw i32 %x.i, 8		%add.x = add nsw i32 %x.i, 8
%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %lds, i32 %add.x		%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %lds, i32 %add.x
%val1 = load float, float addrspace(3)* %arrayidx1, align 2		%val1 = load float, float addrspace(3)* %arrayidx1, align 2
%sum = fadd float %val0, %val1		%sum = fadd float %val0, %val1
%out.gep = getelementptr inbounds float, float addrspace(1)* %out, i32 %x.i		%out.gep = getelementptr inbounds float, float addrspace(1)* %out, i32 %x.i
▲ Show 20 Lines • Show All 245 Lines • ▼ Show 20 Lines
; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-ALIGNED-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2		; GFX9-ALIGNED-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
; GFX9-ALIGNED-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc		; GFX9-ALIGNED-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
; GFX9-ALIGNED-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]		; GFX9-ALIGNED-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]
; GFX9-ALIGNED-NEXT: s_endpgm		; GFX9-ALIGNED-NEXT: s_endpgm
;		;
; GFX9-UNALIGNED-LABEL: load_misaligned64_constant_offsets:		; GFX9-UNALIGNED-LABEL: load_misaligned64_constant_offsets:
; GFX9-UNALIGNED: ; %bb.0:		; GFX9-UNALIGNED: ; %bb.0:
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v4, 0		; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v4, 0
; GFX9-UNALIGNED-NEXT: ds_read_b128 v[0:3], v4		; GFX9-UNALIGNED-NEXT: ds_read2_b32 v[0:1], v4 offset1:1
		rampitecUnsubmitted Not Done Reply Inline Actions I would probably expect ds_read2_b32 for align 4. rampitec: I would probably expect ds_read2_b32 for align 4.
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions The fix is further updated to take care of this - Now the allowsMisalignedMemoryAccessesImpl() checks for the strict alignment requirment for ds_read/ds_write operations. hsmhsm: The fix is further updated to take care of this - Now the allowsMisalignedMemoryAccessesImpl()…
		; GFX9-UNALIGNED-NEXT: ds_read2_b32 v[2:3], v4 offset0:2 offset1:3
; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24		; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-UNALIGNED-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2		; GFX9-UNALIGNED-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
; GFX9-UNALIGNED-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc		; GFX9-UNALIGNED-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
; GFX9-UNALIGNED-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]		; GFX9-UNALIGNED-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]
; GFX9-UNALIGNED-NEXT: s_endpgm		; GFX9-UNALIGNED-NEXT: s_endpgm
%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4		%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4
%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4		%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4
%sum = add i64 %val0, %val1		%sum = add i64 %val0, %val1
store i64 %sum, i64 addrspace(1)* %out, align 8		store i64 %sum, i64 addrspace(1)* %out, align 8
ret void		ret void
}		}

@bar.large = addrspace(3) global [4096 x i64] undef, align 4		@bar.large = addrspace(3) global [4096 x i64] undef, align 4

define amdgpu_kernel void @load_misaligned64_constant_large_offsets(i64 addrspace(1)* %out) {		define amdgpu_kernel void @load_misaligned64_constant_large_offsets(i64 addrspace(1)* %out) {
; CI-LABEL: load_misaligned64_constant_large_offsets:		; CI-LABEL: load_misaligned64_constant_large_offsets:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: v_mov_b32_e32 v0, 0x4000		; CI-NEXT: s_movk_i32 s4, 0x4000
; CI-NEXT: v_mov_b32_e32 v2, 0x7ff8		; CI-NEXT: v_add_i32_e64 v0, vcc, s4, 0
		; CI-NEXT: s_movk_i32 s4, 0x7c00
; CI-NEXT: s_mov_b32 m0, -1		; CI-NEXT: s_mov_b32 m0, -1
		; CI-NEXT: v_add_i32_e64 v2, vcc, s4, 0
; CI-NEXT: ds_read2_b32 v[0:1], v0 offset1:1		; CI-NEXT: ds_read2_b32 v[0:1], v0 offset1:1
; CI-NEXT: ds_read2_b32 v[2:3], v2 offset1:1		; CI-NEXT: ds_read2_b32 v[2:3], v2 offset0:254 offset1:255
; CI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9		; CI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9
; CI-NEXT: s_mov_b32 s3, 0xf000		; CI-NEXT: s_mov_b32 s3, 0xf000
; CI-NEXT: s_mov_b32 s2, -1		; CI-NEXT: s_mov_b32 s2, -1
; CI-NEXT: s_waitcnt lgkmcnt(0)		; CI-NEXT: s_waitcnt lgkmcnt(0)
; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v2		; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v2
; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc		; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
; CI-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0		; CI-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0
; CI-NEXT: s_endpgm		; CI-NEXT: s_endpgm
;		;
; GFX9-LABEL: load_misaligned64_constant_large_offsets:		; GFX9-LABEL: load_misaligned64_constant_large_offsets:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: v_mov_b32_e32 v0, 0x4000		; GFX9-NEXT: s_movk_i32 s2, 0x4000
; GFX9-NEXT: v_mov_b32_e32 v2, 0x7ff8		; GFX9-NEXT: v_add_u32_e64 v0, s2, 0
		; GFX9-NEXT: s_movk_i32 s2, 0x7c00
		; GFX9-NEXT: v_add_u32_e64 v2, s2, 0
; GFX9-NEXT: ds_read2_b32 v[0:1], v0 offset1:1		; GFX9-NEXT: ds_read2_b32 v[0:1], v0 offset1:1
; GFX9-NEXT: ds_read2_b32 v[2:3], v2 offset1:1		; GFX9-NEXT: ds_read2_b32 v[2:3], v2 offset0:254 offset1:255
; GFX9-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24		; GFX9-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-NEXT: v_mov_b32_e32 v4, 0		; GFX9-NEXT: v_mov_b32_e32 v4, 0
; GFX9-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2		; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc		; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
; GFX9-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]		; GFX9-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]
; GFX9-NEXT: s_endpgm		; GFX9-NEXT: s_endpgm
%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4		%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4
%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4		%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4
%sum = add i64 %val0, %val1		%sum = add i64 %val0, %val1
store i64 %sum, i64 addrspace(1)* %out, align 8		store i64 %sum, i64 addrspace(1)* %out, align 8
ret void		ret void
}		}

@sgemm.lA = internal unnamed_addr addrspace(3) global [264 x float] undef, align 4		@sgemm.lA = internal unnamed_addr addrspace(3) global [264 x float] undef, align 4
@sgemm.lB = internal unnamed_addr addrspace(3) global [776 x float] undef, align 4		@sgemm.lB = internal unnamed_addr addrspace(3) global [776 x float] undef, align 4

define amdgpu_kernel void @sgemm_inner_loop_read2_sequence(float addrspace(1)* %C, i32 %lda, i32 %ldb) #0 {		define amdgpu_kernel void @sgemm_inner_loop_read2_sequence(float addrspace(1)* %C, i32 %lda, i32 %ldb) #0 {
; CI-LABEL: sgemm_inner_loop_read2_sequence:		; CI-LABEL: sgemm_inner_loop_read2_sequence:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x9		; CI-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x9
; CI-NEXT: s_lshl_b32 s0, s2, 2		; CI-NEXT: s_lshl_b32 s0, s2, 2
; CI-NEXT: s_add_i32 s1, s0, 0xc20		; CI-NEXT: v_mov_b32_e32 v0, s0
; CI-NEXT: s_addk_i32 s0, 0xc60		; CI-NEXT: s_movk_i32 s0, 0xc00
; CI-NEXT: v_mov_b32_e32 v0, s1		; CI-NEXT: v_add_i32_e32 v0, vcc, s0, v0
; CI-NEXT: v_mov_b32_e32 v4, s0
; CI-NEXT: s_mov_b32 m0, -1		; CI-NEXT: s_mov_b32 m0, -1
; CI-NEXT: ds_read2_b32 v[2:3], v0 offset1:1		; CI-NEXT: ds_read2_b32 v[2:3], v0 offset0:8 offset1:9
; CI-NEXT: ds_read2_b32 v[4:5], v4 offset1:1		; CI-NEXT: ds_read2_b32 v[4:5], v0 offset0:24 offset1:25
; CI-NEXT: v_lshlrev_b32_e32 v8, 2, v1		; CI-NEXT: v_lshlrev_b32_e32 v8, 2, v1
; CI-NEXT: ds_read2_b32 v[0:1], v8 offset1:1		; CI-NEXT: ds_read2_b32 v[0:1], v8 offset1:1
; CI-NEXT: ds_read2_b32 v[6:7], v8 offset0:32 offset1:33		; CI-NEXT: ds_read2_b32 v[6:7], v8 offset0:32 offset1:33
; CI-NEXT: ds_read2_b32 v[8:9], v8 offset0:64 offset1:65		; CI-NEXT: ds_read2_b32 v[8:9], v8 offset0:64 offset1:65
; CI-NEXT: s_mov_b32 s7, 0xf000		; CI-NEXT: s_mov_b32 s7, 0xf000
; CI-NEXT: s_waitcnt lgkmcnt(0)		; CI-NEXT: s_waitcnt lgkmcnt(0)
; CI-NEXT: v_add_f32_e32 v2, v2, v3		; CI-NEXT: v_add_f32_e32 v2, v2, v3
; CI-NEXT: v_add_f32_e32 v2, v2, v4		; CI-NEXT: v_add_f32_e32 v2, v2, v4
; CI-NEXT: v_add_f32_e32 v2, v2, v5		; CI-NEXT: v_add_f32_e32 v2, v2, v5
; CI-NEXT: v_add_f32_e32 v0, v2, v0		; CI-NEXT: v_add_f32_e32 v0, v2, v0
; CI-NEXT: v_add_f32_e32 v0, v0, v1		; CI-NEXT: v_add_f32_e32 v0, v0, v1
; CI-NEXT: v_add_f32_e32 v0, v0, v6		; CI-NEXT: v_add_f32_e32 v0, v0, v6
; CI-NEXT: v_add_f32_e32 v0, v0, v7		; CI-NEXT: v_add_f32_e32 v0, v0, v7
; CI-NEXT: v_add_f32_e32 v0, v0, v8		; CI-NEXT: v_add_f32_e32 v0, v0, v8
; CI-NEXT: s_mov_b32 s6, -1		; CI-NEXT: s_mov_b32 s6, -1
; CI-NEXT: v_add_f32_e32 v0, v0, v9		; CI-NEXT: v_add_f32_e32 v0, v0, v9
; CI-NEXT: buffer_store_dword v0, off, s[4:7], 0		; CI-NEXT: buffer_store_dword v0, off, s[4:7], 0
; CI-NEXT: s_endpgm		; CI-NEXT: s_endpgm
;		;
; GFX9-LABEL: sgemm_inner_loop_read2_sequence:		; GFX9-LABEL: sgemm_inner_loop_read2_sequence:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_lshl_b32 s2, s2, 2		; GFX9-NEXT: s_lshl_b32 s2, s2, 2
; GFX9-NEXT: s_add_i32 s3, s2, 0xc20		; GFX9-NEXT: v_mov_b32_e32 v0, s2
; GFX9-NEXT: s_addk_i32 s2, 0xc60		; GFX9-NEXT: v_add_u32_e32 v0, 0xc00, v0
; GFX9-NEXT: v_mov_b32_e32 v0, s3		; GFX9-NEXT: ds_read2_b32 v[2:3], v0 offset0:8 offset1:9
; GFX9-NEXT: v_mov_b32_e32 v4, s2		; GFX9-NEXT: ds_read2_b32 v[4:5], v0 offset0:24 offset1:25
; GFX9-NEXT: ds_read2_b32 v[2:3], v0 offset1:1
; GFX9-NEXT: ds_read2_b32 v[4:5], v4 offset1:1
; GFX9-NEXT: v_lshlrev_b32_e32 v8, 2, v1		; GFX9-NEXT: v_lshlrev_b32_e32 v8, 2, v1
; GFX9-NEXT: ds_read2_b32 v[0:1], v8 offset1:1		; GFX9-NEXT: ds_read2_b32 v[0:1], v8 offset1:1
; GFX9-NEXT: ds_read2_b32 v[6:7], v8 offset0:32 offset1:33		; GFX9-NEXT: ds_read2_b32 v[6:7], v8 offset0:32 offset1:33
; GFX9-NEXT: ds_read2_b32 v[8:9], v8 offset0:64 offset1:65		; GFX9-NEXT: ds_read2_b32 v[8:9], v8 offset0:64 offset1:65
; GFX9-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24		; GFX9-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: v_add_f32_e32 v2, v2, v3		; GFX9-NEXT: v_add_f32_e32 v2, v2, v3
; GFX9-NEXT: v_add_f32_e32 v2, v2, v4		; GFX9-NEXT: v_add_f32_e32 v2, v2, v4
; GFX9-NEXT: v_add_f32_e32 v2, v2, v5		; GFX9-NEXT: v_add_f32_e32 v2, v2, v5
; GFX9-NEXT: v_add_f32_e32 v0, v2, v0		; GFX9-NEXT: v_add_f32_e32 v0, v2, v0
; GFX9-NEXT: v_add_f32_e32 v0, v0, v1		; GFX9-NEXT: v_add_f32_e32 v0, v0, v1
; GFX9-NEXT: v_add_f32_e32 v0, v0, v6		; GFX9-NEXT: v_add_f32_e32 v0, v0, v6
; GFX9-NEXT: v_add_f32_e32 v0, v0, v7		; GFX9-NEXT: v_add_f32_e32 v0, v0, v7
; GFX9-NEXT: v_add_f32_e32 v0, v0, v8		; GFX9-NEXT: v_add_f32_e32 v0, v0, v8
; GFX9-NEXT: v_mov_b32_e32 v10, 0		; GFX9-NEXT: v_mov_b32_e32 v10, 0
; GFX9-NEXT: v_add_f32_e32 v0, v0, v9		; GFX9-NEXT: v_add_f32_e32 v0, v0, v9
; GFX9-NEXT: global_store_dword v10, v0, s[0:1]		; GFX9-NEXT: global_store_dword v10, v0, s[0:1]
; GFX9-NEXT: s_endpgm		; GFX9-NEXT: s_endpgm
%x.i = tail call i32 @llvm.amdgcn.workgroup.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workgroup.id.x() #1
%y.i = tail call i32 @llvm.amdgcn.workitem.id.y() #1		%y.i = tail call i32 @llvm.amdgcn.workitem.id.y() #1
%arrayidx44 = getelementptr inbounds [264 x float], [264 x float] addrspace(3)* @sgemm.lA, i32 0, i32 %x.i		%arrayidx44 = getelementptr inbounds [264 x float], [264 x float] addrspace(3)* @sgemm.lA, i32 0, i32 %x.i
%tmp16 = load float, float addrspace(3)* %arrayidx44, align 4		%tmp16 = load float, float addrspace(3)* %arrayidx44, align 4
%add47 = add nsw i32 %x.i, 1		%add47 = add nsw i32 %x.i, 1
%arrayidx48 = getelementptr inbounds [264 x float], [264 x float] addrspace(3)* @sgemm.lA, i32 0, i32 %add47		%arrayidx48 = getelementptr inbounds [264 x float], [264 x float] addrspace(3)* @sgemm.lA, i32 0, i32 %add47
%tmp17 = load float, float addrspace(3)* %arrayidx48, align 4		%tmp17 = load float, float addrspace(3)* %arrayidx48, align 4
%add51 = add nsw i32 %x.i, 16		%add51 = add nsw i32 %x.i, 16
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	; GFX9-NEXT: s_endpgm
%load = load i64, i64 addrspace(3)* %in, align 4		%load = load i64, i64 addrspace(3)* %in, align 4
store i64 %load, i64 addrspace(1)* %out, align 8		store i64 %load, i64 addrspace(1)* %out, align 8
ret void		ret void
}		}

define amdgpu_kernel void @ds_read_diff_base_interleaving(		define amdgpu_kernel void @ds_read_diff_base_interleaving(
; CI-LABEL: ds_read_diff_base_interleaving:		; CI-LABEL: ds_read_diff_base_interleaving:
; CI: ; %bb.0: ; %bb		; CI: ; %bb.0: ; %bb
; CI-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x9		; CI-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x9
; CI-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0xb		; CI-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0xb
; CI-NEXT: v_lshlrev_b32_e32 v1, 4, v1		; CI-NEXT: v_lshlrev_b32_e32 v1, 4, v1
; CI-NEXT: v_lshlrev_b32_e32 v0, 2, v0		; CI-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; CI-NEXT: s_mov_b32 m0, -1		; CI-NEXT: s_mov_b32 m0, -1
; CI-NEXT: s_mov_b32 s7, 0xf000		; CI-NEXT: s_mov_b32 s7, 0xf000
; CI-NEXT: s_waitcnt lgkmcnt(0)		; CI-NEXT: s_waitcnt lgkmcnt(0)
; CI-NEXT: v_add_i32_e32 v2, vcc, s0, v1		; CI-NEXT: v_add_i32_e32 v2, vcc, s0, v1
; CI-NEXT: v_add_i32_e32 v4, vcc, s1, v0		; CI-NEXT: v_add_i32_e32 v3, vcc, s1, v0
; CI-NEXT: v_add_i32_e32 v3, vcc, s2, v1		; CI-NEXT: v_add_i32_e32 v4, vcc, s2, v1
; CI-NEXT: v_add_i32_e32 v6, vcc, s3, v0		; CI-NEXT: v_add_i32_e32 v5, vcc, s3, v0
; CI-NEXT: ds_read2_b32 v[0:1], v2 offset1:1		; CI-NEXT: ds_read2_b32 v[0:1], v2 offset1:1
; CI-NEXT: ds_read2_b32 v[2:3], v3 offset1:1		; CI-NEXT: ds_read2_b32 v[2:3], v3 offset1:4
; CI-NEXT: ds_read2_b32 v[4:5], v4 offset1:4
; CI-NEXT: s_mov_b32 s6, -1		; CI-NEXT: s_mov_b32 s6, -1
; CI-NEXT: s_waitcnt lgkmcnt(0)		; CI-NEXT: s_waitcnt lgkmcnt(0)
; CI-NEXT: v_mul_f32_e32 v0, v0, v4		; CI-NEXT: v_mul_f32_e32 v0, v0, v2
; CI-NEXT: v_add_f32_e32 v4, 2.0, v0		; CI-NEXT: v_add_f32_e32 v6, 2.0, v0
; CI-NEXT: v_mul_f32_e32 v5, v1, v5		; CI-NEXT: v_mul_f32_e32 v7, v1, v3
; CI-NEXT: ds_read2_b32 v[0:1], v6 offset1:4		; CI-NEXT: ds_read2_b32 v[0:1], v4 offset1:1
; CI-NEXT: s_waitcnt lgkmcnt(0)		; CI-NEXT: ds_read2_b32 v[2:3], v5 offset1:4
; CI-NEXT: v_mul_f32_e32 v0, v2, v0		; CI-NEXT: s_waitcnt lgkmcnt(0)
; CI-NEXT: v_sub_f32_e32 v0, v4, v0		; CI-NEXT: v_mul_f32_e32 v0, v0, v2
; CI-NEXT: v_sub_f32_e32 v0, v0, v5		; CI-NEXT: v_sub_f32_e32 v0, v6, v0
; CI-NEXT: v_mul_f32_e32 v1, v3, v1		; CI-NEXT: v_sub_f32_e32 v0, v0, v7
		; CI-NEXT: v_mul_f32_e32 v1, v1, v3
; CI-NEXT: v_sub_f32_e32 v0, v0, v1		; CI-NEXT: v_sub_f32_e32 v0, v0, v1
; CI-NEXT: buffer_store_dword v0, off, s[4:7], 0 offset:40		; CI-NEXT: buffer_store_dword v0, off, s[4:7], 0 offset:40
; CI-NEXT: s_endpgm		; CI-NEXT: s_endpgm
;		;
; GFX9-LABEL: ds_read_diff_base_interleaving:		; GFX9-LABEL: ds_read_diff_base_interleaving:
; GFX9: ; %bb.0: ; %bb		; GFX9: ; %bb.0: ; %bb
; GFX9-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24		; GFX9-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24
; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c		; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
; GFX9-NEXT: v_lshlrev_b32_e32 v1, 4, v1		; GFX9-NEXT: v_lshlrev_b32_e32 v1, 4, v1
; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0		; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX9-NEXT: v_mov_b32_e32 v8, 0		; GFX9-NEXT: v_mov_b32_e32 v8, 0
▲ Show 20 Lines • Show All 220 Lines • ▼ Show 20 Lines
; GFX9-ALIGNED-NEXT: v_or_b32_sdwa v3, v3, v4 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD		; GFX9-ALIGNED-NEXT: v_or_b32_sdwa v3, v3, v4 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
; GFX9-ALIGNED-NEXT: v_or_b32_e32 v1, v6, v1		; GFX9-ALIGNED-NEXT: v_or_b32_e32 v1, v6, v1
; GFX9-ALIGNED-NEXT: v_or_b32_e32 v0, v3, v0		; GFX9-ALIGNED-NEXT: v_or_b32_e32 v0, v3, v0
; GFX9-ALIGNED-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]		; GFX9-ALIGNED-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
; GFX9-ALIGNED-NEXT: s_endpgm		; GFX9-ALIGNED-NEXT: s_endpgm
;		;
; GFX9-UNALIGNED-LABEL: read2_v2i32_align1_odd_offset:		; GFX9-UNALIGNED-LABEL: read2_v2i32_align1_odd_offset:
; GFX9-UNALIGNED: ; %bb.0: ; %entry		; GFX9-UNALIGNED: ; %bb.0: ; %entry
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, 0x41
; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-UNALIGNED-NEXT: ds_read2_b32 v[0:1], v0 offset1:1
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0		; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v0, v2 offset:65
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v3, v2 offset:66
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v4, v2 offset:67
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v5, v2 offset:68
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v1, v2 offset:69
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v6, v2 offset:70
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v7, v2 offset:71
		; GFX9-UNALIGNED-NEXT: ds_read_u8 v8, v2 offset:72
		; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
		; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v3, 8, v3
		; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v6, 8, v6
		; GFX9-UNALIGNED-NEXT: v_or_b32_e32 v1, v6, v1
		; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v6, 8, v8
		; GFX9-UNALIGNED-NEXT: v_or_b32_e32 v0, v3, v0
		; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v3, 8, v5
		; GFX9-UNALIGNED-NEXT: v_or_b32_sdwa v6, v6, v7 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
		; GFX9-UNALIGNED-NEXT: v_or_b32_sdwa v3, v3, v4 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
		; GFX9-UNALIGNED-NEXT: v_or_b32_e32 v1, v6, v1
		; GFX9-UNALIGNED-NEXT: v_or_b32_e32 v0, v3, v0
; GFX9-UNALIGNED-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]		; GFX9-UNALIGNED-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
; GFX9-UNALIGNED-NEXT: s_endpgm		; GFX9-UNALIGNED-NEXT: s_endpgm
entry:		entry:
%load = load <2 x i32>, <2 x i32> addrspace(3)* bitcast (i8 addrspace(3)* getelementptr (i8, i8 addrspace(3)* bitcast ([100 x <2 x i32>] addrspace(3)* @v2i32_align1 to i8 addrspace(3)), i32 65) to <2 x i32> addrspace(3)), align 1		%load = load <2 x i32>, <2 x i32> addrspace(3)* bitcast (i8 addrspace(3)* getelementptr (i8, i8 addrspace(3)* bitcast ([100 x <2 x i32>] addrspace(3)* @v2i32_align1 to i8 addrspace(3)), i32 65) to <2 x i32> addrspace(3)), align 1
store <2 x i32> %load, <2 x i32> addrspace(1)* %out		store <2 x i32> %load, <2 x i32> addrspace(1)* %out
ret void		ret void
}		}

declare void @void_func_void() #3		declare void @void_func_void() #3

Show All 13 Lines

llvm/test/CodeGen/AMDGPU/ds_write2.ll

	Show First 20 Lines • Show All 831 Lines • ▼ Show 20 Lines
	; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v0, 0			; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v0, 0
	; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v1, 0x7b			; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v1, 0x7b
	; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v1, v0 offset1:1			; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v1, v0 offset1:1
	; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v1, v0 offset0:2 offset1:3			; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v1, v0 offset0:2 offset1:3
	; GFX9-ALIGNED-NEXT: s_endpgm			; GFX9-ALIGNED-NEXT: s_endpgm
	;			;
	; GFX9-UNALIGNED-LABEL: store_misaligned64_constant_offsets:			; GFX9-UNALIGNED-LABEL: store_misaligned64_constant_offsets:
	; GFX9-UNALIGNED: ; %bb.0:			; GFX9-UNALIGNED: ; %bb.0:
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, 0x7b			; GFX9-UNALIGNED-NEXT: s_movk_i32 s0, 0x7b
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v1, 0			; GFX9-UNALIGNED-NEXT: s_mov_b32 s1, 0
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, v0			; GFX9-UNALIGNED-NEXT: s_mov_b32 s2, s0
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v3, v1			; GFX9-UNALIGNED-NEXT: s_mov_b32 s3, s1
	; GFX9-UNALIGNED-NEXT: ds_write_b128 v1, v[0:3]			; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, s0
				; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, s2
				; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v4, 0
				; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v1, s1
				; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v3, s3
				; GFX9-UNALIGNED-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1
				rampitecUnsubmitted Not Done Reply Inline Actions It seems it misses the same split to b32. rampitec: It seems it misses the same split to b32.
	; GFX9-UNALIGNED-NEXT: s_endpgm			; GFX9-UNALIGNED-NEXT: s_endpgm
	store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4			store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4
	store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4			store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4
	ret void			ret void
	}			}

	@bar.large = addrspace(3) global [4096 x i64] undef, align 4			@bar.large = addrspace(3) global [4096 x i64] undef, align 4

	define amdgpu_kernel void @store_misaligned64_constant_large_offsets() {			define amdgpu_kernel void @store_misaligned64_constant_large_offsets() {
	▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines
	; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v3, s2			; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v3, s2
	; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v4, s3			; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v4, s3
	; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v1, v2 offset1:1			; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v1, v2 offset1:1
	; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v3, v4 offset0:2 offset1:3			; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v3, v4 offset0:2 offset1:3
	; GFX9-ALIGNED-NEXT: s_endpgm			; GFX9-ALIGNED-NEXT: s_endpgm
	;			;
	; GFX9-UNALIGNED-LABEL: simple_write2_v4f32_superreg_align4:			; GFX9-UNALIGNED-LABEL: simple_write2_v4f32_superreg_align4:
	; GFX9-UNALIGNED: ; %bb.0:			; GFX9-UNALIGNED: ; %bb.0:
	; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x24			; GFX9-UNALIGNED-NEXT: s_load_dword s4, s[0:1], 0x24
	; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x2c			; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x2c
	; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-UNALIGNED-NEXT: v_lshl_add_u32 v4, v0, 4, s4			; GFX9-UNALIGNED-NEXT: v_lshl_add_u32 v4, v0, 4, s4
	; GFX9-UNALIGNED-NEXT: s_load_dwordx4 s[0:3], s[2:3], 0x0			; GFX9-UNALIGNED-NEXT: s_load_dwordx4 s[0:3], s[2:3], 0x0
	; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, s0			; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, s0
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v1, s1
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, s2			; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, s2
				; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v1, s1
	; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v3, s3			; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v3, s3
	; GFX9-UNALIGNED-NEXT: ds_write_b128 v4, v[0:3]			; GFX9-UNALIGNED-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1
	; GFX9-UNALIGNED-NEXT: s_endpgm			; GFX9-UNALIGNED-NEXT: s_endpgm
	%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1			%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
	%in.gep = getelementptr inbounds <4 x float>, <4 x float> addrspace(1)* %in			%in.gep = getelementptr inbounds <4 x float>, <4 x float> addrspace(1)* %in
	%val0 = load <4 x float>, <4 x float> addrspace(1)* %in.gep, align 4			%val0 = load <4 x float>, <4 x float> addrspace(1)* %in.gep, align 4
	%out.gep = getelementptr inbounds <4 x float>, <4 x float> addrspace(3)* %out, i32 %x.i			%out.gep = getelementptr inbounds <4 x float>, <4 x float> addrspace(3)* %out, i32 %x.i
	store <4 x float> %val0, <4 x float> addrspace(3)* %out.gep, align 4			store <4 x float> %val0, <4 x float> addrspace(3)* %out.gep, align 4
	ret void			ret void
	}			}

	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/store-local.128.ll

	Show First 20 Lines • Show All 310 Lines • ▼ Show 20 Lines
	; GFX7-NEXT: ds_write2_b32 v0, v1, v2 offset1:1			; GFX7-NEXT: ds_write2_b32 v0, v1, v2 offset1:1
	; GFX7-NEXT: v_mov_b32_e32 v1, s2			; GFX7-NEXT: v_mov_b32_e32 v1, s2
	; GFX7-NEXT: v_mov_b32_e32 v2, s3			; GFX7-NEXT: v_mov_b32_e32 v2, s3
	; GFX7-NEXT: ds_write2_b32 v0, v1, v2 offset0:2 offset1:3			; GFX7-NEXT: ds_write2_b32 v0, v1, v2 offset0:2 offset1:3
	; GFX7-NEXT: s_endpgm			; GFX7-NEXT: s_endpgm
	;			;
	; GFX6-LABEL: store_lds_v4i32_align4:			; GFX6-LABEL: store_lds_v4i32_align4:
	; GFX6: ; %bb.0:			; GFX6: ; %bb.0:
	; GFX6-NEXT: s_load_dword s4, s[0:1], 0x9			; GFX6-NEXT: s_load_dword s4, s[0:1], 0x9
	; GFX6-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0xd			; GFX6-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0xd
	; GFX6-NEXT: s_mov_b32 m0, -1			; GFX6-NEXT: s_mov_b32 m0, -1
	; GFX6-NEXT: s_waitcnt lgkmcnt(0)			; GFX6-NEXT: s_waitcnt lgkmcnt(0)
	; GFX6-NEXT: v_mov_b32_e32 v0, s4			; GFX6-NEXT: v_mov_b32_e32 v0, s4
	; GFX6-NEXT: v_mov_b32_e32 v1, s1			; GFX6-NEXT: v_mov_b32_e32 v1, s0
	; GFX6-NEXT: v_mov_b32_e32 v2, s0			; GFX6-NEXT: v_mov_b32_e32 v2, s1
	; GFX6-NEXT: ds_write2_b32 v0, v2, v1 offset1:1			; GFX6-NEXT: ds_write2_b32 v0, v1, v2 offset1:1
	; GFX6-NEXT: v_mov_b32_e32 v1, s3			; GFX6-NEXT: v_mov_b32_e32 v1, s2
	; GFX6-NEXT: v_mov_b32_e32 v2, s2			; GFX6-NEXT: v_mov_b32_e32 v2, s3
	; GFX6-NEXT: ds_write2_b32 v0, v2, v1 offset0:2 offset1:3			; GFX6-NEXT: ds_write2_b32 v0, v1, v2 offset0:2 offset1:3
	; GFX6-NEXT: s_endpgm			; GFX6-NEXT: s_endpgm
	store <4 x i32> %x, <4 x i32> addrspace(3)* %out, align 4			store <4 x i32> %x, <4 x i32> addrspace(3)* %out, align 4
	ret void			ret void
	}			}

	define amdgpu_kernel void @store_lds_v4i32_align8(<4 x i32> addrspace(3)* %out, <4 x i32> %x) {			define amdgpu_kernel void @store_lds_v4i32_align8(<4 x i32> addrspace(3)* %out, <4 x i32> %x) {
	; GFX9-LABEL: store_lds_v4i32_align8:			; GFX9-LABEL: store_lds_v4i32_align8:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dword s2, s[0:1], 0x24			; GFX9-NEXT: s_load_dword s2, s[0:1], 0x24
	; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x34			; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x34
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_mov_b32_e32 v4, s2			; GFX9-NEXT: v_mov_b32_e32 v4, s2
	; GFX9-NEXT: v_mov_b32_e32 v0, s4			; GFX9-NEXT: v_mov_b32_e32 v0, s4
	; GFX9-NEXT: v_mov_b32_e32 v2, s6
	; GFX9-NEXT: v_mov_b32_e32 v1, s5			; GFX9-NEXT: v_mov_b32_e32 v1, s5
				; GFX9-NEXT: v_mov_b32_e32 v2, s6
	; GFX9-NEXT: v_mov_b32_e32 v3, s7			; GFX9-NEXT: v_mov_b32_e32 v3, s7
	; GFX9-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1			; GFX9-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX7-LABEL: store_lds_v4i32_align8:			; GFX7-LABEL: store_lds_v4i32_align8:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_load_dword s4, s[0:1], 0x9			; GFX7-NEXT: s_load_dword s4, s[0:1], 0x9
	; GFX7-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0xd			; GFX7-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0xd
	; GFX7-NEXT: s_mov_b32 m0, -1			; GFX7-NEXT: s_mov_b32 m0, -1
	; GFX7-NEXT: s_waitcnt lgkmcnt(0)			; GFX7-NEXT: s_waitcnt lgkmcnt(0)
	; GFX7-NEXT: v_mov_b32_e32 v4, s4			; GFX7-NEXT: v_mov_b32_e32 v4, s4
	; GFX7-NEXT: v_mov_b32_e32 v0, s0			; GFX7-NEXT: v_mov_b32_e32 v0, s0
	; GFX7-NEXT: v_mov_b32_e32 v2, s2
	; GFX7-NEXT: v_mov_b32_e32 v1, s1			; GFX7-NEXT: v_mov_b32_e32 v1, s1
				; GFX7-NEXT: v_mov_b32_e32 v2, s2
	; GFX7-NEXT: v_mov_b32_e32 v3, s3			; GFX7-NEXT: v_mov_b32_e32 v3, s3
	; GFX7-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1			; GFX7-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1
	; GFX7-NEXT: s_endpgm			; GFX7-NEXT: s_endpgm
	;			;
	; GFX6-LABEL: store_lds_v4i32_align8:			; GFX6-LABEL: store_lds_v4i32_align8:
	; GFX6: ; %bb.0:			; GFX6: ; %bb.0:
	; GFX6-NEXT: s_load_dword s4, s[0:1], 0x9			; GFX6-NEXT: s_load_dword s4, s[0:1], 0x9
	; GFX6-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0xd			; GFX6-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0xd
	; GFX6-NEXT: s_mov_b32 m0, -1			; GFX6-NEXT: s_mov_b32 m0, -1
	; GFX6-NEXT: s_waitcnt lgkmcnt(0)			; GFX6-NEXT: s_waitcnt lgkmcnt(0)
	; GFX6-NEXT: v_mov_b32_e32 v4, s4			; GFX6-NEXT: v_mov_b32_e32 v4, s4
	; GFX6-NEXT: v_mov_b32_e32 v0, s2			; GFX6-NEXT: v_mov_b32_e32 v0, s0
	; GFX6-NEXT: v_mov_b32_e32 v1, s3			; GFX6-NEXT: v_mov_b32_e32 v1, s1
	; GFX6-NEXT: v_mov_b32_e32 v2, s0			; GFX6-NEXT: v_mov_b32_e32 v2, s2
	; GFX6-NEXT: v_mov_b32_e32 v3, s1			; GFX6-NEXT: v_mov_b32_e32 v3, s3
	; GFX6-NEXT: ds_write2_b64 v4, v[2:3], v[0:1] offset1:1			; GFX6-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1
	; GFX6-NEXT: s_endpgm			; GFX6-NEXT: s_endpgm
	store <4 x i32> %x, <4 x i32> addrspace(3)* %out, align 8			store <4 x i32> %x, <4 x i32> addrspace(3)* %out, align 8
	ret void			ret void
	}			}

	define amdgpu_kernel void @store_lds_v4i32_align16(<4 x i32> addrspace(3)* %out, <4 x i32> %x) {			define amdgpu_kernel void @store_lds_v4i32_align16(<4 x i32> addrspace(3)* %out, <4 x i32> %x) {
	; GFX9-LABEL: store_lds_v4i32_align16:			; GFX9-LABEL: store_lds_v4i32_align16:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dword s2, s[0:1], 0x24			; GFX9-NEXT: s_load_dword s2, s[0:1], 0x24
	Show All 40 Lines