This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Refactor LDS alignment checks.
ClosedPublic

Authored by rampitec on Apr 7 2022, 3:50 PM.

Download Raw Diff

Details

Reviewers

arsenm
foad

Commits

rGb8e09f15539a: [AMDGPU] Refactor LDS alignment checks.

Summary

Move features/bugs checks into the single place
allowsMisalignedMemoryAccessesImpl.

This is mostly NFCI except for the order of selection in couple places.
A separate change may be needed to stop lying about Fast.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Apr 7 2022, 3:50 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2022, 3:50 PM

Herald added subscribers: hsmhsm, kerbowa, hiraditya and 7 others. · View Herald Transcript

rampitec requested review of this revision.Apr 7 2022, 3:50 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2022, 3:50 PM

Herald added a subscriber: wdng. · View Herald Transcript

I am not sure we really want to tell truth about the 'Fast' here. If we tell that DS read misaligned by 1 byte is slow vectorizer will not combine 2 of them and we will get 2 separate ds_read_b32 instead of ds_read2_b32. It is slow, but the ds_read2_b32 is still faster than 2 separate instructions equally misaligned. That is what happens then: https://reviews.llvm.org/differential/diff/421361/

Harbormaster completed remote builds in B158586: Diff 421357.Apr 7 2022, 4:41 PM

rampitec added a child revision: D123524: [AMDGPU] Split unaligned 3 DWORD DS operations.Apr 11 2022, 10:48 AM

Fixed comment indentation.

Harbormaster completed remote builds in B159108: Diff 422051.Apr 11 2022, 4:41 PM

arsenm accepted this revision.Apr 11 2022, 5:43 PM

This revision is now accepted and ready to land.Apr 11 2022, 5:43 PM

This revision was landed with ongoing or failed builds.Apr 12 2022, 7:49 AM

Closed by commit rGb8e09f15539a: [AMDGPU] Refactor LDS alignment checks. (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rGb8e09f15539a: [AMDGPU] Refactor LDS alignment checks..

In D123343#3437583, @rampitec wrote:

I am not sure we really want to tell truth about the 'Fast' here. If we tell that DS read misaligned by 1 byte is slow vectorizer will not combine 2 of them and we will get 2 separate ds_read_b32 instead of ds_read2_b32. It is slow, but the ds_read2_b32 is still faster than 2 separate instructions equally misaligned. That is what happens then: https://reviews.llvm.org/differential/diff/421361/

Maybe LoadStoreVectorizer should be changed to create slow instructions, if the instructions being combined were slow already.

foad added inline comments.Apr 13 2022, 6:56 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1555	You don't need this - it is already handled by using PowerOf2Ceil to initialize RequiredAlignment.

In D123343#3448319, @foad wrote:

In D123343#3437583, @rampitec wrote:

I am not sure we really want to tell truth about the 'Fast' here. If we tell that DS read misaligned by 1 byte is slow vectorizer will not combine 2 of them and we will get 2 separate ds_read_b32 instead of ds_read2_b32. It is slow, but the ds_read2_b32 is still faster than 2 separate instructions equally misaligned. That is what happens then: https://reviews.llvm.org/differential/diff/421361/

Maybe LoadStoreVectorizer should be changed to create slow instructions, if the instructions being combined were slow already.

I have a w/a for this in the D123634, but in general I do not think 'fast' or 'slow' is a right measure. Something is not fast or slow, but faster or slower than something other. This would be a big change though.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1555	Indeed.

rampitec marked an inline comment as done.Apr 13 2022, 11:20 AM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1555	D123699

In D123343#3448518, @rampitec wrote:

I have a w/a for this in the D123634, but in general I do not think 'fast' or 'slow' is a right measure. Something is not fast or slow, but faster or slower than something other. This would be a big change though.

Speaking of fast and slow this might be modeled with an unsigned speed rank. The direct translation of current bool IsFast will be to 0 and 1. Then target may use more ranks and a higher the number the faster is the access.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

119 lines

test/

CodeGen/

AMDGPU/

load-local-redundant-copies.ll

51 lines

store-local.128.ll

10 lines

Diff 422232

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,510 Lines • ▼ Show 20 Lines
bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(		bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(
unsigned Size, unsigned AddrSpace, Align Alignment,		unsigned Size, unsigned AddrSpace, Align Alignment,
MachineMemOperand::Flags Flags, bool *IsFast) const {		MachineMemOperand::Flags Flags, bool *IsFast) const {
if (IsFast)		if (IsFast)
*IsFast = false;		*IsFast = false;

if (AddrSpace == AMDGPUAS::LOCAL_ADDRESS \|\|		if (AddrSpace == AMDGPUAS::LOCAL_ADDRESS \|\|
AddrSpace == AMDGPUAS::REGION_ADDRESS) {		AddrSpace == AMDGPUAS::REGION_ADDRESS) {
		// Check if alignment requirements for ds_read/write instructions are
		// disabled.
		if (!Subtarget->hasUnalignedDSAccessEnabled() && Alignment < Align(4))
		return false;

Align RequiredAlignment(PowerOf2Ceil(Size/8)); // Natural alignment.		Align RequiredAlignment(PowerOf2Ceil(Size/8)); // Natural alignment.
if (Subtarget->hasLDSMisalignedBug() && Size > 32 &&		if (Subtarget->hasLDSMisalignedBug() && Size > 32 &&
Alignment < RequiredAlignment)		Alignment < RequiredAlignment)
return false;		return false;

// Check if alignment requirements for ds_read/write instructions are
// disabled.
if (Subtarget->hasUnalignedDSAccessEnabled()) {
if (IsFast)
*IsFast = Alignment != Align(2);
return true;
}

// Either, the alignment requirements are "enabled", or there is an		// Either, the alignment requirements are "enabled", or there is an
// unaligned LDS access related hardware bug though alignment requirements		// unaligned LDS access related hardware bug though alignment requirements
// are "disabled". In either case, we need to check for proper alignment		// are "disabled". In either case, we need to check for proper alignment
// requirements.		// requirements.
//		//
if (Size == 64) {		switch (Size) {
		case 64:
// SI has a hardware bug in the LDS / GDS bounds checking: if the base		// SI has a hardware bug in the LDS / GDS bounds checking: if the base
// address is negative, then the instruction is incorrectly treated as		// address is negative, then the instruction is incorrectly treated as
// out-of-bounds even if base + offsets is in bounds. Split vectorized		// out-of-bounds even if base + offsets is in bounds. Split vectorized
// loads here to avoid emitting ds_read2_b32. We may re-combine the		// loads here to avoid emitting ds_read2_b32. We may re-combine the
// load later in the SILoadStoreOptimizer.		// load later in the SILoadStoreOptimizer.
if (!Subtarget->hasUsableDSOffset() && Alignment < Align(8))		if (!Subtarget->hasUsableDSOffset() && Alignment < Align(8))
return false;		return false;

// 8 byte accessing via ds_read/write_b64 require 8-byte alignment, but we		// 8 byte accessing via ds_read/write_b64 require 8-byte alignment, but we
// can do a 4 byte aligned, 8 byte access in a single operation using		// can do a 4 byte aligned, 8 byte access in a single operation using
// ds_read2/write2_b32 with adjacent offsets.		// ds_read2/write2_b32 with adjacent offsets.
bool AlignedBy4 = Alignment >= Align(4);		RequiredAlignment = Align(4);
if (IsFast)		break;
*IsFast = AlignedBy4;		case 96:
		if (!Subtarget->hasDS96AndDS128())
		return false;

return AlignedBy4;
}
if (Size == 96) {
// 12 byte accessing via ds_read/write_b96 require 16-byte alignment on		// 12 byte accessing via ds_read/write_b96 require 16-byte alignment on
// gfx8 and older.		// gfx8 and older.
bool AlignedBy16 = Alignment >= Align(16);		RequiredAlignment = Align(16);
		foadUnsubmitted Done Reply Inline Actions You don't need this - it is already handled by using PowerOf2Ceil to initialize RequiredAlignment. foad: You don't need this - it is already handled by using PowerOf2Ceil to initialize…
		rampitecAuthorUnsubmitted Done Reply Inline Actions Indeed. rampitec: Indeed.
		rampitecAuthorUnsubmitted Done Reply Inline Actions D123699 rampitec: D123699
if (IsFast)		break;
*IsFast = AlignedBy16;		case 128:
		if (!Subtarget->hasDS96AndDS128() \|\| !Subtarget->useDS128())
		return false;

return AlignedBy16;
}
if (Size == 128) {
// 16 byte accessing via ds_read/write_b128 require 16-byte alignment on		// 16 byte accessing via ds_read/write_b128 require 16-byte alignment on
// gfx8 and older, but we can do a 8 byte aligned, 16 byte access in a		// gfx8 and older, but we can do a 8 byte aligned, 16 byte access in a
// single operation using ds_read2/write2_b64.		// single operation using ds_read2/write2_b64.
bool AlignedBy8 = Alignment >= Align(8);		RequiredAlignment = Align(8);
if (IsFast)		break;
*IsFast = AlignedBy8;		default:
		if (Size > 32)
		return false;

		break;
		}

return AlignedBy8;		if (IsFast) {
		// FIXME: Lie it is fast if +unaligned-access-mode is passed so that
		// DS accesses get vectorized.
		*IsFast = Alignment >= RequiredAlignment \|\|
		Subtarget->hasUnalignedDSAccessEnabled();
}		}

		return Alignment >= RequiredAlignment \|\|
		Subtarget->hasUnalignedDSAccessEnabled();
}		}

if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {		if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
bool AlignedBy4 = Alignment >= Align(4);		bool AlignedBy4 = Alignment >= Align(4);
if (IsFast)		if (IsFast)
*IsFast = AlignedBy4;		*IsFast = AlignedBy4;

return AlignedBy4 \|\|		return AlignedBy4 \|\|
Subtarget->enableFlatScratch() \|\|		Subtarget->enableFlatScratch() \|\|
Subtarget->hasUnalignedScratchAccess();		Subtarget->hasUnalignedScratchAccess();
}		}

// FIXME: We have to be conservative here and assume that flat operations		// FIXME: We have to be conservative here and assume that flat operations
// will access scratch. If we had access to the IR function, then we		// will access scratch. If we had access to the IR function, then we
// could determine if any private memory was used in the function.		// could determine if any private memory was used in the function.
if (AddrSpace == AMDGPUAS::FLAT_ADDRESS &&		if (AddrSpace == AMDGPUAS::FLAT_ADDRESS &&
!Subtarget->hasUnalignedScratchAccess()) {		!Subtarget->hasUnalignedScratchAccess()) {
bool AlignedBy4 = Alignment >= Align(4);		bool AlignedBy4 = Alignment >= Align(4);
if (IsFast)		if (IsFast)
*IsFast = AlignedBy4;		*IsFast = AlignedBy4;

return AlignedBy4;		return AlignedBy4;
}		}

if (Subtarget->hasUnalignedBufferAccessEnabled() &&		if (Subtarget->hasUnalignedBufferAccessEnabled()) {
!(AddrSpace == AMDGPUAS::LOCAL_ADDRESS \|\|
AddrSpace == AMDGPUAS::REGION_ADDRESS)) {
// If we have a uniform constant load, it still requires using a slow		// If we have a uniform constant load, it still requires using a slow
// buffer instruction if unaligned.		// buffer instruction if unaligned.
if (IsFast) {		if (IsFast) {
// Accesses can really be issued as 1-byte aligned or 4-byte aligned, so		// Accesses can really be issued as 1-byte aligned or 4-byte aligned, so
// 2-byte alignment is worse than 1 unless doing a 2-byte access.		// 2-byte alignment is worse than 1 unless doing a 2-byte access.
*IsFast = (AddrSpace == AMDGPUAS::CONSTANT_ADDRESS \|\|		*IsFast = (AddrSpace == AMDGPUAS::CONSTANT_ADDRESS \|\|
AddrSpace == AMDGPUAS::CONSTANT_ADDRESS_32BIT) ?		AddrSpace == AMDGPUAS::CONSTANT_ADDRESS_32BIT) ?
Alignment >= Align(4) : Alignment != Align(2);		Alignment >= Align(4) : Alignment != Align(2);
▲ Show 20 Lines • Show All 6,905 Lines • ▼ Show 20 Lines	case 16:
if (NumElements == 3 && !Subtarget->hasDwordx3LoadStores())		if (NumElements == 3 && !Subtarget->hasDwordx3LoadStores())
return WidenOrSplitVectorLoad(Op, DAG);		return WidenOrSplitVectorLoad(Op, DAG);

return SDValue();		return SDValue();
default:		default:
llvm_unreachable("unsupported private_element_size");		llvm_unreachable("unsupported private_element_size");
}		}
} else if (AS == AMDGPUAS::LOCAL_ADDRESS \|\| AS == AMDGPUAS::REGION_ADDRESS) {		} else if (AS == AMDGPUAS::LOCAL_ADDRESS \|\| AS == AMDGPUAS::REGION_ADDRESS) {
// Use ds_read_b128 or ds_read_b96 when possible.		bool Fast = false;
if (Subtarget->hasDS96AndDS128() &&		auto Flags = Load->getMemOperand()->getFlags();
((Subtarget->useDS128() && MemVT.getStoreSize() == 16) \|\|		if (allowsMisalignedMemoryAccessesImpl(MemVT.getSizeInBits(), AS,
MemVT.getStoreSize() == 12) &&		Load->getAlign(), Flags, &Fast) &&
allowsMisalignedMemoryAccessesImpl(MemVT.getSizeInBits(), AS,		Fast)
Load->getAlign()))
return SDValue();		return SDValue();

if (NumElements > 2)		if (MemVT.isVector())
return SplitVectorLoad(Op, DAG);

// SI has a hardware bug in the LDS / GDS bounds checking: if the base
// address is negative, then the instruction is incorrectly treated as
// out-of-bounds even if base + offsets is in bounds. Split vectorized
// loads here to avoid emitting ds_read2_b32. We may re-combine the
// load later in the SILoadStoreOptimizer.
if (Subtarget->getGeneration() == AMDGPUSubtarget::SOUTHERN_ISLANDS &&
NumElements == 2 && MemVT.getStoreSize() == 8 &&
Load->getAlignment() < 8) {
return SplitVectorLoad(Op, DAG);		return SplitVectorLoad(Op, DAG);
}		}
}

if (!allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),		if (!allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
MemVT, *Load->getMemOperand())) {		MemVT, *Load->getMemOperand())) {
SDValue Ops[2];		SDValue Ops[2];
std::tie(Ops[0], Ops[1]) = expandUnalignedLoad(Load, DAG);		std::tie(Ops[0], Ops[1]) = expandUnalignedLoad(Load, DAG);
return DAG.getMergeValues(Ops, DL);		return DAG.getMergeValues(Ops, DL);
}		}

▲ Show 20 Lines • Show All 473 Lines • ▼ Show 20 Lines	case 16:
if (NumElements > 4 \|\|		if (NumElements > 4 \|\|
(NumElements == 3 && !Subtarget->enableFlatScratch()))		(NumElements == 3 && !Subtarget->enableFlatScratch()))
return SplitVectorStore(Op, DAG);		return SplitVectorStore(Op, DAG);
return SDValue();		return SDValue();
default:		default:
llvm_unreachable("unsupported private_element_size");		llvm_unreachable("unsupported private_element_size");
}		}
} else if (AS == AMDGPUAS::LOCAL_ADDRESS \|\| AS == AMDGPUAS::REGION_ADDRESS) {		} else if (AS == AMDGPUAS::LOCAL_ADDRESS \|\| AS == AMDGPUAS::REGION_ADDRESS) {
// Use ds_write_b128 or ds_write_b96 when possible.		bool Fast = false;
if (Subtarget->hasDS96AndDS128() &&		auto Flags = Store->getMemOperand()->getFlags();
((Subtarget->useDS128() && VT.getStoreSize() == 16) \|\|		if (allowsMisalignedMemoryAccessesImpl(VT.getSizeInBits(), AS,
(VT.getStoreSize() == 12)) &&		Store->getAlign(), Flags, &Fast) &&
allowsMisalignedMemoryAccessesImpl(VT.getSizeInBits(), AS,		Fast)
Store->getAlign()))
return SDValue();		return SDValue();

if (NumElements > 2)
return SplitVectorStore(Op, DAG);

// SI has a hardware bug in the LDS / GDS bounds checking: if the base
// address is negative, then the instruction is incorrectly treated as
// out-of-bounds even if base + offsets is in bounds. Split vectorized
// stores here to avoid emitting ds_write2_b32. We may re-combine the
// store later in the SILoadStoreOptimizer.
if (!Subtarget->hasUsableDSOffset() &&
NumElements == 2 && VT.getStoreSize() == 8 &&
Store->getAlignment() < 8) {
return SplitVectorStore(Op, DAG);
}

if (!allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
VT, *Store->getMemOperand())) {
if (VT.isVector())		if (VT.isVector())
return SplitVectorStore(Op, DAG);		return SplitVectorStore(Op, DAG);
return expandUnalignedStore(Store, DAG);
}

return SDValue();		return expandUnalignedStore(Store, DAG);
} else {		} else {
llvm_unreachable("unhandled address space");		llvm_unreachable("unhandled address space");
}		}
}		}

SDValue SITargetLowering::LowerTrig(SDValue Op, SelectionDAG &DAG) const {		SDValue SITargetLowering::LowerTrig(SDValue Op, SelectionDAG &DAG) const {
SDLoc DL(Op);		SDLoc DL(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
▲ Show 20 Lines • Show All 3,630 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll

	Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines

	define amdgpu_vs void @test_3(i32 inreg %arg1, i32 inreg %arg2, <4 x i32> inreg %arg3, i32 %arg4, <6 x float> addrspace(3)* %arg5, <6 x float> addrspace(3)* %arg6) {			define amdgpu_vs void @test_3(i32 inreg %arg1, i32 inreg %arg2, <4 x i32> inreg %arg3, i32 %arg4, <6 x float> addrspace(3)* %arg5, <6 x float> addrspace(3)* %arg6) {
	; CHECK-LABEL: test_3:			; CHECK-LABEL: test_3:
	; CHECK: ; %bb.0:			; CHECK: ; %bb.0:
	; CHECK-NEXT: s_mov_b32 s7, s5			; CHECK-NEXT: s_mov_b32 s7, s5
	; CHECK-NEXT: s_mov_b32 s6, s4			; CHECK-NEXT: s_mov_b32 s6, s4
	; CHECK-NEXT: s_mov_b32 s5, s3			; CHECK-NEXT: s_mov_b32 s5, s3
	; CHECK-NEXT: s_mov_b32 s4, s2			; CHECK-NEXT: s_mov_b32 s4, s2
	; CHECK-NEXT: v_add_i32_e32 v0, vcc, 4, v1			; CHECK-NEXT: v_add_i32_e32 v0, vcc, 20, v1
				; CHECK-NEXT: v_add_i32_e32 v3, vcc, 16, v1
				; CHECK-NEXT: v_add_i32_e32 v4, vcc, 12, v1
	; CHECK-NEXT: v_add_i32_e32 v5, vcc, 8, v1			; CHECK-NEXT: v_add_i32_e32 v5, vcc, 8, v1
	; CHECK-NEXT: v_add_i32_e32 v6, vcc, 12, v1			; CHECK-NEXT: v_add_i32_e32 v8, vcc, 4, v1
	; CHECK-NEXT: v_add_i32_e32 v7, vcc, 16, v1
	; CHECK-NEXT: v_add_i32_e32 v8, vcc, 20, v1
	; CHECK-NEXT: v_mov_b32_e32 v9, s0			; CHECK-NEXT: v_mov_b32_e32 v9, s0
	; CHECK-NEXT: v_add_i32_e32 v10, vcc, 4, v2			; CHECK-NEXT: v_add_i32_e32 v10, vcc, 20, v2
	; CHECK-NEXT: v_add_i32_e32 v11, vcc, 8, v2			; CHECK-NEXT: v_add_i32_e32 v11, vcc, 16, v2
	; CHECK-NEXT: v_add_i32_e32 v12, vcc, 12, v2
	; CHECK-NEXT: s_mov_b32 m0, -1			; CHECK-NEXT: s_mov_b32 m0, -1
	; CHECK-NEXT: ds_read_b32 v3, v1			; CHECK-NEXT: ds_read_b32 v7, v3
	; CHECK-NEXT: ds_read_b32 v4, v0			; CHECK-NEXT: ds_read_b32 v6, v4
	; CHECK-NEXT: ds_read_b32 v5, v5			; CHECK-NEXT: ds_read_b32 v5, v5
	; CHECK-NEXT: ds_read_b32 v6, v6			; CHECK-NEXT: ds_read_b32 v4, v8
	; CHECK-NEXT: ds_read_b32 v0, v7			; CHECK-NEXT: ds_read_b32 v8, v0
	; CHECK-NEXT: ds_read_b32 v1, v8			; CHECK-NEXT: ds_read_b32 v3, v1
	; CHECK-NEXT: v_add_i32_e32 v7, vcc, 16, v2			; CHECK-NEXT: v_add_i32_e32 v1, vcc, 12, v2
	; CHECK-NEXT: v_add_i32_e32 v8, vcc, 20, v2			; CHECK-NEXT: v_add_i32_e32 v12, vcc, 8, v2
	; CHECK-NEXT: s_waitcnt lgkmcnt(2)			; CHECK-NEXT: v_add_i32_e32 v13, vcc, 4, v2
	; CHECK-NEXT: tbuffer_store_format_xyzw v[3:6], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_32_32_32,BUF_NUM_FORMAT_UINT] idxen offset:264 glc slc
	; CHECK-NEXT: s_waitcnt lgkmcnt(0)			; CHECK-NEXT: s_waitcnt lgkmcnt(0)
	; CHECK-NEXT: tbuffer_store_format_xy v[0:1], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_INVALID,BUF_NUM_FORMAT_UINT] idxen offset:280 glc slc			; CHECK-NEXT: tbuffer_store_format_xyzw v[3:6], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_32_32_32,BUF_NUM_FORMAT_UINT] idxen offset:264 glc slc
	; CHECK-NEXT: s_waitcnt expcnt(0)			; CHECK-NEXT: tbuffer_store_format_xy v[7:8], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_INVALID,BUF_NUM_FORMAT_UINT] idxen offset:280 glc slc
	; CHECK-NEXT: ds_read_b32 v0, v2			; CHECK-NEXT: ds_read_b32 v0, v11
				; CHECK-NEXT: s_waitcnt expcnt(1)
				; CHECK-NEXT: ds_read_b32 v5, v1
				; CHECK-NEXT: ds_read_b32 v4, v12
				; CHECK-NEXT: ds_read_b32 v3, v13
				; CHECK-NEXT: ds_read_b32 v2, v2
	; CHECK-NEXT: ds_read_b32 v1, v10			; CHECK-NEXT: ds_read_b32 v1, v10
	; CHECK-NEXT: ds_read_b32 v2, v11
	; CHECK-NEXT: ds_read_b32 v3, v12
	; CHECK-NEXT: ds_read_b32 v4, v7
	; CHECK-NEXT: ds_read_b32 v5, v8
	; CHECK-NEXT: s_waitcnt lgkmcnt(5)			; CHECK-NEXT: s_waitcnt lgkmcnt(5)
	; CHECK-NEXT: exp mrt0 off, off, off, off			; CHECK-NEXT: exp mrt0 off, off, off, off
	; CHECK-NEXT: s_waitcnt lgkmcnt(2)			; CHECK-NEXT: s_waitcnt lgkmcnt(1)
	; CHECK-NEXT: tbuffer_store_format_xyzw v[0:3], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_32_32_32,BUF_NUM_FORMAT_UINT] idxen offset:240 glc slc			; CHECK-NEXT: tbuffer_store_format_xyzw v[2:5], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_32_32_32,BUF_NUM_FORMAT_UINT] idxen offset:240 glc slc
	; CHECK-NEXT: s_waitcnt lgkmcnt(0)			; CHECK-NEXT: s_waitcnt lgkmcnt(0)
	; CHECK-NEXT: tbuffer_store_format_xy v[4:5], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_INVALID,BUF_NUM_FORMAT_UINT] idxen offset:256 glc slc			; CHECK-NEXT: tbuffer_store_format_xy v[0:1], v9, s[4:7], s1 format:[BUF_DATA_FORMAT_INVALID,BUF_NUM_FORMAT_UINT] idxen offset:256 glc slc
	; CHECK-NEXT: s_endpgm			; CHECK-NEXT: s_endpgm
	%load1 = load <6 x float>, <6 x float> addrspace(3)* %arg5, align 4			%load1 = load <6 x float>, <6 x float> addrspace(3)* %arg5, align 4
	%vec11 = shufflevector <6 x float> %load1, <6 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%vec11 = shufflevector <6 x float> %load1, <6 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	call void @llvm.amdgcn.struct.tbuffer.store.v4f32(<4 x float> %vec11, <4 x i32> %arg3, i32 %arg1, i32 264, i32 %arg2, i32 immarg 77, i32 immarg 3)			call void @llvm.amdgcn.struct.tbuffer.store.v4f32(<4 x float> %vec11, <4 x i32> %arg3, i32 %arg1, i32 264, i32 %arg2, i32 immarg 77, i32 immarg 3)
	%vec12 = shufflevector <6 x float> %load1, <6 x float> undef, <2 x i32> <i32 4, i32 5>			%vec12 = shufflevector <6 x float> %load1, <6 x float> undef, <2 x i32> <i32 4, i32 5>
	call void @llvm.amdgcn.struct.tbuffer.store.v2f32(<2 x float> %vec12, <4 x i32> %arg3, i32 %arg1, i32 280, i32 %arg2, i32 immarg 64, i32 immarg 3)			call void @llvm.amdgcn.struct.tbuffer.store.v2f32(<2 x float> %vec12, <4 x i32> %arg3, i32 %arg1, i32 280, i32 %arg2, i32 immarg 64, i32 immarg 3)

	call void @llvm.amdgcn.exp.f32(i32 immarg 0, i32 immarg 0, float undef, float undef, float undef, float undef, i1 immarg false, i1 immarg false)			call void @llvm.amdgcn.exp.f32(i32 immarg 0, i32 immarg 0, float undef, float undef, float undef, float undef, i1 immarg false, i1 immarg false)
	Show All 13 Lines

llvm/test/CodeGen/AMDGPU/store-local.128.ll

	Show First 20 Lines • Show All 456 Lines • ▼ Show 20 Lines
	; GFX7-NEXT: s_endpgm			; GFX7-NEXT: s_endpgm
	;			;
	; GFX6-LABEL: store_lds_v4i32_align8:			; GFX6-LABEL: store_lds_v4i32_align8:
	; GFX6: ; %bb.0:			; GFX6: ; %bb.0:
	; GFX6-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x4			; GFX6-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x4
	; GFX6-NEXT: s_load_dword s0, s[0:1], 0x0			; GFX6-NEXT: s_load_dword s0, s[0:1], 0x0
	; GFX6-NEXT: s_mov_b32 m0, -1			; GFX6-NEXT: s_mov_b32 m0, -1
	; GFX6-NEXT: s_waitcnt lgkmcnt(0)			; GFX6-NEXT: s_waitcnt lgkmcnt(0)
	; GFX6-NEXT: v_mov_b32_e32 v0, s6			; GFX6-NEXT: v_mov_b32_e32 v0, s4
	; GFX6-NEXT: v_mov_b32_e32 v1, s7			; GFX6-NEXT: v_mov_b32_e32 v1, s5
	; GFX6-NEXT: v_mov_b32_e32 v4, s0			; GFX6-NEXT: v_mov_b32_e32 v4, s0
	; GFX6-NEXT: v_mov_b32_e32 v2, s4			; GFX6-NEXT: v_mov_b32_e32 v2, s6
	; GFX6-NEXT: v_mov_b32_e32 v3, s5			; GFX6-NEXT: v_mov_b32_e32 v3, s7
	; GFX6-NEXT: ds_write2_b64 v4, v[2:3], v[0:1] offset1:1			; GFX6-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1
	; GFX6-NEXT: s_endpgm			; GFX6-NEXT: s_endpgm
	;			;
	; GFX10-LABEL: store_lds_v4i32_align8:			; GFX10-LABEL: store_lds_v4i32_align8:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_clause 0x1			; GFX10-NEXT: s_clause 0x1
	; GFX10-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x10			; GFX10-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x10
	; GFX10-NEXT: s_load_dword s2, s[0:1], 0x0			; GFX10-NEXT: s_load_dword s2, s[0:1], 0x0
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines