This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
4/6
AMDGPUAtomicOptimizer.cpp
-
GCNSubtarget.h
3/3
SIDefines.h
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
atomic_optimizations_local_pointer.ll

Differential D98953

[AMDGPU] Use reductions instead of scans in the atomic optimizer
ClosedPublic

Authored by foad on Mar 19 2021, 7:52 AM.

Download Raw Diff

Details

Reviewers

rampitec
critson
piotr
b-sumner

Commits

rG9d08f276d79b: [AMDGPU] Use reductions instead of scans in the atomic optimizer

Summary

If the result of an atomic operation is not used then it can be more
efficient to build a reduction across all lanes instead of a scan. Do
this for GFX10, where the permlanex16 instruction makes it viable. For
wave64 this saves a couple of dpp operations. For wave32 it saves one
readlane (which are generally bad for performance) and one dpp
operation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

foad created this revision.Mar 19 2021, 7:52 AM

Herald added subscribers: kerbowa, jfb, hiraditya and 8 others. · View Herald TranscriptMar 19 2021, 7:52 AM

foad requested review of this revision.Mar 19 2021, 7:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2021, 7:52 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

foad added a parent revision: D98952: [AMDGPU] Add atomic optimizer nouse tests.Mar 19 2021, 7:52 AM

Harbormaster completed remote builds in B94706: Diff 331877.Mar 19 2021, 9:10 AM

rampitec added a reviewer: b-sumner.Mar 19 2021, 9:12 AM

b-sumner added inline comments.Mar 19 2021, 11:25 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
299	This requires all lanes to be active. Are we guaranteed that the work group size will be a integer multiple of the wave size?

foad added inline comments.Mar 19 2021, 11:36 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
299	The reduction or scan runs in whole wave mode. All lanes are active.
299	... and lanes that weren't active to start with are set to an appropriate identity value for the operation.

b-sumner added inline comments.Mar 19 2021, 12:25 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
299	But suppose the launched grid has size 66. That means one wave has only 2 active lanes, and I'm not aware that WWM can actually activate the rest of them.

foad added inline comments.Mar 19 2021, 2:02 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
299	That's exactly what WWM does: unconditionally activates all lanes. You can see that in the tests in this patch (both before and after my changes): `s_or_saveexec_b64 s[0:1], -1` sets all bits in the exec mask.

LGTM - this seems like a good use of GFX10 row_xmask.

Please see minor comments.

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
358	Should this V_PERMLANEX16 be guarded as well? Or at least have an assert?
llvm/lib/Target/AMDGPU/SIDefines.h
675	Is it worth silencing linting of this enum?
711	ROW_SHARE0 is defined, but not used? Not that I am against it existing.

This revision is now accepted and ready to land.Mar 23 2021, 1:14 AM

Address review comments, and use an early-out.

foad marked 3 inline comments as done.Mar 23 2021, 2:47 AM

foad added inline comments.

llvm/lib/Target/AMDGPU/SIDefines.h
711	Yeah, I just added it for consistency before I had worked out which ones I would actually need.

Harbormaster completed remote builds in B95209: Diff 332583.Mar 23 2021, 5:20 AM

Closed by commit rG9d08f276d79b: [AMDGPU] Use reductions instead of scans in the atomic optimizer (authored by foad). · Explain WhyMar 26 2021, 8:41 AM

This revision was automatically updated to reflect the committed changes.

foad marked an inline comment as done.

foad added a commit: rG9d08f276d79b: [AMDGPU] Use reductions instead of scans in the atomic optimizer.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUAtomicOptimizer.cpp

70 lines

GCNSubtarget.h

3 lines

SIDefines.h

4 lines

test/

CodeGen/

AMDGPU/

atomic_optimizations_local_pointer.ll

84 lines

Diff 333565

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
private:		private:
SmallVector<ReplacementInfo, 8> ToReplace;		SmallVector<ReplacementInfo, 8> ToReplace;
const LegacyDivergenceAnalysis *DA;		const LegacyDivergenceAnalysis *DA;
const DataLayout *DL;		const DataLayout *DL;
DominatorTree *DT;		DominatorTree *DT;
const GCNSubtarget *ST;		const GCNSubtarget *ST;
bool IsPixelShader;		bool IsPixelShader;

		Value buildReduction(IRBuilder<> &B, AtomicRMWInst::BinOp Op, Value V,
		Value *const Identity) const;
Value buildScan(IRBuilder<> &B, AtomicRMWInst::BinOp Op, Value V,		Value buildScan(IRBuilder<> &B, AtomicRMWInst::BinOp Op, Value V,
Value *const Identity) const;		Value *const Identity) const;
Value buildShiftRight(IRBuilder<> &B, Value V, Value *const Identity) const;		Value buildShiftRight(IRBuilder<> &B, Value V, Value *const Identity) const;
void optimizeAtomic(Instruction &I, AtomicRMWInst::BinOp Op, unsigned ValIdx,		void optimizeAtomic(Instruction &I, AtomicRMWInst::BinOp Op, unsigned ValIdx,
bool ValDivergent) const;		bool ValDivergent) const;

public:		public:
static char ID;		static char ID;
▲ Show 20 Lines • Show All 215 Lines • ▼ Show 20 Lines	static Value *buildNonAtomicBinOp(IRBuilder<> &B, AtomicRMWInst::BinOp Op,
case AtomicRMWInst::UMin:		case AtomicRMWInst::UMin:
Pred = CmpInst::ICMP_ULT;		Pred = CmpInst::ICMP_ULT;
break;		break;
}		}
Value *Cond = B.CreateICmp(Pred, LHS, RHS);		Value *Cond = B.CreateICmp(Pred, LHS, RHS);
return B.CreateSelect(Cond, LHS, RHS);		return B.CreateSelect(Cond, LHS, RHS);
}		}

		// Use the builder to create a reduction of V across the wavefront, with all
		// lanes active, returning the same result in all lanes.
		Value *AMDGPUAtomicOptimizer::buildReduction(IRBuilder<> &B,
		AtomicRMWInst::BinOp Op, Value *V,
		Value *const Identity) const {
		Type *const Ty = V->getType();
		Module *M = B.GetInsertBlock()->getModule();
		Function *UpdateDPP =
		Intrinsic::getDeclaration(M, Intrinsic::amdgcn_update_dpp, Ty);

		// Reduce within each row of 16 lanes.
		for (unsigned Idx = 0; Idx < 4; Idx++) {
		V = buildNonAtomicBinOp(
		B, Op, V,
		B.CreateCall(UpdateDPP,
		{Identity, V, B.getInt32(DPP::ROW_XMASK0 \| 1 << Idx),
		b-sumnerUnsubmitted Not Done Reply Inline Actions This requires all lanes to be active. Are we guaranteed that the work group size will be a integer multiple of the wave size? b-sumner: This requires all lanes to be active. Are we guaranteed that the work group size will be a…
		foadAuthorUnsubmitted Done Reply Inline Actions The reduction or scan runs in whole wave mode. All lanes are active. foad: The reduction or scan runs in whole wave mode. All lanes are active.
		foadAuthorUnsubmitted Done Reply Inline Actions ... and lanes that weren't active to start with are set to an appropriate identity value for the operation. foad: ... and lanes that weren't active to start with are set to an appropriate identity value for…
		b-sumnerUnsubmitted Not Done Reply Inline Actions But suppose the launched grid has size 66. That means one wave has only 2 active lanes, and I'm not aware that WWM can actually activate the rest of them. b-sumner: But suppose the launched grid has size 66. That means one wave has only 2 active lanes, and I'm…
		foadAuthorUnsubmitted Done Reply Inline Actions That's exactly what WWM does: unconditionally activates all lanes. You can see that in the tests in this patch (both before and after my changes): `s_or_saveexec_b64 s[0:1], -1` sets all bits in the exec mask. foad: That's exactly what WWM does: unconditionally activates all lanes. You can see that in the…
		B.getInt32(0xf), B.getInt32(0xf), B.getFalse()}));
		}

		// Reduce within each pair of rows (i.e. 32 lanes).
		assert(ST->hasPermLaneX16());
		V = buildNonAtomicBinOp(
		B, Op, V,
		B.CreateIntrinsic(
		Intrinsic::amdgcn_permlanex16, {},
		{V, V, B.getInt32(-1), B.getInt32(-1), B.getFalse(), B.getFalse()}));

		if (ST->isWave32())
		return V;

		// Pick an arbitrary lane from 0..31 and an arbitrary lane from 32..63 and
		// combine them with a scalar operation.
		Function *ReadLane =
		Intrinsic::getDeclaration(M, Intrinsic::amdgcn_readlane, {});
		Value *const Lane0 = B.CreateCall(ReadLane, {V, B.getInt32(0)});
		Value *const Lane32 = B.CreateCall(ReadLane, {V, B.getInt32(32)});
		return buildNonAtomicBinOp(B, Op, Lane0, Lane32);
		}

// Use the builder to create an inclusive scan of V across the wavefront, with		// Use the builder to create an inclusive scan of V across the wavefront, with
// all lanes active.		// all lanes active.
Value *AMDGPUAtomicOptimizer::buildScan(IRBuilder<> &B, AtomicRMWInst::BinOp Op,		Value *AMDGPUAtomicOptimizer::buildScan(IRBuilder<> &B, AtomicRMWInst::BinOp Op,
Value V, Value const Identity) const {		Value V, Value const Identity) const {
Type *const Ty = V->getType();		Type *const Ty = V->getType();
Module *M = B.GetInsertBlock()->getModule();		Module *M = B.GetInsertBlock()->getModule();
Function *UpdateDPP =		Function *UpdateDPP =
Intrinsic::getDeclaration(M, Intrinsic::amdgcn_update_dpp, Ty);		Intrinsic::getDeclaration(M, Intrinsic::amdgcn_update_dpp, Ty);
Show All 18 Lines	V = buildNonAtomicBinOp(
{Identity, V, B.getInt32(DPP::BCAST31), B.getInt32(0xc),		{Identity, V, B.getInt32(DPP::BCAST31), B.getInt32(0xc),
B.getInt32(0xf), B.getFalse()}));		B.getInt32(0xf), B.getFalse()}));
} else {		} else {
// On GFX10 all DPP operations are confined to a single row. To get cross-		// On GFX10 all DPP operations are confined to a single row. To get cross-
// row operations we have to use permlane or readlane.		// row operations we have to use permlane or readlane.

// Combine lane 15 into lanes 16..31 (and, for wave 64, lane 47 into lanes		// Combine lane 15 into lanes 16..31 (and, for wave 64, lane 47 into lanes
// 48..63).		// 48..63).
		assert(ST->hasPermLaneX16());
Value *const PermX = B.CreateIntrinsic(		Value *const PermX = B.CreateIntrinsic(
		critsonUnsubmitted Done Reply Inline Actions Should this V_PERMLANEX16 be guarded as well? Or at least have an assert? critson: Should this V_PERMLANEX16 be guarded as well? Or at least have an assert?
Intrinsic::amdgcn_permlanex16, {},		Intrinsic::amdgcn_permlanex16, {},
{V, V, B.getInt32(-1), B.getInt32(-1), B.getFalse(), B.getFalse()});		{V, V, B.getInt32(-1), B.getInt32(-1), B.getFalse(), B.getFalse()});
V = buildNonAtomicBinOp(		V = buildNonAtomicBinOp(
B, Op, V,		B, Op, V,
B.CreateCall(UpdateDPP,		B.CreateCall(UpdateDPP,
{Identity, PermX, B.getInt32(DPP::QUAD_PERM_ID),		{Identity, PermX, B.getInt32(DPP::QUAD_PERM_ID),
B.getInt32(0xa), B.getInt32(0xf), B.getFalse()}));		B.getInt32(0xa), B.getInt32(0xf), B.getFalse()}));
if (!ST->isWave32()) {		if (!ST->isWave32()) {
▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	void AMDGPUAtomicOptimizer::optimizeAtomic(Instruction &I,
// using DPP.		// using DPP.
if (ValDivergent) {		if (ValDivergent) {
// First we need to set all inactive invocations to the identity value, so		// First we need to set all inactive invocations to the identity value, so
// that they can correctly contribute to the final result.		// that they can correctly contribute to the final result.
NewV = B.CreateIntrinsic(Intrinsic::amdgcn_set_inactive, Ty, {V, Identity});		NewV = B.CreateIntrinsic(Intrinsic::amdgcn_set_inactive, Ty, {V, Identity});

const AtomicRMWInst::BinOp ScanOp =		const AtomicRMWInst::BinOp ScanOp =
Op == AtomicRMWInst::Sub ? AtomicRMWInst::Add : Op;		Op == AtomicRMWInst::Sub ? AtomicRMWInst::Add : Op;
		if (!NeedResult && ST->hasPermLaneX16()) {
		// On GFX10 the permlanex16 instruction helps us build a reduction without
		// too many readlanes and writelanes, which are generally bad for
		// performance.
		NewV = buildReduction(B, ScanOp, NewV, Identity);
		} else {
NewV = buildScan(B, ScanOp, NewV, Identity);		NewV = buildScan(B, ScanOp, NewV, Identity);
if (NeedResult)		if (NeedResult)
ExclScan = buildShiftRight(B, NewV, Identity);		ExclScan = buildShiftRight(B, NewV, Identity);

// Read the value from the last lane, which has accumlated the values of		// Read the value from the last lane, which has accumlated the values of
// each active lane in the wavefront. This will be our new value which we		// each active lane in the wavefront. This will be our new value which we
// will provide to the atomic operation.		// will provide to the atomic operation.
Value *const LastLaneIdx = B.getInt32(ST->getWavefrontSize() - 1);		Value *const LastLaneIdx = B.getInt32(ST->getWavefrontSize() - 1);
assert(TyBitWidth == 32);		assert(TyBitWidth == 32);
NewV = B.CreateIntrinsic(Intrinsic::amdgcn_readlane, {}, {NewV, LastLaneIdx});		NewV = B.CreateIntrinsic(Intrinsic::amdgcn_readlane, {},
		{NewV, LastLaneIdx});
		}

// Finally mark the readlanes in the WWM section.		// Finally mark the readlanes in the WWM section.
NewV = B.CreateIntrinsic(Intrinsic::amdgcn_strict_wwm, Ty, NewV);		NewV = B.CreateIntrinsic(Intrinsic::amdgcn_strict_wwm, Ty, NewV);
} else {		} else {
switch (Op) {		switch (Op) {
default:		default:
llvm_unreachable("Unhandled atomic op");		llvm_unreachable("Unhandled atomic op");

▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/GCNSubtarget.h

Show First 20 Lines • Show All 805 Lines • ▼ Show 20 Lines	public:
bool hasScalarAtomics() const {		bool hasScalarAtomics() const {
return HasScalarAtomics;		return HasScalarAtomics;
}		}

bool hasLDSFPAtomics() const {		bool hasLDSFPAtomics() const {
return GFX8Insts;		return GFX8Insts;
}		}

		/// \returns true if the subtarget has the v_permlanex16_b32 instruction.
		bool hasPermLaneX16() const { return getGeneration() >= GFX10; }

bool hasDPP() const {		bool hasDPP() const {
return HasDPP;		return HasDPP;
}		}

bool hasDPPBroadcasts() const {		bool hasDPPBroadcasts() const {
return HasDPP && getGeneration() < GFX10;		return HasDPP && getGeneration() < GFX10;
}		}

▲ Show 20 Lines • Show All 293 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIDefines.h

Show First 20 Lines • Show All 665 Lines • ▼ Show 20 Lines	enum SDWA9EncValues : unsigned {
SRC_TTMP_MIN = 364,		SRC_TTMP_MIN = 364,
SRC_TTMP_MAX = 379,		SRC_TTMP_MAX = 379,
};		};

} // namespace SDWA		} // namespace SDWA

namespace DPP {		namespace DPP {

		// clang-format off
enum DppCtrl : unsigned {		enum DppCtrl : unsigned {
		critsonUnsubmitted Done Reply Inline Actions Is it worth silencing linting of this enum? critson: Is it worth silencing linting of this enum?
QUAD_PERM_FIRST = 0,		QUAD_PERM_FIRST = 0,
QUAD_PERM_ID = 0xE4, // identity permutation		QUAD_PERM_ID = 0xE4, // identity permutation
QUAD_PERM_LAST = 0xFF,		QUAD_PERM_LAST = 0xFF,
DPP_UNUSED1 = 0x100,		DPP_UNUSED1 = 0x100,
ROW_SHL0 = 0x100,		ROW_SHL0 = 0x100,
ROW_SHL_FIRST = 0x101,		ROW_SHL_FIRST = 0x101,
ROW_SHL_LAST = 0x10F,		ROW_SHL_LAST = 0x10F,
DPP_UNUSED2 = 0x110,		DPP_UNUSED2 = 0x110,
Show All 19 Lines	enum DppCtrl : unsigned {
ROW_MIRROR = 0x140,		ROW_MIRROR = 0x140,
ROW_HALF_MIRROR = 0x141,		ROW_HALF_MIRROR = 0x141,
BCAST15 = 0x142,		BCAST15 = 0x142,
BCAST31 = 0x143,		BCAST31 = 0x143,
DPP_UNUSED8_FIRST = 0x144,		DPP_UNUSED8_FIRST = 0x144,
DPP_UNUSED8_LAST = 0x14F,		DPP_UNUSED8_LAST = 0x14F,
ROW_NEWBCAST_FIRST= 0x150,		ROW_NEWBCAST_FIRST= 0x150,
ROW_NEWBCAST_LAST = 0x15F,		ROW_NEWBCAST_LAST = 0x15F,
		ROW_SHARE0 = 0x150,
		critsonUnsubmitted Done Reply Inline Actions ROW_SHARE0 is defined, but not used? Not that I am against it existing. critson: ROW_SHARE0 is defined, but not used? Not that I am against it existing.
		foadAuthorUnsubmitted Done Reply Inline Actions Yeah, I just added it for consistency before I had worked out which ones I would actually need. foad: Yeah, I just added it for consistency before I had worked out which ones I would actually need.
ROW_SHARE_FIRST = 0x150,		ROW_SHARE_FIRST = 0x150,
ROW_SHARE_LAST = 0x15F,		ROW_SHARE_LAST = 0x15F,
		ROW_XMASK0 = 0x160,
ROW_XMASK_FIRST = 0x160,		ROW_XMASK_FIRST = 0x160,
ROW_XMASK_LAST = 0x16F,		ROW_XMASK_LAST = 0x16F,
DPP_LAST = ROW_XMASK_LAST		DPP_LAST = ROW_XMASK_LAST
};		};
		// clang-format on

enum DppFiMode {		enum DppFiMode {
DPP_FI_0 = 0,		DPP_FI_0 = 0,
DPP_FI_1 = 1,		DPP_FI_1 = 1,
DPP8_FI_0 = 0xE9,		DPP8_FI_0 = 0xE9,
DPP8_FI_1 = 0xEA,		DPP8_FI_1 = 0xEA,
};		};

▲ Show 20 Lines • Show All 192 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

	Show First 20 Lines • Show All 657 Lines • ▼ Show 20 Lines
	;			;
	; GFX1064-LABEL: add_i32_varying_nouse:			; GFX1064-LABEL: add_i32_varying_nouse:
	; GFX1064: ; %bb.0: ; %entry			; GFX1064: ; %bb.0: ; %entry
	; GFX1064-NEXT: v_mov_b32_e32 v1, v0			; GFX1064-NEXT: v_mov_b32_e32 v1, v0
	; GFX1064-NEXT: s_not_b64 exec, exec			; GFX1064-NEXT: s_not_b64 exec, exec
	; GFX1064-NEXT: v_mov_b32_e32 v1, 0			; GFX1064-NEXT: v_mov_b32_e32 v1, 0
	; GFX1064-NEXT: s_not_b64 exec, exec			; GFX1064-NEXT: s_not_b64 exec, exec
	; GFX1064-NEXT: s_or_saveexec_b64 s[0:1], -1			; GFX1064-NEXT: s_or_saveexec_b64 s[0:1], -1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:1 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:2 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:4 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:8 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1064-NEXT: v_mov_b32_e32 v2, v1			; GFX1064-NEXT: v_mov_b32_e32 v2, v1
	; GFX1064-NEXT: v_permlanex16_b32 v2, v2, -1, -1			; GFX1064-NEXT: v_permlanex16_b32 v2, v2, -1, -1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v2, v1 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf			; GFX1064-NEXT: v_add_nc_u32_e32 v1, v1, v2
	; GFX1064-NEXT: v_readlane_b32 s2, v1, 31
	; GFX1064-NEXT: v_mov_b32_e32 v2, s2
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v2, v1 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
	; GFX1064-NEXT: s_mov_b64 exec, s[0:1]			; GFX1064-NEXT: s_mov_b64 exec, s[0:1]
	; GFX1064-NEXT: v_mbcnt_lo_u32_b32_e64 v0, exec_lo, 0			; GFX1064-NEXT: v_mbcnt_lo_u32_b32_e64 v0, exec_lo, 0
	; GFX1064-NEXT: s_or_saveexec_b64 s[0:1], -1			; GFX1064-NEXT: s_or_saveexec_b64 s[0:1], -1
	; GFX1064-NEXT: v_readlane_b32 s2, v1, 63			; GFX1064-NEXT: v_readlane_b32 s2, v1, 0
				; GFX1064-NEXT: v_readlane_b32 s3, v1, 32
	; GFX1064-NEXT: s_mov_b64 exec, s[0:1]			; GFX1064-NEXT: s_mov_b64 exec, s[0:1]
	; GFX1064-NEXT: v_mbcnt_hi_u32_b32_e64 v0, exec_hi, v0			; GFX1064-NEXT: v_mbcnt_hi_u32_b32_e64 v0, exec_hi, v0
	; GFX1064-NEXT: s_mov_b32 s0, s2			; GFX1064-NEXT: s_add_i32 s0, s2, s3
	; GFX1064-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0			; GFX1064-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
	; GFX1064-NEXT: s_and_saveexec_b64 s[2:3], vcc			; GFX1064-NEXT: s_and_saveexec_b64 s[2:3], vcc
	; GFX1064-NEXT: s_cbranch_execz BB3_2			; GFX1064-NEXT: s_cbranch_execz BB3_2
	; GFX1064-NEXT: ; %bb.1:			; GFX1064-NEXT: ; %bb.1:
	; GFX1064-NEXT: v_mov_b32_e32 v0, local_var32@abs32@lo			; GFX1064-NEXT: v_mov_b32_e32 v0, local_var32@abs32@lo
	; GFX1064-NEXT: v_mov_b32_e32 v3, s0			; GFX1064-NEXT: v_mov_b32_e32 v3, s0
	; GFX1064-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1064-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1064-NEXT: s_waitcnt_vscnt null, 0x0			; GFX1064-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX1064-NEXT: ds_add_u32 v0, v3			; GFX1064-NEXT: ds_add_u32 v0, v3
	; GFX1064-NEXT: s_waitcnt lgkmcnt(0)			; GFX1064-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1064-NEXT: buffer_gl0_inv			; GFX1064-NEXT: buffer_gl0_inv
	; GFX1064-NEXT: BB3_2:			; GFX1064-NEXT: BB3_2:
	; GFX1064-NEXT: s_endpgm			; GFX1064-NEXT: s_endpgm
	;			;
	; GFX1032-LABEL: add_i32_varying_nouse:			; GFX1032-LABEL: add_i32_varying_nouse:
	; GFX1032: ; %bb.0: ; %entry			; GFX1032: ; %bb.0: ; %entry
	; GFX1032-NEXT: v_mov_b32_e32 v1, v0			; GFX1032-NEXT: v_mov_b32_e32 v1, v0
	; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo			; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX1032-NEXT: v_mov_b32_e32 v1, 0			; GFX1032-NEXT: v_mov_b32_e32 v1, 0
	; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo			; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX1032-NEXT: s_or_saveexec_b32 s0, -1			; GFX1032-NEXT: s_or_saveexec_b32 s0, -1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:1 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:2 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:4 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:8 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1032-NEXT: v_mov_b32_e32 v2, v1			; GFX1032-NEXT: v_mov_b32_e32 v2, v1
	; GFX1032-NEXT: v_permlanex16_b32 v2, v2, -1, -1			; GFX1032-NEXT: v_permlanex16_b32 v2, v2, -1, -1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v2, v1 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf			; GFX1032-NEXT: v_add_nc_u32_e32 v1, v1, v2
	; GFX1032-NEXT: v_readlane_b32 s1, v1, 31
	; GFX1032-NEXT: s_mov_b32 exec_lo, s0			; GFX1032-NEXT: s_mov_b32 exec_lo, s0
	; GFX1032-NEXT: v_mbcnt_lo_u32_b32_e64 v0, exec_lo, 0			; GFX1032-NEXT: v_mbcnt_lo_u32_b32_e64 v3, exec_lo, 0
	; GFX1032-NEXT: s_mov_b32 s0, s1			; GFX1032-NEXT: v_mov_b32_e32 v0, v1
	; GFX1032-NEXT: v_cmp_eq_u32_e32 vcc_lo, 0, v0			; GFX1032-NEXT: v_cmp_eq_u32_e32 vcc_lo, 0, v3
	; GFX1032-NEXT: s_and_saveexec_b32 s1, vcc_lo			; GFX1032-NEXT: s_and_saveexec_b32 s0, vcc_lo
	; GFX1032-NEXT: s_cbranch_execz BB3_2			; GFX1032-NEXT: s_cbranch_execz BB3_2
	; GFX1032-NEXT: ; %bb.1:			; GFX1032-NEXT: ; %bb.1:
	; GFX1032-NEXT: v_mov_b32_e32 v0, local_var32@abs32@lo			; GFX1032-NEXT: v_mov_b32_e32 v3, local_var32@abs32@lo
	; GFX1032-NEXT: v_mov_b32_e32 v3, s0
	; GFX1032-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1032-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1032-NEXT: s_waitcnt_vscnt null, 0x0			; GFX1032-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX1032-NEXT: ds_add_u32 v0, v3			; GFX1032-NEXT: ds_add_u32 v3, v0
	; GFX1032-NEXT: s_waitcnt lgkmcnt(0)			; GFX1032-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1032-NEXT: buffer_gl0_inv			; GFX1032-NEXT: buffer_gl0_inv
	; GFX1032-NEXT: BB3_2:			; GFX1032-NEXT: BB3_2:
	; GFX1032-NEXT: s_endpgm			; GFX1032-NEXT: s_endpgm
	entry:			entry:
	%lane = call i32 @llvm.amdgcn.workitem.id.x()			%lane = call i32 @llvm.amdgcn.workitem.id.x()
	%old = atomicrmw add i32 addrspace(3)* @local_var32, i32 %lane acq_rel			%old = atomicrmw add i32 addrspace(3)* @local_var32, i32 %lane acq_rel
	ret void			ret void
	▲ Show 20 Lines • Show All 1,110 Lines • ▼ Show 20 Lines
	;			;
	; GFX1064-LABEL: sub_i32_varying_nouse:			; GFX1064-LABEL: sub_i32_varying_nouse:
	; GFX1064: ; %bb.0: ; %entry			; GFX1064: ; %bb.0: ; %entry
	; GFX1064-NEXT: v_mov_b32_e32 v1, v0			; GFX1064-NEXT: v_mov_b32_e32 v1, v0
	; GFX1064-NEXT: s_not_b64 exec, exec			; GFX1064-NEXT: s_not_b64 exec, exec
	; GFX1064-NEXT: v_mov_b32_e32 v1, 0			; GFX1064-NEXT: v_mov_b32_e32 v1, 0
	; GFX1064-NEXT: s_not_b64 exec, exec			; GFX1064-NEXT: s_not_b64 exec, exec
	; GFX1064-NEXT: s_or_saveexec_b64 s[0:1], -1			; GFX1064-NEXT: s_or_saveexec_b64 s[0:1], -1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:1 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:2 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:4 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1064-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:8 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1064-NEXT: v_mov_b32_e32 v2, v1			; GFX1064-NEXT: v_mov_b32_e32 v2, v1
	; GFX1064-NEXT: v_permlanex16_b32 v2, v2, -1, -1			; GFX1064-NEXT: v_permlanex16_b32 v2, v2, -1, -1
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v2, v1 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf			; GFX1064-NEXT: v_add_nc_u32_e32 v1, v1, v2
	; GFX1064-NEXT: v_readlane_b32 s2, v1, 31
	; GFX1064-NEXT: v_mov_b32_e32 v2, s2
	; GFX1064-NEXT: v_add_nc_u32_dpp v1, v2, v1 quad_perm:[0,1,2,3] row_mask:0xc bank_mask:0xf
	; GFX1064-NEXT: s_mov_b64 exec, s[0:1]			; GFX1064-NEXT: s_mov_b64 exec, s[0:1]
	; GFX1064-NEXT: v_mbcnt_lo_u32_b32_e64 v0, exec_lo, 0			; GFX1064-NEXT: v_mbcnt_lo_u32_b32_e64 v0, exec_lo, 0
	; GFX1064-NEXT: s_or_saveexec_b64 s[0:1], -1			; GFX1064-NEXT: s_or_saveexec_b64 s[0:1], -1
	; GFX1064-NEXT: v_readlane_b32 s2, v1, 63			; GFX1064-NEXT: v_readlane_b32 s2, v1, 0
				; GFX1064-NEXT: v_readlane_b32 s3, v1, 32
	; GFX1064-NEXT: s_mov_b64 exec, s[0:1]			; GFX1064-NEXT: s_mov_b64 exec, s[0:1]
	; GFX1064-NEXT: v_mbcnt_hi_u32_b32_e64 v0, exec_hi, v0			; GFX1064-NEXT: v_mbcnt_hi_u32_b32_e64 v0, exec_hi, v0
	; GFX1064-NEXT: s_mov_b32 s0, s2			; GFX1064-NEXT: s_add_i32 s0, s2, s3
	; GFX1064-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0			; GFX1064-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
	; GFX1064-NEXT: s_and_saveexec_b64 s[2:3], vcc			; GFX1064-NEXT: s_and_saveexec_b64 s[2:3], vcc
	; GFX1064-NEXT: s_cbranch_execz BB10_2			; GFX1064-NEXT: s_cbranch_execz BB10_2
	; GFX1064-NEXT: ; %bb.1:			; GFX1064-NEXT: ; %bb.1:
	; GFX1064-NEXT: v_mov_b32_e32 v0, local_var32@abs32@lo			; GFX1064-NEXT: v_mov_b32_e32 v0, local_var32@abs32@lo
	; GFX1064-NEXT: v_mov_b32_e32 v3, s0			; GFX1064-NEXT: v_mov_b32_e32 v3, s0
	; GFX1064-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1064-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1064-NEXT: s_waitcnt_vscnt null, 0x0			; GFX1064-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX1064-NEXT: ds_sub_u32 v0, v3			; GFX1064-NEXT: ds_sub_u32 v0, v3
	; GFX1064-NEXT: s_waitcnt lgkmcnt(0)			; GFX1064-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1064-NEXT: buffer_gl0_inv			; GFX1064-NEXT: buffer_gl0_inv
	; GFX1064-NEXT: BB10_2:			; GFX1064-NEXT: BB10_2:
	; GFX1064-NEXT: s_endpgm			; GFX1064-NEXT: s_endpgm
	;			;
	; GFX1032-LABEL: sub_i32_varying_nouse:			; GFX1032-LABEL: sub_i32_varying_nouse:
	; GFX1032: ; %bb.0: ; %entry			; GFX1032: ; %bb.0: ; %entry
	; GFX1032-NEXT: v_mov_b32_e32 v1, v0			; GFX1032-NEXT: v_mov_b32_e32 v1, v0
	; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo			; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX1032-NEXT: v_mov_b32_e32 v1, 0			; GFX1032-NEXT: v_mov_b32_e32 v1, 0
	; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo			; GFX1032-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX1032-NEXT: s_or_saveexec_b32 s0, -1			; GFX1032-NEXT: s_or_saveexec_b32 s0, -1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:1 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:1 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:2 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:2 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:4 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:4 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_shr:8 row_mask:0xf bank_mask:0xf bound_ctrl:1			; GFX1032-NEXT: v_add_nc_u32_dpp v1, v1, v1 row_xmask:8 row_mask:0xf bank_mask:0xf bound_ctrl:1
	; GFX1032-NEXT: v_mov_b32_e32 v2, v1			; GFX1032-NEXT: v_mov_b32_e32 v2, v1
	; GFX1032-NEXT: v_permlanex16_b32 v2, v2, -1, -1			; GFX1032-NEXT: v_permlanex16_b32 v2, v2, -1, -1
	; GFX1032-NEXT: v_add_nc_u32_dpp v1, v2, v1 quad_perm:[0,1,2,3] row_mask:0xa bank_mask:0xf			; GFX1032-NEXT: v_add_nc_u32_e32 v1, v1, v2
	; GFX1032-NEXT: v_readlane_b32 s1, v1, 31
	; GFX1032-NEXT: s_mov_b32 exec_lo, s0			; GFX1032-NEXT: s_mov_b32 exec_lo, s0
	; GFX1032-NEXT: v_mbcnt_lo_u32_b32_e64 v0, exec_lo, 0			; GFX1032-NEXT: v_mbcnt_lo_u32_b32_e64 v3, exec_lo, 0
	; GFX1032-NEXT: s_mov_b32 s0, s1			; GFX1032-NEXT: v_mov_b32_e32 v0, v1
	; GFX1032-NEXT: v_cmp_eq_u32_e32 vcc_lo, 0, v0			; GFX1032-NEXT: v_cmp_eq_u32_e32 vcc_lo, 0, v3
	; GFX1032-NEXT: s_and_saveexec_b32 s1, vcc_lo			; GFX1032-NEXT: s_and_saveexec_b32 s0, vcc_lo
	; GFX1032-NEXT: s_cbranch_execz BB10_2			; GFX1032-NEXT: s_cbranch_execz BB10_2
	; GFX1032-NEXT: ; %bb.1:			; GFX1032-NEXT: ; %bb.1:
	; GFX1032-NEXT: v_mov_b32_e32 v0, local_var32@abs32@lo			; GFX1032-NEXT: v_mov_b32_e32 v3, local_var32@abs32@lo
	; GFX1032-NEXT: v_mov_b32_e32 v3, s0
	; GFX1032-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1032-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1032-NEXT: s_waitcnt_vscnt null, 0x0			; GFX1032-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX1032-NEXT: ds_sub_u32 v0, v3			; GFX1032-NEXT: ds_sub_u32 v3, v0
	; GFX1032-NEXT: s_waitcnt lgkmcnt(0)			; GFX1032-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1032-NEXT: buffer_gl0_inv			; GFX1032-NEXT: buffer_gl0_inv
	; GFX1032-NEXT: BB10_2:			; GFX1032-NEXT: BB10_2:
	; GFX1032-NEXT: s_endpgm			; GFX1032-NEXT: s_endpgm
	entry:			entry:
	%lane = call i32 @llvm.amdgcn.workitem.id.x()			%lane = call i32 @llvm.amdgcn.workitem.id.x()
	%old = atomicrmw sub i32 addrspace(3)* @local_var32, i32 %lane acq_rel			%old = atomicrmw sub i32 addrspace(3)* @local_var32, i32 %lane acq_rel
	ret void			ret void
	▲ Show 20 Lines • Show All 2,825 Lines • Show Last 20 Lines