Download Raw Diff

Details

Reviewers

nhaehnle
tpr
dstuttard

Commits

rG8c10fa1a903f: [AMDGPU] Fix DPP sequence in atomic optimizer.
rL353703: [AMDGPU] Fix DPP sequence in atomic optimizer.

Summary

This commit fixes the DPP sequence in the atomic optimizer (which was previously missing the row_shr:3 step).

Diff Detail

Event Timeline

sheredom created this revision.Feb 5 2019, 1:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 1:42 AM

Herald added subscribers: llvm-commits, jfb, t-tye and 5 others. · View Herald Transcript

Updated to bring in an additional fix to remove read_register exec and replace it with a ballot.

(removed a comment here that was wrong)

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
319–320	So I hadn't noticed this before, but I think the wwm intrinsic shouldn't be applied after the readlane below. With wwm before readlane, there's a theoretical possibility that register allocation splits the live range of the value and inserts a V_MOV in between which ends up executed with bit 63 disabled, leading to an incorrect results from the readlane.

sheredom marked an inline comment as done.Feb 7 2019, 5:00 AM

sheredom added inline comments.

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
319–320	Oh yeah - so I was assuming because readlane ignores the exec mask it should be fine, but I can see why that might be an issue. I'll change the code.

Make the final readlane be in the WWM section too as per the review comments.

sheredom marked an inline comment as done.Feb 8 2019, 2:06 AM

Not really an area I'm 100% sure about - but looks ok to me. One of the other reviewers will have to sign off too.
Minor niggle on the comment (if my understanding is correct).

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
306	Is this comment still true?

This revision is now accepted and ready to land.Feb 11 2019, 2:36 AM

Fix comment that should now say exclusive scan instead of inclusive scan.

sheredom marked an inline comment as done.Feb 11 2019, 2:51 AM

I don't understand this fix. Surely a reduction is done with just power of two shifts. Why do we need the shift by 3 as well? What is the extra wf_sr1 dpp at the start for?

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
265	Do you need the setConvergent? It's already marked setConvergent in the .td file. This might also apply to the setConvergent calls lower down, but I haven't checked.

Oh, it's because you've switched to an exclusive scan.

sheredom marked 2 inline comments as done.Feb 11 2019, 3:25 AM

sheredom added inline comments.

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
265	Probably not, but it shouldn't harm this by adding it (I use SetConvergent on every other convergent thing in the file, so best to be consistent I think!).

But I still don't understand it:

Why do you want an exclusive scan? Surely what you're trying to do is just "sum" up all lanes into lane 63, which is an inclusive scan.
Can't you do an exclusive scan with powers of 2 shifts like an inclusive scan, but just with the wf_sr1 on the front? (Although I think that gives the wrong answer due to (1)).
Isn't the only thing wrong with this code before this fix that you forgot to put the bank masks on steps 2, 3 and 4? (Although you're correct to remove the unnecessary intermediate wwm intrinsic calls.)

In D57737#1392667, @tpr wrote:

But I still don't understand it:

Why do you want an exclusive scan? Surely what you're trying to do is just "sum" up all lanes into lane 63, which is an inclusive scan.

No we need both - you need an exclusive scan to know the individual lane's position in the single atomic operation, but you also need the reduction (sum) to know how much we are adding in the atomic operation in the first place. So I do an exclusive scan to get our position, then add on our own lane's value to get the inclusive scan, and readlane 63 to get the total reduction across the wavefront.

Can't you do an exclusive scan with powers of 2 shifts like an inclusive scan, but just with the wf_sr1 on the front? (Although I think that gives the wrong answer due to (1)).

No you need to do the shift 1,2,3 - can see the reasoning why in here https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

Isn't the only thing wrong with this code before this fix that you forgot to put the bank masks on steps 2, 3 and 4? (Although you're correct to remove the unnecessary intermediate wwm intrinsic calls.)

There was two bugs:

The exclusive scan component returned the wrong result in some cases because the shift-by-3 wasn't there.
We need to do a ballot instead of read_register because read_register was sometimes being misidentified as requiring WWM and thus giving us garbage results for the atomc load.

Ah, right, I see about the need for the exclusive and inclusive scan results.

I checked https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ and I didn't see any reasoning about why you need the shift by 3. In the diagram there, I reckon you could change instruction 2 to have the result of instruction 1 as both operands, and omit instruction 3 (the shift by 3), and get the same result, saving one instruction. However that might actually take one wait state longer, assuming you can't schedule other stuff into the middle of it, which you probably can't.

So I'm happy now.

Closed by commit rL353703: [AMDGPU] Fix DPP sequence in atomic optimizer. (authored by sheredom). · Explain WhyFeb 11 2019, 6:43 AM

This revision was automatically updated to reflect the committed changes.

dnovillo added a subscriber: dnovillo.Mar 5 2019, 5:20 AM

dnovillo added inline comments.

llvm/trunk/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
256 ↗	(On Diff #186244)	This is leaving the declaration for \|Context\| unused in line 214. I'm getting build errors with -Wunused-variable.

dstuttard added a subscriber: dstuttard.Mar 5 2019, 5:34 AM

dstuttard added inline comments.Mar 5 2019, 5:37 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
256 ↗	(On Diff #186244)	There's a later change that removes it. See https://llvm.org/svn/llvm-project/llvm/trunk@353704 by Benny Kramer.

Diff 185253

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

Show All 24 Lines

using namespace llvm;		using namespace llvm;

namespace {		namespace {

enum DPP_CTRL {		enum DPP_CTRL {
DPP_ROW_SR1 = 0x111,		DPP_ROW_SR1 = 0x111,
DPP_ROW_SR2 = 0x112,		DPP_ROW_SR2 = 0x112,
		DPP_ROW_SR3 = 0x113,
DPP_ROW_SR4 = 0x114,		DPP_ROW_SR4 = 0x114,
DPP_ROW_SR8 = 0x118,		DPP_ROW_SR8 = 0x118,
DPP_WF_SR1 = 0x138,		DPP_WF_SR1 = 0x138,
DPP_ROW_BCAST15 = 0x142,		DPP_ROW_BCAST15 = 0x142,
DPP_ROW_BCAST31 = 0x143		DPP_ROW_BCAST31 = 0x143
};		};

struct ReplacementInfo {		struct ReplacementInfo {
▲ Show 20 Lines • Show All 215 Lines • ▼ Show 20 Lines	MDNode *const RegName =
llvm::MDNode::get(Context, llvm::MDString::get(Context, "exec"));		llvm::MDNode::get(Context, llvm::MDString::get(Context, "exec"));
Value *const Metadata = llvm::MetadataAsValue::get(Context, RegName);		Value *const Metadata = llvm::MetadataAsValue::get(Context, RegName);
CallInst *const Exec =		CallInst *const Exec =
B.CreateIntrinsic(Intrinsic::read_register, {B.getInt64Ty()}, {Metadata});		B.CreateIntrinsic(Intrinsic::read_register, {B.getInt64Ty()}, {Metadata});
setConvergent(Exec);		setConvergent(Exec);

// We need to know how many lanes are active within the wavefront that are		// We need to know how many lanes are active within the wavefront that are
// below us. If we counted each lane linearly starting from 0, a lane is		// below us. If we counted each lane linearly starting from 0, a lane is
// below us only if its associated index was less than ours. We do this by		// below us only if its associated index was less than ours. We do this by
		tprUnsubmitted Done Reply Inline Actions Do you need the setConvergent? It's already marked setConvergent in the .td file. This might also apply to the setConvergent calls lower down, but I haven't checked. tpr: Do you need the setConvergent? It's already marked setConvergent in the .td file. This might…
		sheredomAuthorUnsubmitted Done Reply Inline Actions Probably not, but it shouldn't harm this by adding it (I use SetConvergent on every other convergent thing in the file, so best to be consistent I think!). sheredom: Probably not, but it shouldn't harm this by adding it (I use SetConvergent on every other…
// using the mbcnt intrinsic.		// using the mbcnt intrinsic.
Value *const BitCast = B.CreateBitCast(Exec, VecTy);		Value *const BitCast = B.CreateBitCast(Exec, VecTy);
Value *const ExtractLo = B.CreateExtractElement(BitCast, B.getInt32(0));		Value *const ExtractLo = B.CreateExtractElement(BitCast, B.getInt32(0));
Value *const ExtractHi = B.CreateExtractElement(BitCast, B.getInt32(1));		Value *const ExtractHi = B.CreateExtractElement(BitCast, B.getInt32(1));
CallInst *const PartialMbcnt = B.CreateIntrinsic(		CallInst *const PartialMbcnt = B.CreateIntrinsic(
Intrinsic::amdgcn_mbcnt_lo, {}, {ExtractLo, B.getInt32(0)});		Intrinsic::amdgcn_mbcnt_lo, {}, {ExtractLo, B.getInt32(0)});
CallInst *const Mbcnt = B.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_hi, {},		CallInst *const Mbcnt = B.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_hi, {},
{ExtractHi, PartialMbcnt});		{ExtractHi, PartialMbcnt});

Value *const MbcntCast = B.CreateIntCast(Mbcnt, Ty, false);		Value *const MbcntCast = B.CreateIntCast(Mbcnt, Ty, false);

Value *LaneOffset = nullptr;		Value *LaneOffset = nullptr;
Value *NewV = nullptr;		Value *NewV = nullptr;

// If we have a divergent value in each lane, we need to combine the value		// If we have a divergent value in each lane, we need to combine the value
// using DPP.		// using DPP.
if (ValDivergent) {		if (ValDivergent) {
		Value *const Identity = B.getIntN(TyBitWidth, 0);

// First we need to set all inactive invocations to 0, so that they can		// First we need to set all inactive invocations to 0, so that they can
// correctly contribute to the final result.		// correctly contribute to the final result.
CallInst *const SetInactive = B.CreateIntrinsic(		CallInst *const SetInactive =
Intrinsic::amdgcn_set_inactive, Ty, {V, B.getIntN(TyBitWidth, 0)});		B.CreateIntrinsic(Intrinsic::amdgcn_set_inactive, Ty, {V, Identity});
setConvergent(SetInactive);		setConvergent(SetInactive);
NewV = SetInactive;

const unsigned Iters = 6;		CallInst *const FirstDPP =
const unsigned DPPCtrl[Iters] = {DPP_ROW_SR1, DPP_ROW_SR2,		B.CreateIntrinsic(Intrinsic::amdgcn_update_dpp, Ty,
DPP_ROW_SR4, DPP_ROW_SR8,		{Identity, SetInactive, B.getInt32(DPP_WF_SR1),
DPP_ROW_BCAST15, DPP_ROW_BCAST31};		B.getInt32(0xf), B.getInt32(0xf), B.getFalse()});
const unsigned RowMask[Iters] = {0xf, 0xf, 0xf, 0xf, 0xa, 0xc};		setConvergent(FirstDPP);
		NewV = FirstDPP;

		const unsigned Iters = 7;
		const unsigned DPPCtrl[Iters] = {
		DPP_ROW_SR1, DPP_ROW_SR2, DPP_ROW_SR3, DPP_ROW_SR4,
		DPP_ROW_SR8, DPP_ROW_BCAST15, DPP_ROW_BCAST31};
		const unsigned RowMask[Iters] = {0xf, 0xf, 0xf, 0xf, 0xf, 0xa, 0xc};
		const unsigned BankMask[Iters] = {0xf, 0xf, 0xf, 0xe, 0xc, 0xf, 0xf};

// This loop performs an inclusive scan across the wavefront, with all lanes		// This loop performs an inclusive scan across the wavefront, with all lanes
// active (by using the WWM intrinsic).		// active (by using the WWM intrinsic).
		dstuttardUnsubmitted Done Reply Inline Actions Is this comment still true? dstuttard: Is this comment still true?
for (unsigned Idx = 0; Idx < Iters; Idx++) {		for (unsigned Idx = 0; Idx < Iters; Idx++) {
CallInst *const DPP = B.CreateIntrinsic(Intrinsic::amdgcn_mov_dpp, Ty,		Value *const UpdateValue = Idx < 3 ? FirstDPP : NewV;
{NewV, B.getInt32(DPPCtrl[Idx]),		CallInst *const DPP = B.CreateIntrinsic(
B.getInt32(RowMask[Idx]),		Intrinsic::amdgcn_update_dpp, Ty,
B.getInt32(0xf), B.getFalse()});		{Identity, UpdateValue, B.getInt32(DPPCtrl[Idx]),
		B.getInt32(RowMask[Idx]), B.getInt32(BankMask[Idx]), B.getFalse()});
setConvergent(DPP);		setConvergent(DPP);
Value *const WWM = B.CreateIntrinsic(Intrinsic::amdgcn_wwm, Ty, DPP);

NewV = B.CreateBinOp(Op, NewV, WWM);		NewV = B.CreateBinOp(Op, NewV, DPP);
NewV = B.CreateIntrinsic(Intrinsic::amdgcn_wwm, Ty, NewV);
}		}

// NewV has returned the inclusive scan of V, but for the lane offset we		LaneOffset = B.CreateIntrinsic(Intrinsic::amdgcn_wwm, Ty, NewV);
// require an exclusive scan. We do this by shifting the values from the		NewV = B.CreateIntrinsic(Intrinsic::amdgcn_wwm, Ty,
// entire wavefront right by 1, and by setting the bound_ctrl (last argument		B.CreateBinOp(Op, NewV, SetInactive));
		nhaehnleUnsubmitted Done Reply Inline Actions So I hadn't noticed this before, but I think the wwm intrinsic shouldn't be applied after the readlane below. With wwm before readlane, there's a theoretical possibility that register allocation splits the live range of the value and inserts a V_MOV in between which ends up executed with bit 63 disabled, leading to an incorrect results from the readlane. nhaehnle: So I hadn't noticed this before, but I think the wwm intrinsic shouldn't be applied after the…
		sheredomAuthorUnsubmitted Done Reply Inline Actions Oh yeah - so I was assuming because readlane ignores the exec mask it should be fine, but I can see why that might be an issue. I'll change the code. sheredom: Oh yeah - so I was assuming because readlane ignores the exec mask it should be fine, but I can…
// to the intrinsic below) to true, we can guarantee that 0 will be shifted
// into the 0'th invocation.
CallInst *const DPP =
B.CreateIntrinsic(Intrinsic::amdgcn_mov_dpp, {Ty},
{NewV, B.getInt32(DPP_WF_SR1), B.getInt32(0xf),
B.getInt32(0xf), B.getTrue()});
setConvergent(DPP);
LaneOffset = B.CreateIntrinsic(Intrinsic::amdgcn_wwm, Ty, DPP);

// Read the value from the last lane, which has accumlated the values of		// Read the value from the last lane, which has accumlated the values of
// each active lane in the wavefront. This will be our new value with which		// each active lane in the wavefront. This will be our new value with which
// we will provide to the atomic operation.		// we will provide to the atomic operation.
if (TyBitWidth == 64) {		if (TyBitWidth == 64) {
Value *const ExtractLo = B.CreateTrunc(NewV, B.getInt32Ty());		Value *const ExtractLo = B.CreateTrunc(NewV, B.getInt32Ty());
Value *const ExtractHi =		Value *const ExtractHi =
B.CreateTrunc(B.CreateLShr(NewV, B.getInt64(32)), B.getInt32Ty());		B.CreateTrunc(B.CreateLShr(NewV, B.getInt64(32)), B.getInt32Ty());
▲ Show 20 Lines • Show All 130 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

Show All 38 Lines	entry:
ret void		ret void
}		}

; GCN-LABEL: add_i32_varying_vdata:		; GCN-LABEL: add_i32_varying_vdata:
; GFX7LESS-NOT: v_mbcnt_lo_u32_b32		; GFX7LESS-NOT: v_mbcnt_lo_u32_b32
; GFX7LESS-NOT: v_mbcnt_hi_u32_b32		; GFX7LESS-NOT: v_mbcnt_hi_u32_b32
; GFX7LESS-NOT: s_bcnt1_i32_b64		; GFX7LESS-NOT: s_bcnt1_i32_b64
; GFX7LESS: buffer_atomic_add v{{[0-9]+}}		; GFX7LESS: buffer_atomic_add v{{[0-9]+}}
		; GFX8MORE: v_mov_b32_dpp v[[wave_shr1:[0-9]+]], v{{[0-9]+}} wave_shr:1 row_mask:0xf bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:1 row_mask:0xf bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:2 row_mask:0xf bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:3 row_mask:0xf bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:4 row_mask:0xf bank_mask:0xe
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:8 row_mask:0xf bank_mask:0xc
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:15 row_mask:0xa bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:31 row_mask:0xc bank_mask:0xf
; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63		; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63
; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]		; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]
; GFX8MORE: buffer_atomic_add v[[value]]		; GFX8MORE: buffer_atomic_add v[[value]]
define amdgpu_kernel void @add_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {		define amdgpu_kernel void @add_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {
entry:		entry:
%lane = call i32 @llvm.amdgcn.workitem.id.x()		%lane = call i32 @llvm.amdgcn.workitem.id.x()
%old = call i32 @llvm.amdgcn.buffer.atomic.add(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i1 0)		%old = call i32 @llvm.amdgcn.buffer.atomic.add(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i1 0)
store i32 %old, i32 addrspace(1)* %out		store i32 %old, i32 addrspace(1)* %out
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	entry:
ret void		ret void
}		}

; GCN-LABEL: sub_i32_varying_vdata:		; GCN-LABEL: sub_i32_varying_vdata:
; GFX7LESS-NOT: v_mbcnt_lo_u32_b32		; GFX7LESS-NOT: v_mbcnt_lo_u32_b32
; GFX7LESS-NOT: v_mbcnt_hi_u32_b32		; GFX7LESS-NOT: v_mbcnt_hi_u32_b32
; GFX7LESS-NOT: s_bcnt1_i32_b64		; GFX7LESS-NOT: s_bcnt1_i32_b64
; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}		; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}
		; GFX8MORE: v_mov_b32_dpp v[[wave_shr1:[0-9]+]], v{{[0-9]+}} wave_shr:1 row_mask:0xf bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:1 row_mask:0xf bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:2 row_mask:0xf bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:3 row_mask:0xf bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:4 row_mask:0xf bank_mask:0xe
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:8 row_mask:0xf bank_mask:0xc
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:15 row_mask:0xa bank_mask:0xf
		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:31 row_mask:0xc bank_mask:0xf
; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63		; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63
; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]		; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]
; GFX8MORE: buffer_atomic_sub v[[value]]		; GFX8MORE: buffer_atomic_sub v[[value]]
define amdgpu_kernel void @sub_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {		define amdgpu_kernel void @sub_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {
entry:		entry:
%lane = call i32 @llvm.amdgcn.workitem.id.x()		%lane = call i32 @llvm.amdgcn.workitem.id.x()
%old = call i32 @llvm.amdgcn.buffer.atomic.sub(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i1 0)		%old = call i32 @llvm.amdgcn.buffer.atomic.sub(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i1 0)
store i32 %old, i32 addrspace(1)* %out		store i32 %old, i32 addrspace(1)* %out
Show All 15 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix DPP sequence in atomic optimizer.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 185253

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix DPP sequence in atomic optimizer.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 185253

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

[AMDGPU] Fix DPP sequence in atomic optimizer.
ClosedPublic