This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsAMDGPU.td
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPUSearchableTables.td
-
SIISelLowering.cpp
-
SIInstrInfo.cpp
1
SIInstructions.td
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
2/6
llvm.amdgcn.inverse.ballot.ll

Differential D57748

AMDGPU: Add inverse ballot intrinsic
Needs RevisionPublic

Authored by cwabbott on Feb 5 2019, 5:57 AM.

Download Raw Diff

Details

Reviewers

arsenm
nhaehnle

Summary

This takes a uniform 64-bit bitmask, and returns a boolean value which
is true for each thread if the corresponding bit is 1. The
implementation of subgroupInverseBallot() in radv (and AMDVLK presumably) is
currently implements this by shifting by the thread-id, but this is a
complicated way of doing what's basically just a no-op thanks to how booleans
are stored in SGPR's.

Although the user guarantees that the value is uniform, it may still
wind up in VGPR's thanks to deficiencies in the backend, in which case
we have to emit two readfirstlane's.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 27876
Build 27875: arc lint + arc unit

Event Timeline

cwabbott created this revision.Feb 5 2019, 5:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 5:57 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 5 others. · View Herald Transcript

Harbormaster completed remote builds in B27735: Diff 185290.Feb 5 2019, 5:57 AM

The user can't actually guarantee the argument is uniform, since transforms are allowed to touch the argument. We're accumulating quite a few places that assume this though

arsenm added inline comments.Feb 5 2019, 6:13 AM

test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
19–21	You can just use an i64 argument here instead of this build vector and cast thing
27	Need to add a few more tests with more complex situations (particularly one with a uniform phi, and one with a divergent phi)

In D57748#1385046, @arsenm wrote:

The user can't actually guarantee the argument is uniform, since transforms are allowed to touch the argument. We're accumulating quite a few places that assume this though

Yeah, that's true. We're already currently disabling some problematic transforms in Mesa because of that. We can probably get a proper solution for that when we add thread group semantics to the LangRef (c.f. the llvm-dev thread), but we shouldn't need to wait on that to get this in.

cwabbott marked an inline comment as done.Feb 5 2019, 6:19 AM

cwabbott added inline comments.

test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
19–21	No, that crashes: Formal argument #0 has unhandled type i64 UNREACHABLE executed at ../lib/CodeGen/CallingConvLower.cpp:98! I just copied the amdgpu_ps calling convention from somewhere else, is there a better one to use?

arsenm added inline comments.Feb 5 2019, 6:23 AM

test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
19–21	I know I've written the patch to fix this before, but I guess I never committed it. You can just use a default calling convention function for this just as well

cwabbott marked an inline comment as done.Feb 5 2019, 6:25 AM

cwabbott added inline comments.

test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
27	Why would having a phi node change anything? As far as I can see these tests cover the only two cases that really matter for lowering this (input in SGPR and input in VGPR).

arsenm added inline comments.Feb 5 2019, 6:33 AM

test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll
27	I'm specifically worried about the change isVGPRToSGPRCopy. Phi and i1 handling has been a problematic area in SIFixSGPRCopies. I would probably be more comfortable using a pseudo instruction for this until this point rather than making a regular COPY to VReg_1 legal

Added a pseudoinstruction lowered in SILowerI1Copies, and legalized it more

similarly to how other instructions are legalized.

Added tests where the source is a uniform and non-uniform phi node. I had to

to use the amdgpu_ps calling convention here to get the arguments into SGPR's.

Harbormaster completed remote builds in B27746: Diff 185317.Feb 5 2019, 8:30 AM

arsenm added inline comments.Feb 6 2019, 5:40 AM

lib/Target/AMDGPU/SIInstructions.td
597	extra whitespace change
lib/Target/AMDGPU/SILowerI1Copies.cpp
651–652 ↗	(On Diff #185317)	Can you handle this in EmitInstrWithCustomInserter instead? You'll need to set usesCustomInserter on the instruction

Remove spurious whitespace change.
Lower S_INV_BALLOT with EmitInstrWithCustomInserter.

Harbormaster completed remote builds in B27876: Diff 185757.Feb 7 2019, 6:48 AM

Why can't we recognize this as a pattern? Basically, it's just (src & (1 << thread_idx)), and thread_idx can be matched as a sequence of mbcnt intrinsics.

Hmm, except the SelectionDAG is only per-basic block. Ugh.

In D57748#1389105, @nhaehnle wrote:

Why can't we recognize this as a pattern? Basically, it's just (src & (1 << thread_idx)), and thread_idx can be matched as a sequence of mbcnt intrinsics.

Hmm, except the SelectionDAG is only per-basic block. Ugh.

We could always make CodeGenPrepare always sink these

In D57748#1389109, @arsenm wrote:

In D57748#1389105, @nhaehnle wrote:

Why can't we recognize this as a pattern? Basically, it's just (src & (1 << thread_idx)), and thread_idx can be matched as a sequence of mbcnt intrinsics.

Hmm, except the SelectionDAG is only per-basic block. Ugh.

We could always make CodeGenPrepare always sink these

I think this would be cleaner since then we wouldn't have to worry about the conceptually disturbing readfirstlane case

Adding a pattern for this wouldn't work for what I wanted to do, which was a ballot/inverseballot pair to operate directly on the bitmask representation of a boolean, since there's a bug where SelectionDAG forgets that ballot removes divergence, and it needs to be non-divergent for the pattern to fire. That being said, inserting two readlanes isn't that much better, so maybe I should just fix that instead...

In D57748#1390251, @cwabbott wrote:

Adding a pattern for this wouldn't work for what I wanted to do, which was a ballot/inverseballot pair to operate directly on the bitmask representation of a boolean, since there's a bug where SelectionDAG forgets that ballot removes divergence, and it needs to be non-divergent for the pattern to fire. That being said, inserting two readlanes isn't that much better, so maybe I should just fix that instead...

Fixing DAG bugs is good. This would be also be if you could use mbcnt here instead of hoping readfirstlane works out OK

Please rebase if still relevant

This revision now requires changes to proceed.Dec 22 2022, 3:52 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 22 2022, 3:52 PM

Herald added subscribers: kosarev, kerbowa. · View Herald Transcript

arsenm mentioned this in D146287: [AMDGPU][GISel] Add inverse ballot intrinsic.Mar 17 2023, 6:28 AM

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

3 lines

lib/

Target/

AMDGPU/

AMDGPUSearchableTables.td

1 line

SIISelLowering.cpp

3 lines

SIInstrInfo.cpp

8 lines

SIInstructions.td

7 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.inverse.ballot.ll

83 lines

Diff 185757

include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 1,276 Lines • ▼ Show 20 Lines
	def int_amdgcn_icmp :			def int_amdgcn_icmp :
	Intrinsic<[llvm_i64_ty], [llvm_anyint_ty, LLVMMatchType<0>, llvm_i32_ty],			Intrinsic<[llvm_i64_ty], [llvm_anyint_ty, LLVMMatchType<0>, llvm_i32_ty],
	[IntrNoMem, IntrConvergent]>;			[IntrNoMem, IntrConvergent]>;

	def int_amdgcn_fcmp :			def int_amdgcn_fcmp :
	Intrinsic<[llvm_i64_ty], [llvm_anyfloat_ty, LLVMMatchType<0>, llvm_i32_ty],			Intrinsic<[llvm_i64_ty], [llvm_anyfloat_ty, LLVMMatchType<0>, llvm_i32_ty],
	[IntrNoMem, IntrConvergent]>;			[IntrNoMem, IntrConvergent]>;

				def int_amdgcn_inverse_ballot :
				Intrinsic<[llvm_i1_ty], [llvm_i64_ty], [IntrNoMem, IntrSpeculatable]>;

	def int_amdgcn_readfirstlane :			def int_amdgcn_readfirstlane :
	GCCBuiltin<"__builtin_amdgcn_readfirstlane">,			GCCBuiltin<"__builtin_amdgcn_readfirstlane">,
	Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoMem, IntrConvergent]>;			Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoMem, IntrConvergent]>;

	// The lane argument must be uniform across the currently active threads of the			// The lane argument must be uniform across the currently active threads of the
	// current wave. Otherwise, the result is undefined.			// current wave. Otherwise, the result is undefined.
	def int_amdgcn_readlane :			def int_amdgcn_readlane :
	GCCBuiltin<"__builtin_amdgcn_readlane">,			GCCBuiltin<"__builtin_amdgcn_readlane">,
	▲ Show 20 Lines • Show All 256 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSearchableTables.td

	Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_and>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_and>;
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_or>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_or>;
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_xor>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_xor>;
	def : SourceOfDivergence<int_amdgcn_buffer_atomic_cmpswap>;			def : SourceOfDivergence<int_amdgcn_buffer_atomic_cmpswap>;
	def : SourceOfDivergence<int_amdgcn_ps_live>;			def : SourceOfDivergence<int_amdgcn_ps_live>;
	def : SourceOfDivergence<int_amdgcn_ds_swizzle>;			def : SourceOfDivergence<int_amdgcn_ds_swizzle>;
	def : SourceOfDivergence<int_amdgcn_ds_ordered_add>;			def : SourceOfDivergence<int_amdgcn_ds_ordered_add>;
	def : SourceOfDivergence<int_amdgcn_ds_ordered_swap>;			def : SourceOfDivergence<int_amdgcn_ds_ordered_swap>;
				def : SourceOfDivergence<int_amdgcn_inverse_ballot>;

	foreach intr = AMDGPUImageDimAtomicIntrinsics in			foreach intr = AMDGPUImageDimAtomicIntrinsics in
	def : SourceOfDivergence<intr>;			def : SourceOfDivergence<intr>;

lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,502 Lines • ▼ Show 20 Lines	case AMDGPU::SI_TCRETURN_ISEL: {

for (unsigned I = 1, E = MI.getNumOperands(); I != E; ++I)		for (unsigned I = 1, E = MI.getNumOperands(); I != E; ++I)
MIB.add(MI.getOperand(I));		MIB.add(MI.getOperand(I));

MIB.cloneMemRefs(MI);		MIB.cloneMemRefs(MI);
MI.eraseFromParent();		MI.eraseFromParent();
return BB;		return BB;
}		}
		case AMDGPU::S_INV_BALLOT:
		MI.setDesc(TII->get(AMDGPU::COPY));
		return BB;
default:		default:
return AMDGPUTargetLowering::EmitInstrWithCustomInserter(MI, BB);		return AMDGPUTargetLowering::EmitInstrWithCustomInserter(MI, BB);
}		}
}		}

bool SITargetLowering::hasBitPreservingFPLogic(EVT VT) const {		bool SITargetLowering::hasBitPreservingFPLogic(EVT VT) const {
return isTypeLegal(VT.getScalarType());		return isTypeLegal(VT.getScalarType());
}		}
▲ Show 20 Lines • Show All 6,309 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 3,984 Lines • ▼ Show 20 Lines	void SIInstrInfo::legalizeOperands(MachineInstr &MI,
// Legalize SI_INIT_M0		// Legalize SI_INIT_M0
if (MI.getOpcode() == AMDGPU::SI_INIT_M0) {		if (MI.getOpcode() == AMDGPU::SI_INIT_M0) {
MachineOperand &Src = MI.getOperand(0);		MachineOperand &Src = MI.getOperand(0);
if (Src.isReg() && RI.hasVGPRs(MRI.getRegClass(Src.getReg())))		if (Src.isReg() && RI.hasVGPRs(MRI.getRegClass(Src.getReg())))
Src.setReg(readlaneVGPRToSGPR(Src.getReg(), MI, MRI));		Src.setReg(readlaneVGPRToSGPR(Src.getReg(), MI, MRI));
return;		return;
}		}

		// Legalize S_INV_BALLOT
		if (MI.getOpcode() == AMDGPU::S_INV_BALLOT) {
		MachineOperand &Src = MI.getOperand(1);
		if (Src.isReg() && RI.hasVGPRs(MRI.getRegClass(Src.getReg())))
		Src.setReg(readlaneVGPRToSGPR(Src.getReg(), MI, MRI));
		return;
		}

// Legalize MIMG and MUBUF/MTBUF for shaders.		// Legalize MIMG and MUBUF/MTBUF for shaders.
//		//
// Shaders only generate MUBUF/MTBUF instructions via intrinsics or via		// Shaders only generate MUBUF/MTBUF instructions via intrinsics or via
// scratch memory access. In both cases, the legalization never involves		// scratch memory access. In both cases, the legalization never involves
// conversion to the addr64 form.		// conversion to the addr64 form.
if (isMIMG(MI) \|\|		if (isMIMG(MI) \|\|
(AMDGPU::isShader(MF.getFunction().getCallingConv()) &&		(AMDGPU::isShader(MF.getFunction().getCallingConv()) &&
(isMUBUF(MI) \|\| isMTBUF(MI)))) {		(isMUBUF(MI) \|\| isMTBUF(MI)))) {
▲ Show 20 Lines • Show All 1,647 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines
}		}

def V_SET_INACTIVE_B64 : VPseudoInstSI <(outs VReg_64:$vdst),		def V_SET_INACTIVE_B64 : VPseudoInstSI <(outs VReg_64:$vdst),
(ins VReg_64: $src, VSrc_b64:$inactive),		(ins VReg_64: $src, VSrc_b64:$inactive),
[(set i64:$vdst, (int_amdgcn_set_inactive i64:$src, i64:$inactive))]> {		[(set i64:$vdst, (int_amdgcn_set_inactive i64:$src, i64:$inactive))]> {
let Constraints = "$src = $vdst";		let Constraints = "$src = $vdst";
}		}

		// Pseudoinstruction for @llvm.amdgcn.inverse.ballot. It is turned into
		// potentially a readlane and then a copy.
		def S_INV_BALLOT : SPseudoInstSI <(outs VReg_1:$dst), (ins SReg_64:$src),
		[(set i1:$dst, (int_amdgcn_inverse_ballot i64:$src))]> {
		let usesCustomInserter = 1;
		}


let usesCustomInserter = 1, Defs = [SCC] in {		let usesCustomInserter = 1, Defs = [SCC] in {
def S_ADD_U64_PSEUDO : SPseudoInstSI <		def S_ADD_U64_PSEUDO : SPseudoInstSI <
(outs SReg_64:$vdst), (ins SSrc_b64:$src0, SSrc_b64:$src1),		(outs SReg_64:$vdst), (ins SSrc_b64:$src0, SSrc_b64:$src1),
[(set SReg_64:$vdst, (add i64:$src0, i64:$src1))]		[(set SReg_64:$vdst, (add i64:$src0, i64:$src1))]
>;		>;

def S_SUB_U64_PSEUDO : SPseudoInstSI <		def S_SUB_U64_PSEUDO : SPseudoInstSI <
▲ Show 20 Lines • Show All 430 Lines • ▼ Show 20 Lines	>;

// TODO: we could add more variants for other types of conditionals		// TODO: we could add more variants for other types of conditionals

def : Pat <		def : Pat <
(int_amdgcn_icmp i1:$src, (i1 0), (i32 33)),		(int_amdgcn_icmp i1:$src, (i1 0), (i32 33)),
(COPY $src) // Return the SGPRs representing i1 src		(COPY $src) // Return the SGPRs representing i1 src
>;		>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		arsenmUnsubmitted Not Done Reply Inline Actions extra whitespace change arsenm: extra whitespace change
// VOP1 Patterns		// VOP1 Patterns
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

let SubtargetPredicate = isGCN, OtherPredicates = [UnsafeFPMath] in {		let SubtargetPredicate = isGCN, OtherPredicates = [UnsafeFPMath] in {

//def : RcpPat<V_RCP_F64_e32, f64>;		//def : RcpPat<V_RCP_F64_e32, f64>;
//defm : RsqPat<V_RSQ_F64_e32, f64>;		//defm : RsqPat<V_RSQ_F64_e32, f64>;
//defm : RsqPat<V_RSQ_F32_e32, f32>;		//defm : RsqPat<V_RSQ_F32_e32, f32>;
▲ Show 20 Lines • Show All 1,091 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=CHECK %s

				; CHECK-LABEL: {{^}}ret:
				; CHECK: s_mov_b64 s[[MASK:\[[0-9]+:[0-9]+\]]], 1
				; CHECK-NEXT: v_cndmask_b32_e64 v0, 0, 1.0, s[[MASK]]
				define float @ret() #1 {
				main_body:
				%w = call i1 @llvm.amdgcn.inverse.ballot(i64 1)
				%r = select i1 %w, float 1.0, float 0.0
				ret float %r
				}

				; make sure it works for things that wind up in VGPR's
				; CHECK-LABEL: {{^}}vgpr:
				; CHECK: v_readfirstlane_b32 s[[MASKLO:[0-9]+]], v0
				; CHECK-NEXT: v_readfirstlane_b32 s[[MASKHI:[0-9]+]], v1
				; CHECK-NEXT: v_cndmask_b32_e64 v0, 0, 1.0, s{{\[}}[[MASKLO]]:[[MASKHI]]{{\]}}
				define float @vgpr(i64 %v0_1) {
				%inv = call i1 @llvm.amdgcn.inverse.ballot(i64 %v0_1)
				%r = select i1 %inv, float 1.0, float 0.0
				ret float %r
				arsenmUnsubmitted Not Done Reply Inline Actions You can just use an i64 argument here instead of this build vector and cast thing arsenm: You can just use an i64 argument here instead of this build vector and cast thing
				cwabbottAuthorUnsubmitted Done Reply Inline Actions No, that crashes: Formal argument #0 has unhandled type i64 UNREACHABLE executed at ../lib/CodeGen/CallingConvLower.cpp:98! I just copied the amdgpu_ps calling convention from somewhere else, is there a better one to use? cwabbott: No, that crashes: ``` Formal argument #0 has unhandled type i64 UNREACHABLE executed at ..
				arsenmUnsubmitted Not Done Reply Inline Actions I know I've written the patch to fix this before, but I guess I never committed it. You can just use a default calling convention function for this just as well arsenm: I know I've written the patch to fix this before, but I guess I never committed it. You can…
				}

				; CHECK-LABEL: {{^}}phi_uniform:
				; CHECK: s_cmp_lg_u32 s2, 0
				; CHECK: s_cbranch_scc0
				; CHECK: v_cndmask_b32_e64 v0, 0, 1.0, s[0:1]
				arsenmUnsubmitted Not Done Reply Inline Actions Need to add a few more tests with more complex situations (particularly one with a uniform phi, and one with a divergent phi) arsenm: Need to add a few more tests with more complex situations (particularly one with a uniform phi…
				cwabbottAuthorUnsubmitted Done Reply Inline Actions Why would having a phi node change anything? As far as I can see these tests cover the only two cases that really matter for lowering this (input in SGPR and input in VGPR). cwabbott: Why would having a phi node change anything? As far as I can see these tests cover the only two…
				arsenmUnsubmitted Not Done Reply Inline Actions I'm specifically worried about the change isVGPRToSGPRCopy. Phi and i1 handling has been a problematic area in SIFixSGPRCopies. I would probably be more comfortable using a pseudo instruction for this until this point rather than making a regular COPY to VReg_1 legal arsenm: I'm specifically worried about the change isVGPRToSGPRCopy. Phi and i1 handling has been a…
				; CHECK: s_branch
				; CHECK: s_add_u32 s[[MASKLO:[0-9]+]], s0, 1
				; CHECK: s_addc_u32 s[[MASKHI:[0-9]+]], s1, 0
				; CHECK: v_cndmask_b32_e64 v0, 0, 1.0, s{{\[}}[[MASKLO]]:[[MASKHI]]{{\]}}
				define amdgpu_ps float @phi_uniform(i64 inreg %s0_1, i32 inreg %s2) {
				main_body:
				%cc = icmp ne i32 %s2, 0

				br i1 %cc, label %endif, label %if

				if:
				%tmp = add i64 %s0_1, 1
				br label %endif

				endif:
				%sel = phi i64 [ %s0_1, %main_body], [ %tmp, %if ]

				%inv = call i1 @llvm.amdgcn.inverse.ballot(i64 %sel)
				%r = select i1 %inv, float 1.0, float 0.0
				ret float %r
				}

				; CHECK-LABEL: {{^}}phi_divergent:
				; CHECK: v_cmp_eq_u32_e32 vcc, 0, v0
				; CHECK: v_mov_b32_e32 v[[VMASKLO:[0-9]+]], s0
				; CHECK: v_mov_b32_e32 v[[VMASKHI:[0-9]+]], s1
				; CHECK: s_and_saveexec
				; CHECK: s_add_u32
				; CHECK: s_addc_u32
				; CHECK: v_mov_b32_e32 v[[VMASKLO]],
				; CHECK: v_mov_b32_e32 v[[VMASKHI]],
				; CHECK: s_or_b64 exec, exec,
				; CHECK: v_readfirstlane_b32 s[[SMASKLO:[0-9]+]], v[[VMASKLO]]
				; CHECK: v_readfirstlane_b32 s[[SMASKHI:[0-9]+]], v[[VMASKHI]]
				; CHECK: v_cndmask_b32_e64 v0, 0, 1.0, s{{\[}}[[SMASKLO]]:[[SMASKHI]]{{\]}}
				define amdgpu_ps float @phi_divergent(i32 %v0, i64 inreg %s0_1) {
				main_body:
				%cc = icmp ne i32 %v0, 0

				br i1 %cc, label %endif, label %if

				if:
				%tmp = add i64 %s0_1, 1
				br label %endif

				endif:
				%sel = phi i64 [ %s0_1, %main_body], [ %tmp, %if ]

				%inv = call i1 @llvm.amdgcn.inverse.ballot(i64 %sel)
				%r = select i1 %inv, float 1.0, float 0.0
				ret float %r
				}

				declare i1 @llvm.amdgcn.inverse.ballot(i64)
				declare i64 @llvm.amdgcn.icmp.i32(i32, i32, i32)