This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Broadcast scalar boolean to vector boolean explicitly
ClosedPublic

Authored by ruiling on Sep 16 2021, 8:13 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad
critson
piotr
alex-t
rampitec

Commits

rG52785989e95d: AMDGPU: Broadcast scalar boolean to vector boolean explicitly

Summary

This is used to fix wrong code generation of s_add_co_select_user in
test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll

s_addc_u32 s4, s6, 0
s_cselect_b64 vcc, 1, 0    <-- vcc set as 0x1 if SCC==1
v_mov_b32_e32 v1, s4
s_cmp_gt_u32 s6, 31
v_cndmask_b32_e32 v1, 0, v1, vcc

If the s_addc_u32 set SCC, then we will get value 0x1 in VCC.
The v_cndmask will do per thread selection with VCC as condition
register. As VCC only gets the first bit being set, only the first
thread/lane in destination register can get correct result if the
very first lane is active. In fact, we should broadcast the value to all
active lanes of the final register.

The idea here is doing this broadcast to vector boolean explicitly
instead of lowering it into a COPY from SCC which would be interpreted as
selecting between 0/1.

This is used to replace D109754.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ruiling created this revision.Sep 16 2021, 8:13 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald TranscriptSep 16 2021, 8:13 AM

ruiling requested review of this revision.Sep 16 2021, 8:13 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2021, 8:13 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B124200: Diff 372952.Sep 16 2021, 8:14 AM

ruiling mentioned this in D109754: AMDGPU: Use -1/0 when copying from SCC to SGPR.Sep 16 2021, 8:14 AM

Hi @arsenm, what do you think of this idea?

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

In D109889#3012452, @foad wrote:

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

In D109889#3012754, @arsenm wrote:

In D109889#3012452, @foad wrote:

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

My understanding is that the TargetLowering::setBooleanContents stuff only affects instruction selection, but at instruction selection time we keep all booleans as i1, so the point is moot. It's only after ISel in SILowerI1Copies that we expand booleans to full sgpr width, and yes we can decide to do whatever we want, but I think 0/-1 makes more sense than 0/1 because it matches how we represent divergent i1 values with a bit per lane.

In D109889#3012862, @foad wrote:

In D109889#3012754, @arsenm wrote:

In D109889#3012452, @foad wrote:

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

My understanding is that the TargetLowering::setBooleanContents stuff only affects instruction selection, but at instruction selection time we keep all booleans as i1, so the point is moot. It's only after ISel in SILowerI1Copies that we expand booleans to full sgpr width, and yes we can decide to do whatever we want, but I think 0/-1 makes more sense than 0/1 because it matches how we represent divergent i1 values with a bit per lane.

It's more visible in GlobalISel because it doesn't have the DAG scheduler to avoid intermediate values for copies between physical registers. In globalisel we can end up with an SReg_32 virtual register that is expected to be 0/1

Using -1 is also misleading, since a true boolean value is also anded with exec

In D109889#3012879, @arsenm wrote:

In D109889#3012862, @foad wrote:

In D109889#3012754, @arsenm wrote:

In D109889#3012452, @foad wrote:

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

My understanding is that the TargetLowering::setBooleanContents stuff only affects instruction selection, but at instruction selection time we keep all booleans as i1, so the point is moot. It's only after ISel in SILowerI1Copies that we expand booleans to full sgpr width, and yes we can decide to do whatever we want, but I think 0/-1 makes more sense than 0/1 because it matches how we represent divergent i1 values with a bit per lane.

It's more visible in GlobalISel because it doesn't have the DAG scheduler to avoid intermediate values for copies between physical registers. In globalisel we can end up with an SReg_32 virtual register that is expected to be 0/1

Using -1 is also misleading, since a true boolean value is also anded with exec

Overall I think we should not have contexts where copy from SCC is being used as a broadcast to a vector boolean. I think these only arise as a side effect of the hacky way SIFixSGPRCopies rewrites the function instruction at a time

Using -1 is also misleading, since a true boolean value is also anded with exec

It depends on what do you think of the problem.
We can formalize the boolean values stored in SGPR like this: for uniform booleans, the bits corresponding to active lanes holding the effective value, other bits are undefined. for divergent booleans, the active lanes holding the effective value, other bits are zero.

Overall I think we should not have contexts where copy from SCC is being used as a broadcast to a vector boolean. I think these only arise as a side effect of the hacky way SIFixSGPRCopies rewrites the function instruction at a time

I would like to keep using -1 for SelectionDAG, the reason is that this is good for generating efficient machine code when there are logical operation with uniform booleans and divergent booleans.
In this way, we are free to just use an S_AND/OR/XOR_B64 to implement logical operations between uniform booleans and divergent booleans. But using 0/1 representation currently in GlobalISel, we need more instructions to achieve a logical operation between uniform booleans and divergent booleans.

mloud added a subscriber: mloud.Sep 25 2021, 11:42 AM

ruiling retitled this revision from AMDGPU: Lower one copy from SCC early for SelectionDAG to AMDGPU: Broadcast scalar boolean to vector boolean explicitly.Sep 27 2021, 7:26 PM

ruiling edited the summary of this revision. (Show Details)

ruiling added a reviewer: rampitec.

I think this is OK given that it fixes a bug, and it moves us in the direction of generating explicit bit-broadcasting code instead of relying on the behaviour of i1 COPY instructions.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4176	Maybe remember the reg size calculated on line 4141, and reuse it here?

This revision is now accepted and ready to land.Sep 28 2021, 3:16 AM

Overall I think we should not have contexts where copy from SCC is being used as a broadcast to a vector boolean. I think these only arise as a side effect of the hacky way SIFixSGPRCopies rewrites the function instruction at a time

In the case given as an example - yes. As soon as I finish divergence-driven isel for the ISD::SELECT, this will turn to the following:

s_addc_u32 s4, s6, 0
s_cselect_b32 s4, s4, 0

But we still have other cases that are going to persist irrelative to the SIFixSGPRCopies hackery presence.

Ruiling gave me a good example in our last discussion on this topic:
What is about using the floating-point compare result in the bitwise operation?

	s_cmp_ge_u32 s0, s1
	s_cselect_b64 s[0:1], -1, 0
	v_cmp_nlt_f32_e32 vcc, v1, v2
	s_and_b64 vcc, vcc, s[0:1]
	v_cndmask_b32_e32 v1, v2, v1, vcc

Both comparisons are uniform but we still have to broadcast SCC.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

Could you please explain, why we have chosen the ZeroOrOneBooleanContent over the ZeroOrNegativeOneBooleanContent?
Is there any strong reason? For the divergent target the latter looks more natural. I remember we already had the same discussion but I did not get persuaded :)

Now we have a set of hacks intended to fix the consequences of that decision.

The change LGTM
It follows the currently accepted way of copying SCC.
And it does nothing except replacing the one incorrect SCC copying with the correct one.

remember wave-size and reuse

Harbormaster completed remote builds in B126327: Diff 375882.Sep 29 2021, 7:28 AM

I will submit this version tomorrow.

This revision was landed with ongoing or failed builds.Sep 29 2021, 7:19 PM

Closed by commit rG52785989e95d: AMDGPU: Broadcast scalar boolean to vector boolean explicitly (authored by ruiling). · Explain Why

This revision was automatically updated to reflect the committed changes.

ruiling added a commit: rG52785989e95d: AMDGPU: Broadcast scalar boolean to vector boolean explicitly.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

14 lines

test/

CodeGen/

AMDGPU/

expand-scalar-carry-out-select-user.ll

6 lines

Diff 376090

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,132 Lines • ▼ Show 20 Lines	case AMDGPU::S_SUB_CO_PSEUDO: {
Register RegOp2 = MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass);		Register RegOp2 = MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass);
if (TRI->isVectorRegister(MRI, Src2.getReg())) {		if (TRI->isVectorRegister(MRI, Src2.getReg())) {
BuildMI(*BB, MII, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32), RegOp2)		BuildMI(*BB, MII, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32), RegOp2)
.addReg(Src2.getReg());		.addReg(Src2.getReg());
Src2.setReg(RegOp2);		Src2.setReg(RegOp2);
}		}

const TargetRegisterClass *Src2RC = MRI.getRegClass(Src2.getReg());		const TargetRegisterClass *Src2RC = MRI.getRegClass(Src2.getReg());
if (TRI->getRegSizeInBits(*Src2RC) == 64) {		unsigned WaveSize = TRI->getRegSizeInBits(*Src2RC);
		assert(WaveSize == 64 \|\| WaveSize == 32);

		if (WaveSize == 64) {
if (ST.hasScalarCompareEq64()) {		if (ST.hasScalarCompareEq64()) {
BuildMI(*BB, MII, DL, TII->get(AMDGPU::S_CMP_LG_U64))		BuildMI(*BB, MII, DL, TII->get(AMDGPU::S_CMP_LG_U64))
.addReg(Src2.getReg())		.addReg(Src2.getReg())
.addImm(0);		.addImm(0);
} else {		} else {
const TargetRegisterClass *SubRC =		const TargetRegisterClass *SubRC =
TRI->getSubRegClass(Src2RC, AMDGPU::sub0);		TRI->getSubRegClass(Src2RC, AMDGPU::sub0);
MachineOperand Src2Sub0 = TII->buildExtractSubRegOrImm(		MachineOperand Src2Sub0 = TII->buildExtractSubRegOrImm(
Show All 13 Lines	case AMDGPU::S_SUB_CO_PSEUDO: {
} else {		} else {
BuildMI(*BB, MII, DL, TII->get(AMDGPU::S_CMPK_LG_U32))		BuildMI(*BB, MII, DL, TII->get(AMDGPU::S_CMPK_LG_U32))
.addReg(Src2.getReg())		.addReg(Src2.getReg())
.addImm(0);		.addImm(0);
}		}

BuildMI(*BB, MII, DL, TII->get(Opc), Dest.getReg()).add(Src0).add(Src1);		BuildMI(*BB, MII, DL, TII->get(Opc), Dest.getReg()).add(Src0).add(Src1);

BuildMI(*BB, MII, DL, TII->get(AMDGPU::COPY), CarryDest.getReg())		unsigned SelOpc =
.addReg(AMDGPU::SCC);		(WaveSize == 64) ? AMDGPU::S_CSELECT_B64 : AMDGPU::S_CSELECT_B32;

		foadUnsubmitted Done Reply Inline Actions Maybe remember the reg size calculated on line 4141, and reuse it here? foad: Maybe remember the reg size calculated on line 4141, and reuse it here?
		BuildMI(*BB, MII, DL, TII->get(SelOpc), CarryDest.getReg())
		.addImm(-1)
		.addImm(0);

MI.eraseFromParent();		MI.eraseFromParent();
return BB;		return BB;
}		}
case AMDGPU::SI_INIT_M0: {		case AMDGPU::SI_INIT_M0: {
BuildMI(*BB, MI.getIterator(), MI.getDebugLoc(),		BuildMI(*BB, MI.getIterator(), MI.getDebugLoc(),
TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)		TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)
.add(MI.getOperand(0));		.add(MI.getOperand(0));
MI.eraseFromParent();		MI.eraseFromParent();
▲ Show 20 Lines • Show All 8,221 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll

	Show All 9 Lines
	; GFX7-NEXT: s_mov_b64 s[4:5], 0			; GFX7-NEXT: s_mov_b64 s[4:5], 0
	; GFX7-NEXT: s_load_dword s6, s[4:5], 0x0			; GFX7-NEXT: s_load_dword s6, s[4:5], 0x0
	; GFX7-NEXT: s_waitcnt lgkmcnt(0)			; GFX7-NEXT: s_waitcnt lgkmcnt(0)
	; GFX7-NEXT: v_add_i32_e64 v0, s[4:5], s6, s6			; GFX7-NEXT: v_add_i32_e64 v0, s[4:5], s6, s6
	; GFX7-NEXT: s_or_b32 s4, s4, s5			; GFX7-NEXT: s_or_b32 s4, s4, s5
	; GFX7-NEXT: s_cmp_lg_u32 s4, 0			; GFX7-NEXT: s_cmp_lg_u32 s4, 0
	; GFX7-NEXT: s_addc_u32 s4, s6, 0			; GFX7-NEXT: s_addc_u32 s4, s6, 0
	; GFX7-NEXT: v_mov_b32_e32 v1, s4			; GFX7-NEXT: v_mov_b32_e32 v1, s4
	; GFX7-NEXT: s_cselect_b64 vcc, 1, 0			; GFX7-NEXT: s_cselect_b64 vcc, -1, 0
	; GFX7-NEXT: s_cmp_gt_u32 s6, 31			; GFX7-NEXT: s_cmp_gt_u32 s6, 31
	; GFX7-NEXT: v_cndmask_b32_e32 v1, 0, v1, vcc			; GFX7-NEXT: v_cndmask_b32_e32 v1, 0, v1, vcc
	; GFX7-NEXT: s_cselect_b64 vcc, -1, 0			; GFX7-NEXT: s_cselect_b64 vcc, -1, 0
	; GFX7-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc			; GFX7-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc
	; GFX7-NEXT: s_setpc_b64 s[30:31]			; GFX7-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX9-LABEL: s_add_co_select_user:			; GFX9-LABEL: s_add_co_select_user:
	; GFX9: ; %bb.0: ; %bb			; GFX9: ; %bb.0: ; %bb
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: s_mov_b64 s[4:5], 0			; GFX9-NEXT: s_mov_b64 s[4:5], 0
	; GFX9-NEXT: s_load_dword s6, s[4:5], 0x0			; GFX9-NEXT: s_load_dword s6, s[4:5], 0x0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_add_co_u32_e64 v0, s[4:5], s6, s6			; GFX9-NEXT: v_add_co_u32_e64 v0, s[4:5], s6, s6
	; GFX9-NEXT: s_cmp_lg_u64 s[4:5], 0			; GFX9-NEXT: s_cmp_lg_u64 s[4:5], 0
	; GFX9-NEXT: s_addc_u32 s4, s6, 0			; GFX9-NEXT: s_addc_u32 s4, s6, 0
	; GFX9-NEXT: v_mov_b32_e32 v1, s4			; GFX9-NEXT: v_mov_b32_e32 v1, s4
	; GFX9-NEXT: s_cselect_b64 vcc, 1, 0			; GFX9-NEXT: s_cselect_b64 vcc, -1, 0
	; GFX9-NEXT: s_cmp_gt_u32 s6, 31			; GFX9-NEXT: s_cmp_gt_u32 s6, 31
	; GFX9-NEXT: v_cndmask_b32_e32 v1, 0, v1, vcc			; GFX9-NEXT: v_cndmask_b32_e32 v1, 0, v1, vcc
	; GFX9-NEXT: s_cselect_b64 vcc, -1, 0			; GFX9-NEXT: s_cselect_b64 vcc, -1, 0
	; GFX9-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc			; GFX9-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-LABEL: s_add_co_select_user:			; GFX10-LABEL: s_add_co_select_user:
	; GFX10: ; %bb.0: ; %bb			; GFX10: ; %bb.0: ; %bb
	; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: s_mov_b64 s[4:5], 0			; GFX10-NEXT: s_mov_b64 s[4:5], 0
	; GFX10-NEXT: s_load_dword s4, s[4:5], 0x0			; GFX10-NEXT: s_load_dword s4, s[4:5], 0x0
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: v_add_co_u32 v0, s5, s4, s4			; GFX10-NEXT: v_add_co_u32 v0, s5, s4, s4
	; GFX10-NEXT: s_cmpk_lg_u32 s5, 0x0			; GFX10-NEXT: s_cmpk_lg_u32 s5, 0x0
	; GFX10-NEXT: s_addc_u32 s5, s4, 0			; GFX10-NEXT: s_addc_u32 s5, s4, 0
	; GFX10-NEXT: s_cselect_b32 s6, 1, 0			; GFX10-NEXT: s_cselect_b32 s6, -1, 0
	; GFX10-NEXT: s_cmp_gt_u32 s4, 31			; GFX10-NEXT: s_cmp_gt_u32 s4, 31
	; GFX10-NEXT: v_cndmask_b32_e64 v1, 0, s5, s6			; GFX10-NEXT: v_cndmask_b32_e64 v1, 0, s5, s6
	; GFX10-NEXT: s_cselect_b32 vcc_lo, -1, 0			; GFX10-NEXT: s_cselect_b32 vcc_lo, -1, 0
	; GFX10-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc_lo			; GFX10-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc_lo
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	bb:			bb:
	%i = load volatile i32, i32 addrspace(4)* null, align 8			%i = load volatile i32, i32 addrspace(4)* null, align 8
	%i1 = add i32 %i, %i			%i1 = add i32 %i, %i
	▲ Show 20 Lines • Show All 111 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Broadcast scalar boolean to vector boolean explicitlyClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 376090

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll

AMDGPU: Broadcast scalar boolean to vector boolean explicitly
ClosedPublic