This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Broadcast scalar boolean to vector boolean explicitly
ClosedPublic

Authored by ruiling on Sep 16 2021, 8:13 AM.

Details

Summary

This is used to fix wrong code generation of s_add_co_select_user in
test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll

s_addc_u32 s4, s6, 0
s_cselect_b64 vcc, 1, 0    <-- vcc set as 0x1 if SCC==1
v_mov_b32_e32 v1, s4
s_cmp_gt_u32 s6, 31
v_cndmask_b32_e32 v1, 0, v1, vcc

If the s_addc_u32 set SCC, then we will get value 0x1 in VCC.
The v_cndmask will do per thread selection with VCC as condition
register. As VCC only gets the first bit being set, only the first
thread/lane in destination register can get correct result if the
very first lane is active. In fact, we should broadcast the value to all
active lanes of the final register.

The idea here is doing this broadcast to vector boolean explicitly
instead of lowering it into a COPY from SCC which would be interpreted as
selecting between 0/1.

This is used to replace D109754.

Diff Detail

Event Timeline

ruiling created this revision.Sep 16 2021, 8:13 AM
ruiling requested review of this revision.Sep 16 2021, 8:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2021, 8:13 AM

Hi @arsenm, what do you think of this idea?

foad added a comment.Sep 21 2021, 7:04 AM

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

foad added a comment.Sep 21 2021, 9:08 AM

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

My understanding is that the TargetLowering::setBooleanContents stuff only affects instruction selection, but at instruction selection time we keep all booleans as i1, so the point is moot. It's only after ISel in SILowerI1Copies that we expand booleans to full sgpr width, and yes we can decide to do whatever we want, but I think 0/-1 makes more sense than 0/1 because it matches how we represent divergent i1 values with a bit per lane.

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

My understanding is that the TargetLowering::setBooleanContents stuff only affects instruction selection, but at instruction selection time we keep all booleans as i1, so the point is moot. It's only after ISel in SILowerI1Copies that we expand booleans to full sgpr width, and yes we can decide to do whatever we want, but I think 0/-1 makes more sense than 0/1 because it matches how we represent divergent i1 values with a bit per lane.

It's more visible in GlobalISel because it doesn't have the DAG scheduler to avoid intermediate values for copies between physical registers. In globalisel we can end up with an SReg_32 virtual register that is expected to be 0/1

Using -1 is also misleading, since a true boolean value is also anded with exec

In SelectionDAG path, we are currently using -1/0 when copying from SCC.
But in GlobalISel path, we are requesting 1/0 when copying from SCC.

Is there a good reason for that difference? Is it related to TargetLowering::setBooleanContents? We always set ZeroOrOneBooleanContent for GCN subtargets.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

My understanding is that the TargetLowering::setBooleanContents stuff only affects instruction selection, but at instruction selection time we keep all booleans as i1, so the point is moot. It's only after ISel in SILowerI1Copies that we expand booleans to full sgpr width, and yes we can decide to do whatever we want, but I think 0/-1 makes more sense than 0/1 because it matches how we represent divergent i1 values with a bit per lane.

It's more visible in GlobalISel because it doesn't have the DAG scheduler to avoid intermediate values for copies between physical registers. In globalisel we can end up with an SReg_32 virtual register that is expected to be 0/1

Using -1 is also misleading, since a true boolean value is also anded with exec

Overall I think we should not have contexts where copy from SCC is being used as a broadcast to a vector boolean. I think these only arise as a side effect of the hacky way SIFixSGPRCopies rewrites the function instruction at a time

Using -1 is also misleading, since a true boolean value is also anded with exec

It depends on what do you think of the problem.
We can formalize the boolean values stored in SGPR like this: for uniform booleans, the bits corresponding to active lanes holding the effective value, other bits are undefined. for divergent booleans, the active lanes holding the effective value, other bits are zero.

Overall I think we should not have contexts where copy from SCC is being used as a broadcast to a vector boolean. I think these only arise as a side effect of the hacky way SIFixSGPRCopies rewrites the function instruction at a time

I would like to keep using -1 for SelectionDAG, the reason is that this is good for generating efficient machine code when there are logical operation with uniform booleans and divergent booleans.
In this way, we are free to just use an S_AND/OR/XOR_B64 to implement logical operations between uniform booleans and divergent booleans. But using 0/1 representation currently in GlobalISel, we need more instructions to achieve a logical operation between uniform booleans and divergent booleans.

mloud added a subscriber: mloud.Sep 25 2021, 11:42 AM
ruiling retitled this revision from AMDGPU: Lower one copy from SCC early for SelectionDAG to AMDGPU: Broadcast scalar boolean to vector boolean explicitly.Sep 27 2021, 7:26 PM
ruiling edited the summary of this revision. (Show Details)
ruiling added a reviewer: rampitec.
foad accepted this revision.Sep 28 2021, 3:16 AM

I think this is OK given that it fixes a bug, and it moves us in the direction of generating explicit bit-broadcasting code instead of relying on the behaviour of i1 COPY instructions.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4176

Maybe remember the reg size calculated on line 4141, and reuse it here?

This revision is now accepted and ready to land.Sep 28 2021, 3:16 AM

Overall I think we should not have contexts where copy from SCC is being used as a broadcast to a vector boolean. I think these only arise as a side effect of the hacky way SIFixSGPRCopies rewrites the function instruction at a time

In the case given as an example - yes. As soon as I finish divergence-driven isel for the ISD::SELECT, this will turn to the following:

s_addc_u32 s4, s6, 0
s_cselect_b32 s4, s4, 0

But we still have other cases that are going to persist irrelative to the SIFixSGPRCopies hackery presence.

Ruiling gave me a good example in our last discussion on this topic:
What is about using the floating-point compare result in the bitwise operation?

	s_cmp_ge_u32 s0, s1
	s_cselect_b64 s[0:1], -1, 0
	v_cmp_nlt_f32_e32 vcc, v1, v2
	s_and_b64 vcc, vcc, s[0:1]
	v_cndmask_b32_e32 v1, v2, v1, vcc

Both comparisons are uniform but we still have to broadcast SCC.

The problem is "COPY from SCC" by itself is not a semantically meaningful concept. We can make up whatever we want. I think ZeroOrOneBooleanContent is a better choice, since it's not fighting an uphill battle for optimization priority, and there really is only one bit. Places that semantically need to use -1 can emit the select directly.

Could you please explain, why we have chosen the ZeroOrOneBooleanContent over the ZeroOrNegativeOneBooleanContent?
Is there any strong reason? For the divergent target the latter looks more natural. I remember we already had the same discussion but I did not get persuaded :)

Now we have a set of hacks intended to fix the consequences of that decision.

alex-t accepted this revision.Sep 28 2021, 9:45 AM

The change LGTM
It follows the currently accepted way of copying SCC.
And it does nothing except replacing the one incorrect SCC copying with the correct one.

ruiling updated this revision to Diff 375882.Sep 29 2021, 7:28 AM

remember wave-size and reuse

ruiling marked an inline comment as done.Sep 29 2021, 7:30 AM

I will submit this version tomorrow.

This revision was landed with ongoing or failed builds.Sep 29 2021, 7:19 PM
This revision was automatically updated to reflect the committed changes.