This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPUInstructionSelector.cpp
-
test/CodeGen/AMDGPU/GlobalISel/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
-
inst-select-copy.mir

Differential D64726

AMDGPU/GlobalISel: Fix not constraining result reg of copies to VCC
ClosedPublic

Authored by arsenm on Jul 15 2019, 5:55 AM.

Download Raw Diff

Details

Reviewers

tstellar
nhaehnle
arsenm

Diff Detail

Event Timeline

arsenm created this revision.Jul 15 2019, 5:55 AM

Herald added subscribers: Petar.Avramovic, t-tye, tpr and 6 others. · View Herald TranscriptJul 15 2019, 5:55 AM

This seems incorrect, doesn't it? The truncation disappeared.... (e.g., what if $sgpr0 is 0x10)

In D64726#1587233, @nhaehnle wrote:

This seems incorrect, doesn't it? The truncation disappeared.... (e.g., what if $sgpr0 is 0x10)

My current understanding of G_TRUNC is it's a no-op, and supposed to always be legal. This is supposed to be the legalized MIR, so theoretically this was generated by something that knew the original argument was zeroext from i1

In D64726#1587457, @arsenm wrote:

In D64726#1587233, @nhaehnle wrote:

This seems incorrect, doesn't it? The truncation disappeared.... (e.g., what if $sgpr0 is 0x10)

My current understanding of G_TRUNC is it's a no-op, and supposed to always be legal. This is supposed to be the legalized MIR, so theoretically this was generated by something that knew the original argument was zeroext from i1

It's a no-op in the sense that only the low bit is known valid

In D64726#1587457, @arsenm wrote:

In D64726#1587233, @nhaehnle wrote:

This seems incorrect, doesn't it? The truncation disappeared.... (e.g., what if $sgpr0 is 0x10)

My current understanding of G_TRUNC is it's a no-op, and supposed to always be legal. This is supposed to be the legalized MIR, so theoretically this was generated by something that knew the original argument was zeroext from i1

Where do you get this from? And if this is the case, what is the real trunc instruction lowered to? This smells extremely fishy to me.

Case in point; take this IR:

define amdgpu_ps float @foo(i32 %a) {
  %cc = trunc i32 %a to i1
  %r = zext i1 %cc to i32
  %r.f = bitcast i32 %r to float
  ret float %r.f
}

which becomes, before legalization:

bb.1 (%ir-block.0):
  liveins: $vgpr0
  %0:_(s32) = COPY $vgpr0
  %1:_(s1) = G_TRUNC %0:_(s32)
  %2:_(s32) = G_ZEXT %1:_(s1)
  $vgpr0 = COPY %2:_(s32)
  SI_RETURN_TO_EPILOG $vgpr0

So it is very plain here that we cannot assume the high bits of the input to be zero.

Now maybe an exception can be made here due to something that happens earlier in legalization, but it does seem highly suspicious and it would be good to have a test all the way from IR to understand why this is happening -- having different semantics for the same opcode at different stages of the compilation is not cool.

In D64726#1587496, @nhaehnle wrote:
In D64726#1587457, @arsenm wrote:

In D64726#1587233, @nhaehnle wrote:

This seems incorrect, doesn't it? The truncation disappeared.... (e.g., what if $sgpr0 is 0x10)

My current understanding of G_TRUNC is it's a no-op, and supposed to always be legal. This is supposed to be the legalized MIR, so theoretically this was generated by something that knew the original argument was zeroext from i1

Where do you get this from? And if this is the case, what is the real trunc instruction lowered to? This smells extremely fishy to me.

Case in point; take this IR:
define amdgpu_ps float @foo(i32 %a) {
  %cc = trunc i32 %a to i1
  %r = zext i1 %cc to i32
  %r.f = bitcast i32 %r to float
  ret float %r.f
}
which becomes, before legalization:
bb.1 (%ir-block.0):
  liveins: $vgpr0
  %0:_(s32) = COPY $vgpr0
  %1:_(s1) = G_TRUNC %0:_(s32)
  %2:_(s32) = G_ZEXT %1:_(s1)
  $vgpr0 = COPY %2:_(s32)
  SI_RETURN_TO_EPILOG $vgpr0
So it is very plain here that we cannot assume the high bits of the input to be zero.

Now maybe an exception can be made here due to something that happens earlier in legalization, but it does seem highly suspicious and it would be good to have a test all the way from IR to understand why this is happening -- having different semantics for the same opcode at different stages of the compilation is not cool.

After legalization, you get:

%0:vgpr(s32) = COPY $vgpr0
%5:sgpr(s32) = G_CONSTANT i32 1
%6:vgpr(s32) = COPY %0(s32)
%7:vgpr(s32) = COPY %5(s32)
%4:vgpr(s32) = G_AND %6, %7
$vgpr0 = COPY %4(s32)
SI_RETURN_TO_EPILOG $vgpr0

The G_ZEXT here is what has the semantics of getting 0 in the high bits, not the G_TRUNC. G_TRUNC is a legalization artifact to get the sizes to match to keep the MIR valid at all points during legalization. This is the point the section about G_ANYEXT and G_TRUNC is getting at in D62423

But then following this logic, I still think that by analogy with G_ZEXT, the operation of COPY from s1 into vcc should have the semantics of ignoring the high bits of the "s1 which is really an s32". Since there's nothing in the MIR test which guarantees that the incoming high bits of $sgpr0 are 0, the resulting code needs to have some form of masking.

(As an aside, there's an argument to be made that the resulting code should really be an S_AND_B32 with 1 followed by an S_CSELECT_B64 writing to vcc instead of using the VALU, but we should probably clear up the semantics/correctness question first.)

In D64726#1591516, @nhaehnle wrote:

But then following this logic, I still think that by analogy with G_ZEXT, the operation of COPY from s1 into vcc should have the semantics of ignoring the high bits of the "s1 which is really an s32". Since there's nothing in the MIR test which guarantees that the incoming high bits of $sgpr0 are 0, the resulting code needs to have some form of masking.

(As an aside, there's an argument to be made that the resulting code should really be an S_AND_B32 with 1 followed by an S_CSELECT_B64 writing to vcc instead of using the VALU, but we should probably clear up the semantics/correctness question first.)

For the purpose of the test, it doesn't matter what is really correct. It just needs to get an s1 value in a non-condition register bank. Theoretically there could have been an AssertZext equivalent, which was correctly deleted along with the G_AND the legalizer produced, turning it into a plain argument copy. The selector shouldn't need to be second guessing what the legalizer/combiner did was semantically correct.

An alternative strategy I've been considering is to disallow SGPR/VGPR s1 values entirely, and to always require these to use SCC/VCC producers for them (which would be a lot simpler). Any s1 use would then get a COPY which will turn into S_CMP/V_CMP. I think I ran into some problem when I tried to do this about 6 months ago, but I don't remember all the details.

Okay, the possibility of an AssertZext is an interesting point. So let me try the other way around: What would the MIR at this stage look like to enforce an and?

In the sample IR I had it was the G_ZEXT which became an and, but a G_ZEXT to copy from sgpr(s1) to vcc(s1) doesn't really make sense since both are s1...

To make this more concrete, the GlobalISel path currently generates incorrect code for:

define amdgpu_ps float @foo(i32 inreg %a, i32 %b, i32 %c) {
  %cc = trunc i32 %a to i1
  %r = select i1 %cc, i32 %b, i32 %c
  %r.f = bitcast i32 %r to float
  ret float %r.f
}

The non-GlobalISel path does the right thing (though I wonder whether it might be able to use scc instead of vcc, but that can be a separate discussion).

One more thought:

An alternative strategy I've been considering is to disallow SGPR/VGPR s1 values entirely, and to always require these to use SCC/VCC producers for them (which would be a lot simpler). Any s1 use would then get a COPY which will turn into S_CMP/V_CMP.

We really want to keep divergent booleans as lane masks, the operations for which are quite tricky to select correctly in the normal instruction selection flow when they cross basic block boundaries. Without s1, we'd have to first lower everything as vgpr s32 and then run an analysis that pattern-matches to essentially recover the s1 values again in order to optimize them to lanemasks. Which seems a pretty roundabout and complex approach. So I do think there's a legitimate role for s1 values early in the backend. With this particular issue here we've unfortunately run into some tough questions as to what s1 means in terms of representation in an s32 sgpr.

This was 3bfdb54d88d71b8cd8594d22d4b3712ee5661aa5

This revision is now accepted and ready to land.Apr 3 2020, 1:01 PM

Herald added a subscriber: kerbowa. · View Herald TranscriptApr 3 2020, 1:02 PM

arsenm closed this revision.Apr 3 2020, 1:02 PM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUInstructionSelector.cpp

4 lines

test/

CodeGen/

AMDGPU/

GlobalISel/

inst-select-copy.mir

26 lines

Diff 209822

lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	if (isVCC(DstReg, MRI)) {
if (SrcReg == AMDGPU::SCC) {		if (SrcReg == AMDGPU::SCC) {
const TargetRegisterClass *RC		const TargetRegisterClass *RC
= TRI.getConstrainedRegClassForOperand(Dst, MRI);		= TRI.getConstrainedRegClassForOperand(Dst, MRI);
if (!RC)		if (!RC)
return true;		return true;
return RBI.constrainGenericRegister(DstReg, *RC, MRI);		return RBI.constrainGenericRegister(DstReg, *RC, MRI);
}		}

		// TODO: Should probably leave the copy and let copyPhysReg expand it.
		if (!RBI.constrainGenericRegister(DstReg, *TRI.getBoolRC(), MRI))
		return false;

BuildMI(*BB, &I, DL, TII.get(AMDGPU::V_CMP_NE_U32_e64), DstReg)		BuildMI(*BB, &I, DL, TII.get(AMDGPU::V_CMP_NE_U32_e64), DstReg)
.addImm(0)		.addImm(0)
.addReg(SrcReg);		.addReg(SrcReg);

if (!MRI.getRegClassOrNull(SrcReg))		if (!MRI.getRegClassOrNull(SrcReg))
MRI.setRegClass(SrcReg, TRI.getConstrainedRegClassForOperand(Src, MRI));		MRI.setRegClass(SrcReg, TRI.getConstrainedRegClassForOperand(Src, MRI));
I.eraseFromParent();		I.eraseFromParent();
return true;		return true;
▲ Show 20 Lines • Show All 1,341 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/GlobalISel/inst-select-copy.mir

Show First 20 Lines • Show All 236 Lines • ▼ Show 20 Lines	bb.0:
%0:sgpr(s32) = COPY $sgpr0		%0:sgpr(s32) = COPY $sgpr0
%1:sgpr(s1) = G_TRUNC %0		%1:sgpr(s1) = G_TRUNC %0
%2:vcc(s1) = COPY %1		%2:vcc(s1) = COPY %1
G_BRCOND %2, %bb.1		G_BRCOND %2, %bb.1

bb.1:		bb.1:

...		...

		---

		name: copy_sgpr_s1_to_vcc_constrain
		legalized: true
		regBankSelected: true


		body: \|
		bb.0:
		liveins: $sgpr0_sgpr1
		; WAVE64-LABEL: name: copy_sgpr_s1_to_vcc_constrain
		; WAVE64: [[COPY:%[0-9]+]]:sreg_32_xm0 = COPY $sgpr0
		; WAVE64: [[V_CMP_NE_U32_e64_:%[0-9]+]]:sreg_64 = V_CMP_NE_U32_e64 0, [[COPY]], implicit $exec
		; WAVE64: S_ENDPGM 0, implicit [[V_CMP_NE_U32_e64_]]
		; WAVE32-LABEL: name: copy_sgpr_s1_to_vcc_constrain
		; WAVE32: $vcc_hi = IMPLICIT_DEF
		; WAVE32: [[COPY:%[0-9]+]]:sreg_32_xm0 = COPY $sgpr0
		; WAVE32: [[V_CMP_NE_U32_e64_:%[0-9]+]]:sreg_32_xm0 = V_CMP_NE_U32_e64 0, [[COPY]], implicit $exec
		; WAVE32: S_ENDPGM 0, implicit [[V_CMP_NE_U32_e64_]]
		%0:sgpr(s32) = COPY $sgpr0
		%1:sgpr(s1) = G_TRUNC %0
		%2:vcc(s1) = COPY %1
		S_ENDPGM 0, implicit %2

		...