Diff Detail
Event Timeline
This seems incorrect, doesn't it? The truncation disappeared.... (e.g., what if $sgpr0 is 0x10)
My current understanding of G_TRUNC is it's a no-op, and supposed to always be legal. This is supposed to be the legalized MIR, so theoretically this was generated by something that knew the original argument was zeroext from i1
Where do you get this from? And if this is the case, what is the real trunc instruction lowered to? This smells extremely fishy to me.
Case in point; take this IR:
define amdgpu_ps float @foo(i32 %a) { %cc = trunc i32 %a to i1 %r = zext i1 %cc to i32 %r.f = bitcast i32 %r to float ret float %r.f }
which becomes, before legalization:
bb.1 (%ir-block.0): liveins: $vgpr0 %0:_(s32) = COPY $vgpr0 %1:_(s1) = G_TRUNC %0:_(s32) %2:_(s32) = G_ZEXT %1:_(s1) $vgpr0 = COPY %2:_(s32) SI_RETURN_TO_EPILOG $vgpr0
So it is very plain here that we cannot assume the high bits of the input to be zero.
Now maybe an exception can be made here due to something that happens earlier in legalization, but it does seem highly suspicious and it would be good to have a test all the way from IR to understand why this is happening -- having different semantics for the same opcode at different stages of the compilation is not cool.
After legalization, you get:
%0:vgpr(s32) = COPY $vgpr0 %5:sgpr(s32) = G_CONSTANT i32 1 %6:vgpr(s32) = COPY %0(s32) %7:vgpr(s32) = COPY %5(s32) %4:vgpr(s32) = G_AND %6, %7 $vgpr0 = COPY %4(s32) SI_RETURN_TO_EPILOG $vgpr0
The G_ZEXT here is what has the semantics of getting 0 in the high bits, not the G_TRUNC. G_TRUNC is a legalization artifact to get the sizes to match to keep the MIR valid at all points during legalization. This is the point the section about G_ANYEXT and G_TRUNC is getting at in D62423
But then following this logic, I still think that by analogy with G_ZEXT, the operation of COPY from s1 into vcc should have the semantics of ignoring the high bits of the "s1 which is really an s32". Since there's nothing in the MIR test which guarantees that the incoming high bits of $sgpr0 are 0, the resulting code needs to have some form of masking.
(As an aside, there's an argument to be made that the resulting code should really be an S_AND_B32 with 1 followed by an S_CSELECT_B64 writing to vcc instead of using the VALU, but we should probably clear up the semantics/correctness question first.)
For the purpose of the test, it doesn't matter what is really correct. It just needs to get an s1 value in a non-condition register bank. Theoretically there could have been an AssertZext equivalent, which was correctly deleted along with the G_AND the legalizer produced, turning it into a plain argument copy. The selector shouldn't need to be second guessing what the legalizer/combiner did was semantically correct.
An alternative strategy I've been considering is to disallow SGPR/VGPR s1 values entirely, and to always require these to use SCC/VCC producers for them (which would be a lot simpler). Any s1 use would then get a COPY which will turn into S_CMP/V_CMP. I think I ran into some problem when I tried to do this about 6 months ago, but I don't remember all the details.
Okay, the possibility of an AssertZext is an interesting point. So let me try the other way around: What would the MIR at this stage look like to enforce an and?
In the sample IR I had it was the G_ZEXT which became an and, but a G_ZEXT to copy from sgpr(s1) to vcc(s1) doesn't really make sense since both are s1...
To make this more concrete, the GlobalISel path currently generates incorrect code for:
define amdgpu_ps float @foo(i32 inreg %a, i32 %b, i32 %c) { %cc = trunc i32 %a to i1 %r = select i1 %cc, i32 %b, i32 %c %r.f = bitcast i32 %r to float ret float %r.f }
The non-GlobalISel path does the right thing (though I wonder whether it might be able to use scc instead of vcc, but that can be a separate discussion).
One more thought:
An alternative strategy I've been considering is to disallow SGPR/VGPR s1 values entirely, and to always require these to use SCC/VCC producers for them (which would be a lot simpler). Any s1 use would then get a COPY which will turn into S_CMP/V_CMP.
We really want to keep divergent booleans as lane masks, the operations for which are quite tricky to select correctly in the normal instruction selection flow when they cross basic block boundaries. Without s1, we'd have to first lower everything as vgpr s32 and then run an analysis that pattern-matches to essentially recover the s1 values again in order to optimize them to lanemasks. Which seems a pretty roundabout and complex approach. So I do think there's a legitimate role for s1 values early in the backend. With this particular issue here we've unfortunately run into some tough questions as to what s1 means in terms of representation in an s32 sgpr.