This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Eliminate the need to create a zero vector by reusing the mask.
ClosedPublic

Authored by jonpa on May 14 2020, 1:04 AM.

Details

Summary

Try to avoid creating VGBMs by reusing the permutation mask if it contains a zero. If the first byte was into (any byte of) a zero vector, then the first byte of the mask can become zero and reused by putting the mask also as the first operand. If there was a first-byte use of the other source operand, then that zero index can be reused if the mask is placed as the second operand. The first case will be important probably with the upcoming patch for zero_extend_vector_inreg, while the second case affects just a few cases on SPEC'17. The patch could of course be simplified if the second case was skipped...

There is also the case of Intrinsic::s390_vperm, but it seems more complicated to wait until DAGCombine to handle this, since then we have to extract the mask back from a constant-pool or maybe even a SystemZVectorConstant... Besides, the user should be able to do that on his own also, right?

This doesn't seem to have that big of an effect on SPEC'17:

vperm          :                21136                21110      -26
vgbm           :                11385                11368      -17
vl             :               109414               109411       -3
larl           :               371345               371347       +2

...

, but it probably becomes more noticeable with the zero_extend_vector_inreg patch... Should it be committed separately first, or rather incorporated into that other patch?

It seems that now MachineCSE can remove a few VPERMs also. At least one case was because instead of using an undef source operand, the mask was used (it doesn't help any to replace an undef Ops[1] with Op2 on the last line of getGeneralPermuteNode(), which I first thought). The test case I have reduced (not included) shows that MCSE on trunk fails to remove the second vperm if even though the instructions are near identical. The only difference to the success with this patch is that instead of the reused mask, there is an IMPLICIT_DEF. It looks to me that this is a minor deficiency of MCSE (Two different vregs defined by IMPLICIT_DEF should not stop CSE, I would think).

isZeroOrUndefVector(): The check for the undef vector used to be beneficial for the zero_extend_vector_inreg patch, but the way it looks now (just using a single unpack), it is not needed for anymore (NFC). I think it could be removed or we can keep it here to get the few less vperms...

Two tests - one for each case handled. Not quite sure what happened with vector-constrained-fp-intrinsics.ll - I don't understand why v0 is no longer used, but it seems that this patch begins to change things when the second operand of the VPERM is undefined.

Diff Detail

Event Timeline

jonpa created this revision.May 14 2020, 1:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2020, 1:04 AM

It seems that now MachineCSE can remove a few VPERMs also. At least one case was because instead of using an undef source operand, the mask was used (it doesn't help any to replace an undef Ops[1] with Op2 on the last line of getGeneralPermuteNode(), which I first thought). The test case I have reduced (not included) shows that MCSE on trunk fails to remove the second vperm if even though the instructions are near identical. The only difference to the success with this patch is that instead of the reused mask, there is an IMPLICIT_DEF. It looks to me that this is a minor deficiency of MCSE (Two different vregs defined by IMPLICIT_DEF should not stop CSE, I would think).

isZeroOrUndefVector(): The check for the undef vector used to be beneficial for the zero_extend_vector_inreg patch, but the way it looks now (just using a single unpack), it is not needed for anymore (NFC). I think it could be removed or we can keep it here to get the few less vperms...

Two tests - one for each case handled. Not quite sure what happened with vector-constrained-fp-intrinsics.ll - I don't understand why v0 is no longer used, but it seems that this patch begins to change things when the second operand of the VPERM is undefined.

It seems to me these are related. Currently, for a VPERM that only needs a single input, the second input is set to UNDEF, which gets lowered to an IMPLICIT_DEF at the MI level, which the register allocator just replaced by some (random?) register. In the test case, it chooses %v0, which has nothing to do with the VPERM, and is in fact already used for an unrelated operation. This is not really a good idea, because that creates a register dependency on the instruction that is really unnecessary ...

I think for VPERMs that only need a single input, we should simply use the same register for both input slots. That way, the contents are well-defined, and there are no spurious dependencies. Also, it should fix the CSE problem, I think. (Using the other input seems preferable to using the index vector ...)

If we do that, it follows that a completely undefined input should *not* be treated as a zero vector for this optimization. (However, a not fully undefined vector that is a mixture of zero elements and undefined elements can be treated as zero vector. Not sure if this makes much of a difference in practice.)

Note that fixing the handling of an UNDEF second input can (and probably should) be done as a separate patch, before this optimization is merged.

jonpa updated this revision to Diff 264691.May 18 2020, 11:46 AM

Note that fixing the handling of an UNDEF second input can (and probably should) be done as a separate patch, before this optimization is merged.

Committed as 31ecef7.

If we do that, it follows that a completely undefined input should *not* be treated as a zero vector for this optimization. (However, a not fully undefined vector that is a mixture of zero elements and undefined elements can be treated as zero vector. Not sure if this makes much of a difference in practice.)

Removed the check for the undef vector. It also turns out that ISD::isBuildVectorAllZeros() is already accepting some undef elements, as long as they are not all undef, which is exactly what we want.

Patch rebased. The second case (reusing the index into the first byte of the non-zero vector), is eliminating now 13 VGBM.s on SPEC'17, but is not needed for the zero extend inreg patch...

Tests merged into vec-perm-14.ll, which was just committed...

uweigand accepted this revision.May 18 2020, 12:16 PM

LGTM, thanks!

This revision is now accepted and ready to land.May 18 2020, 12:16 PM
This revision was automatically updated to reflect the committed changes.