This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Expand vector zero-extend into a shuffle.
ClosedPublic

Authored by jonpa on Apr 20 2020, 5:44 AM.

Details

Summary

This is a preparatory patch for generating more VLLEZ instructions, although it also has some benefits on its own.

lowerExtendVectorInreg() no longer unpacks ZERO_EXTEND_VECTOR_INREG always, but instead expands to a shuffle with a zero vector.

GeneralShuffle::getNode() now detects and handles cases which involves a zero vector. An advantage to this is that it is now possible to tune when a vperm or unpacking should be used, depending on how many unpacks are needed.
It seemed to work better to use the new UnpackInfo struct to work on the byte level directly, rather than trying to detect byte sequences and from that deduce the unpacks needed.

It is also necessary to handle the cases of a shuffled input source, which must be done cleverly to avoid generating worse code.

An alternative to this - for the purpose of VLLEZ - might be to only expand those potential candidates to shuffles. Expanding all of them reduces the number of vperms generally since more cases can now be handled with unpacks, but there is also need to find the zero vector and make sure to handle it last in getNode().

When realizing that the number of permutes increased at one point I attempted an algorithm to find different orders of shuffling in getNode(). This was however not a simple thing to do since the DAG nodes could then not be created when evaluating an order of the shuffles. I then however found out that putting the new zero vector last handled this problem and now vperms are same or less on all files with this.

On Imagick (this time without ffp-contract=fast), this gave a good improvement *without* VLLEZ. Using the other vllez-patch directly (https://reviews.llvm.org/D76275) gave a slightly bigger improvement. Using this patch and also generating VLLEZs gave the exact same improvement as the first patch. So looking at just this benchmark, I see that this patch is beneficial on its own, but if generating VLLEZ it doesn't really play a role.

Apart from Imagick, I see a 2.5% improvement on x264 if using -max-unpackops=2, which is interesting.

Other notes:

lowerExtendVectorInreg(): All extended bytes are defined to 0, even if the original element was -1. Is this needed, or could it be assumed that a zero-extended undefined element have all bytes undefined as well? Not sure if it matters.

unpacking with a permute: There are a few (140) cases with only two defined elements, where the VPERM of Best.Mask can be done with a VREPI. Most of those cases however required 3 unpacks after the VPERM, and just a dozen could be done with 2 unpacks after. I removed that check since it also did not seem to improve any benchmarks.

New tests: fun0: This particular set of values (i8 compares -> i32) seem to not work on trunk while others do.

Use -debug-only=systemz-lower to see what is going on.

Diff Detail

Event Timeline

jonpa created this revision.Apr 20 2020, 5:44 AM
jonpa edited the summary of this revision. (Show Details)Apr 20 2020, 5:48 AM

Hi Jonas, I haven't looked into everything in detail, but first one fundmental question: My understanding was that the current GeneralShuffle code would detect the shuffle you generate for a zero-extend as a case of a MERGE. Later, combineMERGE would detect that one input of the MERGE is a zero vector, and replace the merge by an UNPACKL.

Is there some reason why this doesn't work for you? Why do you have to create the UNPACKL already in GeneralShuffle?

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
5247

Now that this routine shares no code at all between the sign- and zero-extend case, I think it would make more sense to just have two different routines lowerSIGN_EXTEND_VECTOR_INREG and lowerZERO_EXTEND_VECTOR_INREG.

jonpa updated this revision to Diff 260296.Apr 27 2020, 6:06 AM
jonpa marked an inline comment as done.

Hi Jonas, I haven't looked into everything in detail, but first one fundmental question: My understanding was that the current GeneralShuffle code would detect the shuffle you generate for a zero-extend as a case of a MERGE. Later, combineMERGE would detect that one input of the MERGE is a zero vector, and replace the merge by an UNPACKL.

Is there some reason why this doesn't work for you? Why do you have to create the UNPACKL already in GeneralShuffle?

That will only work in cases where only one unpack is needed, and only if the input does not need any permutation of any kind.

Patch changed to use increasing indexes into the zero vector so that the simple case can still be matched with a MERGE_HIGH.

I am not that sure about the actual best tuning yet, but so far I have aimed to reduce the number of VPERMs. I thought originally that using isByteZero() to check if a single byte was known zero would make for detecting more cases where only one element or byte of was zero. This way, any of the operands could have a zero byte which would be detected, like a high 0 byte in a wider immediate operand, for instance. I see now however that this is no improvement compared to simply finding a zero or undef vector input, so I have changed the patch to to this instead.

lowerExtendVectorInreg() split into lowerSIGN_EXTEND_VECTOR_INREG() and lowerZERO_EXTEND_VECTOR_INREG().

Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2020, 6:06 AM
jonpa updated this revision to Diff 261236.Apr 30 2020, 8:14 AM

It certainly seems to be an improvement on two benchmarks to do just either one unpack or else a vperm (and not multiple unpacks). In fact, i505.mcf actually regressed (2.5%) when doing two unpacks instead of a vperm (with -ffp-contract=fast). So the initial idea of reducing the number of vperms seems to have been proven wrong - it is better to have a single vperm on the critical path rather than multiple unpacks.

Patch simplified to do just one unpack whenever possible.

Tests updated.

It certainly seems to be an improvement on two benchmarks to do just either one unpack or else a vperm (and not multiple unpacks). In fact, i505.mcf actually regressed (2.5%) when doing two unpacks instead of a vperm (with -ffp-contract=fast). So the initial idea of reducing the number of vperms seems to have been proven wrong - it is better to have a single vperm on the critical path rather than multiple unpacks.

Huh, interesting. That's good to know, and certainly makes the code simpler as well.

I'm wondering: unless I'm missing something, there's still one specific case where you generate a vperm followed by an unpack (the case where you already had a permute as source). Wouldn't it be preferable to just use a single vperm there as well?

jonpa added a comment.May 1 2020, 12:01 AM

I'm wondering: unless I'm missing something, there's still one specific case where you generate a vperm followed by an unpack (the case where you already had a permute as source). Wouldn't it be preferable to just use a single vperm there as well?

This is the case where the input is a permute that has no other users, which means that it's being replaced. So instead of vperm + vperm, we get a vperm + unpack, which follows the same reasoning of replacing a vperm with a single unpack.

For example, in the case with three source ops, where A and B first are combined with a permute, and now AB and C are to be combined: instead of permuting AB and C, the AB permute is changed so that AB and C can be unpacked.

This now happens 100 times on SPEC'17:

                                 patch  <>  "No permute replacement"
vperm          :                23803                23903     +100
vl             :               109720               109797      +77
vuplhb         :                  151                  101      -50
vuplhh         :                  177                  127      -50
vst            :                85032                85078      +46
larl           :               371586               371616      +30
vgbm           :                11637                11644       +7
mvi            :                23380                23381       +1

I'm wondering: unless I'm missing something, there's still one specific case where you generate a vperm followed by an unpack (the case where you already had a permute as source). Wouldn't it be preferable to just use a single vperm there as well?

This is the case where the input is a permute that has no other users, which means that it's being replaced. So instead of vperm + vperm, we get a vperm + unpack, which follows the same reasoning of replacing a vperm with a single unpack.

For example, in the case with three source ops, where A and B first are combined with a permute, and now AB and C are to be combined: instead of permuting AB and C, the AB permute is changed so that AB and C can be unpacked.

Ah, that's what I missed: the case where the first permute has already two different inputs!

Now I'm wondering whether this couldn't also apply in other cases, beyond just perm + unpack. Would it be possible to handle all cases by recognizing the case where the source of a permute is itself a (single-use) permute early, e.g. in GeneralShuffle::add ? Hmm, reading that just now, it seems there is already such code:

else if (Op.getOpcode() == ISD::VECTOR_SHUFFLE && Op.hasOneUse()) {
  // See whether the bytes we need come from a contiguous part of one
  // operand.

So now I'm wondering instead: why does your new code ever trigger in the first place; why isn't this already handled in GeneralShuffle::add?

jonpa added a comment.May 5 2020, 4:51 AM

I ran SPEC'2006 as well once with this new patch and found to my surprise that on z14 I saw a big (7%) regression with this on perlbench which I have never seen before, while on z15 everything was as expected. Even though I was not sure if this was a trustable result (given that perl'17 looked ok), I looked into it a bit. I found that things had changed as expected only - multiple unpacks had become vperms instead. Two things were obvious while looking at the generated code:

  1. I see a larl+vl+vgbm inside a single block loop prior to the vperm. Those three instructions are loop-invariant and should (and could) be hoisted to before the loop. It seems that the target instructions look ok (the load is tagged with a 'constant-pool' MO for instance), but MachineLICM (Post-RA) is being way too conservative as regards to when a physreg def cannot be moved. Post-RA, it is keeping track of physreg clobbers for the entire outer loop and then checks if the inner loop clobbers any of those regs, it seems. That means that our LARL in a single-block loop inside a bigger loop cannot be moved out of the inner loop even though that would make sense. MachineLICM is run also pre-RA, but then only outermost loops are visited.

However, this was easy to do in the assembly file, but hoisting those instructions out of the loop proved to be entirely non-effective (no improvement).

  1. one more vperm mask in the constant pool adds 16 bytes in memory, which may possibly lead to some bad side-effects, which however seems unlikely to me since inserting that constant in the assembly does not change the function as such - it seems that the constant pool area is separate from the function using the constant. I also tried inserting a dummy use along with the 16-byte constant but that did not slow down the faster version using unpacks.

This is still strange to me and it seems unpredictable to sort out so it is probably not worth further attention for the moment...

jonpa added a comment.May 5 2020, 5:00 AM

So now I'm wondering instead: why does your new code ever trigger in the first place; why isn't this already handled in GeneralShuffle::add?

I think that be two different cases, where my handling is addressing a *new* permute resulting from combining two operands, when trying to combine (unpack) a third source operand with that first (new) permute.

So now I'm wondering instead: why does your new code ever trigger in the first place; why isn't this already handled in GeneralShuffle::add?

I think that be two different cases, where my handling is addressing a *new* permute resulting from combining two operands, when trying to combine (unpack) a third source operand with that first (new) permute.

Ah, I see. So this permute is one that had just been generated via getGeneralPermuteNode earlier in the loop? It's a bit unfortunate to first create that ISD node and then right away delete it again.

Maybe one option to try would be to do the check whether this permutation looks like an unpack (possibly requiring a permutation) at the top level *before* the loop. If this is true, then apply the permutation (in reverse) to the Bytes array, remove the zerovec from the Ops list, and run through the loop. At the end, remember to actually generate the unpack node. This would also remove the need to special-case handing the zerovec through the loop.

The whole operation would be worthwhile only if by performing the unpack we can remove the zerovec (i.e. it isn't still used elsewhere), and if by removing the zerovec the depth of the resulting tree (i.e. the critical path length) is also reduced. The depth is the log2 of the number of operands (rounded up), so if reducing the zerovec takes us from 3 to 2 ops, it is worthwhile, but if it takes us from 2 to 1 ops (or from 4 to 3 ops), then it probably is not, since adding the unpack would then just add one more instruction to the critical path.

Does this make sense?

jonpa updated this revision to Diff 263936.May 14 2020, 1:36 AM

Patch updated per suggestions.

Ah, I see. So this permute is one that had just been generated via getGeneralPermuteNode earlier in the loop? It's a bit unfortunate to first create that ISD node and then right away delete it again.

Yes, that's the permute that was replaced.

Maybe one option to try would be to do the check whether this permutation looks like an unpack (possibly requiring a permutation) at the top level *before* the loop. If this is true, then apply the permutation (in reverse) to the Bytes array, remove the zerovec from the Ops list, and run through the loop. At the end, remember to actually generate the unpack node. This would also remove the need to special-case handing the zerovec through the loop.

Ah, yes... :-) Now that it seems clear that we want to use just a single unpack, it is simpler to do it in this more integrated way in getNode().

The whole operation would be worthwhile only if by performing the unpack we can remove the zerovec (i.e. it isn't still used elsewhere), and if by removing the zerovec the depth of the resulting tree (i.e. the critical path length) is also reduced. The depth is the log2 of the number of operands (rounded up), so if reducing the zerovec takes us from 3 to 2 ops, it is worthwhile, but if it takes us from 2 to 1 ops (or from 4 to 3 ops), then it probably is not, since adding the unpack would then just add one more instruction to the critical path.

That's a very good point - the idea behind building the tree of operations in the loop must be to avoid serialization so we shouldn't increase the depth with a trailing unpack. I added a check for this in case of >2 operands. For the case of a single source node shuffled with a zero vector (2 ops), there is instead a check that the unpack is enough by itself.

In rewriting this now I saw a (previous patch version) case where moving the zero-vector to last in Ops resulted in a vpkg (PACK) instead of a vperm. I am not sure if that was just a rare case, or if there are other operations than unpack that could also replace vperm if zero vector is put last. That is a separate issue from this, though, I think. Going to this version of the patch there were some minor changes which I think are the variations that arise from rearranging the Bytes vector when preparing for the unpack.

I still see two nice improvements over night: imagick:13% and x264: 4%. :-)

jonpa updated this revision to Diff 264860.May 19 2020, 4:56 AM

Patch rebased.

I updated the tests and noticed that not all VGBMs where going away. In order to achieve that I added look-through of BITCAST in isZeroVector(). This changed a few files on SPEC'17:

In total 11 less VGBMs on SPEC, over 10 files, but also some unexpected changes:

  • morphology.s: One *more* vgbm than before. Regalloc seems to now spill the big VGBM interval now and as a result the VGBM is rematerialized in two places which makes for actually one more VGBM with this. Here the VGBM was used also by a VFMAXDB. The VGBM was available over many blocks.
  • fx.s: four more new and different VPERM masks (LCPI constants) in the text section. This seems to be because four masks where reused before, while now they are not. This is because only one of of the users of that mask had a zero vector operand. With the zero vector this mask is used:

cp#8: <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 16, i8 17, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 18, i8 19>

Without the zero vector, this mask instead:
cp#2: <i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 16, i8 17, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 18, i8 19>

This file also has a few more spills/reloads for some reason, probably because of the increased presence of mask registers. These side-effects are just noticed in this particular file...

It may be possible to treat this as a global problem where the goal is to minimize the number of LCPI vector constants needed, considering that removing the use of the zero vector can instead cause a new mask to be needed where as before an already existing mask could be reused. Not sure how needed or easy that would be...

In addition to this, some NFC changes:

  • Removed the check for the undef vector also in this patch.
  • Minor fixes such as variable renaming.
  • Tests no longer need VGBMs :-)
                               master      <>        patch
vuplhh         :                 3323                  241    -3082
vuplhb         :                 3137                  220    -2917
vperm          :                20996                23777    +2781
vpkg           :                 1535                  750     -785
vl             :               109474               109943     +469
vuplhf         :                  778                  329     -449
vmrlg          :                 4481                 4088     -393
vsldb          :                  566                  191     -375
larl           :               371292               371627     +335
cghsi          :                34550                34393     -157
lg             :              1092297              1092153     -144

...

                               master      <>        patch
Spill|Reload   :               683254               683105     -149
Copies         :              1036372              1036441      +69
grep "LCPI.*:" :                17540                17741     +203

About the LCPI vector constants: It seems that those 200 extra constants I noticed were not really due to the optimization of eliminating the zero-vector for PERMUTE. That optimization eliminates 239 VGBMs on SPEC, while adding only 5 more LCPIs. All the other new LCPIs caused by this patch seem to be regular VPERM masks for zero extension that are duplicated once for each function. In other words, I see the same mask in the constant pool with one copy for each function. I looked at this at file scope only (.s / .o), and I am hoping that either that is not a problem, or maybe that the linker optimizes this?

Over night results look good both with and w/out the zero-vector optimization (not much difference).

jonpa marked an inline comment as done.May 21 2020, 1:39 AM
jonpa added inline comments.
llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
5278

I rewrote this loop from the version on Apr 20 2020, 14:06, so that the indexes into the zero vector would be not just the first byte, but increasing. Back then that lead to some matches of MERGE_HIGH, but I see now that it's a no-op, given the optimization of the zero vector we have now added. But i suppose, it doesn't matter much either way...

uweigand accepted this revision.Jun 5 2020, 8:33 AM

This looks all good to me now, we should verify that performance results are still good, and then it can go in.

This revision is now accepted and ready to land.Jun 5 2020, 8:33 AM
This revision was automatically updated to reflect the committed changes.