This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Use VREP for storing replicated regs/immediates.
AbandonedPublic

Authored by jonpa on Feb 24 2022, 6:55 PM.

Details

Reviewers
uweigand
Summary

When a replicated register / immediate is immediately stored, it is better to use a Vector Replicate rather than a scalar multipliation (by e.g. 0x0101), or two immediate loads into a GPR.

This patch transforms such stores in SystemZ::combineSTORE() before type legalization.

I also tried doing this after legalization, but that seems to be more work without any benefit (actually the i128 case is better handled after splitting). If the types have to be legal, the right vector type have to be produced, with an extracted element. If that element is i16 (not uncommon), an i32 first needs to be extracted and the the store must truncate itself. This is all taken care of by the type legalizer if it is transformed before it.

What's more, the zero-extend node which the patch depends on in order to be sure that the multiply produces a replicated word is easy to detect on the initial DAG, but it is removed later by DAGCombine. And computeKnownBits() do not necessarily work (at least not on the i16 it seems).

I also don't think the DAGCombiner / legalizer will produce these patterns so if there is no other argument, I think it is probably simplest to do this with the inital DAG..?

  • Maybe detect the immediate splat with SystemZVectorConstantInfo, and perhaps also see if there are other immediates to be built/stored instead of via GPRs?
  • I tried avoiding the extra LAY:s but it did not seem to be better on benchmarks - the scalar multiply is slower than the LAY so it should always be better anyway, right?
  • OK to do before legalize only, or continue work for Combine2?

SPEC:

vsteg          :                 5918                 6372     +454
stg            :               370142               369696     -446
vrepif         :                 7729                 8077     +348
llihl          :                 7265                 6940     -325
oill           :                18574                18254     -320
vsteh          :                 2557                 2875     +318
vstef          :                 5796                 6105     +309
vlrepb         :                  207                  509     +302
llc            :                39072                38771     -301
sth            :                25741                25463     -278
mhi            :                 6070                 5802     -268
lay            :                55017                55271     +254
st             :               127451               127280     -171
msfi           :                 7106                 6974     -132
sty            :                 3627                 3514     -113
iilf           :                 6305                 6212      -93
vrepih         :                 1133                 1211      +78
vreph          :                  191                  259      +68
vlvgp          :                 8511                 8576      +65
...
OPCDIFFS: -251
...
Spill|Reload   :               607848               607780      -68
Copies         :               995492               995481      -11

Some improvements on benchmarks on z14, but more neutral on z15.

Diff Detail

Event Timeline

jonpa created this revision.Feb 24 2022, 6:55 PM
jonpa requested review of this revision.Feb 24 2022, 6:55 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2022, 6:55 PM

I agree that this looks like the right thing to do. However, this really should use the SystemZVectorConstantInfo mechanism. (You'll probably have to add an APInt constructor there in addition to the APFloat one - currently this is only used for floating-point and vector constants, but now we also need it for integer constants. But that should be straightforward.)

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 5:06 AM
jonpa updated this revision to Diff 413483.Mar 7 2022, 7:58 AM

Patch updated to use SystemZVectorConstantInfo to detect the constant splats in combineSTORE().

If we were to only handle replicated constants / registers, it seems quite straightforward to simply build a splatVector in combineSTORE() and be done.

There are however also other constants that SystemZVectorConstantInfo can handle (with VGBM / VGM), which I have experimented with. The handling of stores of constants is in this version done instead in SystemZISelDAGToDAG.cpp, where Select() detects the case and loadVectorConstant() builds the vector with a vector element extraction. The experimental option '-replicate-only' skips generation of the non-REPLICATE opcodes, which should then show the value of being able to generate any vector constant.

trunk <> patch "replicate only":

This is the important part of the patch I think, as it eliminates the scalar multiplies (common case seems to be load-replicate-store):

vlrepb         :                  187                  489     +302
llc            :                39073                38772     -301
mhi            :                 6119                 5851     -268
msfi           :                 7066                 6934     -132
msgrkc         :                 7116                 7071      -45
msgc           :                  407                  403       -4

Generally it looks like a good change:

stef           :                 5750                 6386     +636
vsteg          :                 5912                 6475     +563
stg            :               373052               372504     -548
st             :               123399               122875     -524
vrepif         :                 7708                 8127     +419
llihl          :                 7344                 7022     -322
oill           :                18487                18169     -318
vsteh          :                 2557                 2875     +318
iilf           :                 5388                 5108     -280
sth            :                25810                25532     -278
lay            :                54928                55179     +251
vrepih         :                 1131                 1374     +243
sty            :                 3642                 3527     -115
oilf           :                 9138                 9048      -90
llihf          :                11328                11238      -90
OPCDIFFS: -534

Spill|Reload   :               613545               613472      -73
Copies         :              1005714              1005692      -22

patch "replicate only" <> patch "with vgbm/vgm":

There seems to be many cases where this could be enabled:

vsteg          :                 6475                 8912    +2437
stg            :               372504               370085    -2419
vstef          :                 6386                 8551    +2165
st             :               122875               121122    -1753
vgmf           :                13118                14454    +1336
vgmg           :                 8171                 9496    +1325
llilh          :                 6887                 5763    -1124
llihh          :                 2516                 1662     -854
llihl          :                 7022                 6564     -458
lg             :               986019               986363     +344
l              :               179157               179407     +250
stfh           :                21086                20837     -249
vgbm           :                10985                11189     +204
oilh           :                 1533                 1341     -192
llihf          :                11238                11063     -175
oilf           :                 9048                 8909     -139
iilf           :                 5108                 4974     -134
iihf           :                 4940                 4837     -103
cg             :                32046                32139      +93
OPCDIFFS: 554

Spill|Reload   :               613472               614178     +706
Copies         :              1005692              1005667      -25

The increase in spill/reload instructions seems to be due to one file with some +816 of them. I checked and have so far learned that it was not more vector register spills due to the constants being loaded into them, and it also seems that there are many more reloads, but not spills. This file is quite big, so it may be that the change is not that much relatively greater than in other smaller files.

One advantage with handling this in Select() (or rather after legalize-types), is that i128 constants now actually get handled - see @fun_128i. Not sure if that is of any real interest...

I agree that this looks like the right thing to do. However, this really should use the SystemZVectorConstantInfo mechanism. (You'll probably have to add an APInt constructor there in addition to the APFloat one - currently this is only used for floating-point and vector constants, but now we also need it for integer constants. But that should be straightforward.)

Per above we could either just handle the replication of regs/constants in combineSTORE(), or if we would like to be able to use vgbm/vgm also for storing integer constants we can extend also Select(), but perhaps also try again to figure out why one file gets more reloading (f510.parest_r/build/vectors.s)...

jonpa added a comment.Apr 4 2022, 3:18 AM

This patch includes some experimental parts, which have been removed here: https://reviews.llvm.org/D122105.

Please look there during further review, instead of at this post...

jonpa abandoned this revision.May 13 2022, 6:51 AM

Part of eaa7803.