When a replicated register / immediate is immediately stored, it is better to use a Vector Replicate rather than a scalar multipliation (by e.g. 0x0101), or two immediate loads into a GPR.
This patch transforms such stores in SystemZ::combineSTORE() before type legalization.
I also tried doing this after legalization, but that seems to be more work without any benefit (actually the i128 case is better handled after splitting). If the types have to be legal, the right vector type have to be produced, with an extracted element. If that element is i16 (not uncommon), an i32 first needs to be extracted and the the store must truncate itself. This is all taken care of by the type legalizer if it is transformed before it.
What's more, the zero-extend node which the patch depends on in order to be sure that the multiply produces a replicated word is easy to detect on the initial DAG, but it is removed later by DAGCombine. And computeKnownBits() do not necessarily work (at least not on the i16 it seems).
I also don't think the DAGCombiner / legalizer will produce these patterns so if there is no other argument, I think it is probably simplest to do this with the inital DAG..?
- Maybe detect the immediate splat with SystemZVectorConstantInfo, and perhaps also see if there are other immediates to be built/stored instead of via GPRs?
- I tried avoiding the extra LAY:s but it did not seem to be better on benchmarks - the scalar multiply is slower than the LAY so it should always be better anyway, right?
- OK to do before legalize only, or continue work for Combine2?
SPEC:
vsteg : 5918 6372 +454 stg : 370142 369696 -446 vrepif : 7729 8077 +348 llihl : 7265 6940 -325 oill : 18574 18254 -320 vsteh : 2557 2875 +318 vstef : 5796 6105 +309 vlrepb : 207 509 +302 llc : 39072 38771 -301 sth : 25741 25463 -278 mhi : 6070 5802 -268 lay : 55017 55271 +254 st : 127451 127280 -171 msfi : 7106 6974 -132 sty : 3627 3514 -113 iilf : 6305 6212 -93 vrepih : 1133 1211 +78 vreph : 191 259 +68 vlvgp : 8511 8576 +65 ... OPCDIFFS: -251 ... Spill|Reload : 607848 607780 -68 Copies : 995492 995481 -11
Some improvements on benchmarks on z14, but more neutral on z15.
clang-format: please reformat the code