For sizes over 16 bytes MVC is not always efficient, so up to a certain limit it would be better to use a Load/Store sequence.
This is still experimental with two different approaches included for now, to show what they look like.
Approach 1:
New TLI hook prefersVectorSplatForMemset(), to avoid the mandatory scalar multiplication as a means of replication, but instead directly generate a splat vector. I think this makes sense, and it is probably even better to just do this whenever target returns a vector type from getOptimalMemOpType().
Approach 2:
Detect replicated bytes in SystemZTargetLowering::combineSTORE().
If getMemsetStores() generates the multiplies (as it does now unaltered), they need to be combined in combineSTORE() . This is used like '-memset-splat=false -byterepl-fix'. This seems to work also, although it is more LOCs. It does seem though as it would be good also on its own - I see if using this *without* expanding any memcpy/memsets:
vsteh : 2557 2875 +318 vlrepb : 187 475 +288 llc : 39057 38771 -286 sth : 25792 25515 -277 mhi : 6009 5741 -268 // multiply stg : 371885 371803 -82 vstef : 5779 5859 +80 st : 122692 122620 -72 lay : 54734 54797 +63 vsteg : 6106 6159 +53 lg : 987456 987405 -51 sthy : 1054 1014 -40 vlvgp : 8300 8339 +39 vrepb : 95 134 +39 vrepib : 283 320 +37 msgrkc : 6741 6707 -34 // multiply iilf : 6397 6364 -33 msfi : 7106 7082 -24 vl : 107362 107381 +19 ... Spill|Reload : 611703 611679 -24 Copies : 1002825 1002832 +7 Example: - llc %r0, 0(%r4) - msrkc %r0, %r0, %r0 - st %r0, 0(%r1) + vlrepb %v0, 0(%r4) + vstef %v0, 0(%r1), 0
Maybe this could be done in the common DAGCombiner even...
LegalAMVecTy and GEPOffsSplit are experimental options I have played with to see how to best fix the problem where a memcpy address is >U12 range, and now we instead get multiple VL/VSTs which all are out of range. This is a problem that needs to be fixed before using this patch, I think. I see for instance:
stg %r2, 8696(%r15 stg %r2, 8696(%r15 lg %r2, 8816(%r15 lg %r2, 8816(%r15 > vl %v0, 0(%r2), 3 lay %r1, 8712(%r15 lay %r1, 8712(%r15 mvc 0(44,%r1), 0(% | vst %v0, 0(%r1), 3 > vl %v0, 16(%r2), > lay %r1, 8728(%r15 > vst %v0, 0(%r1), 3 > vl %v0, 28(%r2) > lay %r1, 8740(%r15 > vst %v0, 0(%r1)
The problem right now is that if the vector type is set to reject long displacements in isLegalAddressingMode(), LSR generates worse code for some loops :-/
The MVI_TYPEFIX code is probably not needed above 16 bytes, but in some cases there were immediates of type i64 that first got loaded into a register instead of used directly with MVI.
The test cases for now include full ranges of interesting sizes of memcpy/memset.
clang-format: please reformat the code