This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Expand some memcpys/memsets into Load/Store sequences.
AbandonedPublic

Authored by jonpa on Feb 21 2022, 12:33 PM.

Details

Reviewers
uweigand
Summary

For sizes over 16 bytes MVC is not always efficient, so up to a certain limit it would be better to use a Load/Store sequence.

This is still experimental with two different approaches included for now, to show what they look like.

Approach 1:

New TLI hook prefersVectorSplatForMemset(), to avoid the mandatory scalar multiplication as a means of replication, but instead directly generate a splat vector. I think this makes sense, and it is probably even better to just do this whenever target returns a vector type from getOptimalMemOpType().

Approach 2:

Detect replicated bytes in SystemZTargetLowering::combineSTORE().

If getMemsetStores() generates the multiplies (as it does now unaltered), they need to be combined in combineSTORE() . This is used like '-memset-splat=false -byterepl-fix'. This seems to work also, although it is more LOCs. It does seem though as it would be good also on its own - I see if using this *without* expanding any memcpy/memsets:

vsteh          :                 2557                 2875     +318
vlrepb         :                  187                  475     +288
llc            :                39057                38771     -286
sth            :                25792                25515     -277
mhi            :                 6009                 5741     -268   // multiply
stg            :               371885               371803      -82
vstef          :                 5779                 5859      +80
st             :               122692               122620      -72
lay            :                54734                54797      +63
vsteg          :                 6106                 6159      +53
lg             :               987456               987405      -51
sthy           :                 1054                 1014      -40
vlvgp          :                 8300                 8339      +39
vrepb          :                   95                  134      +39
vrepib         :                  283                  320      +37
msgrkc         :                 6741                 6707      -34   // multiply
iilf           :                 6397                 6364      -33
msfi           :                 7106                 7082      -24
vl             :               107362               107381      +19
...
Spill|Reload   :               611703               611679      -24
Copies         :              1002825              1002832       +7

Example:

-       llc     %r0, 0(%r4)
-       msrkc   %r0, %r0, %r0
-       st      %r0, 0(%r1)
+       vlrepb  %v0, 0(%r4)
+       vstef   %v0, 0(%r1), 0

Maybe this could be done in the common DAGCombiner even...

LegalAMVecTy and GEPOffsSplit are experimental options I have played with to see how to best fix the problem where a memcpy address is >U12 range, and now we instead get multiple VL/VSTs which all are out of range. This is a problem that needs to be fixed before using this patch, I think. I see for instance:

stg     %r2, 8696(%r15                  stg     %r2, 8696(%r15
lg      %r2, 8816(%r15                  lg      %r2, 8816(%r15
                          >             vl      %v0, 0(%r2), 3
lay     %r1, 8712(%r15                  lay     %r1, 8712(%r15
mvc     0(44,%r1), 0(%    |             vst     %v0, 0(%r1), 3
                          >             vl      %v0, 16(%r2), 
                          >             lay     %r1, 8728(%r15
                          >             vst     %v0, 0(%r1), 3
                          >             vl      %v0, 28(%r2)
                          >             lay     %r1, 8740(%r15
                          >             vst     %v0, 0(%r1)

The problem right now is that if the vector type is set to reject long displacements in isLegalAddressingMode(), LSR generates worse code for some loops :-/

The MVI_TYPEFIX code is probably not needed above 16 bytes, but in some cases there were immediates of type i64 that first got loaded into a register instead of used directly with MVI.

The test cases for now include full ranges of interesting sizes of memcpy/memset.

Diff Detail

Unit TestsFailed

Event Timeline

jonpa created this revision.Feb 21 2022, 12:33 PM
jonpa requested review of this revision.Feb 21 2022, 12:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2022, 12:33 PM
jonpa added a comment.Apr 4 2022, 3:17 AM

This patch includes some experimental parts, which have been removed here: https://reviews.llvm.org/D122105.

Please look there during further review, instead of at this post...

Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2022, 3:17 AM
jonpa abandoned this revision.May 13 2022, 6:52 AM

Part of eaa7803.