This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Implement memcpy with variable length with MVC
ClosedPublic

Authored by jonpa on Jul 27 2021, 6:59 AM.

Details

Summary

Instead of making a libcall to memcpy, emit an MVC loop along with an EXRL instruction the same way as already done for memset 0.

It seemed this was a slight overall improvements on preliminary measurements.

I also tried some different prefetch settings on (quick) spec for both Write and Read (compared to master which has only Write 768):

Overall results (by average over benchmarks):

z14:
2017_B_Memcpy_pfd_w_0_pfd_r_0                                             99.856 %
2017_E_Memcpy_pfd_w_2048_pfd_r_0                                          99.920 %
2017_C_Memcpy_pfd_w_768_pfd_r_768                                         99.978 %
2017_D_Memcpy_pfd_w_2048_pfd_r_2048                                       99.986 %
2017_F_Memcpy_pfd_w_524287_pfd_r_524287                                   100.426 %

z15:
2017_E_Memcpy_pfd_w_2048_pfd_r_0                                          99.941 %
2017_B_Memcpy_pfd_w_0_pfd_r_0                                             99.941 %
2017_D_Memcpy_pfd_w_2048_pfd_r_2048                                       100.043 %
2017_C_Memcpy_pfd_w_768_pfd_r_768                                         100.053 %
2017_F_Memcpy_pfd_w_524287_pfd_r_524287                                   100.313 %

I also tried to do a runtime check for a big size like:

f17:                                    # @f17
        .cfi_startproc
# %bb.0:
        aghi    %r4, -1
        cgibe   %r4, -1, 0(%r14)
.LBB16_1:
        srlg    %r0, %r4, 8
        cgije   %r0, 0, .LBB16_4
# %bb.2:
        lghi    %r1, 0
        cgfi    %r4, 2000000
        locghihe        %r1, 1
        sllg    %r1, %r1, 22
.LBB16_3:                               # =>This Inner Loop Header: Depth=1
        pfd     2, 0(%r1,%r2)
        mvc     0(256,%r2), 0(%r3)
        la      %r2, 256(%r2)
        la      %r3, 256(%r3)
        brctg   %r0, .LBB16_3
.LBB16_4:
        exrl    %r4, .Ltmp0
        br      %r14

The idea was to prefetch for the L2 cache (4M), if size was bigger than 2M as a check to see if this could give anything. It however did not seem to improve any benchmark either with W, R, or W+R prefetching per this pattern.

Keeping the prefetching as it was with this patch was slightly better overall on z15 and slightly better without it on z14, so there does not seem to be any major gains to be had from changing the MVC prefetching...

Diff Detail

Event Timeline

jonpa created this revision.Jul 27 2021, 6:59 AM
jonpa requested review of this revision.Jul 27 2021, 6:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2021, 6:59 AM

This looks good to me, but we should verify the performance using a full run.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 5 2021, 8:15 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.