Instead of making a libcall to memcpy, emit an MVC loop along with an EXRL instruction the same way as already done for memset 0.
It seemed this was a slight overall improvements on preliminary measurements.
I also tried some different prefetch settings on (quick) spec for both Write and Read (compared to master which has only Write 768):
Overall results (by average over benchmarks):
z14: 2017_B_Memcpy_pfd_w_0_pfd_r_0 99.856 % 2017_E_Memcpy_pfd_w_2048_pfd_r_0 99.920 % 2017_C_Memcpy_pfd_w_768_pfd_r_768 99.978 % 2017_D_Memcpy_pfd_w_2048_pfd_r_2048 99.986 % 2017_F_Memcpy_pfd_w_524287_pfd_r_524287 100.426 % z15: 2017_E_Memcpy_pfd_w_2048_pfd_r_0 99.941 % 2017_B_Memcpy_pfd_w_0_pfd_r_0 99.941 % 2017_D_Memcpy_pfd_w_2048_pfd_r_2048 100.043 % 2017_C_Memcpy_pfd_w_768_pfd_r_768 100.053 % 2017_F_Memcpy_pfd_w_524287_pfd_r_524287 100.313 %
I also tried to do a runtime check for a big size like:
f17: # @f17 .cfi_startproc # %bb.0: aghi %r4, -1 cgibe %r4, -1, 0(%r14) .LBB16_1: srlg %r0, %r4, 8 cgije %r0, 0, .LBB16_4 # %bb.2: lghi %r1, 0 cgfi %r4, 2000000 locghihe %r1, 1 sllg %r1, %r1, 22 .LBB16_3: # =>This Inner Loop Header: Depth=1 pfd 2, 0(%r1,%r2) mvc 0(256,%r2), 0(%r3) la %r2, 256(%r2) la %r3, 256(%r3) brctg %r0, .LBB16_3 .LBB16_4: exrl %r4, .Ltmp0 br %r14
The idea was to prefetch for the L2 cache (4M), if size was bigger than 2M as a check to see if this could give anything. It however did not seem to improve any benchmark either with W, R, or W+R prefetching per this pattern.
Keeping the prefetching as it was with this patch was slightly better overall on z15 and slightly better without it on z14, so there does not seem to be any major gains to be had from changing the MVC prefetching...