This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add optimized memcpy for RISCV
ClosedPublic

Authored by gchatelet on May 9 2023, 7:34 AM.

Details

Summary

This patch adds two versions of memcpy optimized for architectures where unaligned accesses are either illegal or extremely slow.
It is currently enabled for RISCV 64 and RISCV 32 but it could be used for ARM 32 architectures as well.

Here is the before / after output of libc.benchmarks.memory_functions.opt_host --benchmark_filter=BM_Memcpy on a quad core Linux starfive RISCV 64 board running at 1.5GHz.

Before:

Run on (4 X 1500 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x4)
  L1 Data 32 KiB (x4)
  L2 Unified 2048 KiB (x1)
------------------------------------------------------------------------
Benchmark              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------
BM_Memcpy/0/0        474 ns          474 ns      1483776 bytes_per_cycle=0.243492/s bytes_per_second=348.318M/s items_per_second=2.11097M/s __llvm_libc::memcpy,memcpy Google A
BM_Memcpy/1/0        210 ns          209 ns      3649536 bytes_per_cycle=0.233819/s bytes_per_second=334.481M/s items_per_second=4.77519M/s __llvm_libc::memcpy,memcpy Google B
BM_Memcpy/2/0       1814 ns         1814 ns       396288 bytes_per_cycle=0.247899/s bytes_per_second=354.622M/s items_per_second=551.402k/s __llvm_libc::memcpy,memcpy Google D
BM_Memcpy/3/0       89.3 ns         89.2 ns      7459840 bytes_per_cycle=0.217415/s bytes_per_second=311.014M/s items_per_second=11.2071M/s __llvm_libc::memcpy,memcpy Google L
BM_Memcpy/4/0        134 ns          134 ns      3815424 bytes_per_cycle=0.226584/s bytes_per_second=324.131M/s items_per_second=7.44567M/s __llvm_libc::memcpy,memcpy Google M
BM_Memcpy/5/0       52.8 ns         52.6 ns     11001856 bytes_per_cycle=0.194893/s bytes_per_second=278.797M/s items_per_second=19.0284M/s __llvm_libc::memcpy,memcpy Google Q
BM_Memcpy/6/0        180 ns          180 ns      4101120 bytes_per_cycle=0.231884/s bytes_per_second=331.713M/s items_per_second=5.55957M/s __llvm_libc::memcpy,memcpy Google S
BM_Memcpy/7/0        195 ns          195 ns      3906560 bytes_per_cycle=0.232951/s bytes_per_second=333.239M/s items_per_second=5.1217M/s __llvm_libc::memcpy,memcpy Google U
BM_Memcpy/8/0        152 ns          152 ns      4789248 bytes_per_cycle=0.227507/s bytes_per_second=325.452M/s items_per_second=6.58187M/s __llvm_libc::memcpy,memcpy Google W
BM_Memcpy/9/0       6036 ns         6033 ns       118784 bytes_per_cycle=0.249158/s bytes_per_second=356.423M/s items_per_second=165.75k/s __llvm_libc::memcpy,uniform 384 to 4096

After:

BM_Memcpy/0/0        126 ns          126 ns      5770240 bytes_per_cycle=1.04707/s bytes_per_second=1.46273G/s items_per_second=7.9385M/s __llvm_libc::memcpy,memcpy Google A
BM_Memcpy/1/0       75.1 ns         75.0 ns     10204160 bytes_per_cycle=0.691143/s bytes_per_second=988.687M/s items_per_second=13.3289M/s __llvm_libc::memcpy,memcpy Google B
BM_Memcpy/2/0        333 ns          333 ns      2174976 bytes_per_cycle=1.39297/s bytes_per_second=1.94596G/s items_per_second=3.00002M/s __llvm_libc::memcpy,memcpy Google D
BM_Memcpy/3/0       49.6 ns         49.5 ns     16092160 bytes_per_cycle=0.710161/s bytes_per_second=1015.89M/s items_per_second=20.1844M/s __llvm_libc::memcpy,memcpy Google L
BM_Memcpy/4/0       57.7 ns         57.7 ns     11213824 bytes_per_cycle=0.561557/s bytes_per_second=803.314M/s items_per_second=17.3228M/s __llvm_libc::memcpy,memcpy Google M
BM_Memcpy/5/0       48.0 ns         47.9 ns     16437248 bytes_per_cycle=0.346708/s bytes_per_second=495.97M/s items_per_second=20.8571M/s __llvm_libc::memcpy,memcpy Google Q
BM_Memcpy/6/0       67.5 ns         67.5 ns     10616832 bytes_per_cycle=0.614173/s bytes_per_second=878.582M/s items_per_second=14.8142M/s __llvm_libc::memcpy,memcpy Google S
BM_Memcpy/7/0       84.7 ns         84.6 ns     10480640 bytes_per_cycle=0.819077/s bytes_per_second=1.14424G/s items_per_second=11.8174M/s __llvm_libc::memcpy,memcpy Google U
BM_Memcpy/8/0       61.7 ns         61.6 ns     11191296 bytes_per_cycle=0.550078/s bytes_per_second=786.893M/s items_per_second=16.2279M/s __llvm_libc::memcpy,memcpy Google W
BM_Memcpy/9/0        981 ns          981 ns       703488 bytes_per_cycle=1.52333/s bytes_per_second=2.12807G/s items_per_second=1019.81k/s __llvm_libc::memcpy,uniform 384 to 4096

It is not as good as glibc for now so there's room for improvement. I suspect a path pumping 16 bytes at once given the doubled numbers for large copies.

BM_Memcpy/0/1        146 ns         82.5 ns      8576000 bytes_per_cycle=1.35236/s bytes_per_second=1.88922G/s items_per_second=12.1169M/s glibc memcpy,memcpy Google A
BM_Memcpy/1/1        112 ns         63.7 ns     10634240 bytes_per_cycle=0.628018/s bytes_per_second=898.387M/s items_per_second=15.702M/s glibc memcpy,memcpy Google B
BM_Memcpy/2/1        315 ns          180 ns      4079616 bytes_per_cycle=2.65229/s bytes_per_second=3.7052G/s items_per_second=5.54764M/s glibc memcpy,memcpy Google D
BM_Memcpy/3/1       85.3 ns         43.1 ns     15854592 bytes_per_cycle=0.774164/s bytes_per_second=1107.45M/s items_per_second=23.2249M/s glibc memcpy,memcpy Google L
BM_Memcpy/4/1        105 ns         54.3 ns     13427712 bytes_per_cycle=0.7793/s bytes_per_second=1114.8M/s items_per_second=18.4109M/s glibc memcpy,memcpy Google M
BM_Memcpy/5/1       77.1 ns         43.2 ns     16476160 bytes_per_cycle=0.279808/s bytes_per_second=400.269M/s items_per_second=23.1428M/s glibc memcpy,memcpy Google Q
BM_Memcpy/6/1        112 ns         62.7 ns     11236352 bytes_per_cycle=0.676078/s bytes_per_second=967.137M/s items_per_second=15.9387M/s glibc memcpy,memcpy Google S
BM_Memcpy/7/1        131 ns         65.5 ns     11751424 bytes_per_cycle=0.965616/s bytes_per_second=1.34895G/s items_per_second=15.2762M/s glibc memcpy,memcpy Google U
BM_Memcpy/8/1        104 ns         55.0 ns     12314624 bytes_per_cycle=0.583336/s bytes_per_second=834.468M/s items_per_second=18.1937M/s glibc memcpy,memcpy Google W
BM_Memcpy/9/1        932 ns          466 ns      1480704 bytes_per_cycle=3.17342/s bytes_per_second=4.43321G/s items_per_second=2.14679M/s glibc memcpy,uniform 384 to 4096

Diff Detail

Event Timeline

gchatelet created this revision.May 9 2023, 7:34 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 9 2023, 7:34 AM
gchatelet requested review of this revision.May 9 2023, 7:34 AM
gchatelet updated this revision to Diff 520698.May 9 2023, 7:41 AM
  • Fix typo
  • Add missing dependency
gchatelet updated this revision to Diff 520700.May 9 2023, 7:45 AM
  • Make sure the loaded type sizes sum up to the value type size.
sivachandra accepted this revision.May 9 2023, 11:17 PM
sivachandra added inline comments.
libc/src/string/memory_utils/memcpy_implementations.h
36

Looks like this 32-bit flavor is unused? Would it make sense to add #elif defined(LIBC_TARGET_ARCH_IS_RISCV32) and add it there?

libc/src/string/memory_utils/utils.h
101

Can you add a comment as to why this is required?

This revision is now accepted and ready to land.May 9 2023, 11:17 PM
gchatelet updated this revision to Diff 520931.May 10 2023, 1:34 AM
gchatelet marked 2 inline comments as done.
  • rebase
  • Fix typo
  • Add missing dependency
  • Make sure the loaded type sizes sum up to the value type size.
  • Document why we ignore 'array-bounds' warning
  • Support RISCV 32.
gchatelet retitled this revision from [libc] Add optimized memcpy for RISCV 64 to [libc] Add optimized memcpy for RISCV.May 10 2023, 1:41 AM
gchatelet edited the summary of this revision. (Show Details)
This revision was landed with ongoing or failed builds.May 10 2023, 1:42 AM
This revision was automatically updated to reflect the committed changes.