- Add inline to the helper functions because gcc-9 won't inline all of them without the hint. I've avoided __attribute__((always_inline)) because gcc and clang will inline without it, and improves compatibility.
- Replace the byte-by-byte copy in update() with endian::readbe32() since perf reports that 1/2 of the time is spent copying into the buffer before this patch.
- Add a hash-benchmark to measure the performance improvement.
When lld uses --build-id=sha1 it spends 30-45% of CPU in SHA1 depending on the binary (not wall-time since it is parallel). This patch speeds up SHA1 by a factor of 2 on clang-8 and 3 on gcc-6. This leads to a >10% improvement in overall linking time.
Unit tests
ninja check-llvm
LLD speed
lld-speed-test benchmarks run on an Intel i9-9900k with Turbo disabled on CPU 0 compiled with clang-9. Stats recorded with perf stat -r 5. All inputs are using --build-id=sha1.
Input | Before (seconds) | After (seconds) |
---|---|---|
chrome | 2.14 | 1.82 (-15%) |
chrome-icf | 2.56 | 2.29 (-10%) |
clang | 0.65 | 0.53 (-18%) |
clang-fsds | 0.69 | 0.58 (-16%) |
clang-gdb-index | 21.71 | 19.3 (-11%) |
gold | 0.42 | 0.34 (-19%) |
gold-fsds | 0.431 | 0.355 (-17%) |
linux-kernel | 0.625 | 0.575 (-8%) |
llvm-as | 0.045 | 0.039 (-14%) |
llvm-as-fsds | 0.035 | 0.039 (-11%) |
mozilla | 11.3 | 9.8 (-13%) |
mozilla-gc | 11.84 | 10.36 (-12%) |
mozilla-O0 | 8.2 | 5.84 (-28%) |
scylla | 5.59 | 4.52 (-19%) |
Microbenchmarks
Compiled with clang-8:
Before:
2019-10-16 11:33:41 Running ./benchmarks/hash-benchmark/hash-benchmark Run on (24 X 2394.48 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 4096K (x24) L3 Unified 16384K (x24) ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- BM_SHA1/1024 5146 ns 5145 ns 137203 BM_SHA1/4096 20043 ns 20040 ns 32644 BM_SHA1/32768 154810 ns 154803 ns 4401 BM_SHA1/262144 1281332 ns 1281244 ns 555 BM_SHA1/1048576 5154688 ns 5154100 ns 137
After:
2019-10-16 11:34:20 Running ./benchmarks/hash-benchmark/hash-benchmark Run on (24 X 2394.48 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 4096K (x24) L3 Unified 16384K (x24) ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- BM_SHA1/1024 3071 ns 3070 ns 241890 BM_SHA1/4096 10491 ns 10491 ns 64873 BM_SHA1/32768 82802 ns 82791 ns 8533 BM_SHA1/262144 685598 ns 685595 ns 1069 BM_SHA1/1048576 2593819 ns 2593495 ns 265
Compiled with gcc-6:
Before:
2019-10-16 11:36:05 Running ./benchmarks/hash-benchmark/hash-benchmark Run on (24 X 2394.48 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 4096K (x24) L3 Unified 16384K (x24) ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- BM_SHA1/1024 8770 ns 8769 ns 80651 BM_SHA1/4096 34161 ns 34159 ns 20583 BM_SHA1/32768 271183 ns 271154 ns 2565 BM_SHA1/262144 2140979 ns 2140434 ns 332 BM_SHA1/1048576 8376018 ns 8374622 ns 83
After:
2019-10-16 11:34:58 Running ./benchmarks/hash-benchmark/hash-benchmark Run on (24 X 2394.48 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 4096K (x24) L3 Unified 16384K (x24) ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- BM_SHA1/1024 2892 ns 2892 ns 254677 BM_SHA1/4096 10300 ns 10299 ns 72058 BM_SHA1/32768 82527 ns 82527 ns 8880 BM_SHA1/262144 629433 ns 629358 ns 1080 BM_SHA1/1048576 2669301 ns 2669137 ns 272
Do we have a test that calls update() with a chunk of data larger than BLOCK_LENGTH?