This is an archive of the discontinued LLVM Phabricator instance.

[Support] Add llvm::xxh3_64bits
ClosedPublic

Authored by MaskRay on Jul 9 2023, 8:04 PM.

Details

Summary

ld.lld SHF_MERGE|SHF_STRINGS duplicate elimination is computation heavy
and utilitizes llvm::xxHash64, a simplified version of XXH64.
Externally many sources confirm that a new variant XXH3 is much faster.

I have picked a few hash implementations and computed the
proportion of time spent on hashing in the overall link time (a debug
build of clang 16 on a machine using AMD Zen 2 architecture):

  • llvm::xxHash64: 3.63%
  • official XXH64 (#define XXH_VECTOR XXH_SCALAR): 3.53%
  • official XXH3_64bits (#define XXH_VECTOR XXH_SCALAR): 1.21%
  • official XXH3_64bits (default, essentially XXH_SSE2): 1.22%
  • this patch llvm::xxh3_64bits: 1.19%

The remaining part of lld remains unchanged. Consequently, a lower ratio
indicates that hashing is faster. Therefore, it is evident that XXH3 from xxhash
is significantly faster than both the official version and our llvm::xxHash64.

(
string length: count
1-3: 393434
4-8: 2084056
9-16: 2846249
17-128: 5598928
129-240: 1317989
241-: 328058
)

This patch adds heavily simplified https://github.com/Cyan4973/xxHash,
taking account of many simplification ideas from Devin Hussey's xxhash-clean.

Important x86-64 optimization ideas:

  • Make XXH3_len_129to240_64b and XXH3_hashLong_64b noinline
  • Unroll XXH3_len_17to128_64b
  • __restrict does not affect Clang code generation

Beside SHF_MERGE|SHF_STRINGS duplicate elimination, llvm/ADT/StringMap.h
StringMapImpl::LookupBucketFor and a few places in lld can potentially be
accelerated by switching to llvm::xxh3_64bits.

Link: https://github.com/llvm/llvm-project/issues/63750

Diff Detail

Event Timeline

MaskRay created this revision.Jul 9 2023, 8:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2023, 8:04 PM
MaskRay requested review of this revision.Jul 9 2023, 8:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2023, 8:04 PM
MaskRay edited the summary of this revision. (Show Details)Jul 9 2023, 8:05 PM

Looks like a useful optimisation, I'm not in a great position to review the code. Will try and find some time to run a benchmark on AArch64 to see if I can replicate the results on a non-X64 machine.

LGTM with the minor nits mentioned above.

llvm/include/llvm/Support/xxhash.h
48

Why exposing this version to the user?

llvm/lib/Support/xxhash.cpp
3

you should probably update the copyright notice to 2021 to reflect upstream https://github.com/Cyan4973/xxHash/blob/dev/LICENSE#L2

38

Mention the commit number here?

151

could be constexpr size_t

198

same here.

This revision is now accepted and ready to land.Jul 18 2023, 4:39 AM

I haven't looked at the code but I have built ld.lld with this patch and the patch from D154813 on Windows with both VS 2022 cl 19.36.32537 and clang-cl 16.0.6. I have run the usual benchmarks (clang, chrome, mozilla, scylla) and also a UE5 link and I don't see any difference in performance on my Windows 10 PC, i.e. any delta is in the measurement "noise".

I haven't looked at the code but I have built ld.lld with this patch and the patch from D154813 on Windows with both VS 2022 cl 19.36.32537 and clang-cl 16.0.6. I have run the usual benchmarks (clang, chrome, mozilla, scylla) and also a UE5 link and I don't see any difference in performance on my Windows 10 PC, i.e. any delta is in the measurement "noise".

Thank you for benchmarking this on Windows. Do these programs have large .debug_str sections? (Windows builds normally use CodeView/PDB, not DWARF).
I think only any effect is likely only observed with large .debug_str sections. To observe the effect of replacing the hash function, I need to compute the time ratio of hashing

The remaining part of lld remains unchanged. Consequently, a lower ratio indicates that hashing is faster.

I have observed ~2% speedup on an x86-64 machine, but the result on an aarch64 machine (Cavium ThunderX2) is so-so.
In PiotrZSL's setup the speedup seems larger?

I think xxh3 is still worth it and we can probably drop the xxh64 implementation once internal implementations have migrated.

% hyperfine --warmup 1 --min-runs 16 "numactl -C 16-23 "{/tmp/lld0,/tmp/lld1}" -flavor gnu @response.txt --threads=8"
Benchmark 1: numactl -C 16-23 /tmp/lld0 -flavor gnu @response.txt --threads=8
  Time (mean ± σ):      7.393 s ±  0.089 s    [User: 17.077 s, System: 4.274 s]
  Range (min … max):    7.208 s …  7.561 s    16 runs

Benchmark 2: numactl -C 16-23 /tmp/lld1 -flavor gnu @response.txt --threads=8
  Time (mean ± σ):      7.351 s ±  0.093 s    [User: 16.757 s, System: 4.236 s]
  Range (min … max):    7.165 s …  7.519 s    16 runs
MaskRay updated this revision to Diff 541697.Jul 18 2023, 1:15 PM
MaskRay marked 5 inline comments as done.

address comment

llvm/lib/Support/xxhash.cpp
3

Yes, I'll do that. Changing this in the differential would cause clang-format to format all the text below, which is I'd want to avoid. I'll change this when landing...

151

Sounds good.

I was to retain the original code style, but since the original code has been mostly rewritten, changing the macros should be fine as well...

MaskRay updated this revision to Diff 541706.Jul 18 2023, 1:30 PM
MaskRay edited the summary of this revision. (Show Details)

update description

"In PiotrZSL's setup the speedup seems larger?"

I was running this on Ryzen 7 3800X under Ubuntu on single thread with 1.3GB clang binary linking.
When running this with more threads, gain looks smaller because that heavy chunk is split into multiple threads, and gain on entire application seems smaller.

This revision was landed with ongoing or failed builds.Jul 18 2023, 1:36 PM
This revision was automatically updated to reflect the committed changes.

I haven't looked at the code but I have built ld.lld with this patch and the patch from D154813 on Windows with both VS 2022 cl 19.36.32537 and clang-cl 16.0.6. I have run the usual benchmarks (clang, chrome, mozilla, scylla) and also a UE5 link and I don't see any difference in performance on my Windows 10 PC, i.e. any delta is in the measurement "noise".

Thank you for benchmarking this on Windows. Do these programs have large .debug_str sections? (Windows builds normally use CodeView/PDB, not DWARF).
I think only any effect is likely only observed with large .debug_str sections. To observe the effect of replacing the hash function, I need to compute the time ratio of hashing

The tests that I ran were all ELF links and one of the clang links included DWARF debug.

Out of interest, which compiler did you use to build your Linux lld?

I haven't looked at the code but I have built ld.lld with this patch and the patch from D154813 on Windows with both VS 2022 cl 19.36.32537 and clang-cl 16.0.6. I have run the usual benchmarks (clang, chrome, mozilla, scylla) and also a UE5 link and I don't see any difference in performance on my Windows 10 PC, i.e. any delta is in the measurement "noise".

Thank you for benchmarking this on Windows. Do these programs have large .debug_str sections? (Windows builds normally use CodeView/PDB, not DWARF).
I think only any effect is likely only observed with large .debug_str sections. To observe the effect of replacing the hash function, I need to compute the time ratio of hashing

The tests that I ran were all ELF links and one of the clang links included DWARF debug.

Out of interest, which compiler did you use to build your Linux lld?

I use a close-to-trunk prebuilt Clang from Chromium: curl -s https://raw.githubusercontent.com/chromium/chromium/main/tools/clang/scripts/update.py | python3 - --output-dir=~/Stable