This is an archive of the discontinued LLVM Phabricator instance.

Optimize SHA1 implementation
ClosedPublic

Authored by terrelln on Oct 21 2019, 8:46 PM.

Details

Summary
  • Add inline to the helper functions because gcc-9 won't inline all of them without the hint. I've avoided __attribute__((always_inline)) because gcc and clang will inline without it, and improves compatibility.
  • Replace the byte-by-byte copy in update() with endian::readbe32() since perf reports that 1/2 of the time is spent copying into the buffer before this patch.
  • Add a hash-benchmark to measure the performance improvement.

When lld uses --build-id=sha1 it spends 30-45% of CPU in SHA1 depending on the binary (not wall-time since it is parallel). This patch speeds up SHA1 by a factor of 2 on clang-8 and 3 on gcc-6. This leads to a >10% improvement in overall linking time.

Unit tests

ninja check-llvm

LLD speed

lld-speed-test benchmarks run on an Intel i9-9900k with Turbo disabled on CPU 0 compiled with clang-9. Stats recorded with perf stat -r 5. All inputs are using --build-id=sha1.

InputBefore (seconds)After (seconds)
chrome2.141.82 (-15%)
chrome-icf2.562.29 (-10%)
clang0.650.53 (-18%)
clang-fsds0.690.58 (-16%)
clang-gdb-index21.7119.3 (-11%)
gold0.420.34 (-19%)
gold-fsds0.4310.355 (-17%)
linux-kernel0.6250.575 (-8%)
llvm-as0.0450.039 (-14%)
llvm-as-fsds0.0350.039 (-11%)
mozilla11.39.8 (-13%)
mozilla-gc11.8410.36 (-12%)
mozilla-O08.25.84 (-28%)
scylla5.594.52 (-19%)

Microbenchmarks

Compiled with clang-8:

Before:

2019-10-16 11:33:41
Running ./benchmarks/hash-benchmark/hash-benchmark
Run on (24 X 2394.48 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
-----------------------------------------------------------
Benchmark                    Time           CPU Iterations
-----------------------------------------------------------
BM_SHA1/1024              5146 ns       5145 ns     137203
BM_SHA1/4096             20043 ns      20040 ns      32644
BM_SHA1/32768           154810 ns     154803 ns       4401
BM_SHA1/262144         1281332 ns    1281244 ns        555
BM_SHA1/1048576        5154688 ns    5154100 ns        137

After:

2019-10-16 11:34:20
Running ./benchmarks/hash-benchmark/hash-benchmark
Run on (24 X 2394.48 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
-----------------------------------------------------------
Benchmark                    Time           CPU Iterations
-----------------------------------------------------------
BM_SHA1/1024              3071 ns       3070 ns     241890
BM_SHA1/4096             10491 ns      10491 ns      64873
BM_SHA1/32768            82802 ns      82791 ns       8533
BM_SHA1/262144          685598 ns     685595 ns       1069
BM_SHA1/1048576        2593819 ns    2593495 ns        265

Compiled with gcc-6:

Before:

2019-10-16 11:36:05
Running ./benchmarks/hash-benchmark/hash-benchmark
Run on (24 X 2394.48 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
-----------------------------------------------------------
Benchmark                    Time           CPU Iterations
-----------------------------------------------------------
BM_SHA1/1024              8770 ns       8769 ns      80651
BM_SHA1/4096             34161 ns      34159 ns      20583
BM_SHA1/32768           271183 ns     271154 ns       2565
BM_SHA1/262144         2140979 ns    2140434 ns        332
BM_SHA1/1048576        8376018 ns    8374622 ns         83

After:

2019-10-16 11:34:58
Running ./benchmarks/hash-benchmark/hash-benchmark
Run on (24 X 2394.48 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
-----------------------------------------------------------
Benchmark                    Time           CPU Iterations
-----------------------------------------------------------
BM_SHA1/1024              2892 ns       2892 ns     254677
BM_SHA1/4096             10300 ns      10299 ns      72058
BM_SHA1/32768            82527 ns      82527 ns       8880
BM_SHA1/262144          629433 ns     629358 ns       1080
BM_SHA1/1048576        2669301 ns    2669137 ns        272

Diff Detail

Event Timeline

terrelln created this revision.Oct 21 2019, 8:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2019, 8:46 PM
terrelln edited the summary of this revision. (Show Details)Oct 21 2019, 8:47 PM
terrelln edited the summary of this revision. (Show Details)
terrelln edited the summary of this revision. (Show Details)
ruiu added a comment.Oct 22 2019, 11:39 AM

This looks good, though I'm not an expert in this area.

If you really want to make this fast, I think Intel processors have special-purpose instructions to accelerate SHA1 computation. Have you consider using them?

llvm/unittests/Support/raw_sha1_ostream_test.cpp
46

Do we have a test that calls update() with a chunk of data larger than BLOCK_LENGTH?

terrelln marked an inline comment as done.Oct 22 2019, 1:01 PM

If you really want to make this fast, I think Intel processors have special-purpose instructions to accelerate SHA1 computation. Have you consider using them?

Yeah, I don't think this is the fastest it could be. Switching to XXH or MD5 speeds up LLD by another ~3%, and each of those is IO bound. So there is a little bit left to gain in the SHA1 implementation for LLD before it is completely IO bound. And in use cases that are operating on hot memory (in L3 at least) there is a lot more to gain. But... this gets most of the benefit for LLD for a little work.

Someone could certainly take this further and add an implementation using SIMD or hardware intrinsics.

llvm/unittests/Support/raw_sha1_ostream_test.cpp
55

Yup, right here

ruiu accepted this revision.Oct 22 2019, 1:31 PM

LGTM

This revision is now accepted and ready to land.Oct 22 2019, 1:31 PM
MaskRay added a comment.EditedOct 22 2019, 10:51 PM

Add inline to the helper functions because gcc-9 won't inline all of them without the hint. I've avoided attribute((always_inline)) because gcc and clang will inline without it, and improves compatibility.

Is it a -DCMAKE_BUILD_TYPE=Release build? I will be pretty surprised if gcc/clang -O3 does not inline these functions.

The code change looks good, except that the llvm/benchmarks/ directory is underused and you are adding more files. We probably need more consensus (e.g. send an email to the llvm-dev mailing list, referencing http://lists.llvm.org/pipermail/llvm-dev/2018-August/125173.html @kbobyrev) on how benchmark files should be organized.

llvm/unittests/Support/raw_sha1_ostream_test.cpp
48

This can be a const char [] or StringRef, can't it?

Add inline to the helper functions because gcc-9 won't inline all of them without the hint. I've avoided attribute((always_inline)) because gcc and clang will inline without it, and improves compatibility.

Is it a -DCMAKE_BUILD_TYPE=Release build? I will be pretty surprised if gcc/clang -O3 does not inline these functions.

The code change looks good, except that the llvm/benchmarks/ directory is underused and you are adding more files.

Yes, I think putting files into llvm/benchmarks would be sensible especially given the lack of other benchmarks right now.

We probably need more consensus (e.g. send an email to the llvm-dev mailing list, referencing http://lists.llvm.org/pipermail/llvm-dev/2018-August/125173.html @kbobyrev) on how benchmark files should be organized.

I also agree with this. I think llvm/benchmarks would be fine for now either way, but with more benchmarks (which hopefully would be there at some point, I had plans of adding some for a while now but I'm not sure when I'll get to that) subdirectories make sense.

Thanks for CCing me!

I also agree with this. I think llvm/benchmarks would be fine for now either way, but with more benchmarks (which hopefully would be there at some point, I had plans of adding some for a while now but I'm not sure when I'll get to that) subdirectories make sense.

I agree it would be better in llvm/benchmarks, and that is what I tried that first, but had problems with getting the CMakeFile.txt right. I set up the file like this:

set(LLVM_LINK_COMPONENTS
  Support)

add_benchmark(DummyYAML DummyYAML.cpp)
add_benchmark(hash-benchmark hash-benchmark.cpp)

But the first benchmark complained that hash-benchmark.cpp is unused, and the second complained that DummyYAML.cpp is unused in LLVMProcessSources.cmake:112. If someone can help me with the CMake, I'd be glad to switch it back.

Is it a -DCMAKE_BUILD_TYPE=Release build? I will be pretty surprised if gcc/clang -O3 does not inline these functions.

Yeah, clang -O3 inlines all of them, but gcc -O3 inlines most, but not all of them before this patch. gcc-6 doesn't inline at least some calls of r1, blk, and r3.

terrelln marked 2 inline comments as done.Oct 24 2019, 6:00 PM
terrelln added inline comments.
llvm/unittests/Support/raw_sha1_ostream_test.cpp
48

I add it 4 times on line 55 to get an input that is larger than BLOCK_LENGTH. For that I need a string.

terrelln marked an inline comment as done.Oct 24 2019, 6:01 PM

I also agree with this. I think llvm/benchmarks would be fine for now either way, but with more benchmarks (which hopefully would be there at some point, I had plans of adding some for a while now but I'm not sure when I'll get to that) subdirectories make sense.

I agree it would be better in llvm/benchmarks, and that is what I tried that first, but had problems with getting the CMakeFile.txt right. I set up the file like this:

set(LLVM_LINK_COMPONENTS
  Support)

add_benchmark(DummyYAML DummyYAML.cpp)
add_benchmark(hash-benchmark hash-benchmark.cpp)

But the first benchmark complained that hash-benchmark.cpp is unused, and the second complained that DummyYAML.cpp is unused in LLVMProcessSources.cmake:112. If someone can help me with the CMake, I'd be glad to switch it back.

Ah, good point! I'll look into that, thank you for bringing it up.

Is it a -DCMAKE_BUILD_TYPE=Release build? I will be pretty surprised if gcc/clang -O3 does not inline these functions.

Yeah, clang -O3 inlines all of them, but gcc -O3 inlines most, but not all of them before this patch. gcc-6 doesn't inline at least some calls of r1, blk, and r3.

@terrelln No one has responded to http://lists.llvm.org/pipermail/llvm-dev/2019-October/136337.html yet. I think we can just commit the code change for now. Do you need help for committing the patch?

terrelln updated this revision to Diff 228793.Nov 11 2019, 6:28 PM

Remove the benchmark

@terrelln No one has responded to http://lists.llvm.org/pipermail/llvm-dev/2019-October/136337.html yet. I think we can just commit the code change for now. Do you need help for committing the patch?

@MaskRay yeah, I don't have commit access, so that would be great!

MaskRay updated this revision to Diff 228810.Nov 11 2019, 10:15 PM

Small adjustment

MaskRay accepted this revision.Nov 11 2019, 10:15 PM
This revision was automatically updated to reflect the committed changes.