This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Support/
-
Support/
-
SHA1.cpp
-
unittests/Support/
-
Support/
4/4
raw_sha1_ostream_test.cpp

Differential D69295

Optimize SHA1 implementation
ClosedPublic

Authored by terrelln on Oct 21 2019, 8:46 PM.

Download Raw Diff

Details

Reviewers

ruiu
MaskRay

Commits

rG43ff63477256: [Support] Optimize SHA1 implementation

Summary

Add inline to the helper functions because gcc-9 won't inline all of them without the hint. I've avoided __attribute__((always_inline)) because gcc and clang will inline without it, and improves compatibility.
Replace the byte-by-byte copy in update() with endian::readbe32() since perf reports that 1/2 of the time is spent copying into the buffer before this patch.
Add a hash-benchmark to measure the performance improvement.

When lld uses --build-id=sha1 it spends 30-45% of CPU in SHA1 depending on the binary (not wall-time since it is parallel). This patch speeds up SHA1 by a factor of 2 on clang-8 and 3 on gcc-6. This leads to a >10% improvement in overall linking time.

Unit tests

ninja check-llvm

LLD speed

lld-speed-test benchmarks run on an Intel i9-9900k with Turbo disabled on CPU 0 compiled with clang-9. Stats recorded with perf stat -r 5. All inputs are using --build-id=sha1.

Input	Before (seconds)	After (seconds)
chrome	2.14	1.82 (-15%)
chrome-icf	2.56	2.29 (-10%)
clang	0.65	0.53 (-18%)
clang-fsds	0.69	0.58 (-16%)
clang-gdb-index	21.71	19.3 (-11%)
gold	0.42	0.34 (-19%)
gold-fsds	0.431	0.355 (-17%)
linux-kernel	0.625	0.575 (-8%)
llvm-as	0.045	0.039 (-14%)
llvm-as-fsds	0.035	0.039 (-11%)
mozilla	11.3	9.8 (-13%)
mozilla-gc	11.84	10.36 (-12%)
mozilla-O0	8.2	5.84 (-28%)
scylla	5.59	4.52 (-19%)

Microbenchmarks

Compiled with clang-8:

Before:

2019-10-16 11:33:41
Running ./benchmarks/hash-benchmark/hash-benchmark
Run on (24 X 2394.48 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
-----------------------------------------------------------
Benchmark                    Time           CPU Iterations
-----------------------------------------------------------
BM_SHA1/1024              5146 ns       5145 ns     137203
BM_SHA1/4096             20043 ns      20040 ns      32644
BM_SHA1/32768           154810 ns     154803 ns       4401
BM_SHA1/262144         1281332 ns    1281244 ns        555
BM_SHA1/1048576        5154688 ns    5154100 ns        137

After:

2019-10-16 11:34:20
Running ./benchmarks/hash-benchmark/hash-benchmark
Run on (24 X 2394.48 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
-----------------------------------------------------------
Benchmark                    Time           CPU Iterations
-----------------------------------------------------------
BM_SHA1/1024              3071 ns       3070 ns     241890
BM_SHA1/4096             10491 ns      10491 ns      64873
BM_SHA1/32768            82802 ns      82791 ns       8533
BM_SHA1/262144          685598 ns     685595 ns       1069
BM_SHA1/1048576        2593819 ns    2593495 ns        265

Compiled with gcc-6:

Before:

2019-10-16 11:36:05
Running ./benchmarks/hash-benchmark/hash-benchmark
Run on (24 X 2394.48 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
-----------------------------------------------------------
Benchmark                    Time           CPU Iterations
-----------------------------------------------------------
BM_SHA1/1024              8770 ns       8769 ns      80651
BM_SHA1/4096             34161 ns      34159 ns      20583
BM_SHA1/32768           271183 ns     271154 ns       2565
BM_SHA1/262144         2140979 ns    2140434 ns        332
BM_SHA1/1048576        8376018 ns    8374622 ns         83

After:

2019-10-16 11:34:58
Running ./benchmarks/hash-benchmark/hash-benchmark
Run on (24 X 2394.48 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
-----------------------------------------------------------
Benchmark                    Time           CPU Iterations
-----------------------------------------------------------
BM_SHA1/1024              2892 ns       2892 ns     254677
BM_SHA1/4096             10300 ns      10299 ns      72058
BM_SHA1/32768            82527 ns      82527 ns       8880
BM_SHA1/262144          629433 ns     629358 ns       1080
BM_SHA1/1048576        2669301 ns    2669137 ns        272

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

terrelln created this revision.Oct 21 2019, 8:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2019, 8:46 PM

Herald added subscribers: hiraditya, mgorny. · View Herald Transcript

terrelln edited the summary of this revision. (Show Details)Oct 21 2019, 8:47 PM

terrelln edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B39897: Diff 225991.Oct 21 2019, 8:50 PM

This looks good, though I'm not an expert in this area.

If you really want to make this fast, I think Intel processors have special-purpose instructions to accelerate SHA1 computation. Have you consider using them?

llvm/unittests/Support/raw_sha1_ostream_test.cpp
46	Do we have a test that calls update() with a chunk of data larger than BLOCK_LENGTH?

If you really want to make this fast, I think Intel processors have special-purpose instructions to accelerate SHA1 computation. Have you consider using them?

Yeah, I don't think this is the fastest it could be. Switching to XXH or MD5 speeds up LLD by another ~3%, and each of those is IO bound. So there is a little bit left to gain in the SHA1 implementation for LLD before it is completely IO bound. And in use cases that are operating on hot memory (in L3 at least) there is a lot more to gain. But... this gets most of the benefit for LLD for a little work.

Someone could certainly take this further and add an implementation using SIMD or hardware intrinsics.

llvm/unittests/Support/raw_sha1_ostream_test.cpp
55	Yup, right here

LGTM

This revision is now accepted and ready to land.Oct 22 2019, 1:31 PM

Add inline to the helper functions because gcc-9 won't inline all of them without the hint. I've avoided attribute((always_inline)) because gcc and clang will inline without it, and improves compatibility.

Is it a -DCMAKE_BUILD_TYPE=Release build? I will be pretty surprised if gcc/clang -O3 does not inline these functions.

The code change looks good, except that the llvm/benchmarks/ directory is underused and you are adding more files. We probably need more consensus (e.g. send an email to the llvm-dev mailing list, referencing http://lists.llvm.org/pipermail/llvm-dev/2018-August/125173.html @kbobyrev) on how benchmark files should be organized.

llvm/unittests/Support/raw_sha1_ostream_test.cpp
48	This can be a `const char []` or StringRef, can't it?

MaskRay added a subscriber: kbobyrev.Oct 22 2019, 10:54 PM

In D69295#1718448, @MaskRay wrote:

Add inline to the helper functions because gcc-9 won't inline all of them without the hint. I've avoided attribute((always_inline)) because gcc and clang will inline without it, and improves compatibility.

Is it a -DCMAKE_BUILD_TYPE=Release build? I will be pretty surprised if gcc/clang -O3 does not inline these functions.

The code change looks good, except that the llvm/benchmarks/ directory is underused and you are adding more files.

Yes, I think putting files into llvm/benchmarks would be sensible especially given the lack of other benchmarks right now.

We probably need more consensus (e.g. send an email to the llvm-dev mailing list, referencing http://lists.llvm.org/pipermail/llvm-dev/2018-August/125173.html @kbobyrev) on how benchmark files should be organized.

I also agree with this. I think llvm/benchmarks would be fine for now either way, but with more benchmarks (which hopefully would be there at some point, I had plans of adding some for a while now but I'm not sure when I'll get to that) subdirectories make sense.

Thanks for CCing me!

I also agree with this. I think llvm/benchmarks would be fine for now either way, but with more benchmarks (which hopefully would be there at some point, I had plans of adding some for a while now but I'm not sure when I'll get to that) subdirectories make sense.

I agree it would be better in llvm/benchmarks, and that is what I tried that first, but had problems with getting the CMakeFile.txt right. I set up the file like this:

set(LLVM_LINK_COMPONENTS
  Support)

add_benchmark(DummyYAML DummyYAML.cpp)
add_benchmark(hash-benchmark hash-benchmark.cpp)

But the first benchmark complained that hash-benchmark.cpp is unused, and the second complained that DummyYAML.cpp is unused in LLVMProcessSources.cmake:112. If someone can help me with the CMake, I'd be glad to switch it back.

Is it a -DCMAKE_BUILD_TYPE=Release build? I will be pretty surprised if gcc/clang -O3 does not inline these functions.

Yeah, clang -O3 inlines all of them, but gcc -O3 inlines most, but not all of them before this patch. gcc-6 doesn't inline at least some calls of r1, blk, and r3.

terrelln marked 2 inline comments as done.Oct 24 2019, 6:00 PM

terrelln added inline comments.

llvm/unittests/Support/raw_sha1_ostream_test.cpp
48	I add it 4 times on line 55 to get an input that is larger than `BLOCK_LENGTH`. For that I need a string.

terrelln marked an inline comment as done.Oct 24 2019, 6:01 PM

In D69295#1719069, @terrelln wrote:
I also agree with this. I think llvm/benchmarks would be fine for now either way, but with more benchmarks (which hopefully would be there at some point, I had plans of adding some for a while now but I'm not sure when I'll get to that) subdirectories make sense.

I agree it would be better in llvm/benchmarks, and that is what I tried that first, but had problems with getting the CMakeFile.txt right. I set up the file like this:
set(LLVM_LINK_COMPONENTS
  Support)

add_benchmark(DummyYAML DummyYAML.cpp)
add_benchmark(hash-benchmark hash-benchmark.cpp)
But the first benchmark complained that hash-benchmark.cpp is unused, and the second complained that DummyYAML.cpp is unused in LLVMProcessSources.cmake:112. If someone can help me with the CMake, I'd be glad to switch it back.

Ah, good point! I'll look into that, thank you for bringing it up.

Is it a -DCMAKE_BUILD_TYPE=Release build? I will be pretty surprised if gcc/clang -O3 does not inline these functions.

Yeah, clang -O3 inlines all of them, but gcc -O3 inlines most, but not all of them before this patch. gcc-6 doesn't inline at least some calls of r1, blk, and r3.

@terrelln No one has responded to http://lists.llvm.org/pipermail/llvm-dev/2019-October/136337.html yet. I think we can just commit the code change for now. Do you need help for committing the patch?

Remove the benchmark

@terrelln No one has responded to http://lists.llvm.org/pipermail/llvm-dev/2019-October/136337.html yet. I think we can just commit the code change for now. Do you need help for committing the patch?

@MaskRay yeah, I don't have commit access, so that would be great!

Harbormaster completed remote builds in B40787: Diff 228793.Nov 11 2019, 6:36 PM

Small adjustment

MaskRay accepted this revision.Nov 11 2019, 10:15 PM

Harbormaster completed remote builds in B40790: Diff 228810.Nov 11 2019, 10:19 PM

Closed by commit rG43ff63477256: [Support] Optimize SHA1 implementation (authored by terrelln, committed by MaskRay). · Explain WhyNov 11 2019, 10:19 PM

This revision was automatically updated to reflect the committed changes.

MaskRay mentioned this in D113991: Support using sha256 as --build-id kind.Nov 17 2021, 7:45 PM

Revision Contents

Path

Size

llvm/

lib/

Support/

SHA1.cpp

54 lines

unittests/

Support/

raw_sha1_ostream_test.cpp

16 lines

Diff 228811

llvm/lib/Support/SHA1.cpp

	Show All 10 Lines
	// http://cvsweb.netbsd.org/bsdweb.cgi/src/common/lib/libc/hash/sha1/sha1.c?rev=1.6)			// http://cvsweb.netbsd.org/bsdweb.cgi/src/common/lib/libc/hash/sha1/sha1.c?rev=1.6)
	// and modified by wrapping it in a C++ interface for LLVM,			// and modified by wrapping it in a C++ interface for LLVM,
	// and removing unnecessary code.			// and removing unnecessary code.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "llvm/Support/SHA1.h"			#include "llvm/Support/SHA1.h"
	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
				#include "llvm/Support/Endian.h"
	#include "llvm/Support/Host.h"			#include "llvm/Support/Host.h"
	using namespace llvm;			using namespace llvm;

	#include <stdint.h>			#include <stdint.h>
	#include <string.h>			#include <string.h>

	#if defined(BYTE_ORDER) && defined(BIG_ENDIAN) && BYTE_ORDER == BIG_ENDIAN			#if defined(BYTE_ORDER) && defined(BIG_ENDIAN) && BYTE_ORDER == BIG_ENDIAN
	#define SHA_BIG_ENDIAN			#define SHA_BIG_ENDIAN
	#endif			#endif

	static uint32_t rol(uint32_t Number, int Bits) {			static inline uint32_t rol(uint32_t Number, int Bits) {
	return (Number << Bits) \| (Number >> (32 - Bits));			return (Number << Bits) \| (Number >> (32 - Bits));
	}			}

	static uint32_t blk0(uint32_t *Buf, int I) { return Buf[I]; }			static inline uint32_t blk0(uint32_t *Buf, int I) { return Buf[I]; }

	static uint32_t blk(uint32_t *Buf, int I) {			static inline uint32_t blk(uint32_t *Buf, int I) {
	Buf[I & 15] = rol(Buf[(I + 13) & 15] ^ Buf[(I + 8) & 15] ^ Buf[(I + 2) & 15] ^			Buf[I & 15] = rol(Buf[(I + 13) & 15] ^ Buf[(I + 8) & 15] ^ Buf[(I + 2) & 15] ^
	Buf[I & 15],			Buf[I & 15],
	1);			1);
	return Buf[I & 15];			return Buf[I & 15];
	}			}

	static void r0(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,			static inline void r0(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D,
	int I, uint32_t *Buf) {			uint32_t &E, int I, uint32_t *Buf) {
	E += ((B & (C ^ D)) ^ D) + blk0(Buf, I) + 0x5A827999 + rol(A, 5);			E += ((B & (C ^ D)) ^ D) + blk0(Buf, I) + 0x5A827999 + rol(A, 5);
	B = rol(B, 30);			B = rol(B, 30);
	}			}

	static void r1(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,			static inline void r1(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D,
	int I, uint32_t *Buf) {			uint32_t &E, int I, uint32_t *Buf) {
	E += ((B & (C ^ D)) ^ D) + blk(Buf, I) + 0x5A827999 + rol(A, 5);			E += ((B & (C ^ D)) ^ D) + blk(Buf, I) + 0x5A827999 + rol(A, 5);
	B = rol(B, 30);			B = rol(B, 30);
	}			}

	static void r2(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,			static inline void r2(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D,
	int I, uint32_t *Buf) {			uint32_t &E, int I, uint32_t *Buf) {
	E += (B ^ C ^ D) + blk(Buf, I) + 0x6ED9EBA1 + rol(A, 5);			E += (B ^ C ^ D) + blk(Buf, I) + 0x6ED9EBA1 + rol(A, 5);
	B = rol(B, 30);			B = rol(B, 30);
	}			}

	static void r3(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,			static inline void r3(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D,
	int I, uint32_t *Buf) {			uint32_t &E, int I, uint32_t *Buf) {
	E += (((B \| C) & D) \| (B & C)) + blk(Buf, I) + 0x8F1BBCDC + rol(A, 5);			E += (((B \| C) & D) \| (B & C)) + blk(Buf, I) + 0x8F1BBCDC + rol(A, 5);
	B = rol(B, 30);			B = rol(B, 30);
	}			}

	static void r4(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,			static inline void r4(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D,
	int I, uint32_t *Buf) {			uint32_t &E, int I, uint32_t *Buf) {
	E += (B ^ C ^ D) + blk(Buf, I) + 0xCA62C1D6 + rol(A, 5);			E += (B ^ C ^ D) + blk(Buf, I) + 0xCA62C1D6 + rol(A, 5);
	B = rol(B, 30);			B = rol(B, 30);
	}			}

	/* code */			/* code */
	#define SHA1_K0 0x5a827999			#define SHA1_K0 0x5a827999
	#define SHA1_K20 0x6ed9eba1			#define SHA1_K20 0x6ed9eba1
	#define SHA1_K40 0x8f1bbcdc			#define SHA1_K40 0x8f1bbcdc
	▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines
	}			}

	void SHA1::writebyte(uint8_t Data) {			void SHA1::writebyte(uint8_t Data) {
	++InternalState.ByteCount;			++InternalState.ByteCount;
	addUncounted(Data);			addUncounted(Data);
	}			}

	void SHA1::update(ArrayRef<uint8_t> Data) {			void SHA1::update(ArrayRef<uint8_t> Data) {
	for (auto &C : Data)			InternalState.ByteCount += Data.size();
	writebyte(C);
				// Finish the current block.
				if (InternalState.BufferOffset > 0) {
				const size_t Remainder = std::min<size_t>(
				Data.size(), BLOCK_LENGTH - InternalState.BufferOffset);
				for (size_t I = 0; I < Remainder; ++I)
				addUncounted(Data[I]);
				Data = Data.drop_front(Remainder);
				}

				// Fast buffer filling for large inputs.
				while (Data.size() >= BLOCK_LENGTH) {
				assert(InternalState.BufferOffset == 0);
				assert(BLOCK_LENGTH % 4 == 0);
				constexpr size_t BLOCK_LENGTH_32 = BLOCK_LENGTH / 4;
				for (size_t I = 0; I < BLOCK_LENGTH_32; ++I)
				InternalState.Buffer.L[I] = support::endian::read32be(&Data[I * 4]);
				hashBlock();
				Data = Data.drop_front(BLOCK_LENGTH);
				}

				// Finish the remainder.
				for (uint8_t C : Data)
				addUncounted(C);
	}			}

	void SHA1::pad() {			void SHA1::pad() {
	// Implement SHA-1 padding (fips180-2 5.1.1)			// Implement SHA-1 padding (fips180-2 5.1.1)

	// Pad with 0x80 followed by 0x00 until the end of the block			// Pad with 0x80 followed by 0x00 until the end of the block
	addUncounted(0x80);			addUncounted(0x80);
	while (InternalState.BufferOffset != 56)			while (InternalState.BufferOffset != 56)
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

llvm/unittests/Support/raw_sha1_ostream_test.cpp

	Show All 37 Lines

	TEST(sha1_hash_test, Basic) {			TEST(sha1_hash_test, Basic) {
	ArrayRef<uint8_t> Input((const uint8_t *)"Hello World!", 12);			ArrayRef<uint8_t> Input((const uint8_t *)"Hello World!", 12);
	std::array<uint8_t, 20> Vec = SHA1::hash(Input);			std::array<uint8_t, 20> Vec = SHA1::hash(Input);
	std::string Hash = toHex({(const char *)Vec.data(), 20});			std::string Hash = toHex({(const char *)Vec.data(), 20});
	ASSERT_EQ("2EF7BDE608CE5404E97D5F042F95F89F1C232871", Hash);			ASSERT_EQ("2EF7BDE608CE5404E97D5F042F95F89F1C232871", Hash);
	}			}

				TEST(sha1_hash_test, Update) {
				ruiuUnsubmitted Done Reply Inline Actions Do we have a test that calls update() with a chunk of data larger than BLOCK_LENGTH? ruiu: Do we have a test that calls update() with a chunk of data larger than BLOCK_LENGTH?
				SHA1 sha1;
				std::string Input = "123456789012345678901234567890";
				MaskRayUnsubmitted Done Reply Inline Actions This can be a `const char []` or StringRef, can't it? MaskRay: This can be a `const char []` or StringRef, can't it?
				terrellnAuthorUnsubmitted Done Reply Inline Actions I add it 4 times on line 55 to get an input that is larger than `BLOCK_LENGTH`. For that I need a string. terrelln: I add it 4 times on line 55 to get an input that is larger than `BLOCK_LENGTH`. For that I need…
				ASSERT_EQ(Input.size(), 30UL);
				// 3 short updates.
				sha1.update(Input);
				sha1.update(Input);
				sha1.update(Input);
				// Long update that gets into the optimized loop with prefix/suffix.
				sha1.update(Input + Input + Input + Input);
				terrellnAuthorUnsubmitted Done Reply Inline Actions Yup, right here terrelln: Yup, right here
				// 18 bytes buffered now.

				std::string Hash = toHex(sha1.final());
				ASSERT_EQ("3E4A614101AD84985AB0FE54DC12A6D71551E5AE", Hash);
				}

	// Check that getting the intermediate hash in the middle of the stream does			// Check that getting the intermediate hash in the middle of the stream does
	// not invalidate the final result.			// not invalidate the final result.
	TEST(raw_sha1_ostreamTest, Intermediate) {			TEST(raw_sha1_ostreamTest, Intermediate) {
	llvm::raw_sha1_ostream Sha1Stream;			llvm::raw_sha1_ostream Sha1Stream;
	Sha1Stream << "Hello";			Sha1Stream << "Hello";
	auto Hash = toHex(Sha1Stream.sha1());			auto Hash = toHex(Sha1Stream.sha1());

	ASSERT_EQ("F7FF9E8B7BB2E09B70935A5D785E0CC5D9D0ABF0", Hash);			ASSERT_EQ("F7FF9E8B7BB2E09B70935A5D785E0CC5D9D0ABF0", Hash);
	Show All 24 Lines