This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Support/
-
llvm/
-
Support/
-
SHA1.h
-
lib/Support/
-
Support/
-
SHA1.cpp

Differential D26890

SHA1: unroll loop in hashBlock.
ClosedPublic

Authored by ruiu on Nov 19 2016, 3:12 PM.

Download Raw Diff

Details

Reviewers

joerg
mehdi_amini

Commits

rGfe33661ab093: SHA1: unroll loop in hashBlock.
rL287473: SHA1: unroll loop in hashBlock.

Summary

This code is taken from public domain.
https://github.com/jsonn/src/blob/trunk/common/lib/libc/hash/sha1/sha1.c

I wrote a sha1 command and run it on my Xeon E5-2680 v2 2.80GHz machine.
Here is a result. The new hash function is 37% faster than before.

Performance counter stats for './llvm-sha1-old /ssd/build/bin/lld' (10 runs):

    6640.503687 task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.03% )
             54 context-switches          #    0.008 K/sec                    ( +-  5.03% )
              5 cpu-migrations            #    0.001 K/sec                    ( +- 31.73% )
        183,803 page-faults               #    0.028 M/sec                    ( +-  0.00% )
 18,527,954,113 cycles                    #    2.790 GHz                      ( +-  0.03% )
  4,993,237,485 stalled-cycles-frontend   #   26.95% frontend cycles idle     ( +-  0.11% )
<not supported> stalled-cycles-backend
 50,217,149,423 instructions              #    2.71  insns per cycle
                                          #    0.10  stalled cycles per insn  ( +-  0.00% )
  6,094,322,337 branches                  #  917.750 M/sec                    ( +-  0.00% )
     11,778,239 branch-misses             #    0.19% of all branches          ( +-  0.01% )

    6.634017401 seconds time elapsed                                          ( +-  0.03% )

Performance counter stats for './llvm-sha1-new /ssd/build/bin/lld' (10 runs):

    4167.062720 task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.02% )
             52 context-switches          #    0.012 K/sec                    ( +- 16.45% )
              7 cpu-migrations            #    0.002 K/sec                    ( +- 32.20% )
        183,804 page-faults               #    0.044 M/sec                    ( +-  0.00% )
 11,626,611,958 cycles                    #    2.790 GHz                      ( +-  0.02% )
  4,491,897,976 stalled-cycles-frontend   #   38.63% frontend cycles idle     ( +-  0.05% )
<not supported> stalled-cycles-backend
 24,320,180,617 instructions              #    2.09  insns per cycle
                                          #    0.18  stalled cycles per insn  ( +-  0.00% )
  1,574,674,576 branches                  #  377.886 M/sec                    ( +-  0.00% )
     11,769,693 branch-misses             #    0.75% of all branches          ( +-  0.00% )

    4.163251552 seconds time elapsed                                          ( +-  0.02% )

Diff Detail

Repository: rL LLVM

Event Timeline

ruiu updated this revision to Diff 78642.Nov 19 2016, 3:12 PM

ruiu retitled this revision from to SHA1: unroll loop in hashBlock..

ruiu updated this object.

ruiu added reviewers: mehdi_amini, joerg.

ruiu added a subscriber: llvm-commits.

davide added subscribers: chandlerc, davide.Nov 19 2016, 3:36 PM

davide added inline comments.

lib/Support/SHA1.cpp
12 ↗	(On Diff #78642)	I'd rather link the original NetBSD repo rather than Joerg's mirror (anoncvs.netbsd.org)
95–174 ↗	(On Diff #78642)	I don't think this is terrible per-se (not quite readable), but @chandlerc pointed out during the DevSummit (at the libc++ performance BOF IIRC) that we should try to avoid unrolling loops by hand in our algorithms (and make sure the compiler does that on our behalf). Now, for this case, I'm not sure if LLVM knows how to unroll this loop (and if it doesn't, I'm not sure how profitable it is to teach how to do it), but something to keep in mind in general.

mehdi_amini added inline comments.Nov 19 2016, 3:45 PM

lib/Support/SHA1.cpp
30 ↗	(On Diff #78642)	Why did you turn it into a macro?
51 ↗	(On Diff #78642)	Why a macro for all the `RX(..)`?
95–174 ↗	(On Diff #78642)	Yes, I wonder how is the codegen impacted if you turn the above into 4 different loops?

davide added inline comments.Nov 19 2016, 3:51 PM

lib/Support/SHA1.cpp
30 ↗	(On Diff #78642)	Agree, not particularly fond of macros in new software unless really needed.

joerg added inline comments.Nov 19 2016, 4:19 PM

lib/Support/SHA1.cpp
12 ↗	(On Diff #78642)	Agreed, it was primarily meant as starting point. http://cvsweb.netbsd.org/bsdweb.cgi/~checkout~/src/common/lib/libc/hash/sha1/sha1.c?rev=1.6 would be a better link.
30 ↗	(On Diff #78642)	That's actually adopted from the NetBSD implementation. There is no huge advantage for inline function vs macro; the macro just keeps the diff a bit smaller.
51 ↗	(On Diff #78642)	The macros reflect the building blocks of the main loop, e.g. the different constants and blocks used. Again, this could be an inline function with references and hoping the compiler optimised all away, but using a macro keeps the diff down.

Fix the original source URL
Convert macros into functions

If I do not unroll the loop, computing a hash on a file worsened from 4.16 seconds to 5.77, so there's definitely an impact.

LGTM, Thanks!

In D26890#600802, @ruiu wrote:

If I do not unroll the loop, computing a hash on a file worsened from 4.16 seconds to 5.77, so there's definitely an impact.

Please file a PR on llvm.org with the non-unrolled version of this code, and CC me. The optimization folks should look at this!

lib/Support/SHA1.cpp
82 ↗	(On Diff #78643)	Please use static free function if there is no need to access the state of the object.

This revision is now accepted and ready to land.Nov 19 2016, 4:50 PM

mehdi_amini added inline comments.Nov 19 2016, 4:52 PM

lib/Support/SHA1.cpp
51 ↗	(On Diff #78642)	`Keeping the diff` down is not a metric I value, especially if it is by using macros instead of functions.

ruiu added inline comments.Nov 19 2016, 4:53 PM

lib/Support/SHA1.cpp
82 ↗	(On Diff #78643)	blk and blk0 access InternalState, so these functions need to be members.

mehdi_amini added inline comments.Nov 19 2016, 4:57 PM

lib/Support/SHA1.cpp
82 ↗	(On Diff #78643)	Sure, but you could have an extra arg to access the buffer, like: uint32_t SHA1::blk(int I, uint32_t* Buffer) { Buffer[I & 15] = rol( Buffer[(I + 13) & 15] ^ Buffer[(I + 8) & 15] ^ Buffer[(I + 2) & 15] ^ Buffer[I & 15], 1); return Buffer[I & 15]; }

Use static free functions instead of class member functions.

Closed by commit rL287473: SHA1: unroll loop in hashBlock. (authored by ruiu). · Explain WhyNov 19 2016, 5:13 PM

This revision was automatically updated to reflect the committed changes.

FYI, filed https://llvm.org/bugs/show_bug.cgi?id=31075 to report the unrolling issue.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Support/

SHA1.h

5 lines

lib/

Support/

SHA1.cpp

196 lines

Diff 78645

llvm/trunk/include/llvm/Support/SHA1.h

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	private:			private:
	/// Define some constants.			/// Define some constants.
	/// "static constexpr" would be cleaner but MSVC does not support it yet.			/// "static constexpr" would be cleaner but MSVC does not support it yet.
	enum { BLOCK_LENGTH = 64 };			enum { BLOCK_LENGTH = 64 };
	enum { HASH_LENGTH = 20 };			enum { HASH_LENGTH = 20 };

	// Internal State			// Internal State
	struct {			struct {
	uint32_t Buffer[BLOCK_LENGTH / 4];			union {
				uint8_t C[BLOCK_LENGTH];
				uint32_t L[BLOCK_LENGTH / 4];
				} Buffer;
	uint32_t State[HASH_LENGTH / 4];			uint32_t State[HASH_LENGTH / 4];
	uint32_t ByteCount;			uint32_t ByteCount;
	uint8_t BufferOffset;			uint8_t BufferOffset;
	} InternalState;			} InternalState;

	// Internal copy of the hash, populated and accessed on calls to result()			// Internal copy of the hash, populated and accessed on calls to result()
	uint32_t HashResult[HASH_LENGTH / 4];			uint32_t HashResult[HASH_LENGTH / 4];

	Show All 10 Lines

llvm/trunk/lib/Support/SHA1.cpp

	//======- SHA1.h - Private copy of the SHA1 implementation ---- C++ - ======//			//======- SHA1.h - Private copy of the SHA1 implementation ---- C++ - ======//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				//
	// This code is taken from public domain			// This code is taken from public domain
	// (http://oauth.googlecode.com/svn/code/c/liboauth/src/sha1.c)			// (http://oauth.googlecode.com/svn/code/c/liboauth/src/sha1.c and
				// http://cvsweb.netbsd.org/bsdweb.cgi/src/common/lib/libc/hash/sha1/sha1.c?rev=1.6)
	// and modified by wrapping it in a C++ interface for LLVM,			// and modified by wrapping it in a C++ interface for LLVM,
	// and removing unnecessary code.			// and removing unnecessary code.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "llvm/Support/Host.h"			#include "llvm/Support/Host.h"
	#include "llvm/Support/SHA1.h"			#include "llvm/Support/SHA1.h"
	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
	using namespace llvm;			using namespace llvm;

	#include <stdint.h>			#include <stdint.h>
	#include <string.h>			#include <string.h>

	#if defined(BYTE_ORDER) && defined(BIG_ENDIAN) && BYTE_ORDER == BIG_ENDIAN			#if defined(BYTE_ORDER) && defined(BIG_ENDIAN) && BYTE_ORDER == BIG_ENDIAN
	#define SHA_BIG_ENDIAN			#define SHA_BIG_ENDIAN
	#endif			#endif

				static uint32_t rol(uint32_t number, int bits) {
				return (number << bits) \| (number >> (32 - bits));
				};

				#if SHA_BIG_ENDIAN
				static uint32_t blk0(uint32_t *Buf, int I) {
				Buf[I] = (rol(Buf[I], 24) & 0xFF00FF00) \| (rol(Buf[I], 8) & 0x00FF00FF);
				return Buf[I];
				}
				#else
				static uint32_t blk0(uint32_t *Buf, int I) { return Buf[I]; }
				#endif

				static uint32_t blk(uint32_t *Buf, int I) {
				Buf[I & 15] = rol(Buf[(I + 13) & 15] ^ Buf[(I + 8) & 15] ^ Buf[(I + 2) & 15] ^
				Buf[I & 15],
				1);
				return Buf[I & 15];
				}

				static void r0(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,
				int I, uint32_t *Buf) {
				E += ((B & (C ^ D)) ^ D) + blk0(Buf, I) + 0x5A827999 + rol(A, 5);
				B = rol(B, 30);
				}

				static void r1(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,
				int I, uint32_t *Buf) {
				E += ((B & (C ^ D)) ^ D) + blk(Buf, I) + 0x5A827999 + rol(A, 5);
				B = rol(B, 30);
				}

				static void r2(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,
				int I, uint32_t *Buf) {
				E += (B ^ C ^ D) + blk(Buf, I) + 0x6ED9EBA1 + rol(A, 5);
				B = rol(B, 30);
				}

				static void r3(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,
				int I, uint32_t *Buf) {
				E += (((B \| C) & D) \| (B & C)) + blk(Buf, I) + 0x8F1BBCDC + rol(A, 5);
				B = rol(B, 30);
				}

				static void r4(uint32_t &A, uint32_t &B, uint32_t &C, uint32_t &D, uint32_t &E,
				int I, uint32_t *Buf) {
				E += (B ^ C ^ D) + blk(Buf, I) + 0xCA62C1D6 + rol(A, 5);
				B = rol(B, 30);
				}

	/* code */			/* code */
	#define SHA1_K0 0x5a827999			#define SHA1_K0 0x5a827999
	#define SHA1_K20 0x6ed9eba1			#define SHA1_K20 0x6ed9eba1
	#define SHA1_K40 0x8f1bbcdc			#define SHA1_K40 0x8f1bbcdc
	#define SHA1_K60 0xca62c1d6			#define SHA1_K60 0xca62c1d6

	#define SEED_0 0x67452301			#define SEED_0 0x67452301
	#define SEED_1 0xefcdab89			#define SEED_1 0xefcdab89
	#define SEED_2 0x98badcfe			#define SEED_2 0x98badcfe
	#define SEED_3 0x10325476			#define SEED_3 0x10325476
	#define SEED_4 0xc3d2e1f0			#define SEED_4 0xc3d2e1f0

	void SHA1::init() {			void SHA1::init() {
	InternalState.State[0] = SEED_0;			InternalState.State[0] = SEED_0;
	InternalState.State[1] = SEED_1;			InternalState.State[1] = SEED_1;
	InternalState.State[2] = SEED_2;			InternalState.State[2] = SEED_2;
	InternalState.State[3] = SEED_3;			InternalState.State[3] = SEED_3;
	InternalState.State[4] = SEED_4;			InternalState.State[4] = SEED_4;
	InternalState.ByteCount = 0;			InternalState.ByteCount = 0;
	InternalState.BufferOffset = 0;			InternalState.BufferOffset = 0;
	}			}

	static uint32_t rol32(uint32_t number, uint8_t bits) {
	return ((number << bits) \| (number >> (32 - bits)));
	}

	void SHA1::hashBlock() {			void SHA1::hashBlock() {
	uint8_t i;			uint32_t A = InternalState.State[0];
	uint32_t a, b, c, d, e, t;			uint32_t B = InternalState.State[1];
				uint32_t C = InternalState.State[2];
	a = InternalState.State[0];			uint32_t D = InternalState.State[3];
	b = InternalState.State[1];			uint32_t E = InternalState.State[4];
	c = InternalState.State[2];
	d = InternalState.State[3];			// 4 rounds of 20 operations each. Loop unrolled.
	e = InternalState.State[4];			r0(A, B, C, D, E, 0, InternalState.Buffer.L);
	for (i = 0; i < 80; i++) {			r0(E, A, B, C, D, 1, InternalState.Buffer.L);
	if (i >= 16) {			r0(D, E, A, B, C, 2, InternalState.Buffer.L);
	t = InternalState.Buffer[(i + 13) & 15] ^			r0(C, D, E, A, B, 3, InternalState.Buffer.L);
	InternalState.Buffer[(i + 8) & 15] ^			r0(B, C, D, E, A, 4, InternalState.Buffer.L);
	InternalState.Buffer[(i + 2) & 15] ^ InternalState.Buffer[i & 15];			r0(A, B, C, D, E, 5, InternalState.Buffer.L);
	InternalState.Buffer[i & 15] = rol32(t, 1);			r0(E, A, B, C, D, 6, InternalState.Buffer.L);
	}			r0(D, E, A, B, C, 7, InternalState.Buffer.L);
	if (i < 20) {			r0(C, D, E, A, B, 8, InternalState.Buffer.L);
	t = (d ^ (b & (c ^ d))) + SHA1_K0;			r0(B, C, D, E, A, 9, InternalState.Buffer.L);
	} else if (i < 40) {			r0(A, B, C, D, E, 10, InternalState.Buffer.L);
	t = (b ^ c ^ d) + SHA1_K20;			r0(E, A, B, C, D, 11, InternalState.Buffer.L);
	} else if (i < 60) {			r0(D, E, A, B, C, 12, InternalState.Buffer.L);
	t = ((b & c) \| (d & (b \| c))) + SHA1_K40;			r0(C, D, E, A, B, 13, InternalState.Buffer.L);
	} else {			r0(B, C, D, E, A, 14, InternalState.Buffer.L);
	t = (b ^ c ^ d) + SHA1_K60;			r0(A, B, C, D, E, 15, InternalState.Buffer.L);
	}			r1(E, A, B, C, D, 16, InternalState.Buffer.L);
	t += rol32(a, 5) + e + InternalState.Buffer[i & 15];			r1(D, E, A, B, C, 17, InternalState.Buffer.L);
	e = d;			r1(C, D, E, A, B, 18, InternalState.Buffer.L);
	d = c;			r1(B, C, D, E, A, 19, InternalState.Buffer.L);
	c = rol32(b, 30);
	b = a;			r2(A, B, C, D, E, 20, InternalState.Buffer.L);
	a = t;			r2(E, A, B, C, D, 21, InternalState.Buffer.L);
	}			r2(D, E, A, B, C, 22, InternalState.Buffer.L);
	InternalState.State[0] += a;			r2(C, D, E, A, B, 23, InternalState.Buffer.L);
	InternalState.State[1] += b;			r2(B, C, D, E, A, 24, InternalState.Buffer.L);
	InternalState.State[2] += c;			r2(A, B, C, D, E, 25, InternalState.Buffer.L);
	InternalState.State[3] += d;			r2(E, A, B, C, D, 26, InternalState.Buffer.L);
	InternalState.State[4] += e;			r2(D, E, A, B, C, 27, InternalState.Buffer.L);
				r2(C, D, E, A, B, 28, InternalState.Buffer.L);
				r2(B, C, D, E, A, 29, InternalState.Buffer.L);
				r2(A, B, C, D, E, 30, InternalState.Buffer.L);
				r2(E, A, B, C, D, 31, InternalState.Buffer.L);
				r2(D, E, A, B, C, 32, InternalState.Buffer.L);
				r2(C, D, E, A, B, 33, InternalState.Buffer.L);
				r2(B, C, D, E, A, 34, InternalState.Buffer.L);
				r2(A, B, C, D, E, 35, InternalState.Buffer.L);
				r2(E, A, B, C, D, 36, InternalState.Buffer.L);
				r2(D, E, A, B, C, 37, InternalState.Buffer.L);
				r2(C, D, E, A, B, 38, InternalState.Buffer.L);
				r2(B, C, D, E, A, 39, InternalState.Buffer.L);

				r3(A, B, C, D, E, 40, InternalState.Buffer.L);
				r3(E, A, B, C, D, 41, InternalState.Buffer.L);
				r3(D, E, A, B, C, 42, InternalState.Buffer.L);
				r3(C, D, E, A, B, 43, InternalState.Buffer.L);
				r3(B, C, D, E, A, 44, InternalState.Buffer.L);
				r3(A, B, C, D, E, 45, InternalState.Buffer.L);
				r3(E, A, B, C, D, 46, InternalState.Buffer.L);
				r3(D, E, A, B, C, 47, InternalState.Buffer.L);
				r3(C, D, E, A, B, 48, InternalState.Buffer.L);
				r3(B, C, D, E, A, 49, InternalState.Buffer.L);
				r3(A, B, C, D, E, 50, InternalState.Buffer.L);
				r3(E, A, B, C, D, 51, InternalState.Buffer.L);
				r3(D, E, A, B, C, 52, InternalState.Buffer.L);
				r3(C, D, E, A, B, 53, InternalState.Buffer.L);
				r3(B, C, D, E, A, 54, InternalState.Buffer.L);
				r3(A, B, C, D, E, 55, InternalState.Buffer.L);
				r3(E, A, B, C, D, 56, InternalState.Buffer.L);
				r3(D, E, A, B, C, 57, InternalState.Buffer.L);
				r3(C, D, E, A, B, 58, InternalState.Buffer.L);
				r3(B, C, D, E, A, 59, InternalState.Buffer.L);

				r4(A, B, C, D, E, 60, InternalState.Buffer.L);
				r4(E, A, B, C, D, 61, InternalState.Buffer.L);
				r4(D, E, A, B, C, 62, InternalState.Buffer.L);
				r4(C, D, E, A, B, 63, InternalState.Buffer.L);
				r4(B, C, D, E, A, 64, InternalState.Buffer.L);
				r4(A, B, C, D, E, 65, InternalState.Buffer.L);
				r4(E, A, B, C, D, 66, InternalState.Buffer.L);
				r4(D, E, A, B, C, 67, InternalState.Buffer.L);
				r4(C, D, E, A, B, 68, InternalState.Buffer.L);
				r4(B, C, D, E, A, 69, InternalState.Buffer.L);
				r4(A, B, C, D, E, 70, InternalState.Buffer.L);
				r4(E, A, B, C, D, 71, InternalState.Buffer.L);
				r4(D, E, A, B, C, 72, InternalState.Buffer.L);
				r4(C, D, E, A, B, 73, InternalState.Buffer.L);
				r4(B, C, D, E, A, 74, InternalState.Buffer.L);
				r4(A, B, C, D, E, 75, InternalState.Buffer.L);
				r4(E, A, B, C, D, 76, InternalState.Buffer.L);
				r4(D, E, A, B, C, 77, InternalState.Buffer.L);
				r4(C, D, E, A, B, 78, InternalState.Buffer.L);
				r4(B, C, D, E, A, 79, InternalState.Buffer.L);

				InternalState.State[0] += A;
				InternalState.State[1] += B;
				InternalState.State[2] += C;
				InternalState.State[3] += D;
				InternalState.State[4] += E;
	}			}

	void SHA1::addUncounted(uint8_t data) {			void SHA1::addUncounted(uint8_t data) {
	uint8_t const b = (uint8_t )InternalState.Buffer;
	#ifdef SHA_BIG_ENDIAN			#ifdef SHA_BIG_ENDIAN
	b[InternalState.BufferOffset] = data;			InternalState.Buffer.C[InternalState.BufferOffset] = data;
	#else			#else
	b[InternalState.BufferOffset ^ 3] = data;			InternalState.Buffer.C[InternalState.BufferOffset ^ 3] = data;
	#endif			#endif

	InternalState.BufferOffset++;			InternalState.BufferOffset++;
	if (InternalState.BufferOffset == BLOCK_LENGTH) {			if (InternalState.BufferOffset == BLOCK_LENGTH) {
	hashBlock();			hashBlock();
	InternalState.BufferOffset = 0;			InternalState.BufferOffset = 0;
	}			}
	}			}

	void SHA1::writebyte(uint8_t data) {			void SHA1::writebyte(uint8_t data) {
	▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines