This is an archive of the discontinued LLVM Phabricator instance.

SHA1: unroll loop in hashBlock.
ClosedPublic

Authored by ruiu on Nov 19 2016, 3:12 PM.

Details

Summary

This code is taken from public domain.
https://github.com/jsonn/src/blob/trunk/common/lib/libc/hash/sha1/sha1.c

I wrote a sha1 command and run it on my Xeon E5-2680 v2 2.80GHz machine.
Here is a result. The new hash function is 37% faster than before.

Performance counter stats for './llvm-sha1-old /ssd/build/bin/lld' (10 runs):

    6640.503687 task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.03% )
             54 context-switches          #    0.008 K/sec                    ( +-  5.03% )
              5 cpu-migrations            #    0.001 K/sec                    ( +- 31.73% )
        183,803 page-faults               #    0.028 M/sec                    ( +-  0.00% )
 18,527,954,113 cycles                    #    2.790 GHz                      ( +-  0.03% )
  4,993,237,485 stalled-cycles-frontend   #   26.95% frontend cycles idle     ( +-  0.11% )
<not supported> stalled-cycles-backend
 50,217,149,423 instructions              #    2.71  insns per cycle
                                          #    0.10  stalled cycles per insn  ( +-  0.00% )
  6,094,322,337 branches                  #  917.750 M/sec                    ( +-  0.00% )
     11,778,239 branch-misses             #    0.19% of all branches          ( +-  0.01% )

    6.634017401 seconds time elapsed                                          ( +-  0.03% )

Performance counter stats for './llvm-sha1-new /ssd/build/bin/lld' (10 runs):

    4167.062720 task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.02% )
             52 context-switches          #    0.012 K/sec                    ( +- 16.45% )
              7 cpu-migrations            #    0.002 K/sec                    ( +- 32.20% )
        183,804 page-faults               #    0.044 M/sec                    ( +-  0.00% )
 11,626,611,958 cycles                    #    2.790 GHz                      ( +-  0.02% )
  4,491,897,976 stalled-cycles-frontend   #   38.63% frontend cycles idle     ( +-  0.05% )
<not supported> stalled-cycles-backend
 24,320,180,617 instructions              #    2.09  insns per cycle
                                          #    0.18  stalled cycles per insn  ( +-  0.00% )
  1,574,674,576 branches                  #  377.886 M/sec                    ( +-  0.00% )
     11,769,693 branch-misses             #    0.75% of all branches          ( +-  0.00% )

    4.163251552 seconds time elapsed                                          ( +-  0.02% )

Diff Detail

Repository
rL LLVM

Event Timeline

ruiu updated this revision to Diff 78642.Nov 19 2016, 3:12 PM
ruiu retitled this revision from to SHA1: unroll loop in hashBlock..
ruiu updated this object.
ruiu added reviewers: mehdi_amini, joerg.
ruiu added a subscriber: llvm-commits.
davide added inline comments.
lib/Support/SHA1.cpp
12 ↗(On Diff #78642)

I'd rather link the original NetBSD repo rather than Joerg's mirror (anoncvs.netbsd.org)

95–174 ↗(On Diff #78642)

I don't think this is terrible per-se (not quite readable), but @chandlerc pointed out during the DevSummit (at the libc++ performance BOF IIRC) that we should try to avoid unrolling loops by hand in our algorithms (and make sure the compiler does that on our behalf).
Now, for this case, I'm not sure if LLVM knows how to unroll this loop (and if it doesn't, I'm not sure how profitable it is to teach how to do it), but something to keep in mind in general.

mehdi_amini added inline comments.Nov 19 2016, 3:45 PM
lib/Support/SHA1.cpp
30 ↗(On Diff #78642)

Why did you turn it into a macro?

51 ↗(On Diff #78642)

Why a macro for all the RX(..)?

95–174 ↗(On Diff #78642)

Yes, I wonder how is the codegen impacted if you turn the above into 4 different loops?

davide added inline comments.Nov 19 2016, 3:51 PM
lib/Support/SHA1.cpp
30 ↗(On Diff #78642)

Agree, not particularly fond of macros in new software unless really needed.

joerg added inline comments.Nov 19 2016, 4:19 PM
lib/Support/SHA1.cpp
12 ↗(On Diff #78642)

Agreed, it was primarily meant as starting point.

http://cvsweb.netbsd.org/bsdweb.cgi/~checkout~/src/common/lib/libc/hash/sha1/sha1.c?rev=1.6

would be a better link.

30 ↗(On Diff #78642)

That's actually adopted from the NetBSD implementation. There is no huge advantage for inline function vs macro; the macro just keeps the diff a bit smaller.

51 ↗(On Diff #78642)

The macros reflect the building blocks of the main loop, e.g. the different constants and blocks used.
Again, this could be an inline function with references and hoping the compiler optimised all away, but using a macro keeps the diff down.

ruiu updated this revision to Diff 78643.Nov 19 2016, 4:44 PM
  • Fix the original source URL
  • Convert macros into functions
ruiu added a comment.Nov 19 2016, 4:48 PM

If I do not unroll the loop, computing a hash on a file worsened from 4.16 seconds to 5.77, so there's definitely an impact.

mehdi_amini accepted this revision.Nov 19 2016, 4:50 PM
mehdi_amini edited edge metadata.

LGTM, Thanks!

In D26890#600802, @ruiu wrote:

If I do not unroll the loop, computing a hash on a file worsened from 4.16 seconds to 5.77, so there's definitely an impact.

Please file a PR on llvm.org with the non-unrolled version of this code, and CC me. The optimization folks should look at this!

lib/Support/SHA1.cpp
82 ↗(On Diff #78643)

Please use static free function if there is no need to access the state of the object.

This revision is now accepted and ready to land.Nov 19 2016, 4:50 PM
mehdi_amini added inline comments.Nov 19 2016, 4:52 PM
lib/Support/SHA1.cpp
51 ↗(On Diff #78642)

Keeping the diff down is not a metric I value, especially if it is by using macros instead of functions.

ruiu added inline comments.Nov 19 2016, 4:53 PM
lib/Support/SHA1.cpp
82 ↗(On Diff #78643)

blk and blk0 access InternalState, so these functions need to be members.

mehdi_amini added inline comments.Nov 19 2016, 4:57 PM
lib/Support/SHA1.cpp
82 ↗(On Diff #78643)

Sure, but you could have an extra arg to access the buffer, like:

uint32_t SHA1::blk(int I, uint32_t* Buffer) {
  Buffer[I & 15] = rol(
      Buffer[(I + 13) & 15] ^
          Buffer[(I + 8) & 15] ^
          Buffer[(I + 2) & 15] ^ Buffer[I & 15],
      1);
  return Buffer[I & 15];
}
ruiu updated this revision to Diff 78644.Nov 19 2016, 5:09 PM
ruiu edited edge metadata.
  • Use static free functions instead of class member functions.
This revision was automatically updated to reflect the committed changes.
ruiu added a comment.Nov 19 2016, 5:36 PM

FYI, filed https://llvm.org/bugs/show_bug.cgi?id=31075 to report the unrolling issue.