This is an archive of the discontinued LLVM Phabricator instance.

[scudo] CRC32 optimizations
ClosedPublic

Authored by cryptoad on May 8 2017, 10:15 AM.

Details

Summary

This change optimizes several aspects of the checksum used for chunk headers.

First, there is no point in checking the weak symbol computeHardwareCRC32
everytime, it will either be there or not when we start, so check it once
during initialization and set the checksum type accordingly.

Then, the loading of HashAlgorithm for SSE versions (and ARM equivalent) was
not optimized out, while not necessary. So I reshuffled that part of the code,
which duplicates a tiny bit of code, but ends up in a much cleaner assembly
(and faster as we avoid an extraneous load and some calls).

The following code is the checksum at the end of scudoMalloc for x86_64 with
full SSE 4.2, before:

mov     rax, 0FFFFFFFFFFFFFFh
shl     r10, 38h
mov     edi, dword ptr cs:_ZN7__scudoL6CookieE ; __scudo::Cookie
and     r14, rax
lea     rsi, [r13-10h]
movzx   eax, cs:_ZN7__scudoL13HashAlgorithmE ; __scudo::HashAlgorithm
or      r14, r10
mov     rbx, r14
xor     bx, bx
call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
mov     rsi, rbx
mov     edi, eax
call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
mov     r14w, ax
mov     rax, r13
mov     [r13-10h], r14

After:

mov     rax, cs:_ZN7__scudoL6CookieE ; __scudo::Cookie
lea     rcx, [rbx-10h]
mov     rdx, 0FFFFFFFFFFFFFFh
and     r14, rdx
shl     r9, 38h
or      r14, r9
crc32   eax, rcx
mov     rdx, r14
xor     dx, dx
mov     eax, eax
crc32   eax, rdx
mov     r14w, ax
mov     rax, rbx
mov     [rbx-10h], r14

Event Timeline

cryptoad created this revision.May 8 2017, 10:15 AM
dvyukov added inline comments.May 8 2017, 12:04 PM
lib/scudo/scudo_allocator.cpp
39

It looked nice to have a separate function that wraps all of this logic. Why do you inline it?
I would just move the load of HashAlgorithm into this function so that it does not happen when not needed.

cryptoad added inline comments.May 8 2017, 12:30 PM
lib/scudo/scudo_allocator.cpp
39

Indeed, I considered this as well.
The other advantage of doing it the new way was to have the crc32 instructions inlined as opposed to be calls (which is another part of the performance gain), which required me to use the intrinsic within the loop. After duplicating the loop with intrinsic, I didn't see a need to keep that function.

dvyukov added inline comments.May 8 2017, 12:47 PM
lib/scudo/scudo_allocator.cpp
39

Do you can have intrinsics in the inlinable computeCRC32 function. It should lead to the same assembly.
With intrinsics it becomes more complex, so there is even more reason to have a separate function for it. Consider that we want to calculate crc in a another place.

cryptoad added inline comments.May 8 2017, 12:51 PM
lib/scudo/scudo_allocator.cpp
39

So something like:

INLINE u32 computeCRC32(u32 Crc, uptr Data, u8 HashType) {
#if defined(__SSE4_2__) || defined(__ARM_FEATURE_CRC32)
  return CRC32_INTRINSIC(Crc, Data);
#else
  if (atomic_load_relaxed(&HashAlgorithm) == CRC32Hardware)
    return computeHardwareCRC32(Crc, Data);
  return computeSoftwareCRC32(Crc, Data);
#endif
}
39

(without the 3rd param)

dvyukov added inline comments.May 8 2017, 12:54 PM
lib/scudo/scudo_allocator.cpp
39

yup

alekseyshl added inline comments.May 8 2017, 1:11 PM
lib/scudo/scudo_allocator.cpp
89–90

Would it make sense to check algorithm once and duplicate the code here, instead of comparing on every loop iteration? More code, but since we're after the performance...

if (HashType == CRC32Hardware) {
  Crc = computeHardwareCRC32(Cookie, reinterpret_cast<uptr>(this));
  for (uptr i = 0; i < ARRAY_SIZE(HeaderHolder); i++)
    Crc = computeHardwareCRC32(Crc, HeaderHolder[i]);
} else {
  Crc = computeSoftwareCRC32(Cookie, reinterpret_cast<uptr>(this));
  for (uptr i = 0; i < ARRAY_SIZE(HeaderHolder); i++)
    Crc = computeSoftwareCRC32(Crc, HeaderHolder[i]);
}
lib/scudo/scudo_crc32.cpp
23

Please remind me, does it mean that computeHardwareCRC32 != 0 only when defined(SSE4_2) || defined(__ARM_FEATURE_CRC32)? Or is it expected to come from other libraries too?

cryptoad added inline comments.May 8 2017, 1:11 PM
lib/scudo/scudo_allocator.cpp
39

Just gave it a try, for the non-full-SSE the loop is unrolled but with a load & check of the algorithm at each iteration (2 for 64-bit, 3 for 32-bit):

                mov     rax, cs:_ZN7__scudoL6CookieE ; __scudo::Cookie
                mov     dl, cs:_ZN7__scudoL13HashAlgorithmE ; unsigned __int64
                cmp     dl, 1
                jnz     short loc_884B
                mov     edi, eax        ; this
                mov     rsi, rcx        ; unsigned int
                call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
                jmp     loc_88DB
loc_884B:                               ; CODE XREF: __scudo::ScudoChunk::computeChecksum(__scudo::UnpackedHeader *)+1Aj

... software crc32 ...

loc_88DB:                               ; CODE XREF: __scudo::ScudoChunk::computeChecksum(__scudo::UnpackedHeader *)+26j
                mov     cl, cs:_ZN7__scudoL13HashAlgorithmE ; __scudo::HashAlgorithm
                cmp     cl, 1
                jnz     short loc_88FC
                and     r14, 0FFFFFFFFFFFF0000h
                mov     edi, eax        ; this
                mov     rsi, r14        ; unsigned int
                call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
                jmp     loc_8982
loc_88FC:                               ; CODE XREF: __scudo::ScudoChunk::computeChecksum(__scudo::UnpackedHeader *)+C4j

... software crc32 ...

It seems somewhat messier. What do you think?

cryptoad added inline comments.May 8 2017, 1:19 PM
lib/scudo/scudo_allocator.cpp
89–90

I gave this a try, with -O3, the generated code for computeChecksum ends up being identical with the initial version of the patch.

lib/scudo/scudo_crc32.cpp
23

computeHardwareCRC32 != 0 if defined(__SSE4_2__) || defined(__ARM_FEATURE_CRC32), but it could also be defined by an external library, even though this is not the expectation. With full RELRO (-Wl,-z,relro,-z,now), if not defined at load, it can't be defined later.

alekseyshl added inline comments.May 8 2017, 1:34 PM
lib/scudo/scudo_crc32.cpp
23

I see. I'm asking because in the current version of the code the same condition switches the top level code to CRC32_INTRINSIC, completely ignoring this function. I was curious if it is even necessary now.

It just does not add up. Even for the possible future other uses of computeCRC32, for the performance reasons we have to ifdef to INTRINSIC calls at the call site, which renders computeHardwareCRC32 useless, right?

cryptoad added inline comments.May 8 2017, 1:49 PM
lib/scudo/scudo_crc32.cpp
23

I am going to try and clarify the situation. Let me know if something still doesn't make sense after!

Compilation wise, there are 3 cases:

  1. -msse4_2 (or ARM equivalent) is enabled for the whole project (eg: google3), instructions will be emitted accordingly, and we will only use hardware crc32 (no weak function looked, only intrinsics);
  2. -msse4_2 (or ARM equivalent) is enabled for scudo_crc32.cpp only (eg: current cmake config if supported): SSE instructions are only emitted in this file, computeHardwareCRC32 is defined, and can be used at runtime; no SSE instructions will be emitted anywhere else; this allows runtime detection and to have a HW enabled version that also runs on old hardware;
  3. -msse4_2 (or ARM equivalent) is not enabled (eg: current cmake config if not supported): computeHardwareCRC32 is not defined.

Runtime wise, if the processor supports HW CRC32, then #1 or #2 will leverage it. If it doesn't, #1 will crash on illegal instruction, #2 will still work. #3 works in any case.

This patch will make #1 a lot cleaner ASM wise due to the call inline and the removal of the unnecessary load of HashType, #2 & #3 somewhat better due to the check removal for computeHardwareCRC32 at each iteration.

So I think your point about the top level code completely ignoring it refers to #1, but we still need it for #2.

dvyukov added inline comments.May 8 2017, 1:50 PM
lib/scudo/scudo_allocator.cpp
39

I missed that this function consumes a single word rather than an array.
I basically mean a function which you give initial crc value and an array of data and it gives you new crc value using the most efficient method available whatever it is.

cryptoad updated this revision to Diff 98207.May 8 2017, 2:10 PM

This new solution should fit both Dmitry's and Aleksey's suggestions.

Wrap up the checksum logic in a function that takes as parameters all the
information needed to compute the checksum, and returns the checksum using
the fastest way available.

Regarding the assembly, the full SSE one looks like (no extra HashAlgorithm
load and inlined crc32 calls):

mov     rax, 0FFFFFFFFFFFFFFh
lea     rcx, [r13-10h]
shl     r10, 38h
and     r14, rax
mov     rax, cs:_ZN7__scudoL6CookieE ; __scudo::Cookie
or      r14, r10
mov     rdx, r14
crc32   eax, rcx
xor     dx, dx
crc32   eax, rdx
mov     r14w, ax
mov     rax, r13
mov     [r13-10h], r14

And the partial SSE one (only one HashAlgorithm load):

                mov     rax, cs:_ZN7__scudoL6CookieE ; __scudo::Cookie
                mov     dl, cs:_ZN7__scudoL13HashAlgorithmE ; unsigned __int64
                cmp     dl, 1
                jnz     short loc_8859
                and     rbx, 0FFFFFFFFFFFF0000h
                mov     edi, eax        ; this
                mov     rsi, rcx        ; unsigned int
                call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
                mov     edi, eax        ; this
                mov     rsi, rbx        ; unsigned int
                call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
                jmp     loc_8974
; ---------------------------------------------------------------------------

loc_8859:                               ; CODE XREF: __scudo::ScudoChunk::computeChecksum(__scudo::UnpackedHeader *)+17j

... software CRC32 ...

So this seems pretty good.

alekseyshl added inline comments.May 8 2017, 2:29 PM
lib/scudo/scudo_allocator.cpp
60

Now it feels logical to move this entire function along with HashAlgorithm var into scudo_crs32.cpp, but, I guess, not inlining it affects performance, right?

lib/scudo/scudo_crc32.cpp
23

Ah, now I remember the original reasoning behind this separation. Mentioning that in comments helps a lot. Thanks!

cryptoad added inline comments.May 8 2017, 2:34 PM
lib/scudo/scudo_allocator.cpp
60

Yes.
And also I would lose the distinction of whether __SSE4_2__ is defined everywhere (eg: in scudo_allocator.cpp) or only in scudo_crc32.cpp.

lib/scudo/scudo_crc32.cpp
23

Anytime!

cryptoad updated this revision to Diff 98217.May 8 2017, 2:56 PM

Getting rid of the temporary assignment to HashType which is no longer needed.

alekseyshl accepted this revision.May 8 2017, 4:00 PM
This revision is now accepted and ready to land.May 8 2017, 4:00 PM
cryptoad closed this revision.May 9 2017, 8:25 AM