Page MenuHomePhabricator

[SelectionDAG][RFC] Allow the user to specify a memeq function (v5).
ClosedPublic

Authored by courbet on Jan 11 2019, 4:45 AM.

Details

Summary

Right now, when we encounter a string equality check,
e.g. if (memcmp(a, b, s) == 0), we try to expand to a comparison if s is a
small compile-time constant, and fall back on calling memcmp() else.

This is sub-optimal because memcmp has to compute much more than
equality.

This patch replaces memcmp(a, b, s) == 0 by bcmp(a, b, s) == 0 on platforms
that support bcmp.

bcmp can be made much more efficient than memcmp because equality
compare is trivially parallel while lexicographic ordering has a chain
dependency.

Diff Detail

Event Timeline

courbet created this revision.Jan 11 2019, 4:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2019, 11:26 PM
xbolva00 added inline comments.
lib/Transforms/Utils/SimplifyLibCalls.cpp
917

Move code and create function emitBcmp in BuildLibCalls

gchatelet added inline comments.Feb 5 2019, 1:50 AM
lib/Analysis/TargetLibraryInfo.cpp
54

typo 'the the'

courbet updated this revision to Diff 185288.Feb 5 2019, 5:31 AM
courbet marked 2 inline comments as done.

Refactor emitBCmp() to BuildLibCalls.

Thanks for the comments.

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

Given the potential for gains reported, I'd hate to see this as a change that people can't actually take advantage of.

A couple things I'd worry about, which I think this change is doing properly, but just to double check:

  • I assume that with -ffreestanding, this will be disabled.
  • Some folks avoid -ffreestanding, even though they have a freestanding implementation (sigh). For them, I assume -fno-builtin=bcmp will also disable this.

We should document this change for such folk, as they will need to either add the flag, or provide their own bcmp implementation.

Some other transforms in SimplifyLibCalls transform strcmp and strncmp into memcmp. I'm not sure if these optimizations will iterate or not -- will this properly transform strcmp -> memcmp -> bcmp, where appropriate?

Finally, I note that we don't optimize user code which calls bcmp the way we do user code which calls memcmp. Neither in ExpandMemCmp, or SimplifyLibCalls do we handle bcmp. While that's not something that needs to be simultaneously with this change, probably we should be doing so.

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

I agree. That would be great to see this potential gain converted into an actual real gain for everybody.
I'm working on the optimized internal bcmp, we're still discussing how to contribute this to glibc (since implementation relies heavily on C++ features)

A couple things I'd worry about, which I think this change is doing properly, but just to double check:

  • I assume that with -ffreestanding, this will be disabled.
  • Some folks avoid -ffreestanding, even though they have a freestanding implementation (sigh). For them, I assume -fno-builtin=bcmp will also disable this.

Unfortunately using LLVM's -ffreestanding doesn't seems to conform to the standard, so it's already broken in a way.
Clang considers mem* to be a required part of freestanding. Using -nostdlib and -ffreestanding doesn't mean that Clang and LLVM won't use those functions, it means that you're responsible for providing them.
e.g. passing a big struct by value with -ffreestanding will call memcpy in both C and C++

That said, I agree it's better to disable bcmp for -ffreestanding.

courbet updated this revision to Diff 186426.Feb 12 2019, 2:21 AM

Add test to show that simplifylibcall composes (strcmp -> memcmp -> bcmp).

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

Given the potential for gains reported, I'd hate to see this as a change that people can't actually take advantage of.

A couple things I'd worry about, which I think this change is doing properly, but just to double check:

  • I assume that with -ffreestanding, this will be disabled.
  • Some folks avoid -ffreestanding, even though they have a freestanding implementation (sigh). For them, I assume -fno-builtin=bcmp will also disable this. We should document this change for such folk, as they will need to either add the flag, or provide their own bcmp implementation.

While compiling the following code:

return memcmp(a, b, i) == 0;

1 - -ffreestanding will emit memcmp() instead of bcmp(), because freestanding disables bcmp() as a library function.
2 - -fno-builtin-memcmp will emit memcmp() instead of bcmp(), because this one instructs LLVM to not treat calls to memcmp() specially.
3 - -fno-builtin-bcmp will emit bcmp(). AFAICT, this is reasonable because this flag is supposed to tell clang/llvm to not treat bcmp specially, not to avoid using it (which is what freestanding is). That being said, I can see why the behaviour you're suggesting is good for users, so we probably want to teach clang to treat bcmp as a builtin: https://reviews.llvm.org/D58120

Some other transforms in SimplifyLibCalls transform strcmp and strncmp into memcmp. I'm not sure if these optimizations will iterate or not -- will this properly transform strcmp -> memcmp -> bcmp, where appropriate?

They do compose, I've added a test to show that.

Finally, I note that we don't optimize user code which calls bcmp the way we do user code which calls memcmp. Neither in ExpandMemCmp, or SimplifyLibCalls do we handle bcmp. While that's not something that needs to be simultaneously with this change, probably we should be doing so.

That's a very good point. I've created PR40699 with this.

OK, I'm happy with all of that.

One more issue I just thought of: I believe this will reduce sanitizer coverage, since they intercept calls to memcmp, but not (yet) bcmp.

OK, I'm happy with all of that.

One more issue I just thought of: I believe this will reduce sanitizer coverage, since they intercept calls to memcmp, but not (yet) bcmp.

Thanks , I've created D58379 for this.

D58379 Is now submitted.

jyknight accepted this revision.Feb 26 2019, 6:37 AM

LGTM.

Probably worth adding a note to the release notes, something like "The optimizer will now convert calls to memcmp into a calls to bcmp in some circumstances. Users who are building freestanding code (not depending on the platform's libc) without specifying -ffreestanding may need to either pass -fno-builtin-bcmp, or provide a bcmp function."

This revision is now accepted and ready to land.Feb 26 2019, 6:37 AM
courbet updated this revision to Diff 188369.Feb 26 2019, 6:57 AM
  • update realease notes

Done, thanks for the review !

This revision was automatically updated to reflect the committed changes.

This made pdfium use 6.8% more cpu, https://bugs.chromium.org/p/chromium/issues/detail?id=947611

Since the intent here was to make things faster, any ideas how this could happen?

This made pdfium use 6.8% more cpu, https://bugs.chromium.org/p/chromium/issues/detail?id=947611

Since the intent here was to make things faster, any ideas how this could happen?

Hi Nico, thanks for reporting.

This commit introduced a regression (https://bugs.llvm.org/show_bug.cgi?id=40699): memcmp with small constant sizes, e.g. memcmp(a, b, 16) == 0, was expanded as bcmp(a, b, 16) == 0 instead of being lowered to two loads + 1 compare.

This was fixed in https://reviews.llvm.org/rL356550

Could that be the reason for the regression ? I've tried reproducing with the commands in https://bugs.chromium.org/p/chromium/issues/detail?id=947611, but my valgrind will not handle AVX instructions:

valgrind: Unrecognised instruction at address 0x4015cc9
   at 0x4015CC9: _dl_runtime_resolve_avx_slow
...

Could that be the reason for the regression ?

Though I cannot run callgrind, I had a look at the code with/without 356550. Before the change, I see a lot of small constant-size calls to bcmp (~20 of them), e.g.:

leaq -0xa0(%rbp), %rbx
...
movl $0x20, %edx          
movq %rbx, %rdi           
leaq -0xe0(%rbp), %rsi    
callq 0x1954c7

After the change, these are inlined:

movdqa -0x80(%rbp), %xmm0 
movdqa -0x70(%rbp), %xmm1 
pcmpeqb -0xd0(%rbp), %xmm1
pcmpeqb -0xe0(%rbp), %xmm0
pand %xmm1, %xmm0         
pmovmskb %xmm0, %ecx      
cmpl $0xffff, %ecx

OK, I succeeded in profiling this using pprof instead of callgrind.

Again, before 356550 I see:

210764079 32.23% 32.23%  210764079 32.23%  CStretchEngine::ContinueStretchHorz
  98749133 15.10% 47.33%   98749133 15.10%  (anonymous namespace)::FaxG4GetRow
  76895485 11.76% 59.09%   76895485 11.76%  <unknown>
  72290505 11.06% 70.15%   72290505 11.06%  (anonymous namespace)::FindBit
  24300088  3.72% 73.86%   24300088  3.72%  CFX_ScanlineCompositor::CompositeByteMaskLine
  21546869  3.30% 77.16%   21546869  3.30%  CStretchEngine::StretchVert
  17315093  2.65% 79.81%   17315093  2.65%  [pdfium_test]
  16493189  2.52% 82.33%   16493189  2.52%  __memcmp_sse4_1
   8353611  1.28% 83.61%    8353611  1.28%  CStretchEngine::CWeightTable::Calc
   7791739  1.19% 84.80%    7791739  1.19%  CPDF_DIBBase::TranslateScanline24bppDefaultDecode

and after the change:

195287006 32.16% 32.16%  195287006 32.16%  CStretchEngine::ContinueStretchHorz
 85942976 14.15% 46.31%   85942976 14.15%  (anonymous namespace)::FaxG4GetRow
 75448209 12.42% 58.74%   75448209 12.42%  <unknown>
 64249874 10.58% 69.32%   64249874 10.58%  (anonymous namespace)::FindBit
 32990212  5.43% 74.75%   32990212  5.43%  CStretchEngine::StretchVert
 14766318  2.43% 77.18%   14766318  2.43%  CFX_ScanlineCompositor::CompositeByteMaskLine
 13827062  2.28% 79.46%   13827062  2.28%  [pdfium_test]
  9502572  1.56% 81.02%    9502572  1.56%  CPDF_DIBBase::TranslateScanline24bppDefaultDecode
  6078561  1.00% 82.02%    6078561  1.00%  CStretchEngine::CWeightTable::Calc
  5193562  0.86% 82.88%    5193562  0.86%  CPDF_TextPage::ProcessTextObject

So I'm now quite confident that de-inlining was the cause of the regression and that it was fixed by 356550.