Page MenuHomePhabricator

[SelectionDAG][RFC] Allow the user to specify a memeq function (v5).
ClosedPublic

Authored by courbet on Jan 11 2019, 4:45 AM.

Details

Summary

Right now, when we encounter a string equality check,
e.g. if (memcmp(a, b, s) == 0), we try to expand to a comparison if s is a
small compile-time constant, and fall back on calling memcmp() else.

This is sub-optimal because memcmp has to compute much more than
equality.

This patch replaces memcmp(a, b, s) == 0 by bcmp(a, b, s) == 0 on platforms
that support bcmp.

bcmp can be made much more efficient than memcmp because equality
compare is trivially parallel while lexicographic ordering has a chain
dependency.

Diff Detail

Repository
rL LLVM

Event Timeline

courbet created this revision.Jan 11 2019, 4:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2019, 11:26 PM
xbolva00 added inline comments.
lib/Transforms/Utils/SimplifyLibCalls.cpp
917 ↗(On Diff #181249)

Move code and create function emitBcmp in BuildLibCalls

gchatelet added inline comments.Feb 5 2019, 1:50 AM
lib/Analysis/TargetLibraryInfo.cpp
54 ↗(On Diff #181249)

typo 'the the'

courbet updated this revision to Diff 185288.Feb 5 2019, 5:31 AM
courbet marked 2 inline comments as done.

Refactor emitBCmp() to BuildLibCalls.

Thanks for the comments.

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

Given the potential for gains reported, I'd hate to see this as a change that people can't actually take advantage of.

A couple things I'd worry about, which I think this change is doing properly, but just to double check:

  • I assume that with -ffreestanding, this will be disabled.
  • Some folks avoid -ffreestanding, even though they have a freestanding implementation (sigh). For them, I assume -fno-builtin=bcmp will also disable this.

We should document this change for such folk, as they will need to either add the flag, or provide their own bcmp implementation.

Some other transforms in SimplifyLibCalls transform strcmp and strncmp into memcmp. I'm not sure if these optimizations will iterate or not -- will this properly transform strcmp -> memcmp -> bcmp, where appropriate?

Finally, I note that we don't optimize user code which calls bcmp the way we do user code which calls memcmp. Neither in ExpandMemCmp, or SimplifyLibCalls do we handle bcmp. While that's not something that needs to be simultaneously with this change, probably we should be doing so.

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

I agree. That would be great to see this potential gain converted into an actual real gain for everybody.
I'm working on the optimized internal bcmp, we're still discussing how to contribute this to glibc (since implementation relies heavily on C++ features)

A couple things I'd worry about, which I think this change is doing properly, but just to double check:

  • I assume that with -ffreestanding, this will be disabled.
  • Some folks avoid -ffreestanding, even though they have a freestanding implementation (sigh). For them, I assume -fno-builtin=bcmp will also disable this.

Unfortunately using LLVM's -ffreestanding doesn't seems to conform to the standard, so it's already broken in a way.
Clang considers mem* to be a required part of freestanding. Using -nostdlib and -ffreestanding doesn't mean that Clang and LLVM won't use those functions, it means that you're responsible for providing them.
e.g. passing a big struct by value with -ffreestanding will call memcpy in both C and C++

That said, I agree it's better to disable bcmp for -ffreestanding.

courbet updated this revision to Diff 186426.Feb 12 2019, 2:21 AM

Add test to show that simplifylibcall composes (strcmp -> memcmp -> bcmp).

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

Given the potential for gains reported, I'd hate to see this as a change that people can't actually take advantage of.

A couple things I'd worry about, which I think this change is doing properly, but just to double check:

  • I assume that with -ffreestanding, this will be disabled.
  • Some folks avoid -ffreestanding, even though they have a freestanding implementation (sigh). For them, I assume -fno-builtin=bcmp will also disable this.

We should document this change for such folk, as they will need to either add the flag, or provide their own bcmp implementation.

While compiling the following code:

return memcmp(a, b, i) == 0;

1 - -ffreestanding will emit memcmp() instead of bcmp(), because freestanding disables bcmp() as a library function.
2 - -fno-builtin-memcmp will emit memcmp() instead of bcmp(), because this one instructs LLVM to not treat calls to memcmp() specially.
3 - -fno-builtin-bcmp will emit bcmp(). AFAICT, this is reasonable because this flag is supposed to tell clang/llvm to not treat bcmp specially, not to avoid using it (which is what freestanding is). That being said, I can see why the behaviour you're suggesting is good for users, so we probably want to teach clang to treat bcmp as a builtin: https://reviews.llvm.org/D58120

Some other transforms in SimplifyLibCalls transform strcmp and strncmp into memcmp. I'm not sure if these optimizations will iterate or not -- will this properly transform strcmp -> memcmp -> bcmp, where appropriate?

They do compose, I've added a test to show that.

Finally, I note that we don't optimize user code which calls bcmp the way we do user code which calls memcmp. Neither in ExpandMemCmp, or SimplifyLibCalls do we handle bcmp. While that's not something that needs to be simultaneously with this change, probably we should be doing so.

That's a very good point. I've created PR40699 with this.

OK, I'm happy with all of that.

One more issue I just thought of: I believe this will reduce sanitizer coverage, since they intercept calls to memcmp, but not (yet) bcmp.

OK, I'm happy with all of that.

One more issue I just thought of: I believe this will reduce sanitizer coverage, since they intercept calls to memcmp, but not (yet) bcmp.

Thanks , I've created D58379 for this.

D58379 Is now submitted.

jyknight accepted this revision.Feb 26 2019, 6:37 AM

LGTM.

Probably worth adding a note to the release notes, something like "The optimizer will now convert calls to memcmp into a calls to bcmp in some circumstances. Users who are building freestanding code (not depending on the platform's libc) without specifying -ffreestanding may need to either pass -fno-builtin-bcmp, or provide a bcmp function."

This revision is now accepted and ready to land.Feb 26 2019, 6:37 AM
courbet updated this revision to Diff 188369.Feb 26 2019, 6:57 AM
  • update realease notes

Done, thanks for the review !

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Mar 29 2019, 1:11 PM

This made pdfium use 6.8% more cpu, https://bugs.chromium.org/p/chromium/issues/detail?id=947611

Since the intent here was to make things faster, any ideas how this could happen?

This made pdfium use 6.8% more cpu, https://bugs.chromium.org/p/chromium/issues/detail?id=947611

Since the intent here was to make things faster, any ideas how this could happen?

Hi Nico, thanks for reporting.

This commit introduced a regression (https://bugs.llvm.org/show_bug.cgi?id=40699): memcmp with small constant sizes, e.g. memcmp(a, b, 16) == 0, was expanded as bcmp(a, b, 16) == 0 instead of being lowered to two loads + 1 compare.

This was fixed in https://reviews.llvm.org/rL356550

Could that be the reason for the regression ? I've tried reproducing with the commands in https://bugs.chromium.org/p/chromium/issues/detail?id=947611, but my valgrind will not handle AVX instructions:

valgrind: Unrecognised instruction at address 0x4015cc9
   at 0x4015CC9: _dl_runtime_resolve_avx_slow
...

Could that be the reason for the regression ?

Though I cannot run callgrind, I had a look at the code with/without 356550. Before the change, I see a lot of small constant-size calls to bcmp (~20 of them), e.g.:

leaq -0xa0(%rbp), %rbx
...
movl $0x20, %edx          
movq %rbx, %rdi           
leaq -0xe0(%rbp), %rsi    
callq 0x1954c7

After the change, these are inlined:

movdqa -0x80(%rbp), %xmm0 
movdqa -0x70(%rbp), %xmm1 
pcmpeqb -0xd0(%rbp), %xmm1
pcmpeqb -0xe0(%rbp), %xmm0
pand %xmm1, %xmm0         
pmovmskb %xmm0, %ecx      
cmpl $0xffff, %ecx

OK, I succeeded in profiling this using pprof instead of callgrind.

Again, before 356550 I see:

210764079 32.23% 32.23%  210764079 32.23%  CStretchEngine::ContinueStretchHorz
  98749133 15.10% 47.33%   98749133 15.10%  (anonymous namespace)::FaxG4GetRow
  76895485 11.76% 59.09%   76895485 11.76%  <unknown>
  72290505 11.06% 70.15%   72290505 11.06%  (anonymous namespace)::FindBit
  24300088  3.72% 73.86%   24300088  3.72%  CFX_ScanlineCompositor::CompositeByteMaskLine
  21546869  3.30% 77.16%   21546869  3.30%  CStretchEngine::StretchVert
  17315093  2.65% 79.81%   17315093  2.65%  [pdfium_test]
  16493189  2.52% 82.33%   16493189  2.52%  __memcmp_sse4_1
   8353611  1.28% 83.61%    8353611  1.28%  CStretchEngine::CWeightTable::Calc
   7791739  1.19% 84.80%    7791739  1.19%  CPDF_DIBBase::TranslateScanline24bppDefaultDecode

and after the change:

195287006 32.16% 32.16%  195287006 32.16%  CStretchEngine::ContinueStretchHorz
 85942976 14.15% 46.31%   85942976 14.15%  (anonymous namespace)::FaxG4GetRow
 75448209 12.42% 58.74%   75448209 12.42%  <unknown>
 64249874 10.58% 69.32%   64249874 10.58%  (anonymous namespace)::FindBit
 32990212  5.43% 74.75%   32990212  5.43%  CStretchEngine::StretchVert
 14766318  2.43% 77.18%   14766318  2.43%  CFX_ScanlineCompositor::CompositeByteMaskLine
 13827062  2.28% 79.46%   13827062  2.28%  [pdfium_test]
  9502572  1.56% 81.02%    9502572  1.56%  CPDF_DIBBase::TranslateScanline24bppDefaultDecode
  6078561  1.00% 82.02%    6078561  1.00%  CStretchEngine::CWeightTable::Calc
  5193562  0.86% 82.88%    5193562  0.86%  CPDF_TextPage::ProcessTextObject

So I'm now quite confident that de-inlining was the cause of the regression and that it was fixed by 356550.

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

Tried to add an optimized bcmp for GLIBC here.

It was not well received because bcmp is not a standard function. It seems GLIBC supports bcmp out of reluctant necessity rather than any desire for it to be fast.

There was agreement that the functionality was useful so GLIBC landed on __memcmpeq'to get the functionality. The patches made it to HEAD
and will be available starting with the 2.35 release. It declared in "string.h" or can be queried with GLIBC version >= 2.35.

Currently only x86_64 has an optimized version, the rest of the targets still just redirect to memcmp.

Working on a patch to add support in LLVM.

Given the potential for gains reported, I'd hate to see this as a change that people can't actually take advantage of.

A couple things I'd worry about, which I think this change is doing properly, but just to double check:

  • I assume that with -ffreestanding, this will be disabled.
  • Some folks avoid -ffreestanding, even though they have a freestanding implementation (sigh). For them, I assume -fno-builtin=bcmp will also disable this.

We should document this change for such folk, as they will need to either add the flag, or provide their own bcmp implementation.

Some other transforms in SimplifyLibCalls transform strcmp and strncmp into memcmp. I'm not sure if these optimizations will iterate or not -- will this properly transform strcmp -> memcmp -> bcmp, where appropriate?

Finally, I note that we don't optimize user code which calls bcmp the way we do user code which calls memcmp. Neither in ExpandMemCmp, or SimplifyLibCalls do we handle bcmp. While that's not something that needs to be simultaneously with this change, probably we should be doing so.

MaskRay added a comment.EditedNov 6 2021, 12:45 PM

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

Tried to add an optimized bcmp for GLIBC here.

It was not well received because bcmp is not a standard function. It seems GLIBC supports bcmp out of reluctant necessity rather than any desire for it to be fast.

There was agreement that the functionality was useful so GLIBC landed on __memcmpeq'to get the functionality. The patches made it to HEAD
and will be available starting with the 2.35 release. It declared in "string.h" or can be queried with GLIBC version >= 2.35.

Currently only x86_64 has an optimized version, the rest of the targets still just redirect to memcmp.

Working on a patch to add support in LLVM.

I understand and accept the viewpoint that: for glibc, bumping the symbol version for bcmp may lead to difficuly-to-debug issues when users try to upgrade glibc.
An ABI-only symbol in the reserved namespace looks good to me.

AFAIK while LLVM has Triple::isOSLinux and Triple::isGNUEnvironment and there are optimizations dispatching on linux-gnu there is no way detecting the version. (well, Apple platforms and some other OSes have such checks)
The optimizations just assume that a very old glibc version may be used (perhaps 2.10+ or early 2.20+).
In addition, there are many cross compiling users where no version is specified.

The glibc 2.35 availability does not make this optimization exploitable, at least for few (5+?) years :)


The original bcmp transformation does make me wonder whether it is a suitable enabled-by-default optimization if its existence is due to reluctance and neither glibc nor musl actually makes it faster than memcmp.
There are also concerns switching from 3-way to 2-way for bcmp even if its deprecated specification only promises 2-way.
I understand that libc/src/string/bcmp.cpp leverages 2-way and some users may override glibc functions with that implementation, but whether this use case is sufficient to change the more broad Linux default is debatable.

One option is to hide such transformation under a cl::opt option.

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

Tried to add an optimized bcmp for GLIBC here.

It was not well received because bcmp is not a standard function. It seems GLIBC supports bcmp out of reluctant necessity rather than any desire for it to be fast.

There was agreement that the functionality was useful so GLIBC landed on __memcmpeq'to get the functionality. The patches made it to HEAD
and will be available starting with the 2.35 release. It declared in "string.h" or can be queried with GLIBC version >= 2.35.

Currently only x86_64 has an optimized version, the rest of the targets still just redirect to memcmp.

Working on a patch to add support in LLVM.

I understand and accept the viewpoint that: for glibc, bumping the symbol version for bcmp may lead to difficuly-to-debug issues when users try to upgrade glibc.
An ABI-only symbol in the reserved namespace looks good to me.

AFAIK while LLVM has Triple::isOSLinux and Triple::isGNUEnvironment and there are optimizations dispatching on linux-gnu there is no way detecting the version. (well, Apple platforms and some other OSes have such checks)
The optimizations just assume that a very old glibc version may be used (perhaps 2.10+ or early 2.20+).
In addition, there are many cross compiling users where no version is specified.

I was planing to add Triple::isGNUVersionLT and use it in a function hasMemcmpeq in a similar vein to hasBcmp. Will there be an issue with that?

The glibc 2.35 availability does not make this optimization exploitable, at least for few (5+?) years :)

Indeed :/

One option is to hide such transformation under a cl::opt option.

Sorry not sure I understand, can you explain?

It'd be great to see this somewhat more widely publicized, outside of just the clang community. If libc implementors are aware of the gains and are willing to provide an actually-faster bcmp implementation, it'd be a lot better, than having this optimization that doesn't really optimize anything without users providing their own bcmp implementation.

Tried to add an optimized bcmp for GLIBC here.

It was not well received because bcmp is not a standard function. It seems GLIBC supports bcmp out of reluctant necessity rather than any desire for it to be fast.

There was agreement that the functionality was useful so GLIBC landed on __memcmpeq'to get the functionality. The patches made it to HEAD
and will be available starting with the 2.35 release. It declared in "string.h" or can be queried with GLIBC version >= 2.35.

Currently only x86_64 has an optimized version, the rest of the targets still just redirect to memcmp.

Working on a patch to add support in LLVM.

I understand and accept the viewpoint that: for glibc, bumping the symbol version for bcmp may lead to difficuly-to-debug issues when users try to upgrade glibc.
An ABI-only symbol in the reserved namespace looks good to me.

AFAIK while LLVM has Triple::isOSLinux and Triple::isGNUEnvironment and there are optimizations dispatching on linux-gnu there is no way detecting the version. (well, Apple platforms and some other OSes have such checks)
The optimizations just assume that a very old glibc version may be used (perhaps 2.10+ or early 2.20+).
In addition, there are many cross compiling users where no version is specified.

I was planing to add Triple::isGNUVersionLT and use it in a function hasMemcmpeq in a similar vein to hasBcmp. Will there be an issue with that?

So the version parsing doesn't appear to work out of the box for GLIBC and your right that it will miss cases (cross compilation as you pointed out) as well as free standing compilation or libraries like cpu-rt.

Do you know if there is any way to check for the __memcmpeq declaration? We put it in string.h particularly because the issue of how to know when this optimization is valid came up when making the proposal.

As well, does having the optimization attached to the declaration make sense? It may still need to be attached to isGNUEnvironment in case another libc implementation has __memcmpeq with slightly different semantics.

The glibc 2.35 availability does not make this optimization exploitable, at least for few (5+?) years :)

Indeed :/

One option is to hide such transformation under a cl::opt option.

Sorry not sure I understand, can you explain?

See what you mean and why it might be necessary.