This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Implement support for __builtin_memcmp_inline
AbandonedPublic

Authored by avieira on Jul 5 2021, 10:27 AM.

Details

Reviewers
gchatelet
Summary

Hi all,

This is my initial attempt at implementing a __builtin_memcmp_inline function in LLVM. The motivation is to use it in an optimized LLVM Libc implementation of memcmp whilst still using -ffreestanding and -fno-builtin-memcmp.

I haven't added any tests and I know for a fact this goes completely wrong if your memcmp size is big enough to hit the maximum number of loads, or if you compile for size. I just wanted to bring up this RFC before I start going down the wrong path.

My current thinking is the following:

  1. Inline when compiling for size, the use of this builtin should force inlining where possible, regardless of global compilation options. My reasoning here is that if people really don't want to inline due to size considerations, don't use the inline variant of this builtin.
  2. I'm split on what to do for when the memcmp size is large enough to hit the maximum number of loads. I think there are three options: a) We ignore the maximum number of loads and just generate a lot of code... (kind of the same strategy with memcpy) b) We use a call to memcmp (and issue a diagnostic saying we did so?) c) Error out I'm not keen on (c), I suspect (a) would probably make the most sense and hope that users of the builtin are sensible?

Let me know if there are any strong opinions on this. Please add other reviewers if you know they would have an interest in this.

Kind regards,
Andre

Diff Detail

Event Timeline

avieira created this revision.Jul 5 2021, 10:27 AM
avieira requested review of this revision.Jul 5 2021, 10:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2021, 10:27 AM

The motivation here seems weaker than it was for memcpy. memcmp is much less fundamental to LLVM optimizations. It looks like the primary motivation here is to avoid duplicating the logic in llvm/lib/CodeGen/ExpandMemCmp.cpp ? But really, it isn't very much target-independent logic, and the only target-specific bit is the heuristic for load widths.

I'm split on what to do for when the memcmp size is large enough to hit the maximum number of loads. I think there are three options: a) We ignore the maximum number of loads and just generate a lot of code... (kind of the same strategy with memcpy) b) We use a call to memcmp (and issue a diagnostic saying we did so?) c) Error out I'm not keen on (c), I suspect (a) would probably make the most sense and hope that users of the builtin are sensible?

Or (d) generate a loop. But it would be a bunch of work to implement that, and it's not clear the users of the function would want that. (a) seems reasonable.

llvm/include/llvm/IR/Intrinsics.td
625

memcmp

The motivation here seems weaker than it was for memcpy. memcmp is much less fundamental to LLVM optimizations. It looks like the primary motivation here is to avoid duplicating the logic in llvm/lib/CodeGen/ExpandMemCmp.cpp ? But really, it isn't very much target-independent logic, and the only target-specific bit is the heuristic for load widths.

I'd rather not duplicate logic in llvm/lib/CodeGen/ExpandMemCmp.cpp they may get out of sync.

I'm split on what to do for when the memcmp size is large enough to hit the maximum number of loads. I think there are three options: a) We ignore the maximum number of loads and just generate a lot of code... (kind of the same strategy with memcpy) b) We use a call to memcmp (and issue a diagnostic saying we did so?) c) Error out I'm not keen on (c), I suspect (a) would probably make the most sense and hope that users of the builtin are sensible?

Or (d) generate a loop. But it would be a bunch of work to implement that, and it's not clear the users of the function would want that. (a) seems reasonable.

I concur, (a) is the way to go.
Loop generation is an option but it's much more work and involves spreading the logic between layers, I would prefer not going in that direction if possible.
The goal here is to provide building block to create libc memory functions, it's unlikely that the size is going to be big and it is a requirement that it should never call the libc (or it will stackoverflow (or loop infinitely if tail recursion)).
Also the size has to be statically known so I'd be surprised if someone would want to compare 1KiB of memory this way... and the loop is easy to implement on top of the builtin.

The motivation here seems weaker than it was for memcpy. memcmp is much less fundamental to LLVM optimizations. It looks like the primary motivation here is to avoid duplicating the logic in llvm/lib/CodeGen/ExpandMemCmp.cpp ? But really, it isn't very much target-independent logic, and the only target-specific bit is the heuristic for load widths.

I'd rather not duplicate logic in llvm/lib/CodeGen/ExpandMemCmp.cpp they may get out of sync.

The problem, from my perspective, is that you'll have to duplicate the logic to some extent anyway. Even if __builtin_memcmp_inline is technically available, that doesn't let you access the underlying heuristics, so you'll need special-case logic for every target anyway.

The motivation here seems weaker than it was for memcpy. memcmp is much less fundamental to LLVM optimizations. It looks like the primary motivation here is to avoid duplicating the logic in llvm/lib/CodeGen/ExpandMemCmp.cpp ? But really, it isn't very much target-independent logic, and the only target-specific bit is the heuristic for load widths.

I'd rather not duplicate logic in llvm/lib/CodeGen/ExpandMemCmp.cpp they may get out of sync.

The problem, from my perspective, is that you'll have to duplicate the logic to some extent anyway.

I see, we'll probably have to extract the shared logic in one or more libraries then.

Even if __builtin_memcmp_inline is technically available, that doesn't let you access the underlying heuristics, so you'll need special-case logic for every target anyway.

Yes indeed, availability of vector extensions could lead to different algorithms so we need to access target specific logic.

The motivation here seems weaker than it was for memcpy. memcmp is much less fundamental to LLVM optimizations. It looks like the primary motivation here is to avoid duplicating the logic in llvm/lib/CodeGen/ExpandMemCmp.cpp ? But really, it isn't very much target-independent logic, and the only target-specific bit is the heuristic for load widths.

I'd rather not duplicate logic in llvm/lib/CodeGen/ExpandMemCmp.cpp they may get out of sync.

The problem, from my perspective, is that you'll have to duplicate the logic to some extent anyway.

I see, we'll probably have to extract the shared logic in one or more libraries then.

I initially benchmarked memcmp using the scalar building blocks and I was seeing better codegen with the builtins. I couldn't quite remember what the differences were so I decided to benchmark the scalar implementations again and they seem to perform fairly similar now, there is a small codechange around size 3, but the codegen actually seems better to me with the scalar implementation. I may have been using an older llvm or an earlier version of the building blocks.

Even if __builtin_memcmp_inline is technically available, that doesn't let you access the underlying heuristics, so you'll need special-case logic for every target anyway.

Yes indeed, availability of vector extensions could lead to different algorithms so we need to access target specific logic.

I am happy to drop this RFC if there are no objections.

I initially benchmarked memcmp using the scalar building blocks and I was seeing better codegen with the builtins. I couldn't quite remember what the differences were so I decided to benchmark the scalar implementations again and they seem to perform fairly similar now, there is a small codechange around size 3, but the codegen actually seems better to me with the scalar implementation. I may have been using an older llvm or an earlier version of the building blocks.
I am happy to drop this RFC if there are no objections.

Ok then let's drop it.
I think we still need to improve the state of memcmp in LLVM but let's decouple the two issues.

avieira abandoned this revision.Jul 7 2021, 8:00 AM