This is an archive of the discontinued LLVM Phabricator instance.

[PPC] Inline expansion of memcmp
ClosedPublic

Authored by syzaara on Jan 12 2017, 7:57 PM.

Details

Summary

Currently on PowerPC, llvm does not expand calls to memcmp as it does for memcpy, memmove. When memcmp size and alignment info of the two sources being compared is known at compile time, we can do an inline expansion rather than calling the library routine.
The expansion is implemented at the IR level in CodeGenPrepare.cpp where other intrinsic calls are also expanded in bool CodeGenPrepare::optimizeCallInst(CallInst *CI, bool& ModifiedDT). It is only enabled by default for PowerPC, and other targets may enable it by returning true for expandMemCmp().

This patch expands memcmp by using a MaxLoadSize set by the target. For PowerPC, this is set to loading 8 bytes at a time. The class MemCmpExpansion sets up the basic block structure required for the expansion and does the inline expansion through the member function getMemCmpExpansion which returns the final value to replace the memcmp call instruction with. The expansion works by loading the number of bytes specified by LoadSize from each source of the memcmp parameters. It then subtracts and compares to zero to see if the values differ. If a difference is found, it branches with an early exit to the ResultBlock for calculating which source was larger and returning the correct result (Src1 > Src2 ? 1 : -1). Otherwise, it falls through to the either the next LoadCmpBlock or the EndBlock if this is the last LoadCmpBlock. It also adds a special case for when the memcmp result is used in an equality with 0 by skipping the result calculation and returning either 0 or 1 depending on if a difference is found. In this special case, there is also a command line option ("memcmp-num-loads-per-block") that can be used to place more loads per block. For example with 2 loads per block, this builds the basic block with an IR sequence like:

load a, load b, load c, load d
xor x1 = a^b, xor x2 = c^d
or o1 = x1 | x2

Diff Detail

Repository
rL LLVM

Event Timeline

syzaara updated this revision to Diff 84217.Jan 12 2017, 7:57 PM
syzaara retitled this revision from to [PPC] Inline expansion of memcmp.
syzaara updated this object.
syzaara added reviewers: kbarton, nemanjai, lei, sfertile, jtony.
syzaara added a subscriber: llvm-commits.
jtony added inline comments.Jan 13 2017, 8:01 AM
include/llvm/Target/TargetLowering.h
1002 ↗(On Diff #84217)

This variable name sounds like an integer rather than a bool. But since this is NOT newly introduced (historical legacy actually), I am OK to leave it as it is.

How does this compare to the existing SelectionDAGBuilder::visitMemCmpCall/EmitTargetCodeForMemcmp/etc.?

lib/CodeGen/CodeGenPrepare.cpp
1992 ↗(On Diff #84217)

You need to explicitly set the alignment to 1 here.

nemanjai requested changes to this revision.Jan 16 2017, 3:56 AM
nemanjai edited edge metadata.Jan 16 2017, 3:56 AM

A couple of general comments:

  • Make sure you add Hal Finkel, Craig Topper, Renato Golin, Tim Northover, Ulrich Weigand, etc. to the reviewers list to cover other targets
  • You may also want to add those that responded to the original RFC
  • I think the special treatment of 8-bytes is unnecessary in a target-independent pass (for things like statistics, decisions, etc.)
  • Add a bit more testing (multiple blocks, multiple types, negative tests for various reasons such as non-const comparison, max exceeded, etc.)
include/llvm/Analysis/TargetTransformInfoImpl.h
255 ↗(On Diff #84217)

It would probably be better to leave this optimization much more flexible. As such, you should probably pass some information about a specific call to memcmp that would allow the target to decide not only if ALL memcmp calls should be expanded but if THIS memcmp call should be expanded.
I'm not sure if the best option is to pass the actual CallInst or the important details (perhaps in a MemcmpInfo struct - if people would like such a data structure created). I personally prefer the former.
Finally, I think the type for the load should be an output parameter of this function as not every target would necessarily like an i64* load on every memcmp call.

lib/CodeGen/CodeGenPrepare.cpp
85 ↗(On Diff #84217)

This is one of those uses of 8-bytes that don't necessarily mean anything for some targets.
I think I'd opt for a statistic such as NumMemCmpNotMultipleOfLDSize.

1875 ↗(On Diff #84217)

Unrelated formatting change. Please refrain from this.

1914 ↗(On Diff #84217)

I would re-factor this function into things that are common and things that may change depending on endianness, etc.
For example, deciding whether you'll inline the call, calculating the number of blocks and creating the basic blocks is common code once you know the load types. However, the code to load/compare/calculate the return value will be specific. That can be done either in one separate function with the endianness as a parameter or in two separate functions for LE/BE.

1917 ↗(On Diff #84217)

There isn't anything special about Little Endian targets. We bail out here because the Big Endian code gen was not implemented here. Please add a FIXME comment here to complete the implementation for BE systems as well.

The check for 64-bit target is also an artifact of the current implementation and not a fundamental requirement here.

1924 ↗(On Diff #84217)

These should come from the TTI.

1943 ↗(On Diff #84217)

Is this how we want to use getMaxLoadsPerMemcmp()? The name seems to suggest to me that this is a threshold for how many actual load instructions can be emitted. But this appears to be used as the maximum number of bytes that can be loaded. Or am I misreading what SizeVal is?

1951 ↗(On Diff #84217)

Any reason to use std::vector's here rather than SmallVector?

1970 ↗(On Diff #84217)

Full sentences please. And the line is too long anyway, please clang-format this patch.

1981 ↗(On Diff #84217)

It seems kind of weird to me to have this "iteration zero" check here. I think it's more natural to do this outside the loop and put the loads ahead of the increment GEP's.

1992 ↗(On Diff #84217)

Do we want to set it to 1 or to whatever the alignment the arguments to the CallInst had?
But I agree we need to set it explicitly.

2230 ↗(On Diff #84217)

Two comments here:

  1. I think a complex conditional expression like this is better suited for an early exit (i.e. if any one of the required conditions aren't met, return false).
  2. The condition needs a comment explaining why all of these are required.
2238 ↗(On Diff #84217)

Uppercase variable names.

This revision now requires changes to proceed.Jan 16 2017, 3:56 AM

On SystemZ, we use the existing callback EmitTargetCodeForMemcmp to expand memcmp calls. Can you explain how the new hook is different and why we'd need two hooks?

On SystemZ, we use the existing callback EmitTargetCodeForMemcmp to expand memcmp calls. Can you explain how the new hook is different and why we'd need two hooks?

I think the primary difference is that SystemZ has instructions that can perform these comparisons directly (i.e. with an instruction that compares memory from two addresses). On Power, we have to do the loads, comparison and branch out of the load sequence as soon as a mismatch is found. And the SDAG isn't well suited for adding new control flow.
So I'd say the old hook is useful for targets that have instructions that can terminate the comparison on a mismatch without an explicit branch, whereas the new hook is useful for targets that don't.
But I'll let Zaara chime in for the full answer.

syzaara added a subscriber: rengolin.
efriedma added inline comments.Jan 16 2017, 10:15 AM
lib/CodeGen/CodeGenPrepare.cpp
1992 ↗(On Diff #84217)

Well, if you can prove the pointer passed to the load has higher alignment, you can refine it... I guess you might want to do that here since instcombine won't run after CodeGenPrepare.

2036 ↗(On Diff #84217)

You might want to special-case "memcmp(...) == 0", like we do in SelectionDAGBuilder.

On SystemZ, we use the existing callback EmitTargetCodeForMemcmp to expand memcmp calls. Can you explain how the new hook is different and why we'd need two hooks?

I think the primary difference is that SystemZ has instructions that can perform these comparisons directly (i.e. with an instruction that compares memory from two addresses). On Power, we have to do the loads, comparison and branch out of the load sequence as soon as a mismatch is found. And the SDAG isn't well suited for adding new control flow.
So I'd say the old hook is useful for targets that have instructions that can terminate the comparison on a mismatch without an explicit branch, whereas the new hook is useful for targets that don't.

Well, but on SystemZ we will also emit branches; memcmp will generally expand to a loop over CLC instructions. It's just that this is hidden behind a CLC_LOOP pseudo at the DAG level, which then gets expanded by a custom inserter.

Well, but on SystemZ we will also emit branches; memcmp will generally expand to a loop over CLC instructions. It's just that this is hidden behind a CLC_LOOP pseudo at the DAG level, which then gets expanded by a custom inserter.

The original implementation was very similar but the community felt that this isn't the best approach for valid reasons (i.e. no DAG combines, etc.). Please refer to: http://lists.llvm.org/pipermail/llvm-dev/2016-December/108589.html

On SystemZ, we use the existing callback EmitTargetCodeForMemcmp to expand memcmp calls. Can you explain how the new hook is different and why we'd need two hooks?

I think the primary difference is that SystemZ has instructions that can perform these comparisons directly (i.e. with an instruction that compares memory from two addresses). On Power, we have to do the loads, comparison and branch out of the load sequence as soon as a mismatch is found. And the SDAG isn't well suited for adding new control flow.
So I'd say the old hook is useful for targets that have instructions that can terminate the comparison on a mismatch without an explicit branch, whereas the new hook is useful for targets that don't.
But I'll let Zaara chime in for the full answer.

How does this compare to the existing SelectionDAGBuilder::visitMemCmpCall/EmitTargetCodeForMemcmp/etc.?

SelectionDAGBuilder::visitMemCmpCall optimizes specific cases of memcmp such as when size=0 or when we are using the return value of memcmp for comparing !=0 for sizes 2,4,8. These sizes wouldn't need multiple basic blocks since we can just use one load and a branch. This expansion is for the more general case when the size is a constant less than a specific threshold and expands into multiple basic blocks with early exits. Both visitMemCmpCall and EmitTargetCodeForMemcmp work on SelectionDAG where introducing control flow and working with multiple basic blocks isn't preferred.

lib/CodeGen/CodeGenPrepare.cpp
1943 ↗(On Diff #84217)

Ya, this was the same way it was used for memcpy/memset etc. It seems like number of stores but actually returns number of bytes. Just followed it for consistency.

efriedma added inline comments.Jan 16 2017, 12:22 PM
lib/CodeGen/CodeGenPrepare.cpp
1943 ↗(On Diff #84217)

No, getMaxStoresPerMemcpy is definitely the number of memory operations, not the number of bytes.

syzaara updated this revision to Diff 86107.Jan 27 2017, 1:01 PM
syzaara edited edge metadata.
syzaara marked an inline comment as done.
syzaara edited the summary of this revision. (Show Details)
syzaara marked 10 inline comments as done.Jan 27 2017, 1:05 PM

I find the logic in this patch quite difficult to follow. For example, I had trouble figuring out exactly what happens when we have a 1-byte load (i.e. branching past the block that calculates the return value, etc.). I also find that a rather peculiar special case.
Furthermore, it is not clear to me that it would be easy to extend this setup to [say] do this expansion into vector code.

lib/CodeGen/CodeGenPrepare.cpp
85 ↗(On Diff #86107)

Line too long?

1878 ↗(On Diff #86107)

The name is far too general.

1879 ↗(On Diff #86107)

It appears that LoadType is poorly named. From the body of the function, it appears that this is not a pointer type but the type of the [presumably] loaded value.

1882 ↗(On Diff #86107)

I don't see a reason to declare this here and initialize it later. Same with Res. In fact, I don't see a real need to declare either at all.

1893 ↗(On Diff #86107)

It is not at all obvious why we're masking out the bottom 3 bits here. Please explain (in a comment).

And (this is a personal preference) it seems clearer what you're doing if you write this as something like ~ 7LU or ~ 7U.

1896 ↗(On Diff #86107)

This appears to make an implicit assumption that the type of PhiXor is exactly LoadType and that furthermore, they can be either i32 or i64.
I think an assert to make sure of that is in order.

1909 ↗(On Diff #86107)

Why? Why is this not the case for i16, i32, etc...

1912 ↗(On Diff #86107)

It appears likely that a function with this many parameters might need refactoring. I imagine it is even hard to keep track of whether all of these are input parameters or if some of them are output or input/output. At the very least, I'd like to see a Doxygen comment that makes this explicit.

1923 ↗(On Diff #86107)

What if Source1 or Source2 are already of type LoadPtrTy?

1946 ↗(On Diff #86107)

I don't believe this comment is accurate. It does not at all appear that you "subtract and return". In fact, you'll still emit a compare against zero and branch if this not the last load block.

1968 ↗(On Diff #86107)

I imagine this isn't needed. We have IntegerType::get(LLVMContext &C, unsigned NumBits).

1991 ↗(On Diff #86107)

This seems entirely unnecessary since there is Type::getPointerTo().

1993 ↗(On Diff #86107)

default:?
I would imagine an llvm_unreachable is in order.

2089 ↗(On Diff #86107)

Please, use false for values of type bool. Besides, don't we want to actually pass the correct argument here? If we *are* compiling with -Os or -Oz, this should be true I imagine. Or even more likely, we should exit early from this function when compiling with -Oz.

2136 ↗(On Diff #86107)

There is absolutely no reason to separate the initialization of this from its declaration.

2397 ↗(On Diff #86107)

The formatting in this entire block is messed up. Please run it through clang-format.

rengolin removed a subscriber: rengolin.Jan 30 2017, 6:47 AM
jtony added inline comments.Jan 30 2017, 7:36 AM
lib/CodeGen/CodeGenPrepare.cpp
1913 ↗(On Diff #86107)

I agree with Nemanja here. This function has 11 parameters which is far too many (generally speaking function parameters exceed 7 is not recommended). Maybe you can try splitting the function into smaller ones to reduce the number of parameters need to pass using the Single Responsibility Principle. Or make some of the parameters class members so that you don't need to pass them around (be very careful only make these necessary parameters class members because doing that you are increasing their live scope).

lei added inline comments.Feb 7 2017, 6:48 AM
lib/CodeGen/CodeGenPrepare.cpp
2145 ↗(On Diff #86107)

It would be more meaningful to put something like .... find the max load size that can be used

2150 ↗(On Diff #86107)

We should find a better name for this "Remainder" variable. I find it confusing. It should be named for what it actually is, which see to be the number of bytes to be processed. Maybe something like NumBytesToBeProcessed.

syzaara updated this revision to Diff 89705.Feb 24 2017, 11:52 AM
syzaara marked an inline comment as done.
syzaara edited the summary of this revision. (Show Details)
efriedma added inline comments.Mar 2 2017, 10:51 AM
include/llvm/IR/Instruction.h
483 ↗(On Diff #89705)

Better to put this in ValueTracking.h, or something like that.

include/llvm/Target/TargetLowering.h
1016 ↗(On Diff #89705)

The maximum size in bytes isn't really a useful metric; what actually matters for performance/codesize is the number of loads. (It takes roughly the same number of instructions to memcmp 15 bytes vs. 32 bytes.)

lib/CodeGen/CodeGenPrepare.cpp
2122 ↗(On Diff #89705)

CreateZExt.

2233 ↗(On Diff #89705)

Hmm... in isOnlyUsedInZeroEqualityComparison mode, it probably makes sense to perform more loads per block, to reduce the number of branches.

2269 ↗(On Diff #89705)

Would bswap(Src1) > bswap(Src2) ? 1 : -1 be shorter?

2639 ↗(On Diff #89705)

Better to use the overload of getLibFunc that takes a Function&.

syzaara added inline comments.Mar 9 2017, 8:25 AM
lib/CodeGen/CodeGenPrepare.cpp
2233 ↗(On Diff #89705)

I'm not sure what you mean here. We have the branches for doing an early exit so we don't need to continue loading more bytes when we find a difference.

efriedma added inline comments.Mar 9 2017, 9:56 AM
lib/CodeGen/CodeGenPrepare.cpp
2233 ↗(On Diff #89705)

A couple loads hitting the cache are cheaper than a branch in cases where the branch isn't predictable... but maybe not a big deal either way.

syzaara added inline comments.Mar 9 2017, 11:06 AM
lib/CodeGen/CodeGenPrepare.cpp
2233 ↗(On Diff #89705)

I'm still not sure how this reduces the number of branches. We would still need to compare each of the loads to see which one had the difference.

2269 ↗(On Diff #89705)

On power, having a load followed by a bswap will get converted to a load byte reversed hardware instruction. But I'm not sure if other targets will have a good way to handle bswap within gprs. It could introduce a longer sequence. Also this way, we are adding more branches with the comparison.

syzaara updated this revision to Diff 92952.Mar 24 2017, 8:46 AM
syzaara edited the summary of this revision. (Show Details)
This revision now requires changes to proceed.May 3 2017, 11:47 PM

@courbet Hi, you've marked this as needs revision, but I don't see any comments.

syzaara requested review of this revision.May 4 2017, 10:52 AM
nemanjai accepted this revision.May 13 2017, 2:07 PM

This seems OK to me (with the minor nits addressed on the commit). Since I've requested changes to a previous review, I believe that I have to approve the review to allow the change to go ahead. However, @efriedma has raised valid concerns previously and I think you should ensure that you get his approval prior to committing this.

lib/CodeGen/CodeGenPrepare.cpp
89 ↗(On Diff #92952)

This line is too long. I won't add further comments to this end, but please be sure to run clang-format-diff on this to fix up such issues.

2134 ↗(On Diff #92952)

Capitals for variable names.

2169 ↗(On Diff #92952)

Please initialize these in a member initializer list.

2289 ↗(On Diff #92952)

nit: formatting seems off here.

2305 ↗(On Diff #92952)

Just use compound assignment (i.e. +=, -=) on this line and below.

This revision is now accepted and ready to land.May 13 2017, 2:07 PM
syzaara added inline comments.May 16 2017, 1:10 PM
lib/CodeGen/CodeGenPrepare.cpp
2289 ↗(On Diff #92952)

This is from clang-format.

syzaara updated this revision to Diff 99183.May 16 2017, 1:19 PM
vchin added a subscriber: vchin.May 17 2017, 7:51 AM
efriedma accepted this revision.May 17 2017, 12:07 PM

Looks good! A few minor comments. Please test carefully before merging.

lib/CodeGen/CodeGenPrepare.cpp
2311 ↗(On Diff #99183)

I think this function is just return (Size / MaxLoadSize) + countPopulation(Size % MaxLoadSize).

2319 ↗(On Diff #99183)

return MinAlign(Size, MaxLoadSize)?

2549 ↗(On Diff #99183)

Generally better to avoid floating-point for this sort of thing, especially since converting an 32-bit integer to a 32-bit float isn't exact. I guess we don't have a helper in MathExtras for this; you can write something like like "NumBlocks / NumLoadsPerBlock + (NumBlocks % NumLoadsPerBlock != 0 ? 1 : 0)".

syzaara added inline comments.May 25 2017, 6:40 AM
lib/CodeGen/CodeGenPrepare.cpp
2319 ↗(On Diff #99183)

MinAlign doesn't work here. MinAlign returns the largest power of 2 that divides both Size and MaxLoadSize. We want the largest power of 2 which is <= Size.

efriedma added inline comments.May 26 2017, 12:27 PM
lib/CodeGen/CodeGenPrepare.cpp
2319 ↗(On Diff #99183)

Oh, so MinAlign(PowerOf2Floor(Size), MaxLoadSize)?

This revision was automatically updated to reflect the committed changes.