This is an archive of the discontinued LLVM Phabricator instance.

[LibCallSimplifier] try harder to fold memcmp with constant arguments
ClosedPublic

Authored by spatel on Aug 19 2017, 8:20 AM.

Details

Summary

Try to fold:
memcmp(X, C, ConstantLength) == 0 --> load X == *C

Without this change, we're unnecessarily checking the alignment of the constant data, so we miss the transform in the first 2 tests in the patch.

I noted this shortcoming of LibCallSimpifier in one of the recent CGP memcmp expansion patches. This doesn't help the example in:
https://bugs.llvm.org/show_bug.cgi?id=34032#c13
...directly, but I think it's worth short-circuiting more of these simple cases since we're already trying to do that.

Diff Detail

Event Timeline

spatel created this revision.Aug 19 2017, 8:20 AM
spatel edited the summary of this revision. (Show Details)Aug 19 2017, 8:27 AM
davide edited edge metadata.Aug 19 2017, 8:37 AM

What GCC does in this case? (or icc?)

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

No. AFAIK, no such common prefix optimization exists in llvm.

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

No. AFAIK, no such common prefix optimization exists in llvm.

So, maybe we might consider postponing this change until a real use-case shows up?
This doesn't seem to be a lot of code to add when/if a real opportunity shows up.

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

No. AFAIK, no such common prefix optimization exists in llvm.

So, maybe we might consider postponing this change until a real use-case shows up?
This doesn't seem to be a lot of code to add when/if a real opportunity shows up.

I'm not sure why that's better. If you're advocating that we should remove early memcmp transforms entirely, then the burden of proof for that change is much higher as I understand it. But this patch is just trying to fix a logic hole in the existing transform.

The motivation for fixing this now is that there are cases where earlier analysis/expansion of small memcmp would be a perf win, so I'd like to remove this roadblock from affecting work on that going forward.

A potential example of that is the attachment in PR34032 (that's the optimized IR from an llvm source file - lib/IR/Function.cpp). The IR has ~5500 memcmp calls in it, and even in its current limited form for x86, CGP memcmp expansion will transform over 4K of those calls to inline IR (ie, the majority of memcmps are constant length and less than 16 bytes). If you then run that IR through the normal opt -O2 pipeline, it gets significantly smaller, so I'm taking that as early evidence that there's room for improvement.

What GCC does in this case? (or icc?)

https://godbolt.org/g/oKf3s8

So gcc gets it; icc did something terrible; clang also gets this, but in the x86 backend. This is moving the optimization up in the pipeline to hopefully combine with a common prefix optimization suggested by Joerg in D35035.

Do you have an example of a common prefix opt that fires if we perform this lowering as part of the instruction combiner?

No. AFAIK, no such common prefix optimization exists in llvm.

So, maybe we might consider postponing this change until a real use-case shows up?
This doesn't seem to be a lot of code to add when/if a real opportunity shows up.

I'm not sure why that's better. If you're advocating that we should remove early memcmp transforms entirely, then the burden of proof for that change is much higher as I understand it. But this patch is just trying to fix a logic hole in the existing transform.

The motivation for fixing this now is that there are cases where earlier analysis/expansion of small memcmp would be a perf win, so I'd like to remove this roadblock from affecting work on that going forward.

A potential example of that is the attachment in PR34032 (that's the optimized IR from an llvm source file - lib/IR/Function.cpp). The IR has ~5500 memcmp calls in it, and even in its current limited form for x86, CGP memcmp expansion will transform over 4K of those calls to inline IR (ie, the majority of memcmps are constant length and less than 16 bytes). If you then run that IR through the normal opt -O2 pipeline, it gets significantly smaller, so I'm taking that as early evidence that there's room for improvement.

While I understand your motivation, I don't see the immediate benefits. I'm happy to have this patch in InstCombine if you can show a case we can't catch already.

While I understand your motivation, I don't see the immediate benefits. I'm happy to have this patch in InstCombine if you can show a case we can't catch already.

I don't understand your motivation in delaying/blocking this patch. Are you opposed to the existing transform?

The immediate benefit is shown in the test cases in this patch: we're replacing a libcall + cmp with a load + cmp. The transform exposes further IR-level optimizations because we can and do reason more effectively about a load + cmp than a memcmp libcall. If it's not clear, I can check in the test cases with their current behavior as the baseline, so we just have the test diffs here?

majnemer edited edge metadata.Aug 19 2017, 3:59 PM

Canonicalizing away from memcmp seems pretty good to me. An easy way (I think) to show benefit is that multiple memcmps involving the same non-constant operand will CSE the loads which results in a considerably more analyzable result.

Canonicalizing away from memcmp seems pretty good to me. An easy way (I think) to show benefit is that multiple memcmps involving the same non-constant operand will CSE the loads which results in a considerably more analyzable result.

That's correct, and sorry if it wasn't clear, but let me link directly to https://bugs.llvm.org/show_bug.cgi?id=34032#c13 which has this manufactured example:
https://godbolt.org/g/QjVXvS

Does that provide the required motivation/benefit?

Canonicalizing away from memcmp seems pretty good to me. An easy way (I think) to show benefit is that multiple memcmps involving the same non-constant operand will CSE the loads which results in a considerably more analyzable result.

That's correct, and sorry if it wasn't clear, but let me link directly to https://bugs.llvm.org/show_bug.cgi?id=34032#c13 which has this manufactured example:
https://godbolt.org/g/QjVXvS

Does that provide the required motivation/benefit?

In my eyes, yes.

davide accepted this revision.Aug 21 2017, 5:27 AM

Oh, I missed the CSE case. LGTM.

This revision is now accepted and ready to land.Aug 21 2017, 5:27 AM
This revision was automatically updated to reflect the committed changes.
efriedma added inline comments.Aug 23 2017, 1:03 PM
llvm/trunk/lib/Transforms/Utils/SimplifyLibCalls.cpp
778 ↗(On Diff #111971)

Are you sure this is right? It looks like it you'll create a dead load if the LHS is aligned, but the RHS isn't (and therefore drive instcombine into an infinite loop).

spatel added inline comments.Aug 23 2017, 1:10 PM
llvm/trunk/lib/Transforms/Utils/SimplifyLibCalls.cpp
778 ↗(On Diff #111971)

No - I'm sure this is wrong in exactly the way you've noted. :)
Sorry the review didn't get updated here in Phab, but I reverted this version of the patch at rL311340 and recommitted with a fix and extra test case at rL311366. Please let me know if you see any problems there.

efriedma edited edge metadata.Aug 23 2017, 1:15 PM

Oh, sorry, I'm way behind on my email and didn't see it. Yes, that looks fine.