This is an archive of the discontinued LLVM Phabricator instance.

[CGP, PowerPC] try to constant fold before creating loads for memcmp expansion
ClosedPublic

Authored by spatel on Jun 9 2017, 4:40 PM.

Details

Summary

I think this is the last step needed to avoid regressions for x86 before we flip the switch to allow expansion of the smallest set of memcpy() via CGP. The DAG version checks for constant strings, so we need to do that here too.

FWIW, the 2 constant test is not handled by LibCallSimplifier::optimizeMemCmp() because that code is limited to 8-bit constant arrays. LibCallSimplifier will also fail to optimize some 1 constant tests because its alignment requirements are too strict (shouldn't require alignment for a constant operand).

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jun 9 2017, 4:40 PM
nemanjai added inline comments.Jun 11 2017, 9:13 PM
lib/CodeGen/CodeGenPrepare.cpp
1855 ↗(On Diff #102094)

I think this would read nicer if we just had the dyn_cast<Constant> and then just initialized LoadSrc1 with a ternary operator.
But that's just a thought.

spatel added inline comments.Jun 11 2017, 9:22 PM
lib/CodeGen/CodeGenPrepare.cpp
1855 ↗(On Diff #102094)

That would mean that if the pointer is a constant, then ConstantFoldLoadFromConstPtr always succeeds. But I don't think we can make that assumption.

nemanjai added inline comments.Jun 11 2017, 9:51 PM
lib/CodeGen/CodeGenPrepare.cpp
1855 ↗(On Diff #102094)

Ah, OK fair enough. I didn't realize that.

echristo accepted this revision.Jun 19 2017, 11:55 AM

Looks OK to me.

This revision is now accepted and ready to land.Jun 19 2017, 11:55 AM
This revision was automatically updated to reflect the committed changes.

Thanks, Eric!

For those following the progress for x86, I enabled the smallest sizes with:
https://reviews.llvm.org/rL305802

But now I see 2 more missed IR optimizations, and I'm again wondering: is there a reason for this expansion to occur in CGP rather than its own pass which could be before the final simplifycfg/instcombine in a normal opt pipeline?

Looking at the general 4-byte memcmp() expansion as an example:

define i32 @cmp4(i8* %x, i8* %y) {
loadbb:
  %0 = bitcast i8* %x to i32*
  %1 = bitcast i8* %y to i32*
  %2 = load i32, i32* %0
  %3 = load i32, i32* %1
  %4 = call i32 @llvm.bswap.i32(i32 %2)
  %5 = call i32 @llvm.bswap.i32(i32 %3)
  %6 = zext i32 %4 to i64   <--- the extends are unnecessary
  %7 = zext i32 %5 to i64
  %8 = sub i64 %6, %7       <--- causing a too-wide sub
  %9 = icmp ne i64 %8, 0    <--- and a too-wide cmp
  br i1 %9, label %res_block, label %endblock

res_block:                     
  %10 = icmp ult i64 %6, %7
  %11 = select i1 %10, i32 -1, i32 1
  br label %endblock

endblock:  <--- this could have been simplified to a select     
  %phi.res = phi i32 [ 0, %loadbb ], [ %11, %res_block ]
  ret i32 %phi.res
}

Sure enough, if we run -simplifycfg and -instcombine, we get:

%0 = bitcast i8* %x to i32*
%1 = bitcast i8* %y to i32*
%2 = load i32, i32* %0, align 4
%3 = load i32, i32* %1, align 4
%4 = call i32 @llvm.bswap.i32(i32 %2)
%5 = call i32 @llvm.bswap.i32(i32 %3)
%6 = icmp ne i32 %4, %5
%7 = icmp ult i32 %4, %5
%8 = select i1 %7, i32 -1, i32 1
%phi.res = select i1 %6, i32 %8, i32 0
ret i32 %phi.res