This is an archive of the discontinued LLVM Phabricator instance.

[InstSimplify] Remove known bits constant folding
ClosedPublic

Authored by nikic on May 2 2020, 12:19 PM.

Details

Summary

If SimplifyInstruction() does not succeed in simplifying the instruction, it will compute the known bits of the instruction in the hope that all bits are known and the instruction can be folded to a constant. I have removed a similar optimization from InstCombine in D75801, and would like to drop this one as well.

On average, we spend 1% of total compile-time performing this known bits calculation. However, if we introduce some additional statistics for known bits computations and how many of them succeed in simplifying the instruction we get (on test-suite):

instsimplify.NumKnownBits: 216
instsimplify.NumKnownBitsComputed: 13828375
valuetracking.NumKnownBitsComputed: 45860806

Out of ~14M known bits calculations (accounting for approximately one third of all known bits calculations), only 0.0015% succeed in producing a constant. Those cases where we do succeed to compute all known bits will get folded by other passes like InstCombine later. On test-suite, only lencod.test and GCC-C-execute-pr44858.test show a hash difference after this change.

There are of course "regressions" in InstSimplify tests, because some things that were previously handled by InstSimplify are now only handled by InstCombine. I will comment inline.

One final thing to note here is that all this affects only the SimplifyInstruction() API, not the individual per-instruction-kind Simplify APIs, which never try to use KnownBits in this way.

Diff Detail

Event Timeline

nikic created this revision.May 2 2020, 12:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2020, 12:19 PM
nikic marked 3 inline comments as done.May 2 2020, 12:46 PM

Okay, after going through the tests, there isn't really anything to comment on beyond "everything gets folded by InstCombine instead, in the same way".

If the general change looks fine, I'd move some of the tests that no longer fold with InstSimplify into the InstCombine tests instead.

On test-suite, only lencod.test and GCC-C-execute-pr44858.test show a hash difference after this change.

Did you analyze these changes?

nikic added a comment.EditedMay 2 2020, 3:40 PM

On test-suite, only lencod.test and GCC-C-execute-pr44858.test show a hash difference after this change.

Did you analyze these changes?

For lencod:

in function writeMBLayer:
  in block %for.cond138.preheader.i:
    <   %mb_y.0328.i = phi i32 [ 0, %for.cond134.preheader.i ], [ 2, %for.inc219.i ]
  in block %for.inc219.i:
    >   %cmp135.i = icmp eq i64 %indvars.iv383.i, 0
    >   br i1 %cmp135.i, label %for.cond138.preheader.i, label %writeCBPandLumaCoeff.exit
    <   %cmp135.i = icmp eq i32 %mb_y.0328.i, 0
    <   %indvars.iv.next339.i = add nuw nsw i64 %indvars.iv338.i, 2
    <   br i1 %cmp135.i, label %for.cond138.preheader.i, label %writeCBPandLumaCoeff.exit

This is an improvement, we optimize away a loop phi. Don't ask me how.

For the torture test:

in function main:
  in block %entry:
    >   %cmp = icmp ne i32 %call, 0
        %0 = load i32, i32* @b, align 4
    >   %cmp1 = icmp ne i32 %0, 1
    >   %or.cond = or i1 %cmp, %cmp1
    >   br i1 %or.cond, label %if.then, label %if.end
    <   %cmp1 = icmp eq i32 %0, 1
    <   br i1 %cmp1, label %if.end, label %if.then

This is a regression, we fail to optimize away a comparison. This test is small enough to trace: We determine that a call always returns zero a bit later, not in time for IPSCCP. As this is noinline-based test case, IPSCCP is the only chance to propagate interprocedurally.

nikic updated this revision to Diff 261690.May 3 2020, 3:57 AM

I have now duplicated all the relevant tests into InstCombine and added comments that these no longer fold as part of InstSimplify. The only test I've dropped outright is the assume.ll one, as I don't think it makes sense if we don't perform top-level KnownBits folding. I've left the remaining ones alone, but could drop them if desired.

spatel accepted this revision.May 3 2020, 9:29 AM

LGTM - if we can remove redundant/expensive logic with few visible side-effects, that's a nice win.

test/Transforms/GVN/PRE/volatile.ll
204 ↗(On Diff #261690)

There's no dedicated fold for this in InstCombiner::visitLoadInst(). But we call computeKnownBits on the ret arg in instcombine anyway, so we get all basic patterns. That might be worth looking at as another candidate for removal for efficiency.

Ie, we may want to add a dedicated fold (add a simplifyLoad?) since it's cheap to do directly and a big win if it works (no idea if this happens in the real-world, but somebody added this GVN test case...)

A test that would subvert the current known-bits call from InstCombiner::visitReturnInst() based on current MaxDepth recursion:

define i32 @known0(i32* %V, i32 %y) {
  %load = load i32, i32* %V, !range !0
  %m1 = mul i32 %load, %y
  %m2 = mul i32 %m1, %y
  %m3 = mul i32 %m2, %y
  %m4 = mul i32 %m3, %y
  %m5 = mul i32 %m4, %y
  %m6 = mul i32 %m5, %y
  ret i32 %m6
}

!0 = !{ i32 0, i32 1 }
This revision is now accepted and ready to land.May 3 2020, 9:29 AM
nikic marked an inline comment as done.May 3 2020, 9:51 AM
nikic added inline comments.
test/Transforms/GVN/PRE/volatile.ll
204 ↗(On Diff #261690)

Going through the blame, this test was added in https://github.com/llvm/llvm-project/commit/b8da3a2bb2b840db6ab7c473190ee6d65dcf3a1e. I think the intention was to make sure that instructions with side-effects don't get optimized away just because they simplify. Given that, it might make sense to replace the load volatile with something else here (not sure what though -- do we have any standard pattern that simplifies without being trivially dead?)

I don't think we need to explicitly handle this case in InstCombine, as it seems unlikely that single-element range annotations are common in the wild. Additionally I think that this is best left to passes that specialize in range-propagation. For example, CVP will also handle your more complex example successfully. Given ongoing range work, I expect that SCCP will also handle it in the future.

nikic marked an inline comment as done.May 3 2020, 11:30 AM
nikic added inline comments.
test/Transforms/GVN/PRE/volatile.ll
204 ↗(On Diff #261690)

Given that, it might make sense to replace the load volatile with something else here (not sure what though -- do we have any standard pattern that simplifies without being trivially dead?)

Looks like call i32 undef() works for that purpose. I've replaced the load volatile with that.

This revision was automatically updated to reflect the committed changes.
nikic reopened this revision.May 3 2020, 12:43 PM
nikic added a subscriber: arsenm.

Had to revert this due to AMDGPU test failures. What happens there is that AMDGPU expands divisions in a custom CGP pass, then instructions get simplified as part of EarlyCSE run it schedules. The simplification in question is a x ashr 31 sign bit extraction, which gets folded to zero.

Not quite sure what to do about this. I guess it's one of a) just accept the change b) add a known sign check in the AMDGPU div expansion code or c) make SimplifyAShr more aggressive (i.e. use known bits) if we're extracting the sign bit. I'm leaning towards b). cc @arsenm

This revision is now accepted and ready to land.May 3 2020, 12:43 PM
nikic updated this revision to Diff 261720.May 3 2020, 1:00 PM

Try to determine sign of div operands in AMDGPU expansion code. This avoids any test changes on the AMDGPU side.

If this approach looks reasonable, I'll commit it separately.

This revision was automatically updated to reflect the committed changes.
llvm/test/Transforms/GVN/PRE/volatile.ll