This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Remove known bits constant folding
ClosedPublic

Authored by nikic on Mar 7 2020, 2:09 AM.

Details

Summary

If ExpensiveCombines is enabled (which is the case with -O3 on the legacy PM and always on the new PM), InstCombine tries to compute the known bits of all instructions in the hope that all bits end up being known. This is the most expensive individual part of InstCombine.

How effective is it? If we add some statistics on how often the constant folding succeeds and how many KnownBits calculations are performed and run test-suite we get:

"instcombine.NumConstPropKnownBits": 642,
"instcombine.NumConstPropKnownBitsComputed": 18744965,

In other words, we get one fold for every 30000 KnownBits calculations. However, the truth is actually much worse: Currently, known bits are computed before performing other folds, so there is a high chance that cases that get folded by known bits would also have been handled by other folds.

What happens if we compute known bits after all other folds (hacky implementation: https://gist.github.com/nikic/751f25b3b9d9e0860db5dde934f70f46)?

"instcombine.NumConstPropKnownBits": 0,
"instcombine.NumConstPropKnownBitsComputed": 18105547,

So it turns out despite doing 18 million known bits calculations, the known bits fold does not do anything useful on test-suite. I was originally planning to move this into AggressiveInstCombine so it only runs once in the pipeline, but seeing this, I think we're better off removing it entirely.

As this is the only use of the "expensive combines" mechanism, it may be removed afterwards, but I'll leave that to a separate patch.

Diff Detail

Event Timeline

nikic created this revision.Mar 7 2020, 2:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2020, 2:09 AM
nikic marked 5 inline comments as done.Mar 7 2020, 2:15 AM
nikic added inline comments.
test/Transforms/InstCombine/all-bits-shift.ll
36 ↗(On Diff #248914)

I believe this test case was the original motivation for having this fold.

However, I thinks this should be handled by InstCombineSimplifyDemanded, which we invoke in cases where we have a reasonable expectation of either demanded bits or known bits simplifications to occur (such as an "and" root, as is the case here). SimplifyDemanded currently doesn't handle this case due to what looks like an implementation bug to me: While normally SimplifyDemanded computes known bits for instructions it doesn't handle itself, it does not do so for some instructions it only partially handles (e.g. it handles a constant shift amount, but does not compute known bits if the shift amount is not constant).

test/Transforms/InstCombine/assume.ll
340 ↗(On Diff #248914)

I'm not sure we really need to do anything about this, I think it's only important that we have an assume(false) here, and SimplifyCFG will deal with the rest.

If we do want to improve on this, we could convert assume(false) into store undef (the InstCombine UB pattern) and then remove all instructions after store undef.

test/Transforms/InstCombine/expensive-combines.ll
14 ↗(On Diff #248914)

We're missing a fold to replace call with returned attribute to the returned argument. Known bits calculation handles the particular case where the operand is constant, but non-constant operands are not handled at all right now.

test/Transforms/InstCombine/out-of-bounds-indexes.ll
10 ↗(On Diff #248914)

This is an improvement.

test/Transforms/InstCombine/phi-shifts.ll
13 ↗(On Diff #248914)

This is an improvement.

jdoerfert added inline comments.
test/Transforms/InstCombine/assume.ll
340 ↗(On Diff #248914)

Can't we replace the UB instruction with unreachable?

nikic marked an inline comment as done.Mar 8 2020, 12:44 AM
nikic added inline comments.
test/Transforms/InstCombine/assume.ll
340 ↗(On Diff #248914)

InstCombine preserves CFG, so it's not possible to use unreachable directly. Instead we need to use UB patterns understood by SimplifyCFG (assume false is one of them), which will convert them to proper unreachable.

Rebase this after fixes (if any) have landed?

This is the most expensive individual part of InstCombine.

That is an interesting statement.
It would be good to have some perf numbers here.

Do we do this fold somewhere else in the pipeline?

nikic updated this revision to Diff 249166.Mar 9 2020, 11:01 AM

Rebase over D75804 and D75815.

@lebedev.ri Finally got around to implementing some compile-time testing infrastructure today, here's some numbers (in terms of instructions retired) for this change:

Basically this is a 1-2% end-to-end improvement for -O3 compile-times.

@lebedev.ri Finally got around to implementing some compile-time testing infrastructure today, here's some numbers (in terms of instructions retired) for this change:

Basically this is a 1-2% end-to-end improvement for -O3 compile-times.

Thank you, so this does indeed have some measurable effect.
Sounds justified then.

Do we do this fold somewhere else in the pipeline?

?
Do we believe CVP/SCCP/??? will cover all the possible cases?

@lebedev.ri Finally got around to implementing some compile-time testing infrastructure today, here's some numbers (in terms of instructions retired) for this change:

Basically this is a 1-2% end-to-end improvement for -O3 compile-times.

http://llvm-compile-time-tracker.com/?config=O3&stat=wall-time
That looks great!
Is that end-to-end time for the entire compile or just 'opt -O3'?

nikic marked an inline comment as done.Mar 20 2020, 3:29 AM

Do we do this fold somewhere else in the pipeline?

?
Do we believe CVP/SCCP/??? will cover all the possible cases?

I believe this should be covered by two vectors: First, by known bits calculation as part of InstCombineSimplifyDemanded. We don't call this for literally everything, but it handles all the cases that are likely to be fully known via known bits (bitwise ops and shifts as root instructions). Second, we have the same known bits optimization as here also as part of SimplifyInstruction. That means we already perform this in all places that use that high-level interface (e.g. it happens during inlining). I do want to get rid of that known bits call as well, but I think it is the reason why the InstCombine fold triggers exactly zero times on test-suite.

http://llvm-compile-time-tracker.com/?config=O3&stat=wall-time
That looks great!
Is that end-to-end time for the entire compile or just 'opt -O3'?

Those are end-to-end times. Wall time is a bit too noisy to be really useful, so I tend to look at instructions, which is a reasonable proxy, and accurate to within 0.1% for most benchmarks.

test/Analysis/ValueTracking/known-signbit-shift.ll
46 ↗(On Diff #249166)

I missed these Analysis test changes before. The shl here is poison because we know it wraps based on known bits. This optimization gets lost now.

I could add it back explicitly (and better, by returning undef rather than zero) like this: https://gist.github.com/nikic/29135f304f7cf9de6d18dff7ca12659a

I'm not sure whether that's worthwhile though, it seems that these tests are more about not crashing due to conflicting known bits than anything else.

SGTM, but i'll defer to @spatel.

test/Analysis/ValueTracking/known-signbit-shift.ll
46 ↗(On Diff #249166)

This still seems like worthwhile fold.

nikic retitled this revision from [InstCombine] Remove known bits constant folding (WIP) to [InstCombine] Remove known bits constant folding.Mar 20 2020, 4:27 AM
nikic edited the summary of this revision. (Show Details)
nikic set the repository for this revision to rG LLVM Github Monorepo.
nikic marked an inline comment as done.Mar 20 2020, 5:05 AM
nikic added inline comments.
test/Analysis/ValueTracking/known-signbit-shift.ll
46 ↗(On Diff #249166)

I've opened D76489 to address the reason why this difference exists in the first place (SimplifyDemanded produces a less good known bits result). We could still do a better overflow detection on top of that, but at least this removes the gap to what we currently do.

Please update after D76489.
If we're going to defer the removal of ExpensiveCombines to a follow-on NFC patch, add a FIXME note in this patch near its definition to indicate that it is deprecated/useless (just in case the follow-on doesn't happen immediately).

nikic updated this revision to Diff 251694.Mar 20 2020, 10:21 AM

Rebase and add FIXME.

spatel accepted this revision.Mar 20 2020, 10:51 AM

LGTM

This revision is now accepted and ready to land.Mar 20 2020, 10:51 AM
This revision was automatically updated to reflect the committed changes.