This is an archive of the discontinued LLVM Phabricator instance.

[LoopIdiomRecognize] Teach CTLZ/CTTZ idiom recognition to handle not being able to find a pre-loop check for the input being 0.
Changes PlannedPublic

Authored by craig.topper on Mar 26 2021, 10:29 AM.

Details

Reviewers
spatel
lebedev.ri
Summary

To find the zero check we look for a branch whose condition is
a zero compare with the same Value* passed to the phi for the loop.
InstCombine knows some tricks that can break this. The one I observed
is when the input to the loop is a llvm.abs intrinsic. InstCombine
rewrites icmp eq (llvm.abs X), 0 to icmp eq X, 0 since the absolute
value doesn't change 0 or produce 0 for any other inputs. I believe
the original motivating case for this tranform featured an absolute
value before the loop so we should make this work.

I didn't want to teach this pass to also look through llvm.abs as
it seemed like we'd inevitably need to teach it more things to look
through. So instead I've inserted a new zero check for this case to
select the count value the loop should produce if there was no
zero check before the loop. This new zero check will hopefully be
optimized by InstCombine and eventually determined to be unneeded
and removed. If not, it's probably still cheaper than the loop.

Diff Detail

Unit TestsFailed

Event Timeline

craig.topper created this revision.Mar 26 2021, 10:29 AM
craig.topper requested review of this revision.Mar 26 2021, 10:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 26 2021, 10:29 AM

This makes sense to me, but I'm not as familiar with LIR as others, so might want to get a 2nd look.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1781–1784

It's not clear to me what cost model metrics we're using. Does this change the equation?

llvm/test/Transforms/LoopIdiom/X86/ctlz.ll
259–260

Can you add a -O1 (or -O2?) test in a pre-commit to PhaseOrdering that starts from the IR for the actual C code we're trying to model? (could run it through -mem2reg if that makes it less obnoxious).

That way, we'll have coverage to make sure we can get through a full instcombine+LIR+instcombine sequence with the expected output.

712–715

But we do transform this loop now - update comment.

Add phase ordering test.

Update test comment.

craig.topper added inline comments.Mar 28 2021, 12:24 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1781–1784

I think the cost model is if the loop just counts bits and doesn't do anything else, always expand to ctlz since the loop will get deleted. If the loop has other instructions, check if ctlz cost is TCC_Basic (for X86 I think this is is lzcnt or tzcnt intructions enabled). If is then do the transform, the loop will stay but it's trip count will be determine by the ctlz.

If the loop is going to get removed, then the extra select I'm inserting probably doesn't matter. If the loop won't get removed, I'm not sure if the extra select and icmp should be costed. In many cases there probably is a real zero check and other optimizations will remove the extra instructions. I think the original case just counted bits, but it's been a few years since I looked at it. So if we're concerned I can probably handle just the case that the loop will be removed.

spatel accepted this revision.Mar 29 2021, 5:16 AM

LGTM

This revision is now accepted and ready to land.Mar 29 2021, 5:16 AM
craig.topper planned changes to this revision.Mar 29 2021, 9:32 AM
craig.topper added inline comments.
llvm/test/Transforms/LoopIdiom/X86/ctlz.ll
294

I think this needs to be TMP3 not TMP1.

Hm, i've just completed mine's recognizeShiftUntilZero(), and i thought it would resolve this,
but there the shifee is stationary, while here we have shift recurrence (always shift by one).
So more idioms to support...

aaronpuchert added inline comments.
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1811

Going with NewCount here shouldn't have unintended side effects, though we'd need to use a different Ty I think.

llvm/test/Transforms/LoopIdiom/X86/ctlz.ll
294

Agreed. They differ for ABS_N = 0 where TMP1 = 0 but TMP3 = 1. With TCPHI = 0 we get TCDEC = -1 hence TOBOOL = false in the first iteration, so we don't exit as we should. In fact we'll eventually violate the nsw I think.

Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2022, 2:58 PM
Herald added a subscriber: StephenFan. · View Herald Transcript
lebedev.ri resigned from this revision.Jan 12 2023, 4:58 PM

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.