Page MenuHomePhabricator

[LoopIdiomRecognize] Teach CTLZ/CTTZ idiom recognition to handle not being able to find a pre-loop check for the input being 0.
Changes PlannedPublic

Authored by craig.topper on Mar 26 2021, 10:29 AM.



To find the zero check we look for a branch whose condition is
a zero compare with the same Value* passed to the phi for the loop.
InstCombine knows some tricks that can break this. The one I observed
is when the input to the loop is a llvm.abs intrinsic. InstCombine
rewrites icmp eq (llvm.abs X), 0 to icmp eq X, 0 since the absolute
value doesn't change 0 or produce 0 for any other inputs. I believe
the original motivating case for this tranform featured an absolute
value before the loop so we should make this work.

I didn't want to teach this pass to also look through llvm.abs as
it seemed like we'd inevitably need to teach it more things to look
through. So instead I've inserted a new zero check for this case to
select the count value the loop should produce if there was no
zero check before the loop. This new zero check will hopefully be
optimized by InstCombine and eventually determined to be unneeded
and removed. If not, it's probably still cheaper than the loop.

Diff Detail

Unit TestsFailed

25,030 msx64 debian > LLVM.Bindings/Go::go.test
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/llvm-go go=/usr/bin/go test

Event Timeline

craig.topper created this revision.Mar 26 2021, 10:29 AM
craig.topper requested review of this revision.Mar 26 2021, 10:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 26 2021, 10:29 AM

This makes sense to me, but I'm not as familiar with LIR as others, so might want to get a 2nd look.


It's not clear to me what cost model metrics we're using. Does this change the equation?


Can you add a -O1 (or -O2?) test in a pre-commit to PhaseOrdering that starts from the IR for the actual C code we're trying to model? (could run it through -mem2reg if that makes it less obnoxious).

That way, we'll have coverage to make sure we can get through a full instcombine+LIR+instcombine sequence with the expected output.


But we do transform this loop now - update comment.

Add phase ordering test.

Update test comment.

craig.topper added inline comments.Mar 28 2021, 12:24 PM

I think the cost model is if the loop just counts bits and doesn't do anything else, always expand to ctlz since the loop will get deleted. If the loop has other instructions, check if ctlz cost is TCC_Basic (for X86 I think this is is lzcnt or tzcnt intructions enabled). If is then do the transform, the loop will stay but it's trip count will be determine by the ctlz.

If the loop is going to get removed, then the extra select I'm inserting probably doesn't matter. If the loop won't get removed, I'm not sure if the extra select and icmp should be costed. In many cases there probably is a real zero check and other optimizations will remove the extra instructions. I think the original case just counted bits, but it's been a few years since I looked at it. So if we're concerned I can probably handle just the case that the loop will be removed.

spatel accepted this revision.Mar 29 2021, 5:16 AM


This revision is now accepted and ready to land.Mar 29 2021, 5:16 AM
craig.topper planned changes to this revision.Mar 29 2021, 9:32 AM
craig.topper added inline comments.

I think this needs to be TMP3 not TMP1.

Hm, i've just completed mine's recognizeShiftUntilZero(), and i thought it would resolve this,
but there the shifee is stationary, while here we have shift recurrence (always shift by one).
So more idioms to support...