This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Teach when it is profitable to speculate calls to @llvm.cttz/ctlz.
AbandonedPublic

Authored by andreadb on Dec 16 2014, 4:05 AM.

Details

Summary

Hi David, Hal, Quentin (and all),

If we know that the control flow is modelling an if-statement where the only instruction in 'then' basic block (excluding the terminator) is a call to cttz/ctlz, it may be beneficial to speculate the cttz/ctlz call and let SimplifyCFG convert the associated phi instruction in the 'end' basic block into a select.

Example:
;;
entry:

%cmp = icmp eq i64 %Val, 0
br i1 %cmp, label %end.bb, label %then.bb

then.bb:

%c = tail call i64 @llvm.cttz.i64(i64 %val, i1 true)
br label %EndBB

end.bb:

%cond = phi i64 [ %c, %then.bb ], [ 64, %entry]

;;

The call to @llvm.cttz.i64 could be speculated. This would allow to fold the entire code sequence into:

%cmp = icmp eq i64 %Val, 0
%c = tail call i64 @llvm.cttz.i64(i64 %val, i1 false)
%cond = select i1 %cmp, i64 64, i64 %c

The constraints are:
a) The 'then' basic block is taken only if the input operand to the cttz/ctlz is different than zero;
b) The phi node propagates the size-of (in bits) of the value %val in input to the cttz/ctlz if %val is zero.

If all these constraints are met, the optimizer can hoist the call to cttz/ctlz from the 'then' basic block into the 'entry' basic block. The phi instruction would then be replaced by a select statement.
The new cttz/ctlz instruction will also have the 'undef on zero' flag set to 'false'. This would allow the instruction combiner to further simplify the code by getting rid of the redundant select.

The IR from the example can be obtained from the following code:
///
unsigned long long foo(unsigned long long A) {

  return A ? __builtin_ctzll(A) : 64;
}

///

On X86-64 targets with feature TZCNT, this patch would allow the backend to generate optimal assembly code for function 'foo':

tzcntq %rdi, %rax
retq

On X86-64 targets with no TZCNT, the call to cttz with the 'undef on zero' flag cleared would be expanded into a longer sequence involving a conditional move instruction:

bsfq  %rdi, %rcx
movl  $64, %eax
cmovneq  %rc, %rax
popq  %rbp
retq

On X86 (not x86-64) targets with no LZCNT/TZCNT and no CMOV instructions, the backend would futher expand the conditional moves introducing machine basic blocks. Basically it would revert this optimization re-introducing the if-else structure.

My only questions are: is SimplifyCFG the correct place where to put this logic? If not, then where do you think I should put it?

Please, let me know what you think.

Thanks,
Andrea

Diff Detail

Event Timeline

andreadb updated this revision to Diff 17329.Dec 16 2014, 4:05 AM
andreadb retitled this revision from to [SimplifyCFG] Teach when it is profitable to speculate calls to @llvm.cttz/ctlz..
andreadb updated this object.
andreadb edited the test plan for this revision. (Show Details)
andreadb added reviewers: majnemer, hfinkel, qcolombet.
andreadb added a subscriber: Unknown Object (MLST).
qcolombet edited edge metadata.Dec 16 2014, 11:56 AM

Hi Andrea,

On X86 (not x86-64) targets with no LZCNT/TZCNT and no CMOV instructions, the backend would futher expand the conditional moves introducing machine basic blocks. Basically it would revert this optimization re-introducing the if-else structure.

Have you checked that the output assembly is the same (or equivalent) performance-wise on such targets?

My only questions are: is SimplifyCFG the correct place where to put this logic? If not, then where do you think I should put it?

Assuming the answer of my previous question is yes, I think it makes sense to have that in SimplifyCFG.

Thanks,
-Quentin

hfinkel edited edge metadata.Dec 16 2014, 12:36 PM

Hi Andrea,

...

Assuming the answer of my previous question is yes, I think it makes sense to have that in SimplifyCFG.

I agree; the transformation seems reasonable. If it turns out not be be good for targets that can't completely fold the result, you could put the transformation in CodeGenPrep where you can directly query the target.

majnemer added inline comments.Dec 16 2014, 1:13 PM
lib/Transforms/Utils/SimplifyCFG.cpp
1624–1629

This could be:

if (match(ThenV, m_Intrinsic<Intrinsic::cttz>(Op0) ||
    match(ThenV, m_Intrinsic<Intrinsic::ctlz>(Op0))
1631–1633

If you use m_APInt, this will also work with vector types.

1635–1640

Please use dyn_cast.

1643–1646

Why not just use match(Cmp->getOperand(1), m_Zero())

Hi Quentin, Hal, David,

thanks a lot for the useful feedback!.

Hi Andrea,

On X86 (not x86-64) targets with no LZCNT/TZCNT and no CMOV instructions, the backend would futher expand the conditional moves introducing machine basic blocks. Basically it would revert this optimization re-introducing the if-else structure.

Have you checked that the output assembly is the same (or equivalent) performance-wise on such targets?

So, I checked the output assembly for those targets. At first I thought that the codegen was equivalent performance-wise. But I was wrong, since there is unfortunately an important difference: the count leading/trailing zeros instruction is now always dominating the control flow. Therefore, it would always be speculatively executed.
While this is ok for the case where we the input value is known not to be zero (since BSF/BSR would be executed anyway), this is sub-optimal for the case where the value is zero.

Before this patch, with a value of zero in input to cttz/ctlz, the instructions dynamically executed would have been:

"TEST+conditional branch+propagation of constant"

With this patch, we would execute instead:

"BSF/BSR+conditional branch+propagation of constant".

bsf/bsr would be able to set the rFLAGS, so the backend avoids inserting an extra TEST. However, BSF/BSR is much slower (it may be microcoded on old x86 targets...). So, I am afraid that my patch would slow down the code for x86 with no CMOV.

I'll see if I can move this logic in CodeGenPrepare as suggested by Hal.

My idea is to do the following (if you agree):

  • move this logic into CodeGenPrepare;
  • guard the code against a check on the subtarget (something like 'isCheapToSpeculateCttzCtlz').
    • On X86 that method would return true if we have TZCNT/LZCNT or if we have feature CMOV.

Thanks again for your time.
I'll prepare a new patch.

Cheers,
Andrea

Hi Andrea,

My idea is to do the following (if you agree):

  • move this logic into CodeGenPrepare;
  • guard the code against a check on the subtarget (something like 'isCheapToSpeculateCttzCtlz').
  • On X86 that method would return true if we have TZCNT/LZCNT or if we have feature CMOV.

Sounds good to me.

Thanks,
-Quentin

andreadb abandoned this revision.Dec 18 2014, 12:44 PM

Uploaded a new patch here: http://reviews.llvm.org/D6728