The branch/select is removed in InstCombine and then recreated in CodegenPrepare, effectively making it invisible to most of the optimization pipeline, which lead to bad optimizations of cttz/ctlz when the target doesn't have a zero defined version of them.
Details
Diff Detail
- Build Status
Buildable 3248 Build 3248: arc lint + arc unit
Event Timeline
The branch/select is removed in InstCombine and then recreated in CodegenPrepare, effectively making it invisible to most of the optimization pipeline, which lead to bad optimizations of cttz/ctlz when the target doesn't have a zero defined version of them.
That canonicalization transform looks perfectly valid to me. It is undone in CodeGenPrepare if cttz/ctlz (with is_zero_undef == 0) is not supported by the target.
I don't see how it can lead to bad codegen. Could you please provide an example that shows the issue?
@andreadb it hides the branch many of the subsequent optimization passes, resulting in bad codegen. I stumbled on bad codegen from cttz/ctlz several time recently. Thing that I noticed are : doing the 0 case check several time, failure to constant fold the 0 case when there is one, etc...
It doesn't make sense to re-implement all of this in CodeGenprepare nor does it to special case it in the whole pipeline. Canonicalisation should help subsequent passes, not hide information to them, which is what this is doing.
There is obviously a tradeoff here. Canonicalizing toward the intrinsics provides later passes with more information about what the code is doing. There are a number of transformations and analysis that have special logic for ctlz (etc.). However, when we then expand the intrinsic, we need to make sure that we can optimize away redundancies in those expansions.
I really think we need some examples here. Many backends (although perhaps not x86) run EarlyCSE and other IR-level cleanups in CodeGen. Maybe that's a better solution here?
To me, bad codegen has a different meaning, which is why I was concerned by your initial post and the lack of specific tests for the problematic cases. I now understand that you meant what I would call poor codegen (as a synonim of suboptimal).
It doesn't make sense to re-implement all of this in CodeGenprepare nor does it to special case it in the whole pipeline. Canonicalisation should help subsequent passes, not hide information to them, which is what this is doing.
As far as I remember, this combine rule was originally added to specifically target [cttz|ctlz]+select pairs introduced by SimplifyCFG as the result of aggressively flattening simple if-then CFGs. Most (if not all) of those simple if-then constructs originated from conditional statements in x86 bmi/lzcnt intrinsic definitions. SimplifyCFG was originally modified to enable a more aggressive flattening of simple if-then branches. However, there was a concern with the performance of cttz/ctlz+select for targets witn no tzcnt/bmi. So Sanjay implemented a "despeculation" logic in CodeGenPrepare (http://llvm.org/viewvc/llvm-project?view=revision&revision=253573) to revert the CFG flattening done by SimplifyCFG (only for targets which don't provide a cheap cttz/ctlz - see also: llvm.org/viewvc/llvm-project?rev=255660&view=rev).
If there is a problem, then it is very likely to be caused by a bad interaction between these three entities:
- SimplifyCFG (which aggressively flattens simple if-then blocks even in the presence of one expensive cttz/ctlz).
- InstCombine (which canonicalizes a cttz/ctlz +select into a single cttz/ctlz [is_zero_undef=0]).
- CodeGenPrepare (which is able to despeculate a cttz/ctlz [is_zero_undef=0] by undoing the flattening of the CFG done by SimplifyCFG).
I am not suggesting that the current design is optimal. However, as Hal pointed out, there is a trade off here. I am not convinced that InstCombine is "hiding information". FWIW, the "branch" is already hidden by the speculation/cfg flattening performed by SimplifyCFG.
I think that we really need to see some code samples to have a better understand of what is going wrong. There may be better places where to fix your issue.
Cheers,
Andrea
Let's call it poor codegen if that is clearer. Indeed it isn't bad in the sense that it does something invalid, it is bad in the sense that it could be better. It doesn't seems to be a good idea to remove the branch and then reintroducing it later on in the case the target needs a branch anyway.
The reason is that the check for 0 is now implicit and, even if some optimization are done, it is very easy to defeat them. For instance, it is able to set the zero_undef flag when doing :
if (n == 0) { // ... } else { ctlz(n); }
But not in trivial variations such as:
if (n == 0xff) { // ... } else { ctlz(~n); }
This second case is a real one I faced when working on a serialization/deserialization library. While tweaking the code to get the optimization is possible, this would be brittle, and considering I faced variation of this problem several time, I though there is probably something to do here.
It doesn't seems that teaching all passes about this special case is going to fly. There are just too many cases and variations to match. All the code required to do these optimizations already exists, so I think it is smarter to leverage it.
Grepping quickly in there, it doesn't looks like SimplifyCFG create the select. I think we are good here. It may be profitable to explode the select into a branch/phi in that case, but that isn't suitable for InstCombine anyway as it preserve the control flow and it may not be necessary at all in the end, and improving the ability of the compiler to handle select seems like the right path anyway if that's the case at it'll be profitable to other structures as well.
This patch goes against my understanding that InstCombine is only for target-independent transforms.
Like Andrea and Hal have requested, I'd like to see an example for how we get to the problem state.
I tried to put together an IR example from what you described here:
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109398.html
int goo(int n) { int a = (n == 0x0) ? 32 : __builtin_clz(n) ; int b = ((a * 36) + 35) >> 8; return b; }
$ ./clang -O2 ctlz.c -S -o - -emit-llvm
source_filename = "ctlz.c" target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.12.0" define i32 @goo(i32 %n) local_unnamed_addr #0 { entry: %cmp = icmp eq i32 %n, 0 br i1 %cmp, label %cond.end, label %cond.false cond.false: %0 = tail call i32 @llvm.ctlz.i32(i32 %n, i1 true) %phitmp = mul nuw nsw i32 %0, 36 %phitmp4 = add nuw nsw i32 %phitmp, 35 %phitmp56 = lshr i32 %phitmp4, 8 br label %cond.end cond.end: %cond = phi i32 [ %phitmp56, %cond.false ], [ 4, %entry ] <--- constant folded as expected? ret i32 %cond }
I have one case that boils down to this for instance:
define i64 @_D2gc1d4util6decodeFMKAxhZm({ i64, i8* }* nocapture readonly %data) local_unnamed_addr #2 { entry: %0 = getelementptr inbounds { i64, i8* }, { i64, i8* }* %data, i64 0, i32 1 %1 = load i8*, i8** %0, align 8 %2 = load i8, i8* %1, align 1 %3 = icmp eq i8 %2, -1 br i1 %3, label %then, label %endif then: ; preds = %entry ret i64 23 endif: ; preds = %entry %4 = xor i8 %2, -1 %5 = tail call i8 @llvm.ctlz.i8(i8 %4, i1 false) %6 = zext i8 %5 to i64 ret i64 %6 }
This doesn't optimize further.
And the codegen:
_D2gc1d4util6decodeFMKAxhZm: movq 8(%rdi), %rax movb (%rax), %al cmpb $-1, %al je .LBB2_5 je .LBB2_2 notb %al movzbl %al, %eax bsrl %eax, %eax xorl $7, %eax movzbl %al, %eax retq .LBB2_5: movl $23, %eax retq .LBB2_2: movb $8, %al movzbl %al, %eax retq
Note the double branch.
Thanks Amaury,
Sorry in advance for this long post.
Your code example gave me some hints about what is going wrong.
I think I have been able to find a small reproducible for your particular scenario (see below).
static int my_clz(int n) { return (n)? __builtin_clz(n) : 32; } int foo(int n) { unsigned val; if (n == -1) return 32; return my_clz(~n); }
'my_clz' matches the definition of __lzcnt32 from 'lzcntintrin.h'. Before we run SimplifyCFG, that function would look like this:
define internal i32 @my_clz(i32 %n) { entry: %tobool = icmp ne i32 %n, 0 br i1 %tobool, label %cond.true, label %cond.end cond.true: ; preds = %entry %0 = call i32 @llvm.ctlz.i32(i32 %n, i1 true) br label %cond.end cond.end: ; preds = %entry, %cond.true %cond = phi i32 [ %0, %cond.true ], [ 32, %entry ] ret i32 %cond }
SimplifyCFG would firstly speculate the (potentially expensive) call to llvm.ctlz, and then flatten the cfg by inserting a select:
define internal i32 @my_clz(i32 %n) { entry: %tobool = icmp eq i32 %n, 0 %0 = call i32 @llvm.ctlz.i32(i32 %n, i1 true) %cond = select i1 %tobool, i32 32, i32 %0 ret i32 %cond }
At this point, our "problematic" instcombine kicks in, and that entire code sequence is simplified into @llvm.ctlz.i32(i32 %n, i1 false).
I can see how this is going to be sub-optimal if we are in the following scenario:
- We are building for a non-LZCNT target (example: SandyBridge), and
- Our instcombine is triggered before my_clz is inlined into foo.
Before CodeGenPrepare, we would end up with code like this:
define i32 @_Z3fooi(i32 %n) local_unnamed_addr #0 { entry: %cmp = icmp eq i32 %n, -1 br i1 %cmp, label %cleanup, label %if.end if.end: ; preds = %entry %neg = xor i32 %n, -1 %0 = tail call i32 @llvm.ctlz.i32(i32 %neg, i1 false) #2 ;; <--- suboptimal flag!. br label %cleanup cleanup: ; preds = %entry, %if.end %retval.0 = phi i32 [ %0, %if.end ], [ 32, %entry ] ret i32 %retval.0 }
Basically, I can see how triggering that instcombine too prematurely might lead to poor codegen for your non-lzcnt target.
Ideally, we would want that canonicalization to be performed directly before (or during) codegen to help instruction selection on targets that have a fast ctz/clz defined on zero.
What if instead we move that transform into CodeGenPrepare::optimizeSelectInst()? I think that would fix the issue in a cleaner way, since we would not need to introduce new target hooks to conditionalize its execution.
-Andrea
Thanks for posting the example. That made it clear, and the proposal to move the functionality of foldSelectCttzCtlz() later sounds good to me.
Before CodeGenPrepare, we would end up with code like this:
define i32 @_Z3fooi(i32 %n) local_unnamed_addr #0 { entry: %cmp = icmp eq i32 %n, -1 br i1 %cmp, label %cleanup, label %if.end if.end: ; preds = %entry %neg = xor i32 %n, -1 %0 = tail call i32 @llvm.ctlz.i32(i32 %neg, i1 false) #2 ;; <--- suboptimal flag!. br label %cleanup cleanup: ; preds = %entry, %if.end %retval.0 = phi i32 [ %0, %if.end ], [ 32, %entry ] ret i32 %retval.0 }
Can you clarify why the flag is suboptimal by itself? The intrinsic carries the same semantic as the unfolded sequence, isn't it?
This seems to me like just a missing optimization here to recover that at this point: can't we just figure that %neg can't be zero and turn the flag to true?
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
198 | s/defnied/defined. | |
test/Transforms/InstCombine/select-cmp-cttz-ctlz.ll | ||
3 | Some comment before the two new lines can be nice to have. Also I suspect this won't work without the X86 backend configured in. |
Can you clarify why the flag is suboptimal by itself? The intrinsic carries the same semantic as the unfolded sequence, isn't it?
Yes. That is exactly what I originally pointed out in this thread.
This seems to me like just a missing optimization here to recover that at this point: can't we just figure that %neg can't be zero and turn the flag to true?
I agree that we are currently missing an optimization.
That said, (if I remember correctly) the only place where we form cttz/ctlz with is_zero_undef=false is in foldSelectCttzCtlz() and the only goal of that transform is to canonicalize cttz/ctlz in preparation for codegen. That's why I suggested considering the possibility of moving that transform into CGP. If we do this, then we no longer need to add extra optimization rules to "fix" the fact that we prematurely canonicalized.
We also are doing it in InstCombine ( see foldCttzCtlz ) using isKnownNonZero, but that guy is unable to figure that one out. Looking at the implementation, it looks pretty ad hoc.
Just to clarify, foldSelectCttzCtlz() is the only place where we introduce cttz/ctlz with is_zero_undef=false.
We could potentially extend the logic in isKnownNonZero to add more cases. That said, I am not suggesting that we do that, since we can solve this entire issue by just moving the canonicalization rule from InstCombine to CGP.
Out of curiosity, I had a quick look at what extra rules would be needed to fix our reproducible. We would need to teach isKnownNonZero how to look through a xor (see below).
ConstantInt *C if (match(V, m_Xor(m_Value(X), m_ConstantInt(C)))) return isKnownNonEqual(X, C, Q);
However, function isKnownNonEqual would not know how to analyze users of X. So, we would need to add extra logic in ValueTracking.cpp to loop over the users of X in search of a dominating ICmpInst([ICMP_EQ|ICMP_NE], X, C) (similarly to what function isKnownNonNullFromDominatingCondition() does).
So, although it is possible, it may not be the best way to fix this.
s/defnied/defined.