Improve isKnownNonZero for integers in order to improve cttz
optimizations.
Details
- Reviewers
nlopes spatel nikic mkazantsev - Commits
- rZORG89d1bdfe2187: [ValueTracking] Improve isKnowNonZero for Ints
rZORGaf50709cf9d3: [ValueTracking] Improve isKnowNonZero for Ints
rG89d1bdfe2187: [ValueTracking] Improve isKnowNonZero for Ints
rGaf50709cf9d3: [ValueTracking] Improve isKnowNonZero for Ints
rG3b137a495686: [ValueTracking] Improve isKnowNonZero for Ints
rL360222: [ValueTracking] Improve isKnowNonZero for Ints
Diff Detail
Event Timeline
Here are all the uses of isKnownNonZero with a (potentially) non-pointer operand I have found, so these might be affected by / may benefit from this change:
4 uses in simplifyICmpWithZero: https://github.com/llvm-mirror/llvm/blob/a2c8107c80101593ce06445cd3dcb3628dc95819/lib/Analysis/InstructionSimplify.cpp#L2414
isKnownPositive: https://github.com/llvm-mirror/llvm/blob/a2c8107c80101593ce06445cd3dcb3628dc95819/lib/Analysis/ValueTracking.cpp#L264 which has only 2 uses in https://github.com/llvm-mirror/llvm/blob/a2c8107c80101593ce06445cd3dcb3628dc95819/lib/Transforms/InstCombine/InstCombineCompares.cpp#L1318.
2 uses in computeKnownBitsMul: https://github.com/llvm-mirror/llvm/blob/a2c8107c80101593ce06445cd3dcb3628dc95819/lib/Analysis/ValueTracking.cpp#L348
2 uses in computeKnownBitsFromShiftOperator: https://github.com/llvm-mirror/llvm/blob/a2c8107c80101593ce06445cd3dcb3628dc95819/lib/Analysis/ValueTracking.cpp#L936
phi combining: https://github.com/llvm-mirror/llvm/blob/a2c8107c80101593ce06445cd3dcb3628dc95819/lib/Transforms/InstCombine/InstCombinePHI.cpp#L1183
llvm/test/Transforms/InstCombine/cttz.ll | ||
---|---|---|
12 ↗ | (On Diff #195657) | Hm, it looks like the optimization we want doesn't actually happen here? The i1 false should have changed to an i1 true. |
llvm/test/Transforms/InstCombine/cttz.ll | ||
---|---|---|
12 ↗ | (On Diff #195657) | Good catch. Just a cast of not updating the patch I uploaded. |
Looks good to me technically, I think it's only reasonable that integers and pointers are treated the same way here. Would like to have a second opinion though, maybe from @spatel?
It should be noted that the instsimplify cases here would also be simplified by either instcombine (presumably the dominating condition code there) or CVP. Is that possibly why this was not done before?
Can you push the baseline tests to trunk/head, so this diff is against an updated public baseline?
There's also a TODO comment/test about this enhancement in test/Transforms/LICM/hoist-mustexec.ll -- that should be updated as part of this patch.
About the relation to instcombine/CVP: there's an independent question (and unresolved in terms of implementation in LLVM) about which pass(es) have responsibility for these kinds of patterns. This patch makes valuetracking better with no obvious increase in complexity, so I think it's the right step. And if I'm seeing correctly, neither of those passes would get the last test with the extra blocks and phi without this patch.
test/Transforms/LICM/hoist-mustexec.ll still needs to be updated. (Should probably generate full checks as a preliminary NFC commit first.)
Oops - I forgot that even though I mentioned it in the earlier comment. Yes, that must be included in this patch (and any other existing tests that are known to be affected).
Sorry, I forgot all about this as well. I did run the update_test_checks script on this file after applying the patch and there was not improvement to the generated code.
Are you sure? I didn't check what exactly changed, but this test was failing for me locally after applying the patch.
I'm probably doing something wrong. I'll run some tests and tinker with this later today.
I can't tell what's actually happening in the LICM tests, so:
rL359881
Please rebase and update the patch here. Don't forget to update that TODO comment in the test file too; we should be set after that.
Hi folks,
This introduces a regression with RADV (the open source Vulkan driver in mesa) and some CTS tests, here's the list of failures:
- dEQP-VK.glsl.functions.control_flow.mixed_return_break_continue_fragment
- dEQP-VK.glsl.functions.control_flow.mixed_return_break_continue_vertex
Here's the LLVM IR generated by Mesa (before any optimizations passes):
; ModuleID = 'mesa-shader' source_filename = "mesa-shader" target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7" target triple = "amdgcn-mesa-mesa3d" define amdgpu_ps void @main([0 x i8] addrspace(6)* inreg noalias dereferenceable(18446744073709551615), i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, i32, i32, i32, i32) #0 { main_body: %temp7 = alloca float, addrspace(5) %temp6 = alloca float, addrspace(5) %temp5 = alloca float, addrspace(5) %temp4 = alloca float, addrspace(5) %temp3 = alloca float, addrspace(5) %temp2 = alloca float, addrspace(5) %temp1 = alloca float, addrspace(5) %temp = alloca float, addrspace(5) %18 = alloca float, addrspace(5) %19 = alloca float, addrspace(5) %20 = alloca float, addrspace(5) %21 = alloca float, addrspace(5) %22 = call i8 addrspace(4)* @llvm.amdgcn.implicit.buffer.ptr() #2 %23 = bitcast i8 addrspace(4)* %22 to [0 x <4 x i32>] addrspace(4)* %24 = call float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 %1) #2 %25 = bitcast float %24 to i32 %26 = call float @llvm.amdgcn.interp.mov(i32 2, i32 1, i32 0, i32 %1) #2 %27 = bitcast float %26 to i32 %28 = call float @llvm.amdgcn.interp.mov(i32 2, i32 2, i32 0, i32 %1) #2 %29 = bitcast float %28 to i32 %30 = call float @llvm.amdgcn.interp.mov(i32 2, i32 3, i32 0, i32 %1) #2 %31 = bitcast float %30 to i32 br label %loop1 loop1: ; preds = %endif5, %main_body %32 = phi i32 [ %25, %main_body ], [ %49, %endif5 ] %33 = phi i32 [ 0, %main_body ], [ %50, %endif5 ] %34 = icmp sge i32 %33, 6 %35 = select i1 %34, i32 -1, i32 0 %36 = icmp ne i32 %35, 0 br i1 %36, label %if2, label %else3 if2: ; preds = %loop1 br label %endloop1 else3: ; preds = %loop1 br label %endif2 endif2: ; preds = %else3 %37 = icmp ne i32 %33, 0 %38 = select i1 %37, i32 -1, i32 0 %39 = icmp ne i32 %38, 0 br i1 %39, label %if5, label %else12 if5: ; preds = %endif2 %40 = icmp ne i32 %33, 1 %41 = select i1 %40, i32 -1, i32 0 %42 = icmp ne i32 %41, 0 br i1 %42, label %if6, label %else10 if6: ; preds = %if5 %43 = icmp eq i32 %33, 3 %44 = select i1 %43, i32 -1, i32 0 %45 = icmp ne i32 %44, 0 br i1 %45, label %if7, label %else8 if7: ; preds = %if6 br label %endloop1 else8: ; preds = %if6 br label %endloop1 endif7: ; No predecessors! br label %endif6 else10: ; preds = %if5 br label %endif6 endif6: ; preds = %else10, %endif7 %46 = bitcast i32 %32 to float %47 = fsub float -0.000000e+00, %46 %48 = bitcast float %47 to i32 br label %endif5 else12: ; preds = %endif2 br label %endif5 endif5: ; preds = %else12, %endif6 %49 = phi i32 [ %32, %else12 ], [ %48, %endif6 ] %50 = add i32 %33, 1 br label %loop1 endloop1: ; preds = %else8, %if7, %if2 %51 = phi i32 [ undef, %if2 ], [ undef, %if7 ], [ %32, %else8 ] %52 = phi i32 [ 0, %if2 ], [ 0, %if7 ], [ -1, %else8 ] %53 = icmp ne i32 %52, 0 %54 = select i1 %53, i32 %51, i32 1065353216 %55 = getelementptr [0 x i8], [0 x i8] addrspace(6)* %0, i32 0, i32 0 %56 = bitcast i8 addrspace(6)* %55 to <4 x i32> addrspace(6)*, !amdgpu.uniform !0 %57 = load <4 x i32>, <4 x i32> addrspace(6)* %56, !invariant.load !0 %58 = call float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %57, i32 0, i32 0) #2 %59 = bitcast float %58 to i32 %60 = bitcast i32 %59 to float %61 = fsub float -0.000000e+00, %60 %62 = bitcast float %61 to i32 %63 = bitcast i32 %54 to float %64 = bitcast i32 %62 to float %65 = fadd float %63, %64 %66 = bitcast float %65 to i32 %67 = bitcast i32 %66 to float %68 = call float @llvm.fabs.f32(float %67) #2 %69 = bitcast float %68 to i32 %70 = bitcast i32 %59 to float %71 = call float @llvm.fabs.f32(float %70) #2 %72 = bitcast float %71 to i32 %73 = bitcast i32 %72 to float %74 = fmul float 0x3FA99999A0000000, %73 %75 = bitcast float %74 to i32 %76 = bitcast i32 %75 to float %77 = fadd float %76, 0x3FA99999A0000000 %78 = bitcast float %77 to i32 %79 = bitcast i32 %78 to float %80 = bitcast i32 %69 to float %81 = fcmp oge float %79, %80 %82 = select i1 %81, i32 -1, i32 0 %83 = and i32 %82, 1065353216 %84 = bitcast i32 %83 to float %85 = bitcast float %84 to i32 %86 = insertelement <4 x i32> undef, i32 %85, i32 0 %87 = insertelement <4 x i32> %86, i32 %85, i32 1 %88 = insertelement <4 x i32> %87, i32 %85, i32 2 %89 = insertelement <4 x i32> %88, i32 1065353216, i32 3 %90 = bitcast <4 x i32> %89 to <4 x float> %91 = extractelement <4 x float> %90, i32 0 store float %91, float addrspace(5)* %21 %92 = extractelement <4 x float> %90, i32 1 store float %92, float addrspace(5)* %20 %93 = extractelement <4 x float> %90, i32 2 store float %93, float addrspace(5)* %19 %94 = extractelement <4 x float> %90, i32 3 store float %94, float addrspace(5)* %18 %95 = load float, float addrspace(5)* %21 %96 = load float, float addrspace(5)* %20 %97 = load float, float addrspace(5)* %19 %98 = load float, float addrspace(5)* %18 %99 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %95, float %96) #2 %100 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %97, float %98) #2 %101 = bitcast <2 x half> %99 to <2 x i16> %102 = bitcast <2 x half> %100 to <2 x i16> call void @llvm.amdgcn.exp.compr.v2i16(i32 0, i32 5, <2 x i16> %101, <2 x i16> %102, i1 true, i1 true) #3 ret void } ; Function Attrs: nounwind readnone speculatable declare i8 addrspace(4)* @llvm.amdgcn.implicit.buffer.ptr() #1 ; Function Attrs: nounwind readnone speculatable declare float @llvm.amdgcn.interp.mov(i32, i32, i32, i32) #1 ; Function Attrs: nounwind readnone declare float @llvm.amdgcn.s.buffer.load.f32(<4 x i32>, i32, i32 immarg) #2 ; Function Attrs: nounwind readnone speculatable declare float @llvm.fabs.f32(float) #1 ; Function Attrs: nounwind readnone speculatable declare <2 x half> @llvm.amdgcn.cvt.pkrtz(float, float) #1 ; Function Attrs: nounwind declare void @llvm.amdgcn.exp.compr.v2i16(i32 immarg, i32 immarg, <2 x i16>, <2 x i16>, i1 immarg, i1 immarg) #3 attributes #0 = { "amdgpu-32bit-address-high-bits"="0xffff8000" } attributes #1 = { nounwind readnone speculatable } attributes #2 = { nounwind readnone } attributes #3 = { nounwind } !0 = !{}
Can you have a look?
Thanks,
Samuel.
Is the regression a compiler-crash, program crash, miscompile, or performance loss?
Probably related to addrspace?
Hi,
I'm experiencing problems with this patch as well. I'm not 100% sure what the
problem is yet but I have a suspicion:
The comment for isKnownNonZero says:
/ Return true if the given value is known to be non-zero when defined.
[...]
/ For pointers, if the context instruction and dominator tree are
/ specified, perform context-sensitive analysis and return true if the
/ pointer couldn't possibly be null at the specified instruction.
The context-sensitive analysis is done by isKnownNonZeroFromDominatingCondition()
but since we now allow that to return true for non-pointers as well, we suddenly
can return true from isKnownNonZero() based on a context-sensitive analysis. Or?
And perhaps the users of isKnownNonZero() then assumes that
"Return true if the given value is known to be non-zero when defined."
still holds for non-pointers?
The problem I'm seeing is a miscompile for my out-of-tree target so it's quite nasty.
I'll try to reduce and come up with an x86 example.
This is as far as I've come:
With
opt -S -o - red.ll -simplifycfg -loop-unroll
I get
; ModuleID = 'red.ll' source_filename = "red.ll" declare void @foo(i16) define i16 @main() { br label %bb1 bb1: ; preds = %0 call void @foo(i16 0) call void @foo(i16 1000) ret i16 0 }
which I think is correct, and with
opt -S -o - red.ll -instcombine -simplifycfg -loop-unroll
I get
; ModuleID = 'red.ll' source_filename = "red.ll" declare void @foo(i16) define i16 @main() { br label %bb1 bb1: ; preds = %0 call void @foo(i16 0) call void @foo(i16 1) ret i16 0 }
which I think is incorrect.
And in both cases red.ll is
declare void @foo(i16) define i16 @main() { br label %bb1 bb1: ; preds = %bb9, %0 %i.3.0 = phi i16 [ 0, %0 ], [ %_tmp19, %bb9 ] %_tmp8 = icmp uge i16 %i.3.0, 1 br i1 %_tmp8, label %bb3, label %bb5 bb3: ; preds = %bb1 %_tmp11 = add i16 1, 1 %_tmp12 = icmp ult i16 %i.3.0, %_tmp11 br i1 %_tmp12, label %bb4, label %bb5 bb4: ; preds = %bb3 br label %bb6 bb5: ; preds = %bb3, %bb1 br label %bb6 bb6: ; preds = %bb5, %bb4 %tmp0.5.0 = phi i16 [ 1, %bb4 ], [ 0, %bb5 ] %_tmp14 = icmp ne i16 %tmp0.5.0, 0 br i1 %_tmp14, label %bb7, label %bb8 bb7: ; preds = %bb6 br label %bb9 bb8: ; preds = %bb6 br label %bb9 bb9: ; preds = %bb8, %bb7 %tmp1.6.0 = phi i16 [ 1000, %bb7 ], [ %i.3.0, %bb8 ] call void @foo(i16 %tmp1.6.0) %_tmp19 = add i16 %i.3.0, 1 %_tmp21 = icmp ult i16 %_tmp19, 2 br i1 %_tmp21, label %bb1, label %bb10 bb10: ; preds = %bb9 ret i16 0 }
If I revert this commit I get
call void @foo(i16 0) call void @foo(i16 1000)
in both cases.
It's a test failure, it doesn't crash, the mentioned tests just fail to pass with that change.
@uabelho Thanks for the reproducer. I think the problems is that when we thread a comparison over a phi in instsimplify, we retain the original context instruction, while (I think) we should be using the terminator of the incoming value as context.
I found this when running our own test suite for our out-of-tree target. Actually it was a testcase testing memset, but for some reason
it managed to trigger this problem.
Here's a slightly cleaned up version of the reproducer:
define void @test(i1* %p) { entry: br label %loop loop: ; preds = %common, %entry %i = phi i16 [ 0, %entry ], [ %i.inc, %common ] %is_zero = icmp eq i16 %i, 0 br i1 %is_zero, label %common, label %non_zero non_zero: ; preds = %loop %is_one = icmp eq i16 %i, 1 store i1 %is_one, i1* %p br label %common common: ; preds = %non_zero, %loop %i.inc = add i16 %i, 1 %loop_cond = icmp ult i16 %i.inc, 2 br i1 %loop_cond, label %loop, label %exit exit: ; preds = %common ret void }
The %is_one is incorrectly optimized to false. I've been trying to come up with some variation that makes it fail without this patch, but no luck so far (something assume() based should be possible in theory, but it's very hard to actually construct something.)
I accidentally did the same thing you did in D69571 while looking into https://bugs.llvm.org/show_bug.cgi?id=43833. Anyway, I took your reproducer, fixed the issue our patches exposed (I think), and put it up there :)