Page MenuHomePhabricator

[ValueTracking] Improve isKnowNonZero for Ints
Needs ReviewPublic

Authored by dlrobertson on Apr 17 2019, 6:02 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

dlrobertson created this revision.Apr 17 2019, 6:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 17 2019, 6:02 PM

Would an NFS baseline tests patch be useful here?

nikic added inline comments.Apr 18 2019, 12:47 AM
llvm/test/Transforms/InstCombine/cttz.ll
12 ↗(On Diff #195657)

Hm, it looks like the optimization we want doesn't actually happen here? The i1 false should have changed to an i1 true.

dlrobertson marked 2 inline comments as done.Apr 18 2019, 6:38 AM
dlrobertson added inline comments.
llvm/test/Transforms/InstCombine/cttz.ll
12 ↗(On Diff #195657)

Good catch. Just a cast of not updating the patch I uploaded.

dlrobertson marked an inline comment as done.

Update after added tests

nikic added a comment.Apr 22 2019, 3:33 AM

Looks good to me technically, I think it's only reasonable that integers and pointers are treated the same way here. Would like to have a second opinion though, maybe from @spatel?

It should be noted that the instsimplify cases here would also be simplified by either instcombine (presumably the dominating condition code there) or CVP. Is that possibly why this was not done before?

Can you push the baseline tests to trunk/head, so this diff is against an updated public baseline?
There's also a TODO comment/test about this enhancement in test/Transforms/LICM/hoist-mustexec.ll -- that should be updated as part of this patch.

About the relation to instcombine/CVP: there's an independent question (and unresolved in terms of implementation in LLVM) about which pass(es) have responsibility for these kinds of patterns. This patch makes valuetracking better with no obvious increase in complexity, so I think it's the right step. And if I'm seeing correctly, neither of those passes would get the last test with the extra blocks and phi without this patch.

spatel accepted this revision.Apr 28 2019, 6:55 AM

LGTM

This revision is now accepted and ready to land.Apr 28 2019, 6:55 AM
nikic added a comment.EditedApr 28 2019, 8:49 AM

test/Transforms/LICM/hoist-mustexec.ll still needs to be updated. (Should probably generate full checks as a preliminary NFC commit first.)

spatel requested changes to this revision.Apr 28 2019, 9:15 AM

test/Transforms/LICM/hoist-mustexec.ll still needs to be updated. (Should probably generate full checks as a preliminary NFC commit first.)

Oops - I forgot that even though I mentioned it in the earlier comment. Yes, that must be included in this patch (and any other existing tests that are known to be affected).

This revision now requires changes to proceed.Apr 28 2019, 9:15 AM

test/Transforms/LICM/hoist-mustexec.ll still needs to be updated. (Should probably generate full checks as a preliminary NFC commit first.)

Oops - I forgot that even though I mentioned it in the earlier comment. Yes, that must be included in this patch (and any other existing tests that are known to be affected).

Sorry, I forgot all about this as well. I did run the update_test_checks script on this file after applying the patch and there was not improvement to the generated code.

test/Transforms/LICM/hoist-mustexec.ll still needs to be updated. (Should probably generate full checks as a preliminary NFC commit first.)

Oops - I forgot that even though I mentioned it in the earlier comment. Yes, that must be included in this patch (and any other existing tests that are known to be affected).

Sorry, I forgot all about this as well. I did run the update_test_checks script on this file after applying the patch and there was not improvement to the generated code.

Are you sure? I didn't check what exactly changed, but this test was failing for me locally after applying the patch.

test/Transforms/LICM/hoist-mustexec.ll still needs to be updated. (Should probably generate full checks as a preliminary NFC commit first.)

Oops - I forgot that even though I mentioned it in the earlier comment. Yes, that must be included in this patch (and any other existing tests that are known to be affected).

Sorry, I forgot all about this as well. I did run the update_test_checks script on this file after applying the patch and there was not improvement to the generated code.

Are you sure? I didn't check what exactly changed, but this test was failing for me locally after applying the patch.

I'm probably doing something wrong. I'll run some tests and tinker with this later today.

Update LICM tests

spatel added a comment.May 3 2019, 7:17 AM

I can't tell what's actually happening in the LICM tests, so:
rL359881

Please rebase and update the patch here. Don't forget to update that TODO comment in the test file too; we should be set after that.

Updated the LICM test.

nikic accepted this revision.May 4 2019, 4:43 AM

LGTM

This revision was not accepted when it landed; it landed in state Needs Review.May 7 2019, 7:23 PM
Closed by commit rL360222: [ValueTracking] Improve isKnowNonZero for Ints (authored by dlrobertson, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.
hakzsam added a subscriber: hakzsam.May 8 2019, 1:41 AM

Hi folks,

This introduces a regression with RADV (the open source Vulkan driver in mesa) and some CTS tests, here's the list of failures:

  • dEQP-VK.glsl.functions.control_flow.mixed_return_break_continue_fragment
  • dEQP-VK.glsl.functions.control_flow.mixed_return_break_continue_vertex

Here's the LLVM IR generated by Mesa (before any optimizations passes):

; ModuleID = 'mesa-shader'
source_filename = "mesa-shader"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"
target triple = "amdgcn-mesa-mesa3d"

define amdgpu_ps void @main([0 x i8] addrspace(6)* inreg noalias dereferenceable(18446744073709551615), i32 inreg, <2 x i32>, <2 x i32>, <2 x i32>, <3 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, float, float, float, float, float, i32, i32, i32, i32) #0 {
main_body:
  %temp7 = alloca float, addrspace(5)
  %temp6 = alloca float, addrspace(5)
  %temp5 = alloca float, addrspace(5)
  %temp4 = alloca float, addrspace(5)
  %temp3 = alloca float, addrspace(5)
  %temp2 = alloca float, addrspace(5)
  %temp1 = alloca float, addrspace(5)
  %temp = alloca float, addrspace(5)
  %18 = alloca float, addrspace(5)
  %19 = alloca float, addrspace(5)
  %20 = alloca float, addrspace(5)
  %21 = alloca float, addrspace(5)
  %22 = call i8 addrspace(4)* @llvm.amdgcn.implicit.buffer.ptr() #2
  %23 = bitcast i8 addrspace(4)* %22 to [0 x <4 x i32>] addrspace(4)*
  %24 = call float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 %1) #2
  %25 = bitcast float %24 to i32
  %26 = call float @llvm.amdgcn.interp.mov(i32 2, i32 1, i32 0, i32 %1) #2
  %27 = bitcast float %26 to i32
  %28 = call float @llvm.amdgcn.interp.mov(i32 2, i32 2, i32 0, i32 %1) #2
  %29 = bitcast float %28 to i32
  %30 = call float @llvm.amdgcn.interp.mov(i32 2, i32 3, i32 0, i32 %1) #2
  %31 = bitcast float %30 to i32
  br label %loop1

loop1:                                            ; preds = %endif5, %main_body
  %32 = phi i32 [ %25, %main_body ], [ %49, %endif5 ]
  %33 = phi i32 [ 0, %main_body ], [ %50, %endif5 ]
  %34 = icmp sge i32 %33, 6
  %35 = select i1 %34, i32 -1, i32 0
  %36 = icmp ne i32 %35, 0
  br i1 %36, label %if2, label %else3

if2:                                              ; preds = %loop1
  br label %endloop1

else3:                                            ; preds = %loop1
  br label %endif2

endif2:                                           ; preds = %else3
  %37 = icmp ne i32 %33, 0
  %38 = select i1 %37, i32 -1, i32 0
  %39 = icmp ne i32 %38, 0
  br i1 %39, label %if5, label %else12

if5:                                              ; preds = %endif2
  %40 = icmp ne i32 %33, 1
  %41 = select i1 %40, i32 -1, i32 0
  %42 = icmp ne i32 %41, 0
  br i1 %42, label %if6, label %else10

if6:                                              ; preds = %if5
  %43 = icmp eq i32 %33, 3
  %44 = select i1 %43, i32 -1, i32 0
  %45 = icmp ne i32 %44, 0
  br i1 %45, label %if7, label %else8

if7:                                              ; preds = %if6
  br label %endloop1

else8:                                            ; preds = %if6
  br label %endloop1

endif7:                                           ; No predecessors!
  br label %endif6

else10:                                           ; preds = %if5
  br label %endif6

endif6:                                           ; preds = %else10, %endif7
  %46 = bitcast i32 %32 to float
  %47 = fsub float -0.000000e+00, %46
  %48 = bitcast float %47 to i32
  br label %endif5

else12:                                           ; preds = %endif2
  br label %endif5

endif5:                                           ; preds = %else12, %endif6
  %49 = phi i32 [ %32, %else12 ], [ %48, %endif6 ]
  %50 = add i32 %33, 1
  br label %loop1

endloop1:                                         ; preds = %else8, %if7, %if2
  %51 = phi i32 [ undef, %if2 ], [ undef, %if7 ], [ %32, %else8 ]
  %52 = phi i32 [ 0, %if2 ], [ 0, %if7 ], [ -1, %else8 ]
  %53 = icmp ne i32 %52, 0
  %54 = select i1 %53, i32 %51, i32 1065353216
  %55 = getelementptr [0 x i8], [0 x i8] addrspace(6)* %0, i32 0, i32 0
  %56 = bitcast i8 addrspace(6)* %55 to <4 x i32> addrspace(6)*, !amdgpu.uniform !0
  %57 = load <4 x i32>, <4 x i32> addrspace(6)* %56, !invariant.load !0
  %58 = call float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %57, i32 0, i32 0) #2
  %59 = bitcast float %58 to i32
  %60 = bitcast i32 %59 to float
  %61 = fsub float -0.000000e+00, %60
  %62 = bitcast float %61 to i32
  %63 = bitcast i32 %54 to float
  %64 = bitcast i32 %62 to float
  %65 = fadd float %63, %64
  %66 = bitcast float %65 to i32
  %67 = bitcast i32 %66 to float
  %68 = call float @llvm.fabs.f32(float %67) #2
  %69 = bitcast float %68 to i32
  %70 = bitcast i32 %59 to float
  %71 = call float @llvm.fabs.f32(float %70) #2
  %72 = bitcast float %71 to i32
  %73 = bitcast i32 %72 to float
  %74 = fmul float 0x3FA99999A0000000, %73
  %75 = bitcast float %74 to i32
  %76 = bitcast i32 %75 to float
  %77 = fadd float %76, 0x3FA99999A0000000
  %78 = bitcast float %77 to i32
  %79 = bitcast i32 %78 to float
  %80 = bitcast i32 %69 to float
  %81 = fcmp oge float %79, %80
  %82 = select i1 %81, i32 -1, i32 0
  %83 = and i32 %82, 1065353216
  %84 = bitcast i32 %83 to float
  %85 = bitcast float %84 to i32
  %86 = insertelement <4 x i32> undef, i32 %85, i32 0
  %87 = insertelement <4 x i32> %86, i32 %85, i32 1
  %88 = insertelement <4 x i32> %87, i32 %85, i32 2
  %89 = insertelement <4 x i32> %88, i32 1065353216, i32 3
  %90 = bitcast <4 x i32> %89 to <4 x float>
  %91 = extractelement <4 x float> %90, i32 0
  store float %91, float addrspace(5)* %21
  %92 = extractelement <4 x float> %90, i32 1
  store float %92, float addrspace(5)* %20
  %93 = extractelement <4 x float> %90, i32 2
  store float %93, float addrspace(5)* %19
  %94 = extractelement <4 x float> %90, i32 3
  store float %94, float addrspace(5)* %18
  %95 = load float, float addrspace(5)* %21
  %96 = load float, float addrspace(5)* %20
  %97 = load float, float addrspace(5)* %19
  %98 = load float, float addrspace(5)* %18
  %99 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %95, float %96) #2
  %100 = call <2 x half> @llvm.amdgcn.cvt.pkrtz(float %97, float %98) #2
  %101 = bitcast <2 x half> %99 to <2 x i16>
  %102 = bitcast <2 x half> %100 to <2 x i16>
  call void @llvm.amdgcn.exp.compr.v2i16(i32 0, i32 5, <2 x i16> %101, <2 x i16> %102, i1 true, i1 true) #3
  ret void
}

; Function Attrs: nounwind readnone speculatable
declare i8 addrspace(4)* @llvm.amdgcn.implicit.buffer.ptr() #1

; Function Attrs: nounwind readnone speculatable
declare float @llvm.amdgcn.interp.mov(i32, i32, i32, i32) #1

; Function Attrs: nounwind readnone
declare float @llvm.amdgcn.s.buffer.load.f32(<4 x i32>, i32, i32 immarg) #2

; Function Attrs: nounwind readnone speculatable
declare float @llvm.fabs.f32(float) #1

; Function Attrs: nounwind readnone speculatable
declare <2 x half> @llvm.amdgcn.cvt.pkrtz(float, float) #1

; Function Attrs: nounwind
declare void @llvm.amdgcn.exp.compr.v2i16(i32 immarg, i32 immarg, <2 x i16>, <2 x i16>, i1 immarg, i1 immarg) #3

attributes #0 = { "amdgpu-32bit-address-high-bits"="0xffff8000" }
attributes #1 = { nounwind readnone speculatable }
attributes #2 = { nounwind readnone }
attributes #3 = { nounwind }

!0 = !{}

Can you have a look?
Thanks,
Samuel.

spatel added a comment.May 8 2019, 5:36 AM

Hi folks,

This introduces a regression with RADV (the open source Vulkan driver in mesa) and some CTS tests, here's the list of failures:

Is the regression a compiler-crash, program crash, miscompile, or performance loss?
Probably related to addrspace?

uabelho added a subscriber: uabelho.May 8 2019, 5:57 AM

Hi,

I'm experiencing problems with this patch as well. I'm not 100% sure what the
problem is yet but I have a suspicion:

The comment for isKnownNonZero says:

/ Return true if the given value is known to be non-zero when defined.
[...]
/ For pointers, if the context instruction and dominator tree are
/ specified, perform context-sensitive analysis and return true if the
/ pointer couldn't possibly be null at the specified instruction.

The context-sensitive analysis is done by isKnownNonZeroFromDominatingCondition()
but since we now allow that to return true for non-pointers as well, we suddenly
can return true from isKnownNonZero() based on a context-sensitive analysis. Or?

And perhaps the users of isKnownNonZero() then assumes that
"Return true if the given value is known to be non-zero when defined."
still holds for non-pointers?

The problem I'm seeing is a miscompile for my out-of-tree target so it's quite nasty.
I'll try to reduce and come up with an x86 example.

uabelho added a comment.EditedMay 8 2019, 6:49 AM

This is as far as I've come:

With

opt -S -o - red.ll -simplifycfg -loop-unroll

I get

; ModuleID = 'red.ll'
source_filename = "red.ll"

declare void @foo(i16)

define i16 @main() {
  br label %bb1

bb1:                                              ; preds = %0
  call void @foo(i16 0)
  call void @foo(i16 1000)
  ret i16 0
}

which I think is correct, and with

opt -S -o - red.ll -instcombine -simplifycfg -loop-unroll

I get

; ModuleID = 'red.ll'
source_filename = "red.ll"

declare void @foo(i16)

define i16 @main() {
  br label %bb1

bb1:                                              ; preds = %0
  call void @foo(i16 0)
  call void @foo(i16 1)
  ret i16 0
}

which I think is incorrect.

And in both cases red.ll is

declare void @foo(i16)

define i16 @main() {
  br label %bb1

bb1:                                              ; preds = %bb9, %0
  %i.3.0 = phi i16 [ 0, %0 ], [ %_tmp19, %bb9 ]
  %_tmp8 = icmp uge i16 %i.3.0, 1
  br i1 %_tmp8, label %bb3, label %bb5

bb3:                                              ; preds = %bb1
  %_tmp11 = add i16 1, 1
  %_tmp12 = icmp ult i16 %i.3.0, %_tmp11
  br i1 %_tmp12, label %bb4, label %bb5

bb4:                                              ; preds = %bb3
  br label %bb6

bb5:                                              ; preds = %bb3, %bb1
  br label %bb6

bb6:                                              ; preds = %bb5, %bb4
  %tmp0.5.0 = phi i16 [ 1, %bb4 ], [ 0, %bb5 ]
  %_tmp14 = icmp ne i16 %tmp0.5.0, 0
  br i1 %_tmp14, label %bb7, label %bb8

bb7:                                              ; preds = %bb6
  br label %bb9

bb8:                                              ; preds = %bb6
  br label %bb9

bb9:                                              ; preds = %bb8, %bb7
  %tmp1.6.0 = phi i16 [ 1000, %bb7 ], [ %i.3.0, %bb8 ]
  call void @foo(i16 %tmp1.6.0)
  %_tmp19 = add i16 %i.3.0, 1
  %_tmp21 = icmp ult i16 %_tmp19, 2
  br i1 %_tmp21, label %bb1, label %bb10

bb10:                                             ; preds = %bb9
  ret i16 0
}

If I revert this commit I get

call void @foo(i16 0)
call void @foo(i16 1000)

in both cases.

Hi folks,

This introduces a regression with RADV (the open source Vulkan driver in mesa) and some CTS tests, here's the list of failures:

Is the regression a compiler-crash, program crash, miscompile, or performance loss?
Probably related to addrspace?

It's a test failure, it doesn't crash, the mentioned tests just fail to pass with that change.

nikic added a comment.May 8 2019, 7:50 AM

I've reverted this change for now (rL360260).

nikic reopened this revision.May 8 2019, 7:51 AM
nikic added a comment.May 8 2019, 1:50 PM

@uabelho Thanks for the reproducer. I think the problems is that when we thread a comparison over a phi in instsimplify, we retain the original context instruction, while (I think) we should be using the terminator of the incoming value as context.

@uabelho great job coming up with a reproducer and thanks @nikic for reverting! I'll start working on a better solution.

I ran ninja check, but I did not hit any failures. Are there more architectures and/or tools I should enable and check before a commit?

@uabelho great job coming up with a reproducer and thanks @nikic for reverting! I'll start working on a better solution.

I ran ninja check, but I did not hit any failures. Are there more architectures and/or tools I should enable and check before a commit?

I found this when running our own test suite for our out-of-tree target. Actually it was a testcase testing memset, but for some reason
it managed to trigger this problem.

nikic added a comment.May 9 2019, 1:52 PM

Here's a slightly cleaned up version of the reproducer:

define void @test(i1* %p) {
entry:
  br label %loop

loop:                                             ; preds = %common, %entry
  %i = phi i16 [ 0, %entry ], [ %i.inc, %common ]
  %is_zero = icmp eq i16 %i, 0
  br i1 %is_zero, label %common, label %non_zero

non_zero:                                         ; preds = %loop
  %is_one = icmp eq i16 %i, 1
  store i1 %is_one, i1* %p
  br label %common

common:                                           ; preds = %non_zero, %loop
  %i.inc = add i16 %i, 1
  %loop_cond = icmp ult i16 %i.inc, 2
  br i1 %loop_cond, label %loop, label %exit

exit:                                             ; preds = %common
  ret void
}

The %is_one is incorrectly optimized to false. I've been trying to come up with some variation that makes it fail without this patch, but no luck so far (something assume() based should be possible in theory, but it's very hard to actually construct something.)

Here's a slightly cleaned up version of the reproducer

Thanks!