This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll][NFC] Remove redundant canPeel check
ClosedPublic

Authored by mkazantsev on Mar 27 2018, 12:23 AM.

Details

Summary

We check canPeel twice: when evaluating the number of iterations to be peeled
and within the method peelLoop that performs peeling. This method is only
executed if the calculated peel count is positive. Thus, the check in peelLoop can
never fail. This patch replaces this check with an assert.

Diff Detail

Repository
rL LLVM

Event Timeline

mkazantsev created this revision.Mar 27 2018, 12:23 AM
fhahn accepted this revision.Mar 27 2018, 1:29 AM

LGTM. I think it is very reasonable to expect the caller to only pass in a peel count > 0 if we can acutally peel the loop. There is no need to check twice. Also, currently computePeelCount overrides the target specific settings, so I don't think this change would effect any downstream backends forcing a peel count.

This revision is now accepted and ready to land.Mar 27 2018, 1:29 AM

My belief is that all forcing of peel count should happen in computation of peel count and pay respect to peelability check it does.

This revision was automatically updated to reflect the committed changes.
iajbar reopened this revision.Mar 30 2018, 12:03 PM

Hello Max, This assert(canPeel(L)) causes several tests to fail when I set PeelCount in Hexagon target. Apparently there is another path that is taken that calls peelLoop() and does not go through computePeelCount(). So it looks like the second check for canPeel in needed. Here is a test case: opt -march=hexagon -O3 -S < test.ll

  • test.ll ---

target datalayout = "e-m:e-p:32:32:32-a:0-n16:32-i64:64:64-i32:32:32-i16:16:16-i1:8:8-f32:32:32-f64:64:64-v32:32:32-v64:64:64-v512:512:512-v1024:1024:1024-v2048:2048:2048"
target triple = "hexagon"

@g0 = external dso_local global i32, align 4

declare dso_local i64 @f0(i64, i32) #0

declare dso_local i32 @f1(i32, i32) #0

define dso_local void @f2(i16* %a0, i16* %a1, i16* %a2) #0 {
b0:

%v0 = alloca i16, align 2
%v1 = alloca [5 x i32], align 8
%v2 = alloca i32, align 4
store i16 0, i16* %v0, align 2, !tbaa !0
br label %b1

b1: ; preds = %b3, %b0

%v3 = load i16, i16* %v0, align 2, !tbaa !0
%v4 = sext i16 %v3 to i32
%v5 = icmp slt i32 %v4, 5
br i1 %v5, label %b2, label %b4

b2: ; preds = %b1

%v6 = call i32 @f1(i32 undef, i32 undef)
%v7 = load i32, i32* @g0, align 4, !tbaa !4
%v8 = icmp eq i32 %v7, 1
br i1 %v8, label %b4, label %b3

b3: ; preds = %b2

%v9 = getelementptr inbounds [5 x i32], [5 x i32]* %v1, i32 0, i32 undef
store i32 0, i32* %v9, align 4, !tbaa !6
%v10 = load i16, i16* %v0, align 2, !tbaa !0
%v11 = add i16 %v10, 1
store i16 %v11, i16* %v0, align 2, !tbaa !0
br label %b1

b4: ; preds = %b2, %b1

%v12 = getelementptr inbounds [5 x i32], [5 x i32]* %v1, i32 0, i32 undef
%v13 = load i32, i32* %v12, align 4, !tbaa !6
%v14 = call i64 @f0(i64 undef, i32 %v13)
unreachable

}

attributes #0 = { "target-cpu"="hexagonv60" "target-features"="+hvx-length64b,+hvxv60" }

!0 = !{!1, !1, i64 0}
!1 = !{!"short", !2, i64 0}
!2 = !{!"omnipotent char", !3, i64 0}
!3 = !{!"Simple C++ TBAA"}
!4 = !{!5, !5, i64 0}
!5 = !{!"int", !2, i64 0}
!6 = !{!7, !7, i64 0}

!7 = !{!"long", !2, i64 0}

This revision is now accepted and ready to land.Mar 30 2018, 12:03 PM

Hi @iajbar , I've run opt -march=hexagon -O3 -S < test.ll on your example and it worked fine for me. Here is the output I have:

; ModuleID = '<stdin>'
source_filename = "<stdin>"
target datalayout = "e-m:e-p:32:32:32-a:0-n16:32-i64:64:64-i32:32:32-i16:16:16-i1:8:8-f32:32:32-f64:64:64-v32:32:32-v64:64:64-v512:512:512-v1024:1024:1024-v2048:2048:2048"
target triple = "hexagon"

@g0 = external dso_local local_unnamed_addr global i32, align 4

declare dso_local i64 @f0(i64, i32) local_unnamed_addr #0

declare dso_local i32 @f1(i32, i32) local_unnamed_addr #0

; Function Attrs: noreturn
define dso_local void @f2(i16* nocapture readnone %a0, i16* nocapture readnone %a1, i16* nocapture readnone %a2) local_unnamed_addr #1 {
b0:
  %v6 = tail call i32 @f1(i32 undef, i32 undef)
  %v7 = load i32, i32* @g0, align 4, !tbaa !0
  %v8 = icmp eq i32 %v7, 1
  br i1 %v8, label %b4, label %b3

b3:                                               ; preds = %b0
  %v6.1 = tail call i32 @f1(i32 undef, i32 undef)
  %v7.1 = load i32, i32* @g0, align 4, !tbaa !0
  %v8.1 = icmp eq i32 %v7.1, 1
  br i1 %v8.1, label %b4, label %b3.1

b4:                                               ; preds = %b3.3, %b3.2, %b3.1, %b3, %b0
  %v14 = tail call i64 @f0(i64 undef, i32 0)
  unreachable

b3.1:                                             ; preds = %b3
  %v6.2 = tail call i32 @f1(i32 undef, i32 undef)
  %v7.2 = load i32, i32* @g0, align 4, !tbaa !0
  %v8.2 = icmp eq i32 %v7.2, 1
  br i1 %v8.2, label %b4, label %b3.2

b3.2:                                             ; preds = %b3.1
  %v6.3 = tail call i32 @f1(i32 undef, i32 undef)
  %v7.3 = load i32, i32* @g0, align 4, !tbaa !0
  %v8.3 = icmp eq i32 %v7.3, 1
  br i1 %v8.3, label %b4, label %b3.3

b3.3:                                             ; preds = %b3.2
  %v6.4 = tail call i32 @f1(i32 undef, i32 undef)
  br label %b4
}

attributes #0 = { "target-cpu"="hexagonv60" "target-features"="+hvx-length64b,+hvxv60" }
attributes #1 = { noreturn "target-cpu"="hexagonv60" "target-features"="+hvx-length64b,+hvxv60" }

!0 = !{!1, !1, i64 0}
!1 = !{!"int", !2, i64 0}
!2 = !{!"omnipotent char", !3, i64 0}
!3 = !{!"Simple C++ TBAA"}

Could you please take look into the stack trace of the failure that you are observing? Maybe peelLoop was called from some downstream code, in this case you need to just add a canPeel check right before it since the behavior has changed. If it is in upstream code, please double-check that you've given me the right run command and provide stack traces.

Thanks,
Max

iajbar added a comment.Apr 2 2018, 5:02 AM

Maybe it passes for you because my patch was reverted. When I compile opt with your patch and my patch, I see the error
https://reviews.llvm.org/D44919
https://reviews.llvm.org/D44880

Could you please try with both patches

opt -march=hexagon -O3 -S < test.ll

opt: /w/src/llvm.org/lib/Transforms/Utils/LoopUnrollPeel.cpp:482: bool llvm::peelLoop(llvm::Loop *, unsigned int, llvm::LoopInfo *, llvm::ScalarEvolution *, llvm::DominatorTree *, llvm::AssumptionCache *, bool): Assertion `canPeel(L) && "Attempt to peel a loop which is not peelable?"' failed.
#0 0x0000000001ed3814 PrintStackTraceSignalHandler(void*) (/w/bld/org/bin/opt+0x1ed3814)
#1 0x0000000001ed3a96 SignalHandler(int) (/w/bld/org/bin/opt+0x1ed3a96)
#2 0x00007f373ace8330 restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x10330)
#3 0x00007f3739bdbc37 gsignal /build/eglibc-ripdx6/eglibc-2.19/signal/../nptl/sysdeps/unix/sysv/linux/raise.c:56:0
#4 0x00007f3739bdf028 abort /build/eglibc-ripdx6/eglibc-2.19/stdlib/abort.c:91:0
#5 0x00007f3739bd4bf6
assert_fail_base /build/eglibc-ripdx6/eglibc-2.19/assert/assert.c:92:0
#6 0x00007f3739bd4ca2 (/lib/x86_64-linux-gnu/libc.so.6+0x2fca2)
#7 0x0000000001f50cec llvm::peelLoop(llvm::Loop*, unsigned int, llvm::LoopInfo*, llvm::ScalarEvolution*, llvm::DominatorTree*, llvm::AssumptionCache*, bool) (/w/bld/org/bin/opt+0x1f50cec)
#8 0x0000000001f45d2e llvm::UnrollLoop(llvm::Loop*, unsigned int, unsigned int, bool, bool, bool, bool, bool, unsigned int, unsigned int, bool, llvm::LoopInfo*, llvm::ScalarEvolution*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::OptimizationRemarkEmitter*, bool) (/w/bld/org/bin/opt+0x1f45d2e)
#9 0x0000000001d4779d tryToUnrollLoop(llvm::Loop*, llvm::DominatorTree&, llvm::LoopInfo*, llvm::ScalarEvolution&, llvm::TargetTransformInfo const&, llvm::AssumptionCache&, llvm::OptimizationRemarkEmitter&, bool, int, llvm::Optional<unsigned int>, llvm::Optional<unsigned int>, llvm::Optional<bool>, llvm::Optional<bool>, llvm::Optional<bool>, llvm::Optional<bool>) (/w/bld/org/bin/opt+0x1d4779d)
#10 0x0000000001d48cc3 (anonymous namespace)::LoopUnroll::runOnLoop(llvm::Loop*, llvm::LPPassManager&) (/w/bld/org/bin/opt+0x1d48cc3)
#11 0x00000000014126dc llvm::LPPassManager::runOnFunction(llvm::Function&) (/w/bld/org/bin/opt+0x14126dc)
#12 0x00000000019723af llvm::FPPassManager::runOnFunction(llvm::Function&) (/w/bld/org/bin/opt+0x19723af)
#13 0x000000000136f5ea (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) (/w/bld/org/bin/opt+0x136f5ea)
#14 0x0000000001972bb2 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/w/bld/org/bin/opt+0x1972bb2)
#15 0x000000000077d18a main (/w/bld/org/bin/opt+0x77d18a)
#16 0x00007f3739bc6f45 __libc_start_main /build/eglibc-ripdx6/eglibc-2.19/csu/libc-start.c:321:0
#17 0x0000000000769b6f _start (/w/bld/org/bin/opt+0x769b6f)
Stack dump:
0. Program arguments: /w/bld/org/bin/opt -march=hexagon -O3 -S

  1. Running pass 'CallGraph Pass Manager' on module '<stdin>'.
  2. Running pass 'Loop Pass Manager' on function '@f2'
  3. Running pass 'Unroll loops' on basic block '%b2'

Aborted (core dumped)

Thank you.

Thanks for clarification @iajbar ! I see what happens. We enter computeUnrollCount but do not reach computePeelCount because we exit earlier (for example, deciding to make full unrolling). The further logic assumes that if UP.PeelCount is set, then we will do peeling and nothing else. In sense of that, your patch D44880 contains a bug that I haven't noticed on review.

Specifically, even without this patch, your patch will lead to following:

  1. UP.Count will be forcefully set to some value;
  2. Unroller may decide to make full unrolling;

2a) If the loop can be peeled, then it will first peel off 2 iterations and THEN make full unrolling of what remains, which makes no sense (yet it seems functionally correct, but it will still be waste of compile time).
2b) If the loop cannot be peeled, then it will be rejected by peelLoop and then passed for unrolling, or, with this patch applied, lead to assertion failure.

I don't think that peeling 2 iterations first and making full unrolling after is what you expect to do (yet I don't see any correctness problems about it).

I suggest to add canPeel check to your patch and re-land it to avoid this situation. In this case you will at least not crash on unpeelable loops and will not set peel count for it. I think it has some sense by itself, because setting peel count to unpeelable loops may be misleading.

But it will not solve the potential compile time issue with making peeling first and unrolling after, which exist in your patch independently of my patch.

If this unroll-after-peeling situation seems unimportant for you, please add canPeel check and re-land the patch (consider this version approved by me).
If it is important, then we should find another way to inform the peel pass that you want to peel off 2 iterations ONLY WHEN it actually decides to do peeling and not full/user-forced/pragma-forced unrolling.

I hope my evaluation was useful to you.

Good luck,
Max

Note that we might not reach computePeelCount not because of full unrolling, but also because of user-forced unroll count. For example, if I write a loop

for (int i = 0; i < 400; i++)

and then force its unrolling by 4, with your patch we will first peel off 2 iterations, then unroll 396 iterations by 4 and 2 in tail computation. I don't know what motivation stands behind D44880 in sense of performance situations, but at least in some cases your patch will cause harm. If you agree that it's a problem, then it should be reworked.

One more way to fix it is to zero UP.PeelCount before every return from computeUnrollCount that stands before computePeelCount. It will fix both compile time/unefficient code issue with your patch and the assertion failire.

iajbar added a comment.Apr 2 2018, 8:07 PM

Thank you very much Max for your review. It is very much appreciated.

mkazantsev closed this revision.Apr 2 2018, 9:56 PM

I'm closing it since the problem with D44880 is root caused and some ways to solve it are proposed.