This is an archive of the discontinued LLVM Phabricator instance.

[LV] Fix recording of BranchTakenCount for FoldTail
ClosedPublic

Authored by Ayal on Apr 24 2020, 5:59 PM.

Details

Summary

When folding tail, branch taken count is computed during initial VPlan execution
and recorded to be used by the compare computing the loop's mask. This recording
should directly set the State, instead of reusing Value2VPValue mapping which
serves original Values present prior to vectorization.
The branch taken count may be a constant Value, which may be used elsewhere in
the loop; trying to employ Value2VPValue for both leads to the issue reported in
https://reviews.llvm.org/D76992#inline-721028

Diff Detail

Event Timeline

Ayal created this revision.Apr 24 2020, 5:59 PM
Herald added a project: Restricted Project. · View Herald Transcript
fhahn accepted this revision.Apr 25 2020, 6:51 AM

LGTM, thanks!

I'll add a reduced version of the crashing case from D78847 when I recommit it.

This revision is now accepted and ready to land.Apr 25 2020, 6:51 AM
Ayal updated this revision to Diff 260168.Apr 26 2020, 8:42 AM

Added test. Rebased.

Ayal added a comment.Apr 26 2020, 8:47 AM

LGTM, thanks!

I'll add a reduced version of the crashing case from D78847 when I recommit it.

Thanks! Managed to devise a test independent of D76992 (proving it ain't the latter's fault ;-)

fhahn added a comment.Apr 26 2020, 9:10 AM

LGTM, thanks!

I'll add a reduced version of the crashing case from D78847 when I recommit it.

Thanks! Managed to devise a test independent of D76992 (proving it ain't the latter's fault ;-)

Great, that's even better!

This revision was automatically updated to reflect the committed changes.

Hello,

The new fix delivered in this patch has caused an assert failure with a testcase having a loop with a very small trip count going through fold tail by masking.
The reduced testcase is as follows. The options are: -loop-vectorize -force-vector-interleave=4

For your reference, at the time of this writing, my HEAD is at commit 2d3f5a62de8e5d2cc25aaa49d0a00d31ed32544a

$ cat simple.ll
define void @foo() {
entry:
  br label %for.body

for.cond.cleanup:
  ret void

for.body:
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, 15
  br i1 %exitcond, label %for.cond.cleanup, label %for.body
}
$ opt -loop-vectorize -force-vector-interleave=4 -S ./simple.ll

opt: llvm-project/llvm/include/llvm/IR/Instructions.h:1140: void llvm::ICmpInst::AssertOK(): Assertion `getOperand(0)->getType() == getOperand(1)->getType() && "Both operands to ICmp instruction are not of the same type!"' failed.

Stack dump:
0.      Program arguments: build/bin/opt -loop-vectorize -force-vector-interleave=4 -S -debug-only=loop-vectorize simple.ll
1.      Running pass 'Function Pass Manager' on module 'simple.ll'.
2.      Running pass 'Loop Vectorization' on function '@foo'
 #0 0x00007c2509ec9bf4 PrintStackTraceSignalHandler(void*) (build/bin/../lib/libLLVMSupport.so.11git+0x1e9bf4)
 #1 0x00007c2509ec6b98 llvm::sys::RunSignalHandlers() (build/bin/../lib/libLLVMSupport.so.11git+0x1e6b98)
 #2 0x00007c2509ec9f04 SignalHandler(int) (build/bin/../lib/libLLVMSupport.so.11git+0x1e9f04)
 #3 0x00007c250ff004d8 (linux-vdso64.so.1+0x4d8)
 #4 0x00007c25090de98c __libc_signal_restore_set /build/glibc-uvws04/glibc-2.27/signal/../sysdeps/unix/sysv/linux/nptl-signals.h:80:0
 #5 0x00007c25090de98c raise /build/glibc-uvws04/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:48:0
 #6 0x00007c25090e0be0 abort /build/glibc-uvws04/glibc-2.27/stdlib/abort.c:79:0
 #7 0x00007c25090cbb38 __assert_fail_base /build/glibc-uvws04/glibc-2.27/assert/assert.c:92:0
 #8 0x00007c25090cbbe4 __assert_fail /build/glibc-uvws04/glibc-2.27/assert/assert.c:101:0
 #9 0x00007c25098fe518 llvm::ICmpInst::AssertOK() (build/bin/../lib/libLLVMVectorize.so.11git+0x7e518)
#10 0x00007c25098cc31c llvm::IRBuilderBase::CreateICmp(llvm::CmpInst::Predicate, llvm::Value*, llvm::Value*, llvm::Twine const&) (build/bin/../lib/libLLVMVectorize.so.11git+0x4c31c)
#11 0x00007c250996130c llvm::VPInstruction::generateInstruction(llvm::VPTransformState&, unsigned int) (build/bin/../lib/libLLVMVectorize.so.11git+0xe130c)
#12 0x00007c2509961810 llvm::VPInstruction::execute(llvm::VPTransformState&) (build/bin/../lib/libLLVMVectorize.so.11git+0xe1810)
#13 0x00007c25099604bc llvm::VPBasicBlock::execute(llvm::VPTransformState*) (build/bin/../lib/libLLVMVectorize.so.11git+0xe04bc)
#14 0x00007c25099620a8 llvm::VPlan::execute(llvm::VPTransformState*) (build/bin/../lib/libLLVMVectorize.so.11git+0xe20a8)
#15 0x00007c25098eb368 llvm::LoopVectorizationPlanner::executePlan(llvm::InnerLoopVectorizer&, llvm::DominatorTree*) (build/bin/../lib/libLLVMVectorize.so.11git+0x6b368)
#16 0x00007c25098f7380 llvm::LoopVectorizePass::processLoop(llvm::Loop*) (build/bin/../lib/libLLVMVectorize.so.11git+0x77380)
#17 0x00007c25098f8b74 llvm::LoopVectorizePass::runImpl(llvm::Function&, llvm::ScalarEvolution&, llvm::LoopInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::BlockFrequencyInfo&, llvm::TargetLibraryInfo*, llvm::DemandedBits&, llvm::AAResults&, llvm::AssumptionCache&, std::function<llvm::LoopAccessInfo const& (llvm::Loop&)>&, llvm::OptimizationRemarkEmitter&, llvm::ProfileSummaryInfo*) (build/bin/../lib/libLLVMVectorize.so.11git+0x78b74)
#18 0x00007c2509903320 (anonymous namespace)::LoopVectorize::runOnFunction(llvm::Function&) (build/bin/../lib/libLLVMVectorize.so.11git+0x83320)
#19 0x00007c250af0069c llvm::FPPassManager::runOnFunction(llvm::Function&) (build/bin/../lib/libLLVMCore.so.11git+0x27069c)
#20 0x00007c250af00b00 llvm::FPPassManager::runOnModule(llvm::Module&) (build/bin/../lib/libLLVMCore.so.11git+0x270b00)
#21 0x00007c250af012b4 llvm::legacy::PassManagerImpl::run(llvm::Module&) (build/bin/../lib/libLLVMCore.so.11git+0x2712b4)
#22 0x00007c250af0195c llvm::legacy::PassManager::run(llvm::Module&) (build/bin/../lib/libLLVMCore.so.11git+0x27195c)
#23 0x0000000010031aec main (build/bin/opt+0x10031aec)
#24 0x00007c25090b441c generic_start_main /build/glibc-uvws04/glibc-2.27/csu/../csu/libc-start.c:310:0
#25 0x00007c25090b4618 __libc_start_main /build/glibc-uvws04/glibc-2.27/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116:0

It looks like setting a vector as BackedgeTakenCount has caused the type mismatch when generating the icmp instructions

...
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %induction = add i64 %index, 0
  %induction1 = add i64 %index, 1
  %induction2 = add i64 %index, 2
  %induction3 = add i64 %index, 3
  %0 = icmp ule i64 %induction, 14   <======== assert when generating this line

In this example, the operand[0] (%induction) correctly has type i64, but the loop bound (14) is of vector type <1 x i64>

There might be multiple ways to address this assert failure. I list below a few simple ones for your reference: they might or might not be a good solution at all.

  1. Option 1: Not to generate the icmp instructions for %induction. In the particular case of this testcase, these instructions seem to be redundant.
  2. Option 2: If we are to generate the icmp instructions above, can we set the BackedgeTakenCount to the State depending on the type of the first operand? In cases like this one when the first operand is not a vector type, using Value *TCMO instead of Value *VTCMO might be an option.

I will open a Bugzzila and copy its link to this page when my password reset goes through.

Thanks,
Anh

Ayal added a comment.May 10 2020, 4:37 PM

Hello,

[snip]

In this example, the operand[0] (%induction) correctly has type i64, but the loop bound (14) is of vector type <1 x i64>

There might be multiple ways to address this assert failure. I list below a few simple ones for your reference: they might or might not be a good solution at all.

  1. Option 1: Not to generate the icmp instructions for %induction. In the particular case of this testcase, these instructions seem to be redundant.
  2. Option 2: If we are to generate the icmp instructions above, can we set the BackedgeTakenCount to the State depending on the type of the first operand? In cases like this one when the first operand is not a vector type, using Value *TCMO instead of Value *VTCMO might be an option.

I will open a Bugzzila and copy its link to this page when my password reset goes through.

Thanks,
Anh

Yes, thanks for catching this!
One quick fix is indeed to set VTCMO to TCMO when State->VF == 1, instead of "splatting" it into a vector of a single element.
Thinking if fold-tail-by-masking should be restricted to work for VF>1 only, given that only vectors (loads/stores) get masked.

Hello,

[snip]

In this example, the operand[0] (%induction) correctly has type i64, but the loop bound (14) is of vector type <1 x i64>

There might be multiple ways to address this assert failure. I list below a few simple ones for your reference: they might or might not be a good solution at all.

  1. Option 1: Not to generate the icmp instructions for %induction. In the particular case of this testcase, these instructions seem to be redundant.
  2. Option 2: If we are to generate the icmp instructions above, can we set the BackedgeTakenCount to the State depending on the type of the first operand? In cases like this one when the first operand is not a vector type, using Value *TCMO instead of Value *VTCMO might be an option.

I will open a Bugzzila and copy its link to this page when my password reset goes through.

Thanks,
Anh

Yes, thanks for catching this!
One quick fix is indeed to set VTCMO to TCMO when State->VF == 1, instead of "splatting" it into a vector of a single element.
Thinking if fold-tail-by-masking should be restricted to work for VF>1 only, given that only vectors (loads/stores) get masked.

I also came up (and gave up) that fix last week, because it would not work for a loop whose VF is 1, but the loop bound is a vector. I will come up with an example shortly to demonstrate my thought.

Hello,

[snip]

In this example, the operand[0] (%induction) correctly has type i64, but the loop bound (14) is of vector type <1 x i64>

There might be multiple ways to address this assert failure. I list below a few simple ones for your reference: they might or might not be a good solution at all.

  1. Option 1: Not to generate the icmp instructions for %induction. In the particular case of this testcase, these instructions seem to be redundant.
  2. Option 2: If we are to generate the icmp instructions above, can we set the BackedgeTakenCount to the State depending on the type of the first operand? In cases like this one when the first operand is not a vector type, using Value *TCMO instead of Value *VTCMO might be an option.

I will open a Bugzzila and copy its link to this page when my password reset goes through.

Thanks,
Anh

Yes, thanks for catching this!
One quick fix is indeed to set VTCMO to TCMO when State->VF == 1, instead of "splatting" it into a vector of a single element.
Thinking if fold-tail-by-masking should be restricted to work for VF>1 only, given that only vectors (loads/stores) get masked.

I also came up (and gave up) that fix last week, because it would not work for a loop whose VF is 1, but the loop bound is a vector. I will come up with an example shortly to demonstrate my thought.

Below is an example to demonstrate that setting VTCMO to TCMO when State->VF == 1 will not help in the case of a loop of VF 1 having a vector loop-bound.
In my example below, I add some profile meta data to guide the optimization selection process.
As you can tell it from the meta data, the numeric values were pretty much randomly selected as 10, 20, 30, and so on.
Basically, almost any arbitrary data will work as long as it is meaningful.

$ cat ./simple2.ll

define void @foo() !prof !12 {
entry:
  br label %for.body

for.cond.cleanup:
  ret void

for.body:
  %addr = phi double* [ %ptr, %for.body ], [ undef, %entry ]
  %ptr = getelementptr inbounds double, double* %addr, i64 1
  %cond = icmp eq double* %ptr, undef
  br i1 %cond, label %for.cond.cleanup, label %for.body
}

!llvm.module.flags = !{!0}

!0 = !{i32 1, !"ProfileSummary", !1}
!1 = !{!2, !3, !4, !5, !6, !7, !8, !9}
!2 = !{!"ProfileFormat", !"InstrProf"}
!3 = !{!"TotalCount", i64 10}
!4 = !{!"MaxCount", i64 20}
!5 = !{!"MaxInternalCount", i64 30}
!6 = !{!"MaxFunctionCount", i64 40}
!7 = !{!"NumCounts", i64 50}
!8 = !{!"NumFunctions", i64 60}
!9 = !{!"DetailedSummary", !10}
!10 = !{!11}
!11 = !{i32 999999, i64 70, i32 80}
!12 = !{!"function_entry_count", i64 0}

Again, I will use the same option -loop-vectorize -force-vector-interleave=4

$ opt -loop-vectorize -force-vector-interleave=4 -S ./simple2.ll

opt: llvm-project/llvm/include/llvm/IR/Instructions.h:1144: void llvm::ICmpInst::AssertOK(): Assertion `getOperand(0)->getType() == getOperand(1)->getType() && "Both operands to ICmp instruction are not of the same type!"' failed.

Stack dump:
0.      Program arguments: build/bin/opt -loop-vectorize -force-vector-interleave=4 -S -debug-only=loop-vectorize simple2.ll
1.      Running pass 'Function Pass Manager' on module 'simple2.ll'.
2.      Running pass 'Loop Vectorization' on function '@foo'
 #0 0x000070cd938b9bf4 PrintStackTraceSignalHandler(void*) (build/bin/../lib/libLLVMSupport.so.11git+0x1e9bf4)
 #1 0x000070cd938b6b98 llvm::sys::RunSignalHandlers() (build/bin/../lib/libLLVMSupport.so.11git+0x1e6b98)
 #2 0x000070cd938b9f04 SignalHandler(int) (build/bin/../lib/libLLVMSupport.so.11git+0x1e9f04)
 #3 0x000070cd998f04d8 (linux-vdso64.so.1+0x4d8)
 #4 0x000070cd92ace98c __libc_signal_restore_set /build/glibc-uvws04/glibc-2.27/signal/../sysdeps/unix/sysv/linux/nptl-signals.h:80:0
 #5 0x000070cd92ace98c raise /build/glibc-uvws04/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:48:0
 #6 0x000070cd92ad0be0 abort /build/glibc-uvws04/glibc-2.27/stdlib/abort.c:79:0
 #7 0x000070cd92abbb38 __assert_fail_base /build/glibc-uvws04/glibc-2.27/assert/assert.c:92:0
 #8 0x000070cd92abbbe4 __assert_fail /build/glibc-uvws04/glibc-2.27/assert/assert.c:101:0
 #9 0x000070cd932ee5bc llvm::ICmpInst::AssertOK() (build/bin/../lib/libLLVMVectorize.so.11git+0x7e5bc)
#10 0x000070cd932bc37c llvm::IRBuilderBase::CreateICmp(llvm::CmpInst::Predicate, llvm::Value*, llvm::Value*, llvm::Twine const&) (build/bin/../lib/libLLVMVectorize.so.11git+0x4c37c)
#11 0x000070cd933513cc llvm::VPInstruction::generateInstruction(llvm::VPTransformState&, unsigned int) (build/bin/../lib/libLLVMVectorize.so.11git+0xe13cc)
#12 0x000070cd933518d0 llvm::VPInstruction::execute(llvm::VPTransformState&) (build/bin/../lib/libLLVMVectorize.so.11git+0xe18d0)
#13 0x000070cd9335057c llvm::VPBasicBlock::execute(llvm::VPTransformState*) (build/bin/../lib/libLLVMVectorize.so.11git+0xe057c)
#14 0x000070cd93352188 llvm::VPlan::execute(llvm::VPTransformState*) (build/bin/../lib/libLLVMVectorize.so.11git+0xe2188)
#15 0x000070cd932db3c8 llvm::LoopVectorizationPlanner::executePlan(llvm::InnerLoopVectorizer&, llvm::DominatorTree*) (build/bin/../lib/libLLVMVectorize.so.11git+0x6b3c8)
#16 0x000070cd932e73e0 llvm::LoopVectorizePass::processLoop(llvm::Loop*) (build/bin/../lib/libLLVMVectorize.so.11git+0x773e0)
#17 0x000070cd932e8bd4 llvm::LoopVectorizePass::runImpl(llvm::Function&, llvm::ScalarEvolution&, llvm::LoopInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::BlockFrequencyInfo&, llvm::TargetLibraryInfo*, llvm::DemandedBits&, llvm::AAResults&, llvm::AssumptionCache&, std::function<llvm::LoopAccessInfo const& (llvm::Loop&)>&, llvm::OptimizationRemarkEmitter&, llvm::ProfileSummaryInfo*) (build/bin/../lib/libLLVMVectorize.so.11git+0x78bd4)
#18 0x000070cd932f33e0 (anonymous namespace)::LoopVectorize::runOnFunction(llvm::Function&) (build/bin/../lib/libLLVMVectorize.so.11git+0x833e0)
#19 0x000070cd948f073c llvm::FPPassManager::runOnFunction(llvm::Function&) (build/bin/../lib/libLLVMCore.so.11git+0x27073c)
#20 0x000070cd948f0ba0 llvm::FPPassManager::runOnModule(llvm::Module&) (build/bin/../lib/libLLVMCore.so.11git+0x270ba0)
#21 0x000070cd948f1354 llvm::legacy::PassManagerImpl::run(llvm::Module&) (build/bin/../lib/libLLVMCore.so.11git+0x271354)
#22 0x000070cd948f19fc llvm::legacy::PassManager::run(llvm::Module&) (build/bin/../lib/libLLVMCore.so.11git+0x2719fc)
#23 0x0000000010031aec main (build/bin/opt+0x10031aec)
#24 0x000070cd92aa441c generic_start_main /build/glibc-uvws04/glibc-2.27/csu/../csu/libc-start.c:310:0
#25 0x000070cd92aa4618 __libc_start_main /build/glibc-uvws04/glibc-2.27/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:116:0

When setting BackedgeTakenCount with TCMO if State->VF == 1, the mismatch will occur

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %0 = add i64 %index, 0
  %next.gep = getelementptr double, double* undef, i64 %0
  %1 = add i64 %index, 1
  %next.gep1 = getelementptr double, double* undef, i64 %1
  %2 = add i64 %index, 2
  %next.gep2 = getelementptr double, double* undef, i64 %2
  %3 = add i64 %index, 3
  %next.gep3 = getelementptr double, double* undef, i64 %3
  %broadcast.splatinsert = insertelement <1 x i64> undef, i64 %index, i32 0
  %broadcast.splat = shufflevector <1 x i64> %broadcast.splatinsert, <1 x i64> undef, <1 x i32> zeroinitializer
  %vec.iv = add <1 x i64> %broadcast.splat, zeroinitializer
  %vec.iv4 = add <1 x i64> %broadcast.splat, <i64 1>
  %vec.iv5 = add <1 x i64> %broadcast.splat, <i64 2>
  %vec.iv6 = add <1 x i64> %broadcast.splat, <i64 3>
  %4 = icmp ule <1 x i64> %vec.iv, <i64 2305843009213693951>  <======== assert when generating this line

In this case, the operand[0] (which is %vec.iv) has type <1 x i64>. The loop-bound, however, will get the type as i64 instead of the expected <i64 2305843009213693951> .

Ayal added a comment.May 11 2020, 12:49 AM

[snip]

Below is an example to demonstrate that setting VTCMO to TCMO when State->VF == 1 will not help in the case of a loop of VF 1 having a vector loop-bound.

[snip]

In this case, the operand[0] (which is %vec.iv) has type <1 x i64>. The loop-bound, however, will get the type as **i64** instead of the expected **<i64 2305843009213693951>** .

Right, VPWidenCanonicalIVRecipe::execute() also needs to treat VF==1 differently.

[snip]

Below is an example to demonstrate that setting VTCMO to TCMO when State->VF == 1 will not help in the case of a loop of VF 1 having a vector loop-bound.

[snip]

In this case, the operand[0] (which is %vec.iv) has type <1 x i64>. The loop-bound, however, will get the type as **i64** instead of the expected **<i64 2305843009213693951>** .

Right, VPWidenCanonicalIVRecipe::execute() also needs to treat VF==1 differently.

I looked at that, too. It still gives us the assert at a different location. We will need a little more work to do.

opt -loop-vectorize -force-vector-interleave=4 -S simple2.ll

opt: llvm-project/llvm/lib/IR/Instructions.cpp:2446: static llvm::BinaryOperator *llvm::BinaryOperator::Create(llvm::Instruction::BinaryOps, llvm::Value *, llvm::Value *, const llvm::Twine &, llvm::Instruction *): Assertion `S1->getType() == S2->getType() && "Cannot create binary operator with two operands of differing type!"' failed.
Ayal added a comment.May 12 2020, 4:27 AM

[snip]

Right, VPWidenCanonicalIVRecipe::execute() also needs to treat VF==1 differently.

I looked at that, too. It still gives us the assert at a different location. We will need a little more work to do.

The above implies setting both VStart to CanonicalIV instead of splatting, and VStep to ConstantInt::set(STy, Part) instead of ConstantVector::get(Indices), when VF==1. Would doing so pass all your tests?

Some of this issue stems from not using the overloaded getBroadcastInstrs().

The more general issues raised are whether to apply foldTail when VF==1 in the absence of masked scalar loads/stores, and/or whether to internally turn foldTail on for small loops (due to cost considerations) when the VF and/or UF are provided externally (bypassing their cost-based selection process).

[snip]

Right, VPWidenCanonicalIVRecipe::execute() also needs to treat VF==1 differently.

I looked at that, too. It still gives us the assert at a different location. We will need a little more work to do.

The above implies setting both VStart to CanonicalIV instead of splatting, and VStep to ConstantInt::set(STy, Part) instead of ConstantVector::get(Indices), when VF==1. Would doing so pass all your tests?

Some of this issue stems from not using the overloaded getBroadcastInstrs().

The more general issues raised are whether to apply foldTail when VF==1 in the absence of masked scalar loads/stores, and/or whether to internally turn foldTail on for small loops (due to cost considerations) when the VF and/or UF are provided externally (bypassing their cost-based selection process).

Ah, yes! That should work for me. Thanks!
My personal preference, but I think fold-tail-by-masking should be restricted for VF>1 only.

Based on discussion with Ayal, I created a quick patch to the problem of type-mismatch issue I reported above.
https://reviews.llvm.org/D79976