Page MenuHomePhabricator

[LoopVectorize] Take non-instruction's size into consideration when computing leader's demanded bits
Needs ReviewPublic

Authored by guopeilin on Mon, Dec 21, 12:10 AM.

Details

Summary

There are some cases where Trunc's live bitwidth gets erroneously popagated to its operands which are linked to Arguments or ConstantInts.
To be conservative, we need to take the size of Argument or ConstantInt into consideration when computing leader‘s demanded bits.

Diff Detail

Event Timeline

guopeilin created this revision.Mon, Dec 21, 12:10 AM
guopeilin requested review of this revision.Mon, Dec 21, 12:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptMon, Dec 21, 12:10 AM

Please upload patches with full context.
What problem is this fixing?

nikic added a subscriber: nikic.Mon, Dec 21, 12:26 AM

Is this supposed to be a correctness fix, or a profitability heuristic?

guopeilin added a comment.EditedMon, Dec 21, 12:31 AM

Firstly, let's take avoid-truncate-icmp-operands.ll as an example to explain this patch.<br>
Before running loop vectorize pass, we can see that in the basic block if.then,

%cmp.i = icmp ult i64 %e, %d
%.sroa.speculated = select i1 %cmp.i, i64 %d, i64 %e
%conv32 = trunc i64 %.sroa.speculated to i16

we compare value d and value e, and select one of them based on the comparison. And finally truncate the result into 16-bits. <br>
However after loop vectorize pass, we wrongly truncate instruction icmp's operands firstly,

%3 = trunc <2 x i64> %broadcast.splat to <2 x i16>
%4 = trunc <2 x i64> %broadcast.splat4 to <2 x i16>
%5 = icmp ult <2 x i16> %3, %4
%6 = trunc <2 x i64> %broadcast.splat4 to <2 x i16>
%7 = trunc <2 x i64> %broadcast.splat to <2 x i16>
%8 = select <2 x i1> %5, <2 x i16> %6, <2 x i16> %7
%9 = zext <2 x i16> %8 to <2 x i64>
%10 = trunc <2 x i64> %9 to <2 x i16>

That is we only compare the lower 16 bits of value d and value e, which will cause the wrong answer because lower bits cannot represent a comparison of the whole bits.<br>
And the mechanism that how llvm truncate and zext an instruction's operands is introduced in this patch https://reviews.llvm.org/rL250032. Within this patch, it will firstly compute an instruction's MinBWs by function computeMinimumValueSizes() and then`truncateToMinimalBitwidths()`. And MinBWs comes from llvm::computeMinimumValueSizes() which uses DemandedBits analysis as one of the info source. <br>
I guess the root reason is that, during the demanded-bits computation, because the final result will be truncate into 16-bits, that is %conv32 = trunc i64 %.sroa.speculated to i16, and the llvm blindly says that the operands of instruction Trunc have the same live bits as the trunc instruction itself. So the demanded-bits of raw instruction %.sroa.speculated = select i1 %cmp.i, i64 %d, i64 %e is 16-bits too. <br>
And during truncateToMinimalBitwidths, llvm will truncate operands first.

Is this supposed to be a correctness fix, or a profitability heuristic?

I guess it is a profitability heuristic.

So as the avoid-truncate-shift-operands.ll, before loop vectorize, we do a shift operation and then truncate its result into 8 bits. That is

%0 = lshr i32 %f, 18
%conv7 = trunc i32 %0 to i8

And the demanded bits says that we only need 8 bits of %0, so function truncateToMinimalBitwidths() firstly truncate lshr's operands into 8 bits first, and then shift, which will cause a wrong result.

%20 = trunc <2 x i32> %broadcast.splat7 to <2 x i8>
%21 = lshr <2 x i8> %20, <i8 18, i8 18>
%22 = zext <2 x i8> %21 to <2 x i32>
%23 = trunc <2 x i32> %22 to <2 x i8>

To avoid wrongly truncating operands, I add a whitelist for computing demanded bits of Instruction::Trunc's operands, that is we only reduce the demanded bits for those instructions that are always truncation friendly, such as ADD or SUB.

lebedev.ri requested changes to this revision.Mon, Dec 21, 1:36 AM

Is this supposed to be a correctness fix, or a profitability heuristic?

I guess it is a profitability heuristic.

This revision now requires changes to proceed.Mon, Dec 21, 1:36 AM
guopeilin updated this revision to Diff 313049.Mon, Dec 21, 2:11 AM

If this is not a correctness (or precision) fix, but a profitability one,
i would suggest to address this elsewhere.

nikic requested changes to this revision.Mon, Dec 21, 1:53 PM

Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.

As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.

As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.

A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.

This revision now requires changes to proceed.Mon, Dec 21, 1:53 PM

Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.

As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.

As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.

A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.

Thanks for reviewing.
And I still get confused in some places. One is that the DemandedBits.cpp seems compute the demandedbits per-instruction, that is it "successfully" terminates walks at non-Instructions, too.
I guess the way computing demanded-bits per-instruction is not compatible with the way truncateToMinimalBitwidths() truncating the operands for every instruction in MinBWs.
So if we still want keep truncateToMinimalBitwidths(), then we may need modify function computeMinimumValueSizes() to take Arguments into consideration, Or if we still want to compute demanded bits in per-instruction way, then is it a possible solution that we be more conservative in function truncateToMinimalBitwidths(), that is we add some patterns, if match then avoid truncating?

Can you please start with filing a bugreport, explaining the miscompile(?), with tests?

Can you please start with filing a bugreport, explaining the miscompile(?), with tests?

Yes, And I have take the avoid-truncate-icmp-operands.ll as an example to explain how the bug was generated.
The bug in https://bugs.llvm.org/show_bug.cgi?id=48583 is that when we want to 361804 >> 18 in a loop,
after loop vectorization, we first truncate 361804 from 32-bits into 16-bits and then lshr 18, which produce a wrong answer

guopeilin retitled this revision from [DemandedBits] Add a whitelist when computing demanded bits of Trunc Instruction to [LoopVectorize] Take argument's size into consideration when computing leader's demanded bits.
guopeilin edited the summary of this revision. (Show Details)

Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.

As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.

As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.

A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.

Hi, I updated a new patch. The solution right now is that we take every non-instruction operands within the chain into consideration when computing leader's demanded bits so that we will not drop out the information of the Arguments. This a conservative way, and it works on all my related issues.

Please upload patches with full context (-U9999)

guopeilin updated this revision to Diff 313720.Fri, Dec 25, 1:45 AM

Please upload patches with full context (-U9999)

Okay, I updated this patch with full context, please review.

nikic added a comment.Fri, Dec 25, 1:58 AM

This causes a number of test failures:

Failed Tests (4):
  LLVM :: Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
  LLVM :: Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
  LLVM :: Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll
  LLVM :: Transforms/LoopVectorize/demanded-bits-of-pointer-instruction.ll

Last one is an assertion failure because you assume non-pointer types, the others are presumably optimization regressions. What you are effectively doing here is to disable the truncation optimization completely for any case where the instruction chain contains an argument or a constant somewhere. Especially the latter case is probably very common. So while this change does fix the miscompile, it effectively disables the optimization entirely, which is probably not desired.

guopeilin updated this revision to Diff 313952.Tue, Dec 29, 2:48 AM

This causes a number of test failures:

Failed Tests (4):
  LLVM :: Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
  LLVM :: Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
  LLVM :: Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll
  LLVM :: Transforms/LoopVectorize/demanded-bits-of-pointer-instruction.ll

Last one is an assertion failure because you assume non-pointer types, the others are presumably optimization regressions. What you are effectively doing here is to disable the truncation optimization completely for any case where the instruction chain contains an argument or a constant somewhere. Especially the latter case is probably very common. So while this change does fix the miscompile, it effectively disables the optimization entirely, which is probably not desired.

A new patch updated, this patch adds constraints that we only visit argument when traversing all non-instruction values and take care of the case that BiseSize == 0. This patch passes all my cases and the cases you mentioned.

Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.

As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.

As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.

A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.

A new patch updated that takes care of the cases that Argument incorrectly truncated when computeMinimumValueSizes. Could you please review this patch.

nikic added inline comments.Thu, Dec 31, 8:00 AM
llvm/lib/Analysis/VectorUtils.cpp
539

I think we can't entirely ignore non-arguments (aka constants) here. Take a look at this modified form of llvm/test/Transforms/LoopVectorize/avoid-truncate-icmp-operands.ll:

; RUN: opt -loop-vectorize -S < %s | FileCheck %s
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-gnu"

@a = dso_local local_unnamed_addr global i64 0, align 8
@b = dso_local local_unnamed_addr global i16 0, align 4

define dso_local void @myFunc(i32 %d, i32 %e) {
; CHECK:        pred.store.continue2:
; CHECK-NEXT:      %{{[0-9]+}} = icmp ult <2 x i64> %broadcast.splat{{[0-9]*}}, %broadcast.splat{{[0-9]*}}
for.body29.lr.ph:
  br label %for.body29

for.cond25.for.cond.cleanup28_crit_edge:          ; preds = %for.inc
  ret void

for.body29:                                       ; preds = %for.inc, %for.body29.lr.ph
  %n.078 = phi i16 [ undef, %for.body29.lr.ph ], [ %add34, %for.inc ]
  br i1 undef, label %for.inc, label %if.then

if.then:                                          ; preds = %for.body29
  %conv31 = zext i8 undef to i64
  store i64 %conv31, i64* @a, align 8
  %cmp.i = icmp ult i32 %e, %d
  %.sroa.speculated = select i1 %cmp.i, i64 10000000000, i64 20000000000
  %conv32 = trunc i64 %.sroa.speculated to i16
  store i16 %conv32, i16* @b, align 4
  br label %for.inc

for.inc:                                          ; preds = %if.then, %for.body29
  %add34 = add nsw i16 %n.078, 2
  %cmp27 = icmp slt i16 %add34, 16
  br i1 %cmp27, label %for.body29, label %for.cond25.for.cond.cleanup28_crit_edge, !llvm.loop !6
}

!6 = distinct !{!6, !7}
!7 = !{!"llvm.loop.vectorize.enable", i1 true}

This gets vectorized with i32 width, but the two select constants are larger than 32-bit and get truncated.

I think the logic here should be: For ConstantInt, take the getActiveBits() of the constant value, for everything else (including Arguments) take the getTypeSizeInBits (what you're effectively doing now).

541

You should use DL.getTypeSizeInBits(Val->getType() here, which will remove the need for the == 0 check.

guopeilin updated this revision to Diff 314332.Mon, Jan 4, 1:18 AM
guopeilin retitled this revision from [LoopVectorize] Take argument's size into consideration when computing leader's demanded bits to [LoopVectorize] Take non-instruction's size into consideration when computing leader's demanded bits.
guopeilin edited the summary of this revision. (Show Details)
guopeilin added inline comments.Mon, Jan 4, 1:31 AM
llvm/lib/Analysis/VectorUtils.cpp
539

Thanks a lot for reviewing this patch. And a new patch has been updated, which takes case ConstantInt into consideration, too.
And there still exists something that confuses me. Firstly, I don't know why the two select constants are larger than 32-bit? As far as I can see, both %d and %e are defined as an i32 argument, what is the purpose that we take ConstantInt into consideration here?
Secondly, for Constant, there are ConstantFP, ConstantInt and some other kind of constant, so shall we take them into consideration, too?
It would be a great appreciation if you could explain it a bit more. Again, thanks a lot for reviewing.

guopeilin added inline comments.Mon, Jan 4, 1:32 AM
llvm/lib/Analysis/VectorUtils.cpp
541

Ok, and I use Argument->getParent()->getParent()->getDataLayout() to get the DataLayout.

nikic added inline comments.Tue, Jan 5, 1:50 PM
llvm/lib/Analysis/VectorUtils.cpp
539

And there still exists something that confuses me. Firstly, I don't know why the two select constants are larger than 32-bit? As far as I can see, both %d and %e are defined as an i32 argument, what is the purpose that we take ConstantInt into consideration here?

While the %d and %e arguments used in the icmp are 32-bit wide, the two constants that the select chooses between are larger than 32-bit. Vectorizing with 32-bit elements will also truncate the select constant and lose the top bits. Is this clearer?

By the way, it would be good if you could also add the above test case with a check that it uses i64 vectorization, not i32.

Secondly, for Constant, there are ConstantFP, ConstantInt and some other kind of constant, so shall we take them into consideration, too?
It would be a great appreciation if you could explain it a bit more. Again, thanks a lot for reviewing.

I'm not sure to what degree these are really relevant in this code (as demanded bits only works on integer instructions), but generally yes, they should be taken into account. I would suggest to write the code something like this, so only ConstantInt has special treatment (to avoid regressions) while everything else is handled as "else":

unsigned BitSize;
if (const auto *CI = dyn_cast<ConstantInt>(Val)) {
  BitSize = CI->getValue().getActiveBits();
} else {
  DataLayout DL = ...;
  BitSize = DL.getTypeSizeInBits(Val->getType());
}

// Bail out for same reason below.
if (BitSize > 64)
  return MapVector<Instruction *, uint64_t>();
uint64_t V = APInt::getAllOnesValue(BitSize).getZExtValue();
DBits[Leader] |= V;
LLVM_DEBUG(dbgs() << "\t Values's dbits " << Twine::utohexstr(V)
                          << '\n');

There's just one problem there: Getting the DataLayout. I don't see any obvious way to get it, so it may be necessary to pass it as an argument to computeMinimumValueSizes().

guopeilin added inline comments.Wed, Jan 6, 6:08 PM
llvm/lib/Analysis/VectorUtils.cpp
539

Hi, Thanks a lot for your review and suggestion.
And as for DataLayout, I found that ConstantInt can be 0, the ActiveBits will return 0. So we may still need to check the BitSize before initializing APInt with BitSize. So maybe DL is not so necessary here.
Also, I found that case

Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll

failed if we handle everything else as 'else'. So maybe we can only take Argument and ConstantInt into consideration, a conservative way.

guopeilin added inline comments.Wed, Jan 6, 6:24 PM
llvm/lib/Analysis/VectorUtils.cpp
539

Sorry, It seems that the failure of test case Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll is not caused by handle everything else as 'else', I am figuring it out. Please wait.

guopeilin updated this revision to Diff 315298.Thu, Jan 7, 10:18 PM
guopeilin added inline comments.Thu, Jan 7, 10:32 PM
llvm/lib/Analysis/VectorUtils.cpp
539

Hi, nikic! I update a new patch.
This patch fix three things:
Firstly, a ConstantInt can be zero, which means getActiveBits can be 0, we still need to check bitsize == 0 even through we have DL.getTypeSizeInBits(Val->getType()). So I guess we don't need DataLayout here.
Secondly, a ConstantInt can be negative, if we still use getActiveBits, that is BitWidth - countLeadingZeros(), we will always get the most conservative bits for negative value cause its first bit is 1. That is the reason why I failed on the test case by using getActiveBits for all ConstantInt

Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll

For this case, there exist a negative constant int -28, So I use getMinSignedBits() for negative cases.
Thirdly, for no negative constant int, I still use getActiveBits as before. And handle everything else as 'else' by using Val->getType()->getScalarSizeInBits()