There are some cases where Trunc's live bitwidth gets erroneously popagated to its operands which are linked to Arguments or ConstantInts.
To be conservative, we need to take the size of Argument or ConstantInt into consideration when computing leader‘s demanded bits.
Details
Diff Detail
Event Timeline
Firstly, let's take avoid-truncate-icmp-operands.ll as an example to explain this patch.<br>
Before running loop vectorize pass, we can see that in the basic block if.then,
%cmp.i = icmp ult i64 %e, %d %.sroa.speculated = select i1 %cmp.i, i64 %d, i64 %e %conv32 = trunc i64 %.sroa.speculated to i16
we compare value d and value e, and select one of them based on the comparison. And finally truncate the result into 16-bits. <br>
However after loop vectorize pass, we wrongly truncate instruction icmp's operands firstly,
%3 = trunc <2 x i64> %broadcast.splat to <2 x i16> %4 = trunc <2 x i64> %broadcast.splat4 to <2 x i16> %5 = icmp ult <2 x i16> %3, %4 %6 = trunc <2 x i64> %broadcast.splat4 to <2 x i16> %7 = trunc <2 x i64> %broadcast.splat to <2 x i16> %8 = select <2 x i1> %5, <2 x i16> %6, <2 x i16> %7 %9 = zext <2 x i16> %8 to <2 x i64> %10 = trunc <2 x i64> %9 to <2 x i16>
That is we only compare the lower 16 bits of value d and value e, which will cause the wrong answer because lower bits cannot represent a comparison of the whole bits.<br>
And the mechanism that how llvm truncate and zext an instruction's operands is introduced in this patch https://reviews.llvm.org/rL250032. Within this patch, it will firstly compute an instruction's MinBWs by function computeMinimumValueSizes() and then`truncateToMinimalBitwidths()`. And MinBWs comes from llvm::computeMinimumValueSizes() which uses DemandedBits analysis as one of the info source. <br>
I guess the root reason is that, during the demanded-bits computation, because the final result will be truncate into 16-bits, that is %conv32 = trunc i64 %.sroa.speculated to i16, and the llvm blindly says that the operands of instruction Trunc have the same live bits as the trunc instruction itself. So the demanded-bits of raw instruction %.sroa.speculated = select i1 %cmp.i, i64 %d, i64 %e is 16-bits too. <br>
And during truncateToMinimalBitwidths, llvm will truncate operands first.
So as the avoid-truncate-shift-operands.ll, before loop vectorize, we do a shift operation and then truncate its result into 8 bits. That is
%0 = lshr i32 %f, 18 %conv7 = trunc i32 %0 to i8
And the demanded bits says that we only need 8 bits of %0, so function truncateToMinimalBitwidths() firstly truncate lshr's operands into 8 bits first, and then shift, which will cause a wrong result.
%20 = trunc <2 x i32> %broadcast.splat7 to <2 x i8> %21 = lshr <2 x i8> %20, <i8 18, i8 18> %22 = zext <2 x i8> %21 to <2 x i32> %23 = trunc <2 x i32> %22 to <2 x i8>
To avoid wrongly truncating operands, I add a whitelist for computing demanded bits of Instruction::Trunc's operands, that is we only reduce the demanded bits for those instructions that are always truncation friendly, such as ADD or SUB.
If this is not a correctness (or precision) fix, but a profitability one,
i would suggest to address this elsewhere.
Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.
As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.
As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.
A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.
Thanks for reviewing.
And I still get confused in some places. One is that the DemandedBits.cpp seems compute the demandedbits per-instruction, that is it "successfully" terminates walks at non-Instructions, too.
I guess the way computing demanded-bits per-instruction is not compatible with the way truncateToMinimalBitwidths() truncating the operands for every instruction in MinBWs.
So if we still want keep truncateToMinimalBitwidths(), then we may need modify function computeMinimumValueSizes() to take Arguments into consideration, Or if we still want to compute demanded bits in per-instruction way, then is it a possible solution that we be more conservative in function truncateToMinimalBitwidths(), that is we add some patterns, if match then avoid truncating?
Can you please start with filing a bugreport, explaining the miscompile(?), with tests?
Yes, And I have take the avoid-truncate-icmp-operands.ll as an example to explain how the bug was generated.
The bug in https://bugs.llvm.org/show_bug.cgi?id=48583 is that when we want to 361804 >> 18 in a loop,
after loop vectorization, we first truncate 361804 from 32-bits into 16-bits and then lshr 18, which produce a wrong answer
Hi, I updated a new patch. The solution right now is that we take every non-instruction operands within the chain into consideration when computing leader's demanded bits so that we will not drop out the information of the Arguments. This a conservative way, and it works on all my related issues.
This causes a number of test failures:
Failed Tests (4): LLVM :: Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll LLVM :: Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll LLVM :: Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll LLVM :: Transforms/LoopVectorize/demanded-bits-of-pointer-instruction.ll
Last one is an assertion failure because you assume non-pointer types, the others are presumably optimization regressions. What you are effectively doing here is to disable the truncation optimization completely for any case where the instruction chain contains an argument or a constant somewhere. Especially the latter case is probably very common. So while this change does fix the miscompile, it effectively disables the optimization entirely, which is probably not desired.
A new patch updated, this patch adds constraints that we only visit argument when traversing all non-instruction values and take care of the case that BiseSize == 0. This patch passes all my cases and the cases you mentioned.
A new patch updated that takes care of the cases that Argument incorrectly truncated when computeMinimumValueSizes. Could you please review this patch.
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
539 | I think we can't entirely ignore non-arguments (aka constants) here. Take a look at this modified form of llvm/test/Transforms/LoopVectorize/avoid-truncate-icmp-operands.ll: ; RUN: opt -loop-vectorize -S < %s | FileCheck %s target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-unknown-linux-gnu" @a = dso_local local_unnamed_addr global i64 0, align 8 @b = dso_local local_unnamed_addr global i16 0, align 4 define dso_local void @myFunc(i32 %d, i32 %e) { ; CHECK: pred.store.continue2: ; CHECK-NEXT: %{{[0-9]+}} = icmp ult <2 x i64> %broadcast.splat{{[0-9]*}}, %broadcast.splat{{[0-9]*}} for.body29.lr.ph: br label %for.body29 for.cond25.for.cond.cleanup28_crit_edge: ; preds = %for.inc ret void for.body29: ; preds = %for.inc, %for.body29.lr.ph %n.078 = phi i16 [ undef, %for.body29.lr.ph ], [ %add34, %for.inc ] br i1 undef, label %for.inc, label %if.then if.then: ; preds = %for.body29 %conv31 = zext i8 undef to i64 store i64 %conv31, i64* @a, align 8 %cmp.i = icmp ult i32 %e, %d %.sroa.speculated = select i1 %cmp.i, i64 10000000000, i64 20000000000 %conv32 = trunc i64 %.sroa.speculated to i16 store i16 %conv32, i16* @b, align 4 br label %for.inc for.inc: ; preds = %if.then, %for.body29 %add34 = add nsw i16 %n.078, 2 %cmp27 = icmp slt i16 %add34, 16 br i1 %cmp27, label %for.body29, label %for.cond25.for.cond.cleanup28_crit_edge, !llvm.loop !6 } !6 = distinct !{!6, !7} !7 = !{!"llvm.loop.vectorize.enable", i1 true} This gets vectorized with i32 width, but the two select constants are larger than 32-bit and get truncated. I think the logic here should be: For ConstantInt, take the getActiveBits() of the constant value, for everything else (including Arguments) take the getTypeSizeInBits (what you're effectively doing now). | |
541 | You should use DL.getTypeSizeInBits(Val->getType() here, which will remove the need for the == 0 check. |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
539 | Thanks a lot for reviewing this patch. And a new patch has been updated, which takes case ConstantInt into consideration, too. |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
541 | Ok, and I use Argument->getParent()->getParent()->getDataLayout() to get the DataLayout. |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
539 |
While the %d and %e arguments used in the icmp are 32-bit wide, the two constants that the select chooses between are larger than 32-bit. Vectorizing with 32-bit elements will also truncate the select constant and lose the top bits. Is this clearer? By the way, it would be good if you could also add the above test case with a check that it uses i64 vectorization, not i32.
I'm not sure to what degree these are really relevant in this code (as demanded bits only works on integer instructions), but generally yes, they should be taken into account. I would suggest to write the code something like this, so only ConstantInt has special treatment (to avoid regressions) while everything else is handled as "else": unsigned BitSize; if (const auto *CI = dyn_cast<ConstantInt>(Val)) { BitSize = CI->getValue().getActiveBits(); } else { DataLayout DL = ...; BitSize = DL.getTypeSizeInBits(Val->getType()); } // Bail out for same reason below. if (BitSize > 64) return MapVector<Instruction *, uint64_t>(); uint64_t V = APInt::getAllOnesValue(BitSize).getZExtValue(); DBits[Leader] |= V; LLVM_DEBUG(dbgs() << "\t Values's dbits " << Twine::utohexstr(V) << '\n'); There's just one problem there: Getting the DataLayout. I don't see any obvious way to get it, so it may be necessary to pass it as an argument to computeMinimumValueSizes(). |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
539 | Hi, Thanks a lot for your review and suggestion. Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll failed if we handle everything else as 'else'. So maybe we can only take Argument and ConstantInt into consideration, a conservative way. |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
539 | Sorry, It seems that the failure of test case Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll is not caused by handle everything else as 'else', I am figuring it out. Please wait. |
llvm/lib/Analysis/VectorUtils.cpp | ||
---|---|---|
539 | Hi, nikic! I update a new patch. Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll For this case, there exist a negative constant int -28, So I use getMinSignedBits() for negative cases. |
I think we can't entirely ignore non-arguments (aka constants) here. Take a look at this modified form of llvm/test/Transforms/LoopVectorize/avoid-truncate-icmp-operands.ll:
This gets vectorized with i32 width, but the two select constants are larger than 32-bit and get truncated.
I think the logic here should be: For ConstantInt, take the getActiveBits() of the constant value, for everything else (including Arguments) take the getTypeSizeInBits (what you're effectively doing now).