This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
5/8
VectorUtils.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
avoid-truncate-icmp-operands.ll
-
avoid-truncate-remainder-operands.ll
-
avoid-truncate-shift-operands.ll

Differential D93617

[LoopVectorize] Take non-instruction's size into consideration when computing leader's demanded bits
Needs ReviewPublic

Authored by guopeilin on Dec 21 2020, 12:10 AM.

Download Raw Diff

Details

Reviewers

jevinskie
wwei
jmolloy
mssimpso
danielcdh
nikic
lebedev.ri

Summary

There are some cases where Trunc's live bitwidth gets erroneously popagated to its operands which are linked to Arguments or ConstantInts.
To be conservative, we need to take the size of Argument or ConstantInt into consideration when computing leader‘s demanded bits.

Diff Detail

Event Timeline

guopeilin created this revision.Dec 21 2020, 12:10 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptDec 21 2020, 12:10 AM

guopeilin requested review of this revision.Dec 21 2020, 12:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 21 2020, 12:10 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Please upload patches with full context.
What problem is this fixing?

Is this supposed to be a correctness fix, or a profitability heuristic?

Firstly, let's take avoid-truncate-icmp-operands.ll as an example to explain this patch. 
Before running loop vectorize pass, we can see that in the basic block if.then,

%cmp.i = icmp ult i64 %e, %d
%.sroa.speculated = select i1 %cmp.i, i64 %d, i64 %e
%conv32 = trunc i64 %.sroa.speculated to i16

we compare value d and value e, and select one of them based on the comparison. And finally truncate the result into 16-bits. 
However after loop vectorize pass, we wrongly truncate instruction icmp's operands firstly,

%3 = trunc <2 x i64> %broadcast.splat to <2 x i16>
%4 = trunc <2 x i64> %broadcast.splat4 to <2 x i16>
%5 = icmp ult <2 x i16> %3, %4
%6 = trunc <2 x i64> %broadcast.splat4 to <2 x i16>
%7 = trunc <2 x i64> %broadcast.splat to <2 x i16>
%8 = select <2 x i1> %5, <2 x i16> %6, <2 x i16> %7
%9 = zext <2 x i16> %8 to <2 x i64>
%10 = trunc <2 x i64> %9 to <2 x i16>

That is we only compare the lower 16 bits of value d and value e, which will cause the wrong answer because lower bits cannot represent a comparison of the whole bits. 
And the mechanism that how llvm truncate and zext an instruction's operands is introduced in this patch https://reviews.llvm.org/rL250032. Within this patch, it will firstly compute an instruction's MinBWs by function computeMinimumValueSizes() and then`truncateToMinimalBitwidths()`. And MinBWs comes from llvm::computeMinimumValueSizes() which uses DemandedBits analysis as one of the info source. 
I guess the root reason is that, during the demanded-bits computation, because the final result will be truncate into 16-bits, that is %conv32 = trunc i64 %.sroa.speculated to i16, and the llvm blindly says that the operands of instruction Trunc have the same live bits as the trunc instruction itself. So the demanded-bits of raw instruction %.sroa.speculated = select i1 %cmp.i, i64 %d, i64 %e is 16-bits too. 
And during truncateToMinimalBitwidths, llvm will truncate operands first.

In D93617#2465429, @nikic wrote:

Is this supposed to be a correctness fix, or a profitability heuristic?

I guess it is a profitability heuristic.

Harbormaster completed remote builds in B83105: Diff 313030.Dec 21 2020, 12:53 AM

So as the avoid-truncate-shift-operands.ll, before loop vectorize, we do a shift operation and then truncate its result into 8 bits. That is

%0 = lshr i32 %f, 18
%conv7 = trunc i32 %0 to i8

And the demanded bits says that we only need 8 bits of %0, so function truncateToMinimalBitwidths() firstly truncate lshr's operands into 8 bits first, and then shift, which will cause a wrong result.

%20 = trunc <2 x i32> %broadcast.splat7 to <2 x i8>
%21 = lshr <2 x i8> %20, <i8 18, i8 18>
%22 = zext <2 x i8> %21 to <2 x i32>
%23 = trunc <2 x i32> %22 to <2 x i8>

guopeilin added reviewers: jevinskie, wwei, jmolloy, nikic, mssimpso, danielcdh.Dec 21 2020, 1:03 AM

To avoid wrongly truncating operands, I add a whitelist for computing demanded bits of Instruction::Trunc's operands, that is we only reduce the demanded bits for those instructions that are always truncation friendly, such as ADD or SUB.

In D93617#2465469, @guopeilin wrote:

In D93617#2465429, @nikic wrote:

Is this supposed to be a correctness fix, or a profitability heuristic?

I guess it is a profitability heuristic.

This revision now requires changes to proceed.Dec 21 2020, 1:36 AM

guopeilin updated this revision to Diff 313049.Dec 21 2020, 2:11 AM

If this is not a correctness (or precision) fix, but a profitability one,
i would suggest to address this elsewhere.

Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.

As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.

As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.

A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.

This revision now requires changes to proceed.Dec 21 2020, 1:53 PM

In D93617#2466879, @nikic wrote:

Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.

As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.

As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.

A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.

Thanks for reviewing.
And I still get confused in some places. One is that the DemandedBits.cpp seems compute the demandedbits per-instruction, that is it "successfully" terminates walks at non-Instructions, too.
I guess the way computing demanded-bits per-instruction is not compatible with the way truncateToMinimalBitwidths() truncating the operands for every instruction in MinBWs.
So if we still want keep truncateToMinimalBitwidths(), then we may need modify function computeMinimumValueSizes() to take Arguments into consideration, Or if we still want to compute demanded bits in per-instruction way, then is it a possible solution that we be more conservative in function truncateToMinimalBitwidths(), that is we add some patterns, if match then avoid truncating?

Can you please start with filing a bugreport, explaining the miscompile(?), with tests?

bugzilla : https://bugs.llvm.org/show_bug.cgi?id=48583

In D93617#2469876, @lebedev.ri wrote:

Can you please start with filing a bugreport, explaining the miscompile(?), with tests?

Yes, And I have take the avoid-truncate-icmp-operands.ll as an example to explain how the bug was generated.
The bug in https://bugs.llvm.org/show_bug.cgi?id=48583 is that when we want to 361804 >> 18 in a loop,
after loop vectorization, we first truncate 361804 from 32-bits into 16-bits and then lshr 18, which produce a wrong answer

guopeilin updated this revision to Diff 313711.Dec 24 2020, 10:39 PM

guopeilin retitled this revision from [DemandedBits] Add a whitelist when computing demanded bits of Trunc Instruction to [LoopVectorize] Take argument's size into consideration when computing leader's demanded bits.

guopeilin edited the summary of this revision. (Show Details)

In D93617#2466879, @nikic wrote:

Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.

As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.

As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.

A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.

Hi, I updated a new patch. The solution right now is that we take every non-instruction operands within the chain into consideration when computing leader's demanded bits so that we will not drop out the information of the Arguments. This a conservative way, and it works on all my related issues.

Please upload patches with full context (-U9999)

guopeilin updated this revision to Diff 313720.Dec 25 2020, 1:45 AM

In D93617#2471491, @lebedev.ri wrote:

Please upload patches with full context (-U9999)

Okay, I updated this patch with full context, please review.

This causes a number of test failures:

Failed Tests (4):
  LLVM :: Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
  LLVM :: Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
  LLVM :: Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll
  LLVM :: Transforms/LoopVectorize/demanded-bits-of-pointer-instruction.ll

Last one is an assertion failure because you assume non-pointer types, the others are presumably optimization regressions. What you are effectively doing here is to disable the truncation optimization completely for any case where the instruction chain contains an argument or a constant somewhere. Especially the latter case is probably very common. So while this change does fix the miscompile, it effectively disables the optimization entirely, which is probably not desired.

guopeilin updated this revision to Diff 313952.Dec 29 2020, 2:48 AM

In D93617#2471528, @nikic wrote:
This causes a number of test failures:
Failed Tests (4):
  LLVM :: Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll
  LLVM :: Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
  LLVM :: Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll
  LLVM :: Transforms/LoopVectorize/demanded-bits-of-pointer-instruction.ll
Last one is an assertion failure because you assume non-pointer types, the others are presumably optimization regressions. What you are effectively doing here is to disable the truncation optimization completely for any case where the instruction chain contains an argument or a constant somewhere. Especially the latter case is probably very common. So while this change does fix the miscompile, it effectively disables the optimization entirely, which is probably not desired.

A new patch updated, this patch adds constraints that we only visit argument when traversing all non-instruction values and take care of the case that BiseSize == 0. This patch passes all my cases and the cases you mentioned.

In D93617#2466879, @nikic wrote:

Okay, based on your description this *is* a correctness fix, but it's not applied to the right place. The result of the DemandedBits analysis is correct -- demanded bits of the trunc operand are the same as for the trunc result.

As far as I can tell, the actual bug is in computeMinimumValueSizes(). It "successfully" terminates walks at non-Instructions, but that's not right, as the demanded bits for the non-Instruction can be wider. In your avoid-truncate-shift-operands.ll example, note that in lshr i32 %f, 18, %f has demanded bits outside the low 16 bits, but this fact is lost, because %f is an Argument. I imagine the same thing happens with constants as well. The fact that Arguments are incorrectly truncated seems to be the commonality between all your tests.

As such, I believe a fix here needs to be in computeMinimumValueSizes() -- though possibly DemandedBits also needs to be extended to provide additional information.

A possible way to go about this is to extended the existing DemandedBits code for DeadUses, which currently stores whether certain Uses are completely dead. This could be changed to actually store the demanded bits per-use, not just per-instruction. This would allow you to inspect the demanded bits of arguments and constants as used in a specific instruction. We would have to evaluate the additional cost this would have though.

A new patch updated that takes care of the cases that Argument incorrectly truncated when computeMinimumValueSizes. Could you please review this patch.

nikic added inline comments.Dec 31 2020, 8:00 AM

llvm/lib/Analysis/VectorUtils.cpp

539

I think we can't entirely ignore non-arguments (aka constants) here. Take a look at this modified form of llvm/test/Transforms/LoopVectorize/avoid-truncate-icmp-operands.ll:

; RUN: opt -loop-vectorize -S < %s | FileCheck %s
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-gnu"

@a = dso_local local_unnamed_addr global i64 0, align 8
@b = dso_local local_unnamed_addr global i16 0, align 4

define dso_local void @myFunc(i32 %d, i32 %e) {
; CHECK:        pred.store.continue2:
; CHECK-NEXT:      %{{[0-9]+}} = icmp ult <2 x i64> %broadcast.splat{{[0-9]*}}, %broadcast.splat{{[0-9]*}}
for.body29.lr.ph:
  br label %for.body29

for.cond25.for.cond.cleanup28_crit_edge:          ; preds = %for.inc
  ret void

for.body29:                                       ; preds = %for.inc, %for.body29.lr.ph
  %n.078 = phi i16 [ undef, %for.body29.lr.ph ], [ %add34, %for.inc ]
  br i1 undef, label %for.inc, label %if.then

if.then:                                          ; preds = %for.body29
  %conv31 = zext i8 undef to i64
  store i64 %conv31, i64* @a, align 8
  %cmp.i = icmp ult i32 %e, %d
  %.sroa.speculated = select i1 %cmp.i, i64 10000000000, i64 20000000000
  %conv32 = trunc i64 %.sroa.speculated to i16
  store i16 %conv32, i16* @b, align 4
  br label %for.inc

for.inc:                                          ; preds = %if.then, %for.body29
  %add34 = add nsw i16 %n.078, 2
  %cmp27 = icmp slt i16 %add34, 16
  br i1 %cmp27, label %for.body29, label %for.cond25.for.cond.cleanup28_crit_edge, !llvm.loop !6
}

!6 = distinct !{!6, !7}
!7 = !{!"llvm.loop.vectorize.enable", i1 true}

This gets vectorized with i32 width, but the two select constants are larger than 32-bit and get truncated.

I think the logic here should be: For ConstantInt, take the getActiveBits() of the constant value, for everything else (including Arguments) take the getTypeSizeInBits (what you're effectively doing now).

541

You should use DL.getTypeSizeInBits(Val->getType() here, which will remove the need for the == 0 check.

guopeilin updated this revision to Diff 314332.Jan 4 2021, 1:18 AM

guopeilin retitled this revision from [LoopVectorize] Take argument's size into consideration when computing leader's demanded bits to [LoopVectorize] Take non-instruction's size into consideration when computing leader's demanded bits.

guopeilin edited the summary of this revision. (Show Details)

guopeilin added inline comments.Jan 4 2021, 1:31 AM

llvm/lib/Analysis/VectorUtils.cpp
539	Thanks a lot for reviewing this patch. And a new patch has been updated, which takes case `ConstantInt` into consideration, too. And there still exists something that confuses me. Firstly, I don't know why `the two select constants are larger than 32-bit`? As far as I can see, both `%d` and `%e` are defined as an `i32 argument`, what is the purpose that we take `ConstantInt` into consideration here? Secondly, for `Constant`, there are `ConstantFP`, `ConstantInt` and some other kind of constant, so shall we take them into consideration, too? It would be a great appreciation if you could explain it a bit more. Again, thanks a lot for reviewing.

guopeilin added inline comments.Jan 4 2021, 1:32 AM

llvm/lib/Analysis/VectorUtils.cpp
541	Ok, and I use Argument->getParent()->getParent()->getDataLayout() to get the DataLayout.

nikic added inline comments.Jan 5 2021, 1:50 PM

llvm/lib/Analysis/VectorUtils.cpp
539	And there still exists something that confuses me. Firstly, I don't know why `the two select constants are larger than 32-bit`? As far as I can see, both `%d` and `%e` are defined as an `i32 argument`, what is the purpose that we take `ConstantInt` into consideration here? While the `%d` and `%e` arguments used in the icmp are 32-bit wide, the two constants that the `select` chooses between are larger than 32-bit. Vectorizing with 32-bit elements will also truncate the select constant and lose the top bits. Is this clearer? By the way, it would be good if you could also add the above test case with a check that it uses i64 vectorization, not i32. Secondly, for `Constant`, there are `ConstantFP`, `ConstantInt` and some other kind of constant, so shall we take them into consideration, too? It would be a great appreciation if you could explain it a bit more. Again, thanks a lot for reviewing. I'm not sure to what degree these are really relevant in this code (as demanded bits only works on integer instructions), but generally yes, they should be taken into account. I would suggest to write the code something like this, so only ConstantInt has special treatment (to avoid regressions) while everything else is handled as "else": unsigned BitSize; if (const auto CI = dyn_cast<ConstantInt>(Val)) { BitSize = CI->getValue().getActiveBits(); } else { DataLayout DL = ...; BitSize = DL.getTypeSizeInBits(Val->getType()); } // Bail out for same reason below. if (BitSize > 64) return MapVector<Instruction , uint64_t>(); uint64_t V = APInt::getAllOnesValue(BitSize).getZExtValue(); DBits[Leader] \|= V; LLVM_DEBUG(dbgs() << "\t Values's dbits " << Twine::utohexstr(V) << '\n'); There's just one problem there: Getting the DataLayout. I don't see any obvious way to get it, so it may be necessary to pass it as an argument to computeMinimumValueSizes().

guopeilin added inline comments.Jan 6 2021, 6:08 PM

llvm/lib/Analysis/VectorUtils.cpp
539	Hi, Thanks a lot for your review and suggestion. And as for DataLayout, I found that `ConstantInt` can be `0`, the ActiveBits will return 0. So we may still need to check the BitSize before initializing APInt with BitSize. So maybe DL is not so necessary here. Also, I found that case Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll failed if we `handle everything else as 'else'`. So maybe we can only take Argument and ConstantInt into consideration, a conservative way.

guopeilin added inline comments.Jan 6 2021, 6:24 PM

llvm/lib/Analysis/VectorUtils.cpp
539	Sorry, It seems that the failure of test case `Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll` is not caused by `handle everything else as 'else'`, I am figuring it out. Please wait.

guopeilin updated this revision to Diff 315298.Jan 7 2021, 10:18 PM

guopeilin added inline comments.Jan 7 2021, 10:32 PM

llvm/lib/Analysis/VectorUtils.cpp
539	Hi, nikic! I update a new patch. This patch fix three things: Firstly, a `ConstantInt` can be zero, which means `getActiveBits` can be `0`, we still need to check `bitsize == 0` even through we have `DL.getTypeSizeInBits(Val->getType())`. So I guess we don't need DataLayout here. Secondly, a ConstantInt can be negative, if we still use `getActiveBits`, that is `BitWidth - countLeadingZeros()`, we will always get the most conservative bits for negative value cause its first bit is `1`. That is the reason why I failed on the test case by using getActiveBits for all ConstantInt Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll For this case, there exist a negative constant int `-28`, So I use `getMinSignedBits()` for negative cases. Thirdly, for no negative constant int, I still use `getActiveBits` as before. And handle everything else as 'else' by using `Val->getType()->getScalarSizeInBits()`

A new patch has been updated, could you please review it. Thanks a lot.

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:15 PM

Herald added subscribers: • pcwang-thead, shiva0217, StephenFan. · View Herald Transcript

I believe this has been fixed by D154717.

Herald added subscribers: wangpc, artagnon, sunshaoce. · View Herald TranscriptAug 28 2023, 12:14 PM

Revision Contents

Path

Size

llvm/

lib/

Analysis/

VectorUtils.cpp

29 lines

test/

Transforms/

LoopVectorize/

avoid-truncate-icmp-operands.ll

37 lines

avoid-truncate-remainder-operands.ll

53 lines

avoid-truncate-shift-operands.ll

42 lines

Diff 315298

llvm/lib/Analysis/VectorUtils.cpp

Show First 20 Lines • Show All 524 Lines • ▼ Show 20 Lines	llvm::computeMinimumValueSizes(ArrayRef<BasicBlock *> Blocks, DemandedBits &DB,
while (!Worklist.empty()) {		while (!Worklist.empty()) {
Value *Val = Worklist.pop_back_val();		Value *Val = Worklist.pop_back_val();
Value *Leader = ECs.getOrInsertLeaderValue(Val);		Value *Leader = ECs.getOrInsertLeaderValue(Val);

if (Visited.count(Val))		if (Visited.count(Val))
continue;		continue;
Visited.insert(Val);		Visited.insert(Val);

// Non-instructions terminate a chain successfully.		// There are some cases where Trunc's live bitwidth
if (!isa<Instruction>(Val))		// gets erroneously popagated to its operands which
		// are linked to Arguments or ConstantInts. To be
		// conservative, we need to take the size of Argument
		// or ConstantInt into consideration when computing
		// Leader's DemandedBits.
		if (!isa<Instruction>(Val)) {
		nikicUnsubmitted Not Done Reply Inline Actions I think we can't entirely ignore non-arguments (aka constants) here. Take a look at this modified form of `llvm/test/Transforms/LoopVectorize/avoid-truncate-icmp-operands.ll`: ; RUN: opt -loop-vectorize -S < %s \| FileCheck %s target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-unknown-linux-gnu" @a = dso_local local_unnamed_addr global i64 0, align 8 @b = dso_local local_unnamed_addr global i16 0, align 4 define dso_local void @myFunc(i32 %d, i32 %e) { ; CHECK: pred.store.continue2: ; CHECK-NEXT: %{{[0-9]+}} = icmp ult <2 x i64> %broadcast.splat{{[0-9]}}, %broadcast.splat{{[0-9]}} for.body29.lr.ph: br label %for.body29 for.cond25.for.cond.cleanup28_crit_edge: ; preds = %for.inc ret void for.body29: ; preds = %for.inc, %for.body29.lr.ph %n.078 = phi i16 [ undef, %for.body29.lr.ph ], [ %add34, %for.inc ] br i1 undef, label %for.inc, label %if.then if.then: ; preds = %for.body29 %conv31 = zext i8 undef to i64 store i64 %conv31, i64* @a, align 8 %cmp.i = icmp ult i32 %e, %d %.sroa.speculated = select i1 %cmp.i, i64 10000000000, i64 20000000000 %conv32 = trunc i64 %.sroa.speculated to i16 store i16 %conv32, i16* @b, align 4 br label %for.inc for.inc: ; preds = %if.then, %for.body29 %add34 = add nsw i16 %n.078, 2 %cmp27 = icmp slt i16 %add34, 16 br i1 %cmp27, label %for.body29, label %for.cond25.for.cond.cleanup28_crit_edge, !llvm.loop !6 } !6 = distinct !{!6, !7} !7 = !{!"llvm.loop.vectorize.enable", i1 true} This gets vectorized with i32 width, but the two select constants are larger than 32-bit and get truncated. I think the logic here should be: For `ConstantInt`, take the `getActiveBits()` of the constant value, for everything else (including Arguments) take the getTypeSizeInBits (what you're effectively doing now). nikic: I think we can't entirely ignore non-arguments (aka constants) here. Take a look at this…
		guopeilinAuthorUnsubmitted Done Reply Inline Actions Thanks a lot for reviewing this patch. And a new patch has been updated, which takes case `ConstantInt` into consideration, too. And there still exists something that confuses me. Firstly, I don't know why `the two select constants are larger than 32-bit`? As far as I can see, both `%d` and `%e` are defined as an `i32 argument`, what is the purpose that we take `ConstantInt` into consideration here? Secondly, for `Constant`, there are `ConstantFP`, `ConstantInt` and some other kind of constant, so shall we take them into consideration, too? It would be a great appreciation if you could explain it a bit more. Again, thanks a lot for reviewing. guopeilin: Thanks a lot for reviewing this patch. And a new patch has been updated, which takes case…
		nikicUnsubmitted Not Done Reply Inline Actions And there still exists something that confuses me. Firstly, I don't know why `the two select constants are larger than 32-bit`? As far as I can see, both `%d` and `%e` are defined as an `i32 argument`, what is the purpose that we take `ConstantInt` into consideration here? While the `%d` and `%e` arguments used in the icmp are 32-bit wide, the two constants that the `select` chooses between are larger than 32-bit. Vectorizing with 32-bit elements will also truncate the select constant and lose the top bits. Is this clearer? By the way, it would be good if you could also add the above test case with a check that it uses i64 vectorization, not i32. Secondly, for `Constant`, there are `ConstantFP`, `ConstantInt` and some other kind of constant, so shall we take them into consideration, too? It would be a great appreciation if you could explain it a bit more. Again, thanks a lot for reviewing. I'm not sure to what degree these are really relevant in this code (as demanded bits only works on integer instructions), but generally yes, they should be taken into account. I would suggest to write the code something like this, so only ConstantInt has special treatment (to avoid regressions) while everything else is handled as "else": unsigned BitSize; if (const auto CI = dyn_cast<ConstantInt>(Val)) { BitSize = CI->getValue().getActiveBits(); } else { DataLayout DL = ...; BitSize = DL.getTypeSizeInBits(Val->getType()); } // Bail out for same reason below. if (BitSize > 64) return MapVector<Instruction , uint64_t>(); uint64_t V = APInt::getAllOnesValue(BitSize).getZExtValue(); DBits[Leader] \|= V; LLVM_DEBUG(dbgs() << "\t Values's dbits " << Twine::utohexstr(V) << '\n'); There's just one problem there: Getting the DataLayout. I don't see any obvious way to get it, so it may be necessary to pass it as an argument to computeMinimumValueSizes(). nikic: > And there still exists something that confuses me. Firstly, I don't know why `the two select…
		guopeilinAuthorUnsubmitted Done Reply Inline Actions Hi, Thanks a lot for your review and suggestion. And as for DataLayout, I found that `ConstantInt` can be `0`, the ActiveBits will return 0. So we may still need to check the BitSize before initializing APInt with BitSize. So maybe DL is not so necessary here. Also, I found that case Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll failed if we `handle everything else as 'else'`. So maybe we can only take Argument and ConstantInt into consideration, a conservative way. guopeilin: Hi, Thanks a lot for your review and suggestion. And as for DataLayout, I found that…
		guopeilinAuthorUnsubmitted Done Reply Inline Actions Sorry, It seems that the failure of test case `Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll` is not caused by `handle everything else as 'else'`, I am figuring it out. Please wait. guopeilin: Sorry, It seems that the failure of test case `Transforms/LoopVectorize/ARM/tail-folding…
		guopeilinAuthorUnsubmitted Done Reply Inline Actions Hi, nikic! I update a new patch. This patch fix three things: Firstly, a `ConstantInt` can be zero, which means `getActiveBits` can be `0`, we still need to check `bitsize == 0` even through we have `DL.getTypeSizeInBits(Val->getType())`. So I guess we don't need DataLayout here. Secondly, a ConstantInt can be negative, if we still use `getActiveBits`, that is `BitWidth - countLeadingZeros()`, we will always get the most conservative bits for negative value cause its first bit is `1`. That is the reason why I failed on the test case by using getActiveBits for all ConstantInt Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll For this case, there exist a negative constant int `-28`, So I use `getMinSignedBits()` for negative cases. Thirdly, for no negative constant int, I still use `getActiveBits` as before. And handle everything else as 'else' by using `Val->getType()->getScalarSizeInBits()` guopeilin: Hi, nikic! I update a new patch. This patch fix three things: Firstly, a `ConstantInt` can be…
		uint64_t BitSize = 0;
		if (const auto *CI = dyn_cast<ConstantInt>(Val)) {
		nikicUnsubmitted Not Done Reply Inline Actions You should use `DL.getTypeSizeInBits(Val->getType()` here, which will remove the need for the `== 0` check. nikic: You should use `DL.getTypeSizeInBits(Val->getType()` here, which will remove the need for the…
		guopeilinAuthorUnsubmitted Done Reply Inline Actions Ok, and I use Argument->getParent()->getParent()->getDataLayout() to get the DataLayout. guopeilin: Ok, and I use Argument->getParent()->getParent()->getDataLayout() to get the DataLayout.
		APInt CIValue = CI->getValue();
		if (CI->isNegative())
		BitSize = CIValue.getMinSignedBits();
		else
		BitSize = CIValue.getActiveBits();
		} else {
		BitSize = Val->getType()->getScalarSizeInBits();
		}
		if (BitSize == 0)
		continue;
		// Bail out for same reason below.
		if (BitSize > 64)
		return MapVector<Instruction *, uint64_t>();
		uint64_t V = APInt::getAllOnesValue(BitSize).getZExtValue();
		DBits[Leader] \|= V;
		LLVM_DEBUG(dbgs() << "\t Value's dbits: " << Twine::utohexstr(V) << '\n');
continue;		continue;
		}

Instruction *I = cast<Instruction>(Val);		Instruction *I = cast<Instruction>(Val);

// If we encounter a type that is larger than 64 bits, we can't represent		// If we encounter a type that is larger than 64 bits, we can't represent
// it so bail out.		// it so bail out.
if (DB.getDemandedBits(I).getBitWidth() > 64)		if (DB.getDemandedBits(I).getBitWidth() > 64)
return MapVector<Instruction *, uint64_t>();		return MapVector<Instruction *, uint64_t>();

uint64_t V = DB.getDemandedBits(I).getZExtValue();		uint64_t V = DB.getDemandedBits(I).getZExtValue();
▲ Show 20 Lines • Show All 834 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/avoid-truncate-icmp-operands.ll

This file was added.

				; RUN: opt -loop-vectorize -S < %s \| FileCheck %s
				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"

				@a = dso_local local_unnamed_addr global i64 0, align 8
				@b = dso_local local_unnamed_addr global i16 0, align 4

				define dso_local void @myFunc(i64 %d, i64 %e) {
				; CHECK: pred.store.continue2:
				; CHECK-NEXT: %{{[0-9]+}} = icmp ult <2 x i64> %broadcast.splat{{[0-9]}}, %broadcast.splat{{[0-9]}}
				for.body29.lr.ph:
				br label %for.body29

				for.cond25.for.cond.cleanup28_crit_edge: ; preds = %for.inc
				ret void

				for.body29: ; preds = %for.inc, %for.body29.lr.ph
				%n.078 = phi i16 [ undef, %for.body29.lr.ph ], [ %add34, %for.inc ]
				br i1 undef, label %for.inc, label %if.then

				if.then: ; preds = %for.body29
				%conv31 = zext i8 undef to i64
				store i64 %conv31, i64* @a, align 8
				%cmp.i = icmp ult i64 %e, %d
				%.sroa.speculated = select i1 %cmp.i, i64 %d, i64 %e
				%conv32 = trunc i64 %.sroa.speculated to i16
				store i16 %conv32, i16* @b, align 4
				br label %for.inc

				for.inc: ; preds = %if.then, %for.body29
				%add34 = add nsw i16 %n.078, 2
				%cmp27 = icmp slt i16 %add34, 16
				br i1 %cmp27, label %for.body29, label %for.cond25.for.cond.cleanup28_crit_edge, !llvm.loop !6
				}

				!6 = distinct !{!6, !7}
				!7 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/test/Transforms/LoopVectorize/avoid-truncate-remainder-operands.ll

This file was added.

				; RUN: opt -loop-vectorize -S < %s \| FileCheck %s
				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"

				@b = dso_local local_unnamed_addr global i8 0, align 4

				define dso_local void @myFunc(i64 %h) {
				; CHECK: vector.body:
				; CHECK: %{{[0-9]+}} = srem <2 x i64> %broadcast.splat{{[0-9]*}}, <i64 30181, i64 30181>
				; CHECK-NEXT: %{{[0-9]+}} = trunc <2 x i64> %{{[0-9]+}} to <2 x i8>

				for.body19.lr.ph:
				br label %for.body19

				for.cond16.for.cond.cleanup18_crit_edge: ; preds = %for.inc
				ret void

				for.body19: ; preds = %for.inc, %for.body19.lr.ph
				%o.075 = phi i32 [ undef, %for.body19.lr.ph ], [ %add, %for.inc ]
				br i1 undef, label %if.end, label %if.then

				if.then: ; preds = %for.body19
				%rem = srem i64 %h, 30181
				%conv20 = trunc i64 %rem to i8
				store i8 %conv20, i8* undef, align 1
				br label %if.end

				if.end: ; preds = %if.then, %for.body19
				br i1 undef, label %for.inc, label %if.then27

				if.then27: ; preds = %if.end
				br i1 undef, label %cond.false, label %cond.true

				cond.true: ; preds = %if.then27
				%conv37 = sext i8 undef to i64
				br label %cond.end

				cond.false: ; preds = %if.then27
				%0 = load i64, i64* undef, align 8
				br label %cond.end

				cond.end: ; preds = %cond.false, %cond.true
				store i8 undef, i8* @b, align 4
				br label %for.inc

				for.inc: ; preds = %cond.end, %if.end
				%add = add nuw nsw i32 %o.075, 2
				%cmp17 = icmp ult i32 %add, 15
				br i1 %cmp17, label %for.body19, label %for.cond16.for.cond.cleanup18_crit_edge, !llvm.loop !0
				}

				!0 = distinct !{!0, !1}
				!1 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/test/Transforms/LoopVectorize/avoid-truncate-shift-operands.ll

This file was added.

				; RUN: opt -loop-vectorize -S < %s \| FileCheck %s
				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"

				@a = dso_local local_unnamed_addr global i8 0, align 4

				define dso_local void @myFunc(i32 %e, i32 %f) {
				; CHECK: vector.body:
				; CHECK: %{{[0-9]+}} = lshr <2 x i32> %broadcast.splat{{[0-9]*}}, <i32 18, i32 18>
				; CHECK-NEXT: %{{[0-9]+}} = trunc <2 x i32> %{{[0-9]+}} to <2 x i8>

				for.cond1.preheader:
				%conv3 = and i32 %e, 255
				br label %for.body6.lr.ph

				for.body6.lr.ph: ; preds = %for.cond1.preheader
				br label %for.body6

				for.cond1.for.cond.cleanup5_crit_edge: ; preds = %for.inc
				ret void

				for.body6: ; preds = %for.inc, %for.body6.lr.ph
				%h.018 = phi i8 [ 0, %for.body6.lr.ph ], [ %add, %for.inc ]
				br i1 undef, label %for.inc, label %if.then

				if.then: ; preds = %for.body6
				%0 = lshr i32 %f, 18
				%conv7 = trunc i32 %0 to i8
				store i8 %conv7, i8* @a, align 4
				store i64 0, i64* undef, align 8
				br label %for.inc

				for.inc: ; preds = %if.then, %for.body6
				%add = add i8 %h.018, 3
				%conv = zext i8 %add to i32
				%cmp4 = icmp ugt i32 %conv3, %conv
				br i1 %cmp4, label %for.body6, label %for.cond1.for.cond.cleanup5_crit_edge, !llvm.loop !5
				}

				!5 = distinct !{!5, !6, !7}
				!6 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
				!7 = !{!"llvm.loop.vectorize.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Take non-instruction's size into consideration when computing leader's demanded bitsNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 315298

llvm/lib/Analysis/VectorUtils.cpp

llvm/test/Transforms/LoopVectorize/avoid-truncate-icmp-operands.ll

llvm/test/Transforms/LoopVectorize/avoid-truncate-remainder-operands.ll

llvm/test/Transforms/LoopVectorize/avoid-truncate-shift-operands.ll

[LoopVectorize] Take non-instruction's size into consideration when computing leader's demanded bits
Needs ReviewPublic