This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
-
IVDescriptors.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
2
reduction-inloop-pred.ll
1
reduction-inloop.ll
5/11
trunc-reductions.ll

Differential D105632

[LV] Use lookThroughAnd with logical reductions
ClosedPublic

Authored by kmclaughlin on Jul 8 2021, 7:11 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
dmgreen
spatel
CarolineConcatto

Commits

rG9d3559499370: Reland "[LV] Use lookThroughAnd with logical reductions"
rGe22a59967251: [LV] Use lookThroughAnd with logical reductions

Summary

If a reduction Phi has a single user which ANDs the Phi with a type mask,
lookThroughAnd will return the user of the Phi and the narrower type represented
by the mask. Currently this is only used for arithmetic reductions, whereas loops
containing logical reductions will create a reduction intrinsic using the widened
type, for example:

for.body:
  %phi = phi i32 [ %and, %for.body ], [ 255, %entry ]
  %mask = and i32 %phi, 255
  %gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
  %load = load i8, i8* %gep
  %ext = zext i8 %load to i32
  %and = and i32 %mask, %ext
  ...

^ this will generate an and reduction intrinsic such as the following:

call i32 @llvm.vector.reduce.and.v8i32(<8 x i32>...)

The same example for an add instruction would create an intrinsic of type i8:

call i8 @llvm.vector.reduce.add.v8i8(<8 x i8>...)

This patch changes AddReductionVar to call lookThroughAnd for other integer
reductions, allowing loops similar to the example above with reductions such
as and, or & xor to vectorize.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kmclaughlin created this revision.Jul 8 2021, 7:11 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptJul 8 2021, 7:11 AM

kmclaughlin requested review of this revision.Jul 8 2021, 7:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2021, 7:11 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

kmclaughlin added a reviewer: CarolineConcatto.Jul 8 2021, 7:18 AM

Harbormaster completed remote builds in B112984: Diff 357213.Jul 8 2021, 8:31 AM

A nice patch! Good to see more vectorisation capability and code improvements here. Before I accept the patch it might be worth seeing what others think about the inloop tests?

llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll
1598	Given the reduction intrinsic is no longer in the loop is it worth moving this to a different file? Any thoughts @dmgreen @spatel?
llvm/test/Transforms/LoopVectorize/reduction-inloop.ll
1065	Same comment as in `reduction-inloop-pred.ll`.
llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
120	This phi instruction hasn't been narrowed like it has for the cases above, but at least we do now vectorise this loop! Perhaps this is something for a future investigation?
198	nit: Is it worth removing the loop hints and just specifying `-force-vector-interleave=1 -force-vector-width=8` on the RUN line?

dmgreen added inline comments.Jul 8 2021, 10:58 AM

llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll
1598	I'm pretty sure these tests were added to make sure that the combination of inloop reductions and lookThroughAnd worked together correctly. If they still appear to be working, they should be OK to keep here to make sure that combo keeps working in the future.
llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
187	Are we sure that smin/smax work correctly with this? We go from cutting off the top bits (potentially turning a signed number into an unsigned number) to not doing that. Or maybe going from a signed min/max to a unsigned one, from the look of this test? I may be wrong, but I think something like this transform needs to be valid for the op: https://alive2.llvm.org/ce/z/5XJ4ZU. umin/umax may still be OK, if the initial value from the phi is within range. But it doesn't reduce the vector width.

Added new tests for umax & smin
Changed the initial value of the reduction phi in @reduction_smax_trunc to be greater than the mask value

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
187	I think the smax/smin tests are working correctly, as we are still performing an `and` on the phi before the load & compare instructions to mask the top bits. In this test for smax I was previously using an inital value for the phi which was smaller than the mask, which meant the `and` was removed by instcombine. I've changed the initial value to make sure we can check for the `and` in this test which hopefully makes it a bit clearer.

Harbormaster completed remote builds in B113204: Diff 357509.Jul 9 2021, 8:49 AM

LGTM! Thanks for making the changes @kmclaughlin and I'm happy with your explanation for the smin/smax/umin/umax loops.

This revision is now accepted and ready to land.Jul 12 2021, 8:50 AM

I still don't think this code is making correct assumption about smin at least.

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
187	I'm still not sure this is correct. It's about the trunc(reduce.smin.v8i8(..)) performed on an i8, not an i32 like the original. A value with the '7th' bit set will change from being treated like a large positive value to a negative value. It seems that the code is assuming that the reduction can happen in the smaller type, even with signed min/max. As for the initial value, %sum.02p = phi i32 [ %min, %for.body ], [ 256, %entry ] %sum.02 = and i32 %sum.02p, 255 with only one use on the phi is the same as using an input value of 0 (`= 0ff & 0x100`) from entry. I don't think any optimization will do that at the moment, but in principle it seems like it could.

This revision now requires changes to proceed.Jul 12 2021, 9:51 AM

david-arm added inline comments.Jul 13 2021, 1:02 AM

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
187	Hi @dmgreen, ok I think I understand your concerns now with specific reference to the intrinsic after the loop. It wasn't obvious from your initial comment that's all. :) You're specifically worried not about the vector body (which I'm sure is correct), but the ordering of trunc and reduce.smin.v8i8. So basically for this to be correct it needs a minor tweak so that we do: trunc(reduce.smin.v8i32) -> i8 instead, right? I think that would ensure we get the right answer.

Changed AddReductionVar so that for min & max reductions where lookThroughAnd has set the Start instruction to the user of the Phi, we do not change the return type. This ensures that we create the reduction with the wider type (e.g. i32) and truncate the result back to the smaller type (e.g. i8).

Harbormaster completed remote builds in B114020: Diff 358651.Jul 14 2021, 11:30 AM

The test here might be OK. But what about when the type from the And mask and the reduction do not match?

Something like this, where the mask of the And is smaller than the bitwidth from the extend. The loads might also just not be extended, and the final result might not be truncated.

define i8 @reduction_umax_trunc(i8* noalias nocapture %ptr) {
; CHECK-LABEL: @reduction_umax_trunc(
; CHECK-NEXT:  entry:
; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK:       vector.ph:
; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
; CHECK:       vector.body:
; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <8 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP6:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT:    [[TMP0:%.*]] = and <8 x i32> [[VEC_PHI]], <i32 127, i32 127, i32 127, i32 127, i32 127, i32 127, i32 127, i32 127>
; CHECK-NEXT:    [[TMP1:%.*]] = sext i32 [[INDEX]] to i64
; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i8, i8* [[PTR:%.*]], i64 [[TMP1]]
; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i8* [[TMP2]] to <8 x i8>*
; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <8 x i8>, <8 x i8>* [[TMP3]], align 1
; CHECK-NEXT:    [[TMP4:%.*]] = zext <8 x i8> [[WIDE_LOAD]] to <8 x i32>
; CHECK-NEXT:    [[TMP5:%.*]] = icmp ugt <8 x i32> [[TMP0]], [[TMP4]]
; CHECK-NEXT:    [[TMP6]] = select <8 x i1> [[TMP5]], <8 x i32> [[TMP0]], <8 x i32> [[TMP4]]
; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
; CHECK-NEXT:    br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP14:![0-9]+]]
; CHECK:       middle.block:
; CHECK-NEXT:    [[TMP8:%.*]] = call i32 @llvm.vector.reduce.umax.v8i32(<8 x i32> [[TMP6]])
; CHECK-NEXT:    [[EXTRACT_T:%.*]] = trunc i32 [[TMP8]] to i8
; CHECK-NEXT:    br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
; CHECK:       scalar.ph:
; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
; CHECK:       for.body:
; CHECK-NEXT:    br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP15:![0-9]+]]
; CHECK:       for.end:
; CHECK-NEXT:    [[MIN_LCSSA_OFF0:%.*]] = phi i8 [ undef, [[FOR_BODY]] ], [ [[EXTRACT_T]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT:    ret i8 [[MIN_LCSSA_OFF0]]
;
entry:
  br label %for.body

for.body:
  %iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
  %sum.02p = phi i32 [ %max, %for.body ], [ 0, %entry ]
  %sum.02 = and i32 %sum.02p, 127
  %gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
  %load = load i8, i8* %gep
  %ext = zext i8 %load to i32
  %icmp = icmp ugt i32 %sum.02, %ext
  %max = select i1 %icmp, i32 %sum.02, i32 %ext
  %iv.next = add i32 %iv, 1
  %exitcond = icmp eq i32 %iv.next, 256
  br i1 %exitcond, label %for.end, label %for.body

for.end:
  %ret = trunc i32 %max to i8
  ret i8 %ret
}

The order of max's is no longer the same, and the And will cut off bits from the value. So for example with values ptr[254] = 0x90 and ptr[255] = 0x70. In the scalar code it would have %max=0x90 on the penultimate iteration and on the final iteration %sum.02 = 0x90&0x7f = 0x10, %max=umax(0x10, 0x70) = 0x70, which is the final result. The vector version will compute its final iteration with exiting %max values <.., .., 0x90, 0x70>. This then goes into the vector.reduce.umax, producing 0x90.

It feel conceptually odd to me to treat max(and(x, mask), y) the same as max(x, y). They make sense for add and mul, and I think for and/or/xor too providing it's the bottom bits that are demanded. It may be possible to add an extra And with a mask in to get the same results. But I'm not sure there isn't anything else that would go wrong, considering how subtle the issues are here! And the various assumption other code would end up making.

Do the min/max cases come up in practice from C code?

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
86	Can you change the start value to 0. Otherwise the loop doesn't really do much, other than return 65535.

Hi @dmgreen, I think the particular problem you're worried about may not happen. This is because @kmclaughlin's tests also include the flags -dce -instcombine, which I think may have eliminated a redundant and instruction after the select that deals with this. It might be worth @kmclaughlin just creating a test with the mask 0x7f (127) in your example and seeing what the output looks like? The additional flags do make the IR tidier, but sometimes hide the work done by the vectoriser to ensure correctness.

Changed the condition in IVDescriptiors before lookThroughAnd from isArithmeticRecurrenceKind to !isMinMaxRecurrenceKind
Updated the tests for min/max to ensure these are not vectorized

Hi @dmgreen, thank you for taking another look at this & for explaining the problem with min & max here again. I think I understand now :)
I tried your example with a mask of 127 without -instcombine as @david-arm suggested, and there is no additional and after the select which would ensure we get the correct result on the final iteration.

The min/max cases haven't come up in practice from C code. I've changed the condition in IVDescriptors so that the only other instructions which we call lookThroughAnd for are and, or & xor.

david-arm added inline comments.Jul 16 2021, 8:39 AM

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
151	nit: Hi @kmclaughlin, sorry to be a pain, but perhaps it's worth having a single CHECK line for all the min/max tests, i.e. CHECK-NOT: vector.body CHECK-NOT: <8 x instead of checking for the same scalar loop?

Thanks. That makes sense to me. The And/Or/Xor do sound like they should be OK.

LGTM. It sounds like David has some test suggestions, but consider my objections lifted.

This revision is now accepted and ready to land.Jul 16 2021, 8:46 AM

Harbormaster completed remote builds in B114506: Diff 359330.Jul 16 2021, 9:03 AM

Removed CHECK lines for the scalar loop from the min/max tests in trunc-reductions.ll and replaced with the following:

CHECK-NOT: vector.body
CHECK-NOT: <8 x
CHECK: ret

LGTM!

Harbormaster completed remote builds in B114815: Diff 359724.Jul 19 2021, 4:06 AM

This revision was landed with ongoing or failed builds.Jul 21 2021, 1:56 AM

Closed by commit rGe22a59967251: [LV] Use lookThroughAnd with logical reductions (authored by kmclaughlin). · Explain Why

This revision was automatically updated to reflect the committed changes.

kmclaughlin added a commit: rGe22a59967251: [LV] Use lookThroughAnd with logical reductions.

This patch appears to have broken the sanitizer-windows bot: https://lab.llvm.org/buildbot/#/builders/127/builds/14307

******************** TEST 'AddressSanitizer-x86_64-windows :: TestCases/Windows/recalloc_sanity.cpp' FAILED ********************
Exit Code: 1318
# command output:
   Creating library C:\b\slave\sanitizer-windows\build\stage1\projects\compiler-rt\test\asan\X86_64WindowsConfig\TestCases\Windows\Output\recalloc_sanity.cpp.tmp.lib and object C:\b\slave\sanitizer-windows\build\stage1\projects\compiler-rt\test\asan\X86_64WindowsConfig\TestCases\Windows\Output\recalloc_sanity.cpp.tmp.exp
LINK : fatal error LNK1318: Unexpected PDB error; FORMAT (11) ''
# command stderr:
clang-cl: error: linker command failed with exit code 1318 (use -v to see invocation)
error: command failed with exit status: 1318

Could you please take a look?

kmclaughlin added a reverting change: rGbe753b207f59: Revert "[LV] Use lookThroughAnd with logical reductions".Jul 21 2021, 7:18 AM

In D105632#2893018, @morehouse wrote:

This patch appears to have broken the sanitizer-windows bot: https://lab.llvm.org/buildbot/#/builders/127/builds/14307

******************** TEST 'AddressSanitizer-x86_64-windows :: TestCases/Windows/recalloc_sanity.cpp' FAILED ********************
Exit Code: 1318
# command output:
   Creating library C:\b\slave\sanitizer-windows\build\stage1\projects\compiler-rt\test\asan\X86_64WindowsConfig\TestCases\Windows\Output\recalloc_sanity.cpp.tmp.lib and object C:\b\slave\sanitizer-windows\build\stage1\projects\compiler-rt\test\asan\X86_64WindowsConfig\TestCases\Windows\Output\recalloc_sanity.cpp.tmp.exp
LINK : fatal error LNK1318: Unexpected PDB error; FORMAT (11) ''
# command stderr:
clang-cl: error: linker command failed with exit code 1318 (use -v to see invocation)
error: command failed with exit status: 1318

Could you please take a look?

Hi @morehouse,
I have reverted this patch for now in case this is causing the failures, although I believe the sanitizer-windows bot is still failing with the reverted changes: https://lab.llvm.org/buildbot/#/builders/127/builds/14324
Looking at the LINK errors here, I think this can happen due to stale files which weren't deleted from a previous run and it's possible this may not be related to my patch?

Thanks for reverting. Indeed this patch was not the culprit. I clobbered the build directory on the bot and we're passing again.

Sorry for the false alarm!

spatel added inline comments.Jul 21 2021, 8:59 AM

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
238	It's independent of this patch, but I noticed that all of the affected tests have this kind of mask op on the phi. I think -instcombine will eliminate that op, so -loop-vectorize will not typically see this pattern, and we still get the wide reduction op. Is there a plan to account for that?

kmclaughlin reopened this revision.Jul 26 2021, 7:51 AM

kmclaughlin added inline comments.

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
238	Hi @spatel, thanks for taking a look at this! I think it is possible for -loop-vectorize to see this pattern - I've tried using this patch with some small examples such as the following: short foo(short *values, long long N) { int sum = 0; for(long long i=0; i<N; ++i) { sum &= 65535; sum \|= (int)values[i]; } return (short)sum; } I don't believe -instcombine is eliminating the `and` before -loop-vectorize with the above example, and we do get the narrower reduction op after -loop-vectorize with these changes.

This revision is now accepted and ready to land.Jul 26 2021, 7:51 AM

spatel added inline comments.Jul 28 2021, 5:19 AM

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll
238	Thanks for checking. I was seeing the mask op disappear in some of the reduced IR examples, but that may be due to over-simplification in tests.

Closed by commit rG9d3559499370: Reland "[LV] Use lookThroughAnd with logical reductions" (authored by kmclaughlin). · Explain WhyJul 30 2021, 10:04 AM

This revision was automatically updated to reflect the committed changes.

kmclaughlin added a commit: rG9d3559499370: Reland "[LV] Use lookThroughAnd with logical reductions".

Revision Contents

Path

Size

llvm/

lib/

Analysis/

IVDescriptors.cpp

2 lines

test/

Transforms/

LoopVectorize/

reduction-inloop-pred.ll

85 lines

reduction-inloop.ll

25 lines

trunc-reductions.ll

250 lines

Diff 363131

llvm/lib/Analysis/IVDescriptors.cpp

Show First 20 Lines • Show All 266 Lines • ▼ Show 20 Lines	bool RecurrenceDescriptor::AddReductionVar(PHINode *Phi, RecurKind Kind,
// operations are not limited to the legal integer widths, so we may be able		// operations are not limited to the legal integer widths, so we may be able
// to evaluate the reduction in the narrower width.		// to evaluate the reduction in the narrower width.
if (RecurrenceType->isFloatingPointTy()) {		if (RecurrenceType->isFloatingPointTy()) {
if (!isFloatingPointRecurrenceKind(Kind))		if (!isFloatingPointRecurrenceKind(Kind))
return false;		return false;
} else if (RecurrenceType->isIntegerTy()) {		} else if (RecurrenceType->isIntegerTy()) {
if (!isIntegerRecurrenceKind(Kind))		if (!isIntegerRecurrenceKind(Kind))
return false;		return false;
if (isArithmeticRecurrenceKind(Kind))		if (!isMinMaxRecurrenceKind(Kind))
Start = lookThroughAnd(Phi, RecurrenceType, VisitedInsts, CastInsts);		Start = lookThroughAnd(Phi, RecurrenceType, VisitedInsts, CastInsts);
} else {		} else {
// Pointer min/max may exist, but it is not supported as a reduction op.		// Pointer min/max may exist, but it is not supported as a reduction op.
return false;		return false;
}		}

Worklist.push_back(Start);		Worklist.push_back(Start);
VisitedInsts.insert(Start);		VisitedInsts.insert(Start);
▲ Show 20 Lines • Show All 999 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/reduction-inloop-pred.ll

	Show First 20 Lines • Show All 1,535 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @reduction_and_trunc(			; CHECK-LABEL: @reduction_and_trunc(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_LOAD_CONTINUE6:%.*]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_LOAD_CONTINUE6:%.*]] ]
	; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_LOAD_CONTINUE6]] ]			; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_LOAD_CONTINUE6]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 255, [[VECTOR_PH]] ], [ [[TMP34:%.]], [[PRED_LOAD_CONTINUE6]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 255, i32 -1, i32 -1, i32 -1>, [[VECTOR_PH]] ], [ [[TMP29:%.]], [[PRED_LOAD_CONTINUE6]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = icmp ult <4 x i32> [[VEC_IND]], <i32 257, i32 257, i32 257, i32 257>			; CHECK-NEXT: [[TMP0:%.*]] = icmp ult <4 x i32> [[VEC_IND]], <i32 257, i32 257, i32 257, i32 257>
	; CHECK-NEXT: [[TMP1:%.*]] = select <4 x i1> [[TMP0]], <4 x i32> <i32 255, i32 255, i32 255, i32 255>, <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x i1> [[TMP0]], i32 0
	; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.vector.reduce.and.v4i32(<4 x i32> [[TMP1]])			; CHECK-NEXT: br i1 [[TMP1]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
	; CHECK-NEXT: [[TMP3:%.*]] = and i32 [[TMP2]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x i1> [[TMP0]], i32 0
	; CHECK-NEXT: br i1 [[TMP4]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
	; CHECK: pred.load.if:			; CHECK: pred.load.if:
	; CHECK-NEXT: [[TMP5:%.*]] = sext i32 [[INDEX]] to i64			; CHECK-NEXT: [[TMP2:%.*]] = sext i32 [[INDEX]] to i64
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i64 [[TMP5]]			; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i64 [[TMP2]]
	; CHECK-NEXT: [[TMP7:%.]] = load i8, i8 [[TMP6]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load i8, i8 [[TMP3]], align 4
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i8> poison, i8 [[TMP7]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i8> poison, i8 [[TMP4]], i32 0
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE]]			; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE]]
	; CHECK: pred.load.continue:			; CHECK: pred.load.continue:
	; CHECK-NEXT: [[TMP9:%.*]] = phi <4 x i8> [ poison, [[VECTOR_BODY]] ], [ [[TMP8]], [[PRED_LOAD_IF]] ]			; CHECK-NEXT: [[TMP6:%.*]] = phi <4 x i8> [ poison, [[VECTOR_BODY]] ], [ [[TMP5]], [[PRED_LOAD_IF]] ]
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x i1> [[TMP0]], i32 1			; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i1> [[TMP0]], i32 1
	; CHECK-NEXT: br i1 [[TMP10]], label [[PRED_LOAD_IF1:%.]], label [[PRED_LOAD_CONTINUE2:%.]]			; CHECK-NEXT: br i1 [[TMP7]], label [[PRED_LOAD_IF1:%.]], label [[PRED_LOAD_CONTINUE2:%.]]
	; CHECK: pred.load.if1:			; CHECK: pred.load.if1:
	; CHECK-NEXT: [[TMP11:%.*]] = or i32 [[INDEX]], 1			; CHECK-NEXT: [[TMP8:%.*]] = or i32 [[INDEX]], 1
	; CHECK-NEXT: [[TMP12:%.*]] = sext i32 [[TMP11]] to i64			; CHECK-NEXT: [[TMP9:%.*]] = sext i32 [[TMP8]] to i64
	; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP12]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP9]]
	; CHECK-NEXT: [[TMP14:%.]] = load i8, i8 [[TMP13]], align 4			; CHECK-NEXT: [[TMP11:%.]] = load i8, i8 [[TMP10]], align 4
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x i8> [[TMP9]], i8 [[TMP14]], i32 1			; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i8> [[TMP6]], i8 [[TMP11]], i32 1
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE2]]			; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE2]]
	; CHECK: pred.load.continue2:			; CHECK: pred.load.continue2:
	; CHECK-NEXT: [[TMP16:%.*]] = phi <4 x i8> [ [[TMP9]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP15]], [[PRED_LOAD_IF1]] ]			; CHECK-NEXT: [[TMP13:%.*]] = phi <4 x i8> [ [[TMP6]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP12]], [[PRED_LOAD_IF1]] ]
	; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x i1> [[TMP0]], i32 2			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x i1> [[TMP0]], i32 2
	; CHECK-NEXT: br i1 [[TMP17]], label [[PRED_LOAD_IF3:%.]], label [[PRED_LOAD_CONTINUE4:%.]]			; CHECK-NEXT: br i1 [[TMP14]], label [[PRED_LOAD_IF3:%.]], label [[PRED_LOAD_CONTINUE4:%.]]
	; CHECK: pred.load.if3:			; CHECK: pred.load.if3:
	; CHECK-NEXT: [[TMP18:%.*]] = or i32 [[INDEX]], 2			; CHECK-NEXT: [[TMP15:%.*]] = or i32 [[INDEX]], 2
	; CHECK-NEXT: [[TMP19:%.*]] = sext i32 [[TMP18]] to i64			; CHECK-NEXT: [[TMP16:%.*]] = sext i32 [[TMP15]] to i64
	; CHECK-NEXT: [[TMP20:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP19]]			; CHECK-NEXT: [[TMP17:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP16]]
	; CHECK-NEXT: [[TMP21:%.]] = load i8, i8 [[TMP20]], align 4			; CHECK-NEXT: [[TMP18:%.]] = load i8, i8 [[TMP17]], align 4
	; CHECK-NEXT: [[TMP22:%.*]] = insertelement <4 x i8> [[TMP16]], i8 [[TMP21]], i32 2			; CHECK-NEXT: [[TMP19:%.*]] = insertelement <4 x i8> [[TMP13]], i8 [[TMP18]], i32 2
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE4]]			; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE4]]
	; CHECK: pred.load.continue4:			; CHECK: pred.load.continue4:
	; CHECK-NEXT: [[TMP23:%.*]] = phi <4 x i8> [ [[TMP16]], [[PRED_LOAD_CONTINUE2]] ], [ [[TMP22]], [[PRED_LOAD_IF3]] ]			; CHECK-NEXT: [[TMP20:%.*]] = phi <4 x i8> [ [[TMP13]], [[PRED_LOAD_CONTINUE2]] ], [ [[TMP19]], [[PRED_LOAD_IF3]] ]
	; CHECK-NEXT: [[TMP24:%.*]] = extractelement <4 x i1> [[TMP0]], i32 3			; CHECK-NEXT: [[TMP21:%.*]] = extractelement <4 x i1> [[TMP0]], i32 3
	; CHECK-NEXT: br i1 [[TMP24]], label [[PRED_LOAD_IF5:%.*]], label [[PRED_LOAD_CONTINUE6]]			; CHECK-NEXT: br i1 [[TMP21]], label [[PRED_LOAD_IF5:%.*]], label [[PRED_LOAD_CONTINUE6]]
	; CHECK: pred.load.if5:			; CHECK: pred.load.if5:
	; CHECK-NEXT: [[TMP25:%.*]] = or i32 [[INDEX]], 3			; CHECK-NEXT: [[TMP22:%.*]] = or i32 [[INDEX]], 3
	; CHECK-NEXT: [[TMP26:%.*]] = sext i32 [[TMP25]] to i64			; CHECK-NEXT: [[TMP23:%.*]] = sext i32 [[TMP22]] to i64
	; CHECK-NEXT: [[TMP27:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP26]]			; CHECK-NEXT: [[TMP24:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP23]]
	; CHECK-NEXT: [[TMP28:%.]] = load i8, i8 [[TMP27]], align 4			; CHECK-NEXT: [[TMP25:%.]] = load i8, i8 [[TMP24]], align 4
	; CHECK-NEXT: [[TMP29:%.*]] = insertelement <4 x i8> [[TMP23]], i8 [[TMP28]], i32 3			; CHECK-NEXT: [[TMP26:%.*]] = insertelement <4 x i8> [[TMP20]], i8 [[TMP25]], i32 3
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE6]]			; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE6]]
	; CHECK: pred.load.continue6:			; CHECK: pred.load.continue6:
	; CHECK-NEXT: [[TMP30:%.*]] = phi <4 x i8> [ [[TMP23]], [[PRED_LOAD_CONTINUE4]] ], [ [[TMP29]], [[PRED_LOAD_IF5]] ]			; CHECK-NEXT: [[TMP27:%.*]] = phi <4 x i8> [ [[TMP20]], [[PRED_LOAD_CONTINUE4]] ], [ [[TMP26]], [[PRED_LOAD_IF5]] ]
	; CHECK-NEXT: [[TMP31:%.*]] = zext <4 x i8> [[TMP30]] to <4 x i32>			; CHECK-NEXT: [[TMP28:%.*]] = zext <4 x i8> [[TMP27]] to <4 x i32>
	; CHECK-NEXT: [[TMP32:%.*]] = select <4 x i1> [[TMP0]], <4 x i32> [[TMP31]], <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>			; CHECK-NEXT: [[TMP29]] = and <4 x i32> [[VEC_PHI]], [[TMP28]]
	; CHECK-NEXT: [[TMP33:%.*]] = call i32 @llvm.vector.reduce.and.v4i32(<4 x i32> [[TMP32]])
	; CHECK-NEXT: [[TMP34]] = and i32 [[TMP33]], [[TMP3]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
	; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], <i32 4, i32 4, i32 4, i32 4>			; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], <i32 4, i32 4, i32 4, i32 4>
	; CHECK-NEXT: [[TMP35:%.*]] = icmp eq i32 [[INDEX_NEXT]], 260			; CHECK-NEXT: [[TMP30:%.*]] = icmp eq i32 [[INDEX_NEXT]], 260
	; CHECK-NEXT: br i1 [[TMP35]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP32:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP30]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP32:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
				; CHECK-NEXT: [[TMP31:%.*]] = select <4 x i1> [[TMP0]], <4 x i32> [[TMP29]], <4 x i32> [[VEC_PHI]]
				; CHECK-NEXT: [[TMP32:%.*]] = trunc <4 x i32> [[TMP31]] to <4 x i8>
				; CHECK-NEXT: [[TMP33:%.*]] = call i8 @llvm.vector.reduce.and.v4i8(<4 x i8> [[TMP32]])
				david-armUnsubmitted Not Done Reply Inline Actions Given the reduction intrinsic is no longer in the loop is it worth moving this to a different file? Any thoughts @dmgreen @spatel? david-arm: Given the reduction intrinsic is no longer in the loop is it worth moving this to a different…
				dmgreenUnsubmitted Not Done Reply Inline Actions I'm pretty sure these tests were added to make sure that the combination of inloop reductions and lookThroughAnd worked together correctly. If they still appear to be working, they should be OK to keep here to make sure that combo keeps working in the future. dmgreen: I'm pretty sure these tests were added to make sure that the combination of inloop reductions…
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop [[LOOP33:![0-9]+]]			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop [[LOOP33:![0-9]+]]
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP34]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i8 [ undef, [[DOTLR_PH]] ], [ [[TMP33]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: [[RET:%.*]] = trunc i32 [[SUM_0_LCSSA]] to i8			; CHECK-NEXT: ret i8 [[SUM_0_LCSSA]]
	; CHECK-NEXT: ret i8 [[RET]]
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02p = phi i32 [ %l9, %.lr.ph ], [ 255, %entry ]			%sum.02p = phi i32 [ %l9, %.lr.ph ], [ 255, %entry ]
	%sum.02 = and i32 %sum.02p, 255			%sum.02 = and i32 %sum.02p, 255
	Show All 13 Lines

llvm/test/Transforms/LoopVectorize/reduction-inloop.ll

	Show First 20 Lines • Show All 1,046 Lines • ▼ Show 20 Lines
	define i8 @reduction_and_trunc(i8* noalias nocapture %A) {			define i8 @reduction_and_trunc(i8* noalias nocapture %A) {
	; CHECK-LABEL: @reduction_and_trunc(			; CHECK-LABEL: @reduction_and_trunc(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 255, [[VECTOR_PH]] ], [ [[TMP6:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i8> [ <i8 -1, i8 -1, i8 -1, i8 -1>, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = and i32 [[VEC_PHI]], 255			; CHECK-NEXT: [[TMP0:%.*]] = sext i32 [[INDEX]] to i64
	; CHECK-NEXT: [[TMP1:%.*]] = sext i32 [[INDEX]] to i64			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[TMP1]] to <4 x i8>*
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP2]], align 4
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 4			; CHECK-NEXT: [[TMP3]] = and <4 x i8> [[VEC_PHI]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP4:%.*]] = zext <4 x i8> [[WIDE_LOAD]] to <4 x i32>
	; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vector.reduce.and.v4i32(<4 x i32> [[TMP4]])
	; CHECK-NEXT: [[TMP6]] = and i32 [[TMP5]], [[TMP0]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256			; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP38:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP38:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
				; CHECK-NEXT: [[TMP5:%.*]] = call i8 @llvm.vector.reduce.and.v4i8(<4 x i8> [[TMP3]])
				david-armUnsubmitted Not Done Reply Inline Actions Same comment as in `reduction-inloop-pred.ll`. david-arm: Same comment as in `reduction-inloop-pred.ll`.
	; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[DOT_CRIT_EDGE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: br label [[DOTLR_PH:%.*]]			; CHECK-NEXT: br label [[DOTLR_PH:%.*]]
	; CHECK: .lr.ph:			; CHECK: .lr.ph:
	; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop [[LOOP39:![0-9]+]]			; CHECK-NEXT: br i1 undef, label [[DOT_CRIT_EDGE]], label [[DOTLR_PH]], !llvm.loop [[LOOP39:![0-9]+]]
	; CHECK: ._crit_edge:			; CHECK: ._crit_edge:
	; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i32 [ undef, [[DOTLR_PH]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_0_LCSSA:%.*]] = phi i8 [ undef, [[DOTLR_PH]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: [[RET:%.*]] = trunc i32 [[SUM_0_LCSSA]] to i8			; CHECK-NEXT: ret i8 [[SUM_0_LCSSA]]
	; CHECK-NEXT: ret i8 [[RET]]
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02p = phi i32 [ %l9, %.lr.ph ], [ 255, %entry ]			%sum.02p = phi i32 [ %l9, %.lr.ph ], [ 255, %entry ]
	%sum.02 = and i32 %sum.02p, 255			%sum.02 = and i32 %sum.02p, 255
	Show All 17 Lines

llvm/test/Transforms/LoopVectorize/trunc-reductions.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -loop-vectorize -dce -instcombine -force-vector-interleave=1 -force-vector-width=8 -S < %s \| FileCheck %s

				define i8 @reduction_and_trunc(i8* noalias nocapture %ptr) {
				; CHECK-LABEL: @reduction_and_trunc(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <8 x i8> [ <i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = sext i32 [[INDEX]] to i64
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[PTR:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[TMP1]] to <8 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x i8>, <8 x i8> [[TMP2]], align 1
				; CHECK-NEXT: [[TMP3]] = and <8 x i8> [[VEC_PHI]], [[WIDE_LOAD]]
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
				; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP5:%.*]] = call i8 @llvm.vector.reduce.and.v8i8(<8 x i8> [[TMP3]])
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: [[AND_LCSSA_OFF0:%.*]] = phi i8 [ undef, [[FOR_BODY]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: ret i8 [[AND_LCSSA_OFF0]]
				;
				entry:
				br label %for.body

				for.body:
				%iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
				%sum.02p = phi i32 [ %and, %for.body ], [ 0, %entry ]
				%sum.02 = and i32 %sum.02p, 255
				%gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
				%load = load i8, i8* %gep
				%ext = zext i8 %load to i32
				%and = and i32 %sum.02, %ext
				%iv.next = add i32 %iv, 1
				%exitcond = icmp eq i32 %iv.next, 256
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%ret = trunc i32 %and to i8
				ret i8 %ret
				}

				define i16 @reduction_or_trunc(i16* noalias nocapture %ptr) {
				; CHECK-LABEL: @reduction_or_trunc(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <8 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = sext i32 [[INDEX]] to i64
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i16, i16 [[PTR:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[TMP1]] to <8 x i16>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x i16>, <8 x i16> [[TMP2]], align 2
				; CHECK-NEXT: [[TMP3]] = or <8 x i16> [[VEC_PHI]], [[WIDE_LOAD]]
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
				; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP5:%.*]] = call i16 @llvm.vector.reduce.or.v8i16(<8 x i16> [[TMP3]])
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: [[XOR_LCSSA_OFF0:%.*]] = phi i16 [ undef, [[FOR_BODY]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: ret i16 [[XOR_LCSSA_OFF0]]
				;
				entry:
				br label %for.body

				for.body:
				%iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
				%sum.02p = phi i32 [ %xor, %for.body ], [ 0, %entry ]
				%sum.02 = and i32 %sum.02p, 65535
				dmgreenUnsubmitted Done Reply Inline Actions Can you change the start value to 0. Otherwise the loop doesn't really do much, other than return 65535. dmgreen: Can you change the start value to 0. Otherwise the loop doesn't really do much, other than…
				%gep = getelementptr inbounds i16, i16* %ptr, i32 %iv
				%load = load i16, i16* %gep
				%ext = zext i16 %load to i32
				%xor = or i32 %sum.02, %ext
				%iv.next = add i32 %iv, 1
				%exitcond = icmp eq i32 %iv.next, 256
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%ret = trunc i32 %xor to i16
				ret i16 %ret
				}

				define i16 @reduction_xor_trunc(i16* noalias nocapture %ptr) {
				; CHECK-LABEL: @reduction_xor_trunc(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <8 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = sext i32 [[INDEX]] to i64
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i16, i16 [[PTR:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[TMP1]] to <8 x i16>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x i16>, <8 x i16> [[TMP2]], align 2
				; CHECK-NEXT: [[TMP3]] = xor <8 x i16> [[VEC_PHI]], [[WIDE_LOAD]]
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
				; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP5:%.*]] = call i16 @llvm.vector.reduce.xor.v8i16(<8 x i16> [[TMP3]])
				; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				david-armUnsubmitted Not Done Reply Inline Actions This phi instruction hasn't been narrowed like it has for the cases above, but at least we do now vectorise this loop! Perhaps this is something for a future investigation? david-arm: This phi instruction hasn't been narrowed like it has for the cases above, but at least we do…
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: [[XOR_LCSSA_OFF0:%.*]] = phi i16 [ undef, [[FOR_BODY]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: ret i16 [[XOR_LCSSA_OFF0]]
				;
				entry:
				br label %for.body

				for.body:
				%iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
				%sum.02p = phi i32 [ %xor, %for.body ], [ 0, %entry ]
				%sum.02 = and i32 %sum.02p, 65535
				%gep = getelementptr inbounds i16, i16* %ptr, i32 %iv
				%load = load i16, i16* %gep
				%ext = zext i16 %load to i32
				%xor = xor i32 %sum.02, %ext
				%iv.next = add i32 %iv, 1
				%exitcond = icmp eq i32 %iv.next, 256
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%ret = trunc i32 %xor to i16
				ret i16 %ret
				}

				define i8 @reduction_smin_trunc(i8* noalias nocapture %ptr) {
				; CHECK-LABEL: @reduction_smin_trunc(
				; CHECK-NOT: vector.body
				; CHECK-NOT: <8 x
				david-armUnsubmitted Done Reply Inline Actions nit: Hi @kmclaughlin, sorry to be a pain, but perhaps it's worth having a single CHECK line for all the min/max tests, i.e. CHECK-NOT: vector.body CHECK-NOT: <8 x instead of checking for the same scalar loop? david-arm: nit: Hi @kmclaughlin, sorry to be a pain, but perhaps it's worth having a single CHECK line for…
				; CHECK: ret
				entry:
				br label %for.body

				for.body:
				%iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
				%sum.02p = phi i32 [ %min, %for.body ], [ 256, %entry ]
				%sum.02 = and i32 %sum.02p, 255
				%gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
				%load = load i8, i8* %gep
				%ext = sext i8 %load to i32
				%icmp = icmp slt i32 %sum.02, %ext
				%min = select i1 %icmp, i32 %sum.02, i32 %ext
				%iv.next = add i32 %iv, 1
				%exitcond = icmp eq i32 %iv.next, 256
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%ret = trunc i32 %min to i8
				ret i8 %ret
				}

				define i8 @reduction_umin_trunc(i8* noalias nocapture %ptr) {
				; CHECK-LABEL: @reduction_umin_trunc(
				; CHECK-NOT: vector.body
				; CHECK-NOT: <8 x
				; CHECK: ret
				entry:
				br label %for.body

				for.body:
				%iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
				%sum.02p = phi i32 [ %min, %for.body ], [ 0, %entry ]
				%sum.02 = and i32 %sum.02p, 255
				%gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
				%load = load i8, i8* %gep
				dmgreenUnsubmitted Not Done Reply Inline Actions Are we sure that smin/smax work correctly with this? We go from cutting off the top bits (potentially turning a signed number into an unsigned number) to not doing that. Or maybe going from a signed min/max to a unsigned one, from the look of this test? I may be wrong, but I think something like this transform needs to be valid for the op: https://alive2.llvm.org/ce/z/5XJ4ZU. umin/umax may still be OK, if the initial value from the phi is within range. But it doesn't reduce the vector width. dmgreen: Are we sure that smin/smax work correctly with this? We go from cutting off the top bits…
				kmclaughlinAuthorUnsubmitted Done Reply Inline Actions I think the smax/smin tests are working correctly, as we are still performing an `and` on the phi before the load & compare instructions to mask the top bits. In this test for smax I was previously using an inital value for the phi which was smaller than the mask, which meant the `and` was removed by instcombine. I've changed the initial value to make sure we can check for the `and` in this test which hopefully makes it a bit clearer. kmclaughlin: I think the smax/smin tests are working correctly, as we are still performing an `and` on the…
				dmgreenUnsubmitted Not Done Reply Inline Actions I'm still not sure this is correct. It's about the trunc(reduce.smin.v8i8(..)) performed on an i8, not an i32 like the original. A value with the '7th' bit set will change from being treated like a large positive value to a negative value. It seems that the code is assuming that the reduction can happen in the smaller type, even with signed min/max. As for the initial value, %sum.02p = phi i32 [ %min, %for.body ], [ 256, %entry ] %sum.02 = and i32 %sum.02p, 255 with only one use on the phi is the same as using an input value of 0 (`= 0ff & 0x100`) from entry. I don't think any optimization will do that at the moment, but in principle it seems like it could. dmgreen: I'm still not sure this is correct. It's about the trunc(reduce.smin.v8i8(..)) performed on an…
				david-armUnsubmitted Not Done Reply Inline Actions Hi @dmgreen, ok I think I understand your concerns now with specific reference to the intrinsic after the loop. It wasn't obvious from your initial comment that's all. :) You're specifically worried not about the vector body (which I'm sure is correct), but the ordering of trunc and reduce.smin.v8i8. So basically for this to be correct it needs a minor tweak so that we do: trunc(reduce.smin.v8i32) -> i8 instead, right? I think that would ensure we get the right answer. david-arm: Hi @dmgreen, ok I think I understand your concerns now with specific reference to the intrinsic…
				%ext = zext i8 %load to i32
				%icmp = icmp ult i32 %sum.02, %ext
				%min = select i1 %icmp, i32 %sum.02, i32 %ext
				%iv.next = add i32 %iv, 1
				%exitcond = icmp eq i32 %iv.next, 256
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%ret = trunc i32 %min to i8
				ret i8 %ret
				}
				david-armUnsubmitted Done Reply Inline Actions nit: Is it worth removing the loop hints and just specifying `-force-vector-interleave=1 -force-vector-width=8` on the RUN line? david-arm: nit: Is it worth removing the loop hints and just specifying `-force-vector-interleave=1 -force…

				define i16 @reduction_smax_trunc(i16* noalias nocapture %ptr) {
				; CHECK-LABEL: @reduction_smax_trunc(
				; CHECK-NOT: vector.body
				; CHECK-NOT: <8 x
				; CHECK: ret
				entry:
				br label %for.body

				for.body:
				%iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
				%sum.02p = phi i32 [ %min, %for.body ], [ 0, %entry ]
				%sum.02 = and i32 %sum.02p, 65535
				%gep = getelementptr inbounds i16, i16* %ptr, i32 %iv
				%load = load i16, i16* %gep
				%ext = sext i16 %load to i32
				%icmp = icmp sgt i32 %sum.02, %ext
				%min = select i1 %icmp, i32 %sum.02, i32 %ext
				%iv.next = add i32 %iv, 1
				%exitcond = icmp eq i32 %iv.next, 256
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%ret = trunc i32 %min to i16
				ret i16 %ret
				}

				define i16 @reduction_umax_trunc(i16* noalias nocapture %ptr) {
				; CHECK-LABEL: @reduction_umax_trunc(
				; CHECK-NOT: vector.body
				; CHECK-NOT: <8 x
				; CHECK: ret
				entry:
				br label %for.body

				for.body:
				%iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
				%sum.02p = phi i32 [ %min, %for.body ], [ 0, %entry ]
				%sum.02 = and i32 %sum.02p, 65535
				%gep = getelementptr inbounds i16, i16* %ptr, i32 %iv
				spatelUnsubmitted Not Done Reply Inline Actions It's independent of this patch, but I noticed that all of the affected tests have this kind of mask op on the phi. I think -instcombine will eliminate that op, so -loop-vectorize will not typically see this pattern, and we still get the wide reduction op. Is there a plan to account for that? spatel: It's independent of this patch, but I noticed that all of the affected tests have this kind of…
				kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Hi @spatel, thanks for taking a look at this! I think it is possible for -loop-vectorize to see this pattern - I've tried using this patch with some small examples such as the following: short foo(short values, long long N) { int sum = 0; for(long long i=0; i<N; ++i) { sum &= 65535; sum \|= (int)values[i]; } return (short)sum; } I don't believe -instcombine is eliminating the `and` before -loop-vectorize with the above example, and we do get the narrower reduction op after -loop-vectorize with these changes. kmclaughlin:* Hi @spatel, thanks for taking a look at this! I think it is possible for -loop-vectorize to see…
				spatelUnsubmitted Not Done Reply Inline Actions Thanks for checking. I was seeing the mask op disappear in some of the reduced IR examples, but that may be due to over-simplification in tests. spatel: Thanks for checking. I was seeing the mask op disappear in some of the reduced IR examples, but…
				%load = load i16, i16* %gep
				%ext = zext i16 %load to i32
				%icmp = icmp ugt i32 %sum.02, %ext
				%min = select i1 %icmp, i32 %sum.02, i32 %ext
				%iv.next = add i32 %iv, 1
				%exitcond = icmp eq i32 %iv.next, 256
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				%ret = trunc i32 %min to i16
				ret i16 %ret
				}