This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Set vectorizer-maximize-bandwidth as default true
AbandonedPublic

Authored by zatrazz on Apr 30 2018, 1:11 PM.

Download Raw Diff

Details

Reviewers

fhahn
rengolin
javed.absar
huntergr
SjoerdMeijer
t.p.northover
echristo
evandro
mcrosier

Summary

Although this shows no virtual gain in speccpu2006 on A72:

Benchmark Diff
400.perlbench +1.55
401.bzip2 -1.22
403.gcc +0.73
429.mcf +3.00
445.gobmk -0.39
456.hmmer -0.90
458.sjeng -0.41
462.libquantum -1.91
464.h264ref 0.00
471.omnetpp -0.64
473.astar -0.38
483.xalancbmk 0.90
geomean: 0.04

It shows some good improvements in generic loops code where each
element is truncate to a narrow type. For instance vector body for
the following code:

void store_i32_to_i8 (const int *src, int width, unsigned char *dst)
{
  for (int i = 0; i < width; i++) {
   *dst++ = *src++;
  }
}

It currently compiled to:

.LBB0_4: // %vector.body

                                // =>This Inner Loop Header: Depth=1
ldp     w14, w15, [x11, #-4]
add     x11, x11, #8            // =8
subs    x13, x13, #2            // =2
sturb   w14, [x12, #-1]
strb    w15, [x12], #2
b.ne    .LBB0_4

Where with current patch it is now compiled to:

.LBB0_4: // %vector.body

                                // =>This Inner Loop Header: Depth=1
ldp     q0, q1, [x11, #-64]
ldp     q2, q3, [x11, #-32]
ldp     q4, q5, [x11]
ldp     q6, q7, [x11, #32]
xtn     v0.4h, v0.4s
xtn     v2.4h, v2.4s
xtn2    v2.8h, v3.4s
xtn2    v0.8h, v1.4s
xtn     v6.4h, v6.4s
xtn     v4.4h, v4.4s
xtn     v0.8b, v0.8h
xtn2    v0.16b, v2.8h
xtn2    v6.8h, v7.4s
xtn2    v4.8h, v5.4s
xtn     v1.8b, v4.8h
xtn2    v1.16b, v6.8h
add     x11, x11, #128          // =128
subs    x13, x13, #32           // =32
stp     q0, q1, [x12, #-16]
add     x12, x12, #32           // =32
b.ne    .LBB0_4

It is a increase of about 12% of throughput in a micro-benchmark with an array of
16777216 elements.

Diff Detail

Event Timeline

zatrazz created this revision.Apr 30 2018, 1:11 PM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptApr 30 2018, 1:12 PM

This seems like an improvement, but we have to be careful with wide variations and little gain. The geomean is almost null but the standard deviation is higher than 75% of the results.

This is not a good sign and is a reflection of what I have complained about exchanging numbers and not tracking which benchmarks get worse/better, then risking hurting more one side than another in the long term.

The patch sets the default to *all* AArch64 cores, which is a wide range and you have only tested on one.

To enable such flag by default, we'd need to run on more cores (preferably including A53) as well as other vendors (like Apple, Samsung, Qualcomm), which this patch is likely to impact.

@t.p.northover @evandro @mcrosier please have a look.

In D46283#1083350, @rengolin wrote:

This seems like an improvement, but we have to be careful with wide variations and little gain. The geomean is almost null but the standard deviation is higher than 75% of the results.

This is not a good sign and is a reflection of what I have complained about exchanging numbers and not tracking which benchmarks get worse/better, then risking hurting more one side than another in the long term.

The patch sets the default to *all* AArch64 cores, which is a wide range and you have only tested on one.

To enable such flag by default, we'd need to run on more cores (preferably including A53) as well as other vendors (like Apple, Samsung, Qualcomm), which this patch is likely to impact.

@t.p.northover @evandro @mcrosier please have a look.

My understanding is the 'vectorizer-maximize-bandwidth' was previously set and reset as default for all targets. This history is more that it has cause some functionality issues than performance one [1]. I couldn't find any follow up on this thread, so I took a more conservative approach and just enable it on AArch64. I will try to check on more recent workloads (speccpu2017) to see any gains or regressions and also rerun it on master. But I do agree that this kind of change should take with care.

[1] [llvm] r305960 - Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r305990 - Revert "Enable vectorizer-maximize-bandwidth by default." (Diana Picus)
[llvm] r306336 - Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r306344 - revert r306336 for breaking ppc test. (Dehao)
[llvm] r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r306792 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default." (Daniel)
[llvm] r306933 - Enable vectorizer-maximize-bandwidth by default.
[llvm] r306934 - revert r306336 for breaking ppc test.
[llvm] r306935 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default.
[llvm] r306936 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."

In D46283#1083378, @zatrazz wrote:

[1] [llvm] r305960 - Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r305990 - Revert "Enable vectorizer-maximize-bandwidth by default." (Diana Picus)
[llvm] r306336 - Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r306344 - revert r306336 for breaking ppc test. (Dehao)
[llvm] r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default. (Dehao)
[llvm] r306792 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default." (Daniel)
[llvm] r306933 - Enable vectorizer-maximize-bandwidth by default.
[llvm] r306934 - revert r306336 for breaking ppc test.
[llvm] r306935 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default.
[llvm] r306936 - Revert "r306473 - re-commit r306336: Enable vectorizer-maximize-bandwidth by default."

This is a horrible history, even worse if it broke things, not just performance...

Worse still is the number of targets (Arm, PPC) and it was re-committed a number of times without modification.

This trend is worrying and we should be *really* worried about the overall implementation before assuming that, because it didn't break SPEC2k6 on A72, it's a good patch.

We need a full analysis on what happened, why it's breaking so much, why the noise is so great and how can we reduce it while fixing the actual bugs.

Re-committing the patch as is simply won't work.

cheers,
--renato

This revision now requires changes to proceed.Apr 30 2018, 2:17 PM

zatrazz updated this revision to Diff 147815.May 21 2018, 10:41 AM

For some reason I did not attach the meant comments in this update. This is an update of the previous patch with an extended analysis. I checked a bootstrap build TargetTransformation::shouldMaximizeVectorBandwidth enabled for both armhf (r332595) and powerpc64le (r332840). On armhf I did not see any regression, however on powerpc64le I found an issue related on how current code handles the MaximizeBandwidth option. The testcase 'Transforms/LoopVectorize/PowerPC/pr30990.ll' explicit sets vectorizer-maximize-bandwidth to 0, however the code checks for:

lib/Transforms/Vectorize/LoopVectorize.cpp

4970 unsigned MaxVF = MaxVectorSize;
4971 if (TTI.shouldMaximizeVectorBandwidth(OptForSize) ||
4972 (MaximizeBandwidth && !OptForSize)) {

To enable/disable this optimization. I think a possible fix would to check if Maximize is explicit disable (instead to check for its default value) by:

diff --git a/lib/Transforms/Vectorize/LoopVectorize.cpp b/lib/Transforms/Vectorize/LoopVectorize.cpp
index a65dc09..40c6583 100644

a/lib/Transforms/Vectorize/LoopVectorize.cpp

+++ b/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4968,7 +4968,8 @@ LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize,

}
 
unsigned MaxVF = MaxVectorSize;

if (TTI.shouldMaximizeVectorBandwidth(OptForSize) ||

+ if (TTI.shouldMaximizeVectorBandwidth(OptForSize &&
+ !MaximizeBandwidth.getNumOccurrences()) ||

  (MaximizeBandwidth && !OptForSize)) {
// Collect all viable vectorization factors larger than the default MaxVF
// (i.e. MaxVectorSize).

But I think it should not act a block for this patch (I can sent it in another one if required).

Now related to performance differences, for speccpu2006 I see mixed results The machine I am testing show some variance, and even trying to get as minimum OS jitter as possible (by placing cpu allocation and binding to a specific node and disabling OS services), in two runs I see:

RUN 1 (1 iteration)

Benchmark Difference (%) *
400.perlbench 0.73
401.bzip2 0.01
403.gcc 0.53
429.mcf 0.05
445.gobmk 0.99
456.hmmer -0.25
458.sjeng 1.02
462.libquantum 0.04
464.h264ref 0.28
471.omnetpp 0.30
473.astar -0.11
483.xalancbmk 1.92

433.milc 0.03
444.namd -0.38
447.dealII 0.95
450.soplex 0.99
453.povray 1.24
470.lbm -0.88
482.sphinx3 1.43

RUN 2 (3 iteration, get the better result)

Benchmark Difference (%) *
400.perlbench 0.66
401.bzip2 -2.84
403.gcc 0.09
429.mcf 0.46
445.gobmk 0.03
456.hmmer -1.34
458.sjeng 0.12
462.libquantum 0.06
464.h264ref 0.45
471.omnetpp -0.74
473.astar 1.05
483.xalancbmk -0.57

433.milc -0.04
444.namd 0.14
447.dealII -0.37
450.soplex 0.97
453.povray -0.90
470.lbm -0.88
482.sphinx3 0.28

On speccpu2017 the results are slighter more stable:

Benchmark Difference (%) *
600.perlbench_s 0,41
602.gcc_s 0,96
605.mcf_s -0,87
620.omnetpp_s 1,74
623.xalancbmk_s 1,80
625.x264_s 0,16
631.deepsjeng_s 0,33
641.leela_s 0,38
657.xz_s -0,14

619.lbm_s -0,45
638.imagick_s 0,09
644.nab_s -0,10

[EDIT: remove geekbench5 data points since it is not released yet]

Difference between r332336 with and without the patch, positive values represents a higher score indicating a improvement with the patch.

SPEC06 results look too noisy to conclude anything, especially bzip, xalan and povray. Can you find a more stable machine?

In the geekbench side, I guess we need to understand what's wrong with PDFRendering before proceeding.

mcrosier resigned from this revision.May 24 2018, 12:54 PM

Indeed the machine was using for speccpu2006 was not best suitable, I used a different one now (tx1, A57) with extra care to lower variance (cpu binding, services disabled, etc) and it indeed showed a better result:

Benchmark	Difference (%)
400.perlbench	+1.04
401.bzip2	-2.04
403.gcc	+0.63
429.mcf	+0.05
445.gobmk	+0.18
456.hmmer	-0.90
458.sjeng	+0.29
462.libquantum	+0.10
464.h264ref	+1.73
471.omnetpp	+0.18
473.astar	-0.21
483.xalancbmk	+0.12

433.milc	-0.36
444.namd	+0.86
447.dealII	-0.66
450.soplex	-0.13
453.povray	+0.37
470.lbm	+0.08
482.sphinx3	+0.54

I will check if 401.bzip2 slight drop is just noise or something related to this patch, but regardless I do think this change should yield better performance in most scenarios. Taking the example I am trying to get a better auto-vectorization pattern:

void store_i32_to_i8 (const int *src, int width, unsigned char *dst)
{
  for (int i = 0; i < width; i++) {
   *dst++ = *src++;
  }
}

I can only get the maximum throughput when autovectorization do try large vectorization factors. I do try to try optimize the trunc 4 x i32 with a custom LowerTruncStore, but afaiu without either an extra transformation or pass aarch64 backend can't really fuse the high vector instruction (xtn2 in this case) to the maximum throughout. Something I am investigating is if selecting the largest VF for 'MaximizeVectorBandwidth' is the best strategy or if we should add an architecture hook to enable/disable it.

For geekbench side I will investigate on PDFRendering, but I really think it is missing vectorization tuning and I am not sure if we should consider it a block.

In D46283#1118926, @zatrazz wrote:

401.bzip2 -2.04

I will check if 401.bzip2 slight drop is just noise or something related to this patch, but regardless I do think this change should yield better performance in most scenarios.

We have to be careful, though. In the past we have accepted this kind of scenario as "obviously good" because geomean is better. But that's a fallacy that has brought some pain over the past few years.

While a positive geomean is good, a -2% on bzip will mean someone will be trying to fix that later on, and might just as well undo the "good" changes this patch brings.

This is a never-ending scenario where the overall win is zero (plus or minus something).

I'm not saying we should stop any improvement if we have one bad result, but your bad result is worse than any other is good. This is still worrying, and may mean that the changes bring instability to passes after that (including back-end ones).

I can only get the maximum throughput when autovectorization do try large vectorization factors. I do try to try optimize the trunc 4 x i32 with a custom LowerTruncStore, but afaiu without either an extra transformation or pass aarch64 backend can't really fuse the high vector instruction (xtn2 in this case) to the maximum throughout. Something I am investigating is if selecting the largest VF for 'MaximizeVectorBandwidth' is the best strategy or if we should add an architecture hook to enable/disable it.

I believe the fact that LLVM gets this right with the extra flag is a side-effect of how the vectoriser used to work: keep trying until whatever. Very soon this will no longer be the case and the approach for this would probably end up in a VPlan or something.

While I understand the need to get it "faster" on benchmarks, I think we need to look at the bigger picture here.

For geekbench side I will investigate on PDFRendering, but I really think it is missing vectorization tuning and I am not sure if we should consider it a block.

Until we know more about the reason bzip and PDF are so much worse, there is no way to know if this is a blocker or not.

brzycki added a subscriber: brzycki.Nov 6 2018, 11:34 AM

Requesting changes as additional info on the bzip and PDF regressions is needed.

This revision now requires changes to proceed.Jan 11 2019, 9:20 AM

zatrazz abandoned this revision.Jan 22 2019, 3:08 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

4 lines

test/

Transforms/

LoopVectorize/

AArch64/

aarch64-trunc-vec.ll

54 lines

loop-vectorization-factors.ll

32 lines

reduction-small-size.ll

16 lines

Diff 147815

lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 165 Lines • ▼ Show 20 Lines	public:

bool useReductionIntrinsic(unsigned Opcode, Type *Ty,		bool useReductionIntrinsic(unsigned Opcode, Type *Ty,
TTI::ReductionFlags Flags) const;		TTI::ReductionFlags Flags) const;

int getArithmeticReductionCost(unsigned Opcode, Type *Ty,		int getArithmeticReductionCost(unsigned Opcode, Type *Ty,
bool IsPairwiseForm);		bool IsPairwiseForm);

int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);		int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);

		bool shouldMaximizeVectorBandwidth(bool OptSize) const {
		return !OptSize;
		}
/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H

test/Transforms/LoopVectorize/AArch64/aarch64-trunc-vec.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -mtriple=aarch64-none-linux-gnu -mattr=+neon -S \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; Originl source code:
				;
				; void store_i32_to_i8 (const int src, int width, unsigned char dst)
				; {
				; for (int i = 0; i < width; i++) {
				; dst++ = src++;
				; }
				;}

				; Function Attrs: norecurse nounwind
				define void @store_i32_to_i8(i32* noalias nocapture readonly %src, i32 %width, i8* nocapture %dst) {
				; CHECK-LABEL: store_i32_to_i8
				; CHECK-LABEL: vector.body:
				; CHECK: [[WIDE_LOAD0:%.]] = load <16 x i32>, <16 x i32> {{%.*}}, align 4
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr i32, i32 {{%.*}}, i32 16
				; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <16 x i32>*
				; CHECK: [[WIDE_LOAD1:%.]] = load <16 x i32>, <16 x i32> {{%.*}}, align 4
				; CHECK-NEXT: [[TRUNC1:%.*]] = trunc <16 x i32> [[WIDE_LOAD0]] to <16 x i8>
				; CHECK-NEXT: [[TRUNC2:%.*]] = trunc <16 x i32> [[WIDE_LOAD1]] to <16 x i8>
				; CHECK-NEXT: [[TMP16:%.]] = getelementptr i8, i8 {{%.*}}, i32 0
				; CHECK-NEXT: [[TMP17:%.]] = bitcast i8 [[TMP16]] to <16 x i8>*
				; CHECK-NEXT: store <16 x i8> [[TRUNC1]], <16 x i8>* [[TMP17]], align 1
				; CHECK-NEXT: [[TMP18:%.]] = getelementptr i8, i8 {{%.*}}, i32 16
				; CHECK-NEXT: [[TMP19:%.]] = bitcast i8 [[TMP18]] to <16 x i8>*
				; CHECK-NEXT: store <16 x i8> [[TRUNC2]], <16 x i8>* [[TMP19]], align 1
				entry:
				%cmp4 = icmp sgt i32 %width, 0
				br i1 %cmp4, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %for.body
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				ret void

				for.body: ; preds = %for.body.preheader, %for.body
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%dst.addr.06 = phi i8* [ %incdec.ptr1, %for.body ], [ %dst, %for.body.preheader ]
				%src.addr.05 = phi i32* [ %incdec.ptr, %for.body ], [ %src, %for.body.preheader ]
				%incdec.ptr = getelementptr inbounds i32, i32* %src.addr.05, i64 1
				%0 = load i32, i32* %src.addr.05, align 4
				%conv = trunc i32 %0 to i8
				%incdec.ptr1 = getelementptr inbounds i8, i8* %dst.addr.06, i64 1
				store i8 %conv, i8* %dst.addr.06, align 1
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, %width
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	for.body: ; preds = %entry, %for.body
store i16 %conv1, i16* %arrayidx3		store i16 %conv1, i16* %arrayidx3
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %len		%exitcond = icmp eq i32 %lftr.wideiv, %len
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

; CHECK-LABEL: @add_c(		; CHECK-LABEL: @add_c(
; CHECK: load <8 x i8>, <8 x i8>*		; CHECK: load <16 x i8>, <16 x i8>*
; CHECK: add <8 x i16>		; CHECK: add <16 x i16>
; CHECK: store <8 x i16>		; CHECK: store <16 x i16>
; Function Attrs: nounwind		; Function Attrs: nounwind
define void @add_c(i8* noalias nocapture readonly %p, i16* noalias nocapture %q, i32 %len) #0 {		define void @add_c(i8* noalias nocapture readonly %p, i16* noalias nocapture %q, i32 %len) #0 {
entry:		entry:
%cmp8 = icmp sgt i32 %len, 0		%cmp8 = icmp sgt i32 %len, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
ret void		ret void
Show All 9 Lines	for.body: ; preds = %entry, %for.body
store i16 %conv1, i16* %arrayidx3		store i16 %conv1, i16* %arrayidx3
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %len		%exitcond = icmp eq i32 %lftr.wideiv, %len
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

; CHECK-LABEL: @add_d(		; CHECK-LABEL: @add_d(
; CHECK: load <4 x i16>		; CHECK: load <8 x i16>
; CHECK: add nsw <4 x i32>		; CHECK: add nsw <8 x i32>
; CHECK: store <4 x i32>		; CHECK: store <8 x i32>
define void @add_d(i16* noalias nocapture readonly %p, i32* noalias nocapture %q, i32 %len) #0 {		define void @add_d(i16* noalias nocapture readonly %p, i32* noalias nocapture %q, i32 %len) #0 {
entry:		entry:
%cmp7 = icmp sgt i32 %len, 0		%cmp7 = icmp sgt i32 %len, 0
br i1 %cmp7, label %for.body, label %for.cond.cleanup		br i1 %cmp7, label %for.body, label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
ret void		ret void

▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %for.body.lr.ph
store i8 %conv19, i8* %arrayidx21		store i8 %conv19, i8* %arrayidx21
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %len		%exitcond = icmp eq i32 %lftr.wideiv, %len
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

; CHECK-LABEL: @add_f		; CHECK-LABEL: @add_f
; CHECK: load <8 x i16>		; CHECK: load <16 x i16>
; CHECK: trunc <8 x i16>		; CHECK: trunc <16 x i16>
; CHECK: shl <8 x i8>		; CHECK: shl <16 x i8>
; CHECK: add <8 x i8>		; CHECK: add <16 x i8>
; CHECK: or <8 x i8>		; CHECK: or <16 x i8>
; CHECK: mul <8 x i8>		; CHECK: mul <16 x i8>
; CHECK: and <8 x i8>		; CHECK: and <16 x i8>
; CHECK: xor <8 x i8>		; CHECK: xor <16 x i8>
; CHECK: mul <8 x i8>		; CHECK: mul <16 x i8>
; CHECK: store <8 x i8>		; CHECK: store <16 x i8>
define void @add_f(i16* noalias nocapture readonly %p, i8* noalias nocapture %q, i8 %arg1, i8 %arg2, i32 %len) #0 {		define void @add_f(i16* noalias nocapture readonly %p, i8* noalias nocapture %q, i8 %arg1, i8 %arg2, i32 %len) #0 {
entry:		entry:
%cmp.32 = icmp sgt i32 %len, 0		%cmp.32 = icmp sgt i32 %len, 0
br i1 %cmp.32, label %for.body.lr.ph, label %for.cond.cleanup		br i1 %cmp.32, label %for.body.lr.ph, label %for.cond.cleanup

for.body.lr.ph: ; preds = %entry		for.body.lr.ph: ; preds = %entry
%conv11 = zext i8 %arg2 to i32		%conv11 = zext i8 %arg2 to i32
%conv13 = zext i8 %arg1 to i32		%conv13 = zext i8 %arg1 to i32
▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/AArch64/reduction-small-size.ll

	Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	; short reduction_i16_2(char a, char b, int n) {			; short reduction_i16_2(char a, char b, int n) {
	; short sum = 0;			; short sum = 0;
	; for (int i = 0; i < n; ++i)			; for (int i = 0; i < n; ++i)
	; sum += (a[i] + b[i]);			; sum += (a[i] + b[i]);
	; return sum;			; return sum;
	; }			; }
	;			;
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: phi <8 x i16>			; CHECK: phi <16 x i16>
	; CHECK: [[Ld1:%[a-zA-Z0-9.]+]] = load <8 x i8>			; CHECK: [[Ld1:%[a-zA-Z0-9.]+]] = load <16 x i8>
	; CHECK: zext <8 x i8> [[Ld1]] to <8 x i16>			; CHECK: zext <16 x i8> [[Ld1]] to <16 x i16>
	; CHECK: [[Ld2:%[a-zA-Z0-9.]+]] = load <8 x i8>			; CHECK: [[Ld2:%[a-zA-Z0-9.]+]] = load <16 x i8>
	; CHECK: zext <8 x i8> [[Ld2]] to <8 x i16>			; CHECK: zext <16 x i8> [[Ld2]] to <16 x i16>
	; CHECK: add <8 x i16>			; CHECK: add <16 x i16>
	; CHECK: add <8 x i16>			; CHECK: add <16 x i16>
	;			;
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[Rdx:%[a-zA-Z0-9.]+]] = call i16 @llvm.experimental.vector.reduce.add.i16.v8i16(<8 x i16>			; CHECK: [[Rdx:%[a-zA-Z0-9.]+]] = call i16 @llvm.experimental.vector.reduce.add.i16.v16i16(<16 x i16>
	; CHECK: zext i16 [[Rdx]] to i32			; CHECK: zext i16 [[Rdx]] to i32
	;			;
	define i16 @reduction_i16_2(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %n) {			define i16 @reduction_i16_2(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %n) {
	entry:			entry:
	%cmp.14 = icmp sgt i32 %n, 0			%cmp.14 = icmp sgt i32 %n, 0
	br i1 %cmp.14, label %for.body.preheader, label %for.cond.cleanup			br i1 %cmp.14, label %for.body.preheader, label %for.cond.cleanup

	for.body.preheader:			for.body.preheader:
	Show All 28 Lines