This is an archive of the discontinued LLVM Phabricator instance.

Enable vectorizer-maximize-bandwidth by default.
ClosedPublic

Authored by danielcdh on May 18 2017, 2:44 PM.

Download Raw Diff

Details

Reviewers

hfinkel
mkuper
davidxl
chandlerc

Commits

Summary

vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:

spec/2006/fp/C++/444.namd 26.84 -0.31%
spec/2006/fp/C++/447.dealII 46.19 +0.89%
spec/2006/fp/C++/450.soplex 42.92 -0.44%
spec/2006/fp/C++/453.povray 38.57 -2.25%
spec/2006/fp/C/433.milc 24.54 -0.76%
spec/2006/fp/C/470.lbm 41.08 +0.26%
spec/2006/fp/C/482.sphinx3 47.58 -0.99%
spec/2006/int/C++/471.omnetpp 22.06 +1.87%
spec/2006/int/C++/473.astar 22.65 -0.12%
spec/2006/int/C++/483.xalancbmk 33.69 +4.97%
spec/2006/int/C/400.perlbench 33.43 +1.70%
spec/2006/int/C/401.bzip2 23.02 -0.19%
spec/2006/int/C/403.gcc 32.57 -0.43%
spec/2006/int/C/429.mcf 40.35 +0.27%
spec/2006/int/C/445.gobmk 26.96 +0.06%
spec/2006/int/C/456.hmmer 24.4 +0.19%
spec/2006/int/C/458.sjeng 27.91 -0.08%
spec/2006/int/C/462.libquantum 57.47 -0.20%
spec/2006/int/C/464.h264ref 46.52 +1.35%

geometric mean +0.29%

The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.

I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.

Diff Detail

Build Status

Buildable 7259
Build 7259: arc lint + arc unit

Event Timeline

danielcdh created this revision.May 18 2017, 2:44 PM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMay 18 2017, 2:44 PM

To get more feedback, you might want to send an llvm-dev email with the benchmark data you have and ask others to benchmark by passing the flag there? More folks would see that email, and you can point at this patch as a place to have detailed discussion.

Also, while I don't really expect it, have you looked at any code size changes? Might be good just to check off the list.

RKSimon added a subscriber: RKSimon.May 19 2017, 12:20 PM

we're seeing nice improvements but also significant degradations, which we would like to investigate before the patch is committed.

dorit added a subscriber: dorit.May 30 2017, 12:49 AM

In D33341#767418, @magabari wrote:

we're seeing nice improvements but also significant degradations, which we would like to investigate before the patch is committed.

Please follow up on the llvm-dev thread?

Specifically, if you want this change to be delayed, it seems important to provide test cases that reproduce whatever issue you're seeing as no one else has reported issues.

bjope added a subscriber: bjope.May 30 2017, 2:28 PM

Any update on this patch?

In D33341#779138, @davidxl wrote:

Any update on this patch?

Dehao indicated on the llvm-dev thread that he planned to land it by end of day today unless someone else hit issues. Marking as LGTM just as a formality...

This revision is now accepted and ready to land.Jun 13 2017, 11:02 AM

As discussed in llvm-dev:

"Thanks for the update. I think overall consensus is supportive for this change. If no one objects by then end of tomorrow (Tuesday), I will submit the patch to make vectorizer-maximize-bandwidth on by default."

Could someone help approve the patch so that I can "arc commit" directly Wednesday (6/14/2017) morning?

Thanks,
Dehao

In D33341#779146, @chandlerc wrote:

In D33341#779138, @davidxl wrote:

Any update on this patch?

Dehao indicated on the llvm-dev thread that he planned to land it by end of day today unless someone else hit issues. Marking as LGTM just as a formality...

Thanks Chandler!

update tests. PTAL. Thanks!

Herald added subscribers: javed.absar, sanjoy, rengolin. · View Herald TranscriptJun 14 2017, 1:02 PM

Harbormaster completed remote builds in B7259: Diff 102597.Jun 14 2017, 1:02 PM

I thought I had been clear, sorry if not. This LGTM based on the llvm-dev discussion, please go ahead. =]

danielcdh closed this revision.Jun 21 2017, 3:02 PM

The patch was reverted in r305990 because it exposed a aarch64 bug. A fix has been proposed in https://reviews.llvm.org/D34641

This revision is now accepted and ready to land.Jun 26 2017, 1:09 PM

danielcdh closed this revision.Jun 26 2017, 2:34 PM

Accidentally closed it.

Have the exposed bug(https://bugs.llvm.org/show_bug.cgi?id=33600) fixed in r306334

Now I'm retesting this patch and prepare to commit again.

This revision is now accepted and ready to land.Jun 26 2017, 2:39 PM

danielcdh closed this revision.Jun 26 2017, 2:41 PM

hsaito mentioned this in D44735: [LV] Add TTI::shouldMaximizeVectorBandwidth to allow enabling it per target.Mar 21 2018, 9:03 AM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

2 lines

test/

Transforms/

LoopVectorize/

AArch64/

loop-vectorization-factors.ll

32 lines

reduction-small-size.ll

16 lines

ARM/

gcc-examples.ll

6 lines

X86/

fp64_to_uint32-cost-model.ll

4 lines

gcc-examples.ll

15 lines

masked_load_store.ll

60 lines

no_fpmath.ll

2 lines

no_fpmath_with_hotness.ll

2 lines

reduction-crash.ll

2 lines

vectorization-remarks-loopid-dbg.ll

2 lines

vectorization-remarks.ll

2 lines

Diff 102597

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines
	/// We don't vectorize loops with a known constant trip count below this number.			/// We don't vectorize loops with a known constant trip count below this number.
	static cl::opt<unsigned> TinyTripCountVectorThreshold(			static cl::opt<unsigned> TinyTripCountVectorThreshold(
	"vectorizer-min-trip-count", cl::init(16), cl::Hidden,			"vectorizer-min-trip-count", cl::init(16), cl::Hidden,
	cl::desc("Don't vectorize loops with a constant "			cl::desc("Don't vectorize loops with a constant "
	"trip count that is smaller than this "			"trip count that is smaller than this "
	"value."));			"value."));

	static cl::opt<bool> MaximizeBandwidth(			static cl::opt<bool> MaximizeBandwidth(
	"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,			"vectorizer-maximize-bandwidth", cl::init(true), cl::Hidden,
	cl::desc("Maximize bandwidth when selecting vectorization factor which "			cl::desc("Maximize bandwidth when selecting vectorization factor which "
	"will be determined by the smallest type in loop."));			"will be determined by the smallest type in loop."));

	static cl::opt<bool> EnableInterleavedMemAccesses(			static cl::opt<bool> EnableInterleavedMemAccesses(
	"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,			"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,
	cl::desc("Enable vectorization on interleaved memory accesses in a loop"));			cl::desc("Enable vectorization on interleaved memory accesses in a loop"));

	/// Maximum factor for an interleaved memory access.			/// Maximum factor for an interleaved memory access.
	▲ Show 20 Lines • Show All 7,966 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	for.body: ; preds = %entry, %for.body
store i16 %conv1, i16* %arrayidx3		store i16 %conv1, i16* %arrayidx3
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %len		%exitcond = icmp eq i32 %lftr.wideiv, %len
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

; CHECK-LABEL: @add_c(		; CHECK-LABEL: @add_c(
; CHECK: load <8 x i8>, <8 x i8>*		; CHECK: load <16 x i8>, <16 x i8>*
; CHECK: add <8 x i16>		; CHECK: add <16 x i16>
; CHECK: store <8 x i16>		; CHECK: store <16 x i16>
; Function Attrs: nounwind		; Function Attrs: nounwind
define void @add_c(i8* noalias nocapture readonly %p, i16* noalias nocapture %q, i32 %len) #0 {		define void @add_c(i8* noalias nocapture readonly %p, i16* noalias nocapture %q, i32 %len) #0 {
entry:		entry:
%cmp8 = icmp sgt i32 %len, 0		%cmp8 = icmp sgt i32 %len, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
ret void		ret void
Show All 9 Lines	for.body: ; preds = %entry, %for.body
store i16 %conv1, i16* %arrayidx3		store i16 %conv1, i16* %arrayidx3
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %len		%exitcond = icmp eq i32 %lftr.wideiv, %len
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

; CHECK-LABEL: @add_d(		; CHECK-LABEL: @add_d(
; CHECK: load <4 x i16>		; CHECK: load <8 x i16>
; CHECK: add nsw <4 x i32>		; CHECK: add nsw <8 x i32>
; CHECK: store <4 x i32>		; CHECK: store <8 x i32>
define void @add_d(i16* noalias nocapture readonly %p, i32* noalias nocapture %q, i32 %len) #0 {		define void @add_d(i16* noalias nocapture readonly %p, i32* noalias nocapture %q, i32 %len) #0 {
entry:		entry:
%cmp7 = icmp sgt i32 %len, 0		%cmp7 = icmp sgt i32 %len, 0
br i1 %cmp7, label %for.body, label %for.cond.cleanup		br i1 %cmp7, label %for.body, label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
ret void		ret void

▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %for.body.lr.ph
store i8 %conv19, i8* %arrayidx21		store i8 %conv19, i8* %arrayidx21
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %len		%exitcond = icmp eq i32 %lftr.wideiv, %len
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

; CHECK-LABEL: @add_f		; CHECK-LABEL: @add_f
; CHECK: load <8 x i16>		; CHECK: load <16 x i16>
; CHECK: trunc <8 x i16>		; CHECK: trunc <16 x i16>
; CHECK: shl <8 x i8>		; CHECK: shl <16 x i8>
; CHECK: add <8 x i8>		; CHECK: add <16 x i8>
; CHECK: or <8 x i8>		; CHECK: or <16 x i8>
; CHECK: mul <8 x i8>		; CHECK: mul <16 x i8>
; CHECK: and <8 x i8>		; CHECK: and <16 x i8>
; CHECK: xor <8 x i8>		; CHECK: xor <16 x i8>
; CHECK: mul <8 x i8>		; CHECK: mul <16 x i8>
; CHECK: store <8 x i8>		; CHECK: store <16 x i8>
define void @add_f(i16* noalias nocapture readonly %p, i8* noalias nocapture %q, i8 %arg1, i8 %arg2, i32 %len) #0 {		define void @add_f(i16* noalias nocapture readonly %p, i8* noalias nocapture %q, i8 %arg1, i8 %arg2, i32 %len) #0 {
entry:		entry:
%cmp.32 = icmp sgt i32 %len, 0		%cmp.32 = icmp sgt i32 %len, 0
br i1 %cmp.32, label %for.body.lr.ph, label %for.cond.cleanup		br i1 %cmp.32, label %for.body.lr.ph, label %for.cond.cleanup

for.body.lr.ph: ; preds = %entry		for.body.lr.ph: ; preds = %entry
%conv11 = zext i8 %arg2 to i32		%conv11 = zext i8 %arg2 to i32
%conv13 = zext i8 %arg1 to i32		%conv13 = zext i8 %arg1 to i32
▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/AArch64/reduction-small-size.ll

	Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	; short reduction_i16_2(char a, char b, int n) {			; short reduction_i16_2(char a, char b, int n) {
	; short sum = 0;			; short sum = 0;
	; for (int i = 0; i < n; ++i)			; for (int i = 0; i < n; ++i)
	; sum += (a[i] + b[i]);			; sum += (a[i] + b[i]);
	; return sum;			; return sum;
	; }			; }
	;			;
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: phi <8 x i16>			; CHECK: phi <16 x i16>
	; CHECK: [[Ld1:%[a-zA-Z0-9.]+]] = load <8 x i8>			; CHECK: [[Ld1:%[a-zA-Z0-9.]+]] = load <16 x i8>
	; CHECK: zext <8 x i8> [[Ld1]] to <8 x i16>			; CHECK: zext <16 x i8> [[Ld1]] to <16 x i16>
	; CHECK: [[Ld2:%[a-zA-Z0-9.]+]] = load <8 x i8>			; CHECK: [[Ld2:%[a-zA-Z0-9.]+]] = load <16 x i8>
	; CHECK: zext <8 x i8> [[Ld2]] to <8 x i16>			; CHECK: zext <16 x i8> [[Ld2]] to <16 x i16>
	; CHECK: add <8 x i16>			; CHECK: add <16 x i16>
	; CHECK: add <8 x i16>			; CHECK: add <16 x i16>
	;			;
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[Rdx:%[a-zA-Z0-9.]+]] = call i16 @llvm.experimental.vector.reduce.add.i16.v8i16(<8 x i16>			; CHECK: [[Rdx:%[a-zA-Z0-9.]+]] = call i16 @llvm.experimental.vector.reduce.add.i16.v16i16(<16 x i16>
	; CHECK: zext i16 [[Rdx]] to i32			; CHECK: zext i16 [[Rdx]] to i32
	;			;
	define i16 @reduction_i16_2(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %n) {			define i16 @reduction_i16_2(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %n) {
	entry:			entry:
	%cmp.14 = icmp sgt i32 %n, 0			%cmp.14 = icmp sgt i32 %n, 0
	br i1 %cmp.14, label %for.body.preheader, label %for.cond.cleanup			br i1 %cmp.14, label %for.body.preheader, label %for.cond.cleanup

	for.body.preheader:			for.body.preheader:
	Show All 28 Lines

test/Transforms/LoopVectorize/ARM/gcc-examples.ll

Show All 29 Lines	; <label>:1 ; preds = %1, %0
%exitcond = icmp eq i32 %lftr.wideiv, 256		%exitcond = icmp eq i32 %lftr.wideiv, 256
br i1 %exitcond, label %8, label %1		br i1 %exitcond, label %8, label %1

; <label>:8 ; preds = %1		; <label>:8 ; preds = %1
ret void		ret void
}		}

;CHECK-LABEL: @example10b(		;CHECK-LABEL: @example10b(
;CHECK: load <4 x i16>		;CHECK: load <8 x i16>
;CHECK: sext <4 x i16>		;CHECK: sext <8 x i16>
;CHECK: store <4 x i32>		;CHECK: store <8 x i32>
;CHECK: ret void		;CHECK: ret void
define void @example10b(i16* noalias nocapture %sa, i16* noalias nocapture %sb, i16* noalias nocapture %sc, i32* noalias nocapture %ia, i32* noalias nocapture %ib, i32* noalias nocapture %ic) nounwind uwtable ssp {		define void @example10b(i16* noalias nocapture %sa, i16* noalias nocapture %sb, i16* noalias nocapture %sc, i32* noalias nocapture %ia, i32* noalias nocapture %ib, i32* noalias nocapture %ic) nounwind uwtable ssp {
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]		%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]
%2 = getelementptr inbounds i16, i16* %sb, i64 %indvars.iv		%2 = getelementptr inbounds i16, i16* %sb, i64 %indvars.iv
%3 = load i16, i16* %2, align 2		%3 = load i16, i16* %2, align 2
Show All 12 Lines

test/Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll

	; RUN: opt < %s -mcpu=core-avx2 -loop-vectorize -S \| llc -mcpu=core-avx2 \| FileCheck %s			; RUN: opt < %s -mcpu=core-avx2 -loop-vectorize -S \| llc -mcpu=core-avx2 \| FileCheck %s

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx"			target triple = "x86_64-apple-macosx"

	@n = global i32 10000, align 4			@n = global i32 10000, align 4
	@double_array = common global [10000 x double] zeroinitializer, align 16			@double_array = common global [10000 x double] zeroinitializer, align 16
	@unsigned_array = common global [10000 x i32] zeroinitializer, align 16			@unsigned_array = common global [10000 x i32] zeroinitializer, align 16

	; If we need to scalarize the fptoui and then use inserts to build up the			; If we need to scalarize the fptoui and then use inserts to build up the
	; vector again, then there is certainly no value in going 256-bit wide.			; vector again, then there is certainly no value in going 256-bit wide.
	; CHECK-NOT: vpinsrd			; But as we default to maximize bandwidth, we should convert it to 256-bit
				; anyway.
				; CHECK: vpinsrd

	define void @convert() {			define void @convert() {
	entry:			entry:
	%0 = load i32, i32* @n, align 4			%0 = load i32, i32* @n, align 4
	%cmp4 = icmp eq i32 %0, 0			%cmp4 = icmp eq i32 %0, 0
	br i1 %cmp4, label %for.end, label %for.body.preheader			br i1 %cmp4, label %for.end, label %for.body.preheader

	for.body.preheader: ; preds = %entry			for.body.preheader: ; preds = %entry
	Show All 20 Lines

test/Transforms/LoopVectorize/X86/gcc-examples.ll

Show All 38 Lines	; <label>:1 ; preds = %1, %0
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, 256		%exitcond = icmp eq i32 %lftr.wideiv, 256
br i1 %exitcond, label %8, label %1		br i1 %exitcond, label %8, label %1

; <label>:8 ; preds = %1		; <label>:8 ; preds = %1
ret void		ret void
}		}

; Select VF=4 because sext <8 x i1> to <8 x i32> is expensive.
;CHECK-LABEL: @example10b(		;CHECK-LABEL: @example10b(
;CHECK: load <4 x i16>		;CHECK: load <8 x i16>
;CHECK: sext <4 x i16>		;CHECK: sext <8 x i16>
;CHECK: store <4 x i32>		;CHECK: store <8 x i32>
;CHECK: ret void		;CHECK: ret void
;UNROLL-LABEL: @example10b(		;UNROLL-LABEL: @example10b(
;UNROLL: load <4 x i16>		;UNROLL: load <8 x i16>
;UNROLL: load <4 x i16>		;UNROLL: load <8 x i16>
;UNROLL: store <4 x i32>		;UNROLL: store <8 x i32>
;UNROLL: store <4 x i32>		;UNROLL: store <8 x i32>
;UNROLL: ret void		;UNROLL: ret void
define void @example10b(i16* noalias nocapture %sa, i16* noalias nocapture %sb, i16* noalias nocapture %sc, i32* noalias nocapture %ia, i32* noalias nocapture %ib, i32* noalias nocapture %ic) nounwind uwtable ssp {		define void @example10b(i16* noalias nocapture %sa, i16* noalias nocapture %sb, i16* noalias nocapture %sc, i32* noalias nocapture %ia, i32* noalias nocapture %ib, i32* noalias nocapture %ic) nounwind uwtable ssp {
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]		%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]
%2 = getelementptr inbounds i16, i16* %sb, i64 %indvars.iv		%2 = getelementptr inbounds i16, i16* %sb, i64 %indvars.iv
%3 = load i16, i16* %2, align 2		%3 = load i16, i16* %2, align 2
Show All 12 Lines

test/Transforms/LoopVectorize/X86/masked_load_store.ll

	Show First 20 Lines • Show All 254 Lines • ▼ Show 20 Lines
	;			;
	; for (int i=0; i<10000; i++) {			; for (int i=0; i<10000; i++) {
	; if (trigger[i] < 100) {			; if (trigger[i] < 100) {
	; A[i] = B[i] + trigger[i];			; A[i] = B[i] + trigger[i];
	; }			; }
	; }			; }
	;}			;}

	;AVX-LABEL: @foo3			;AVX1-LABEL: @foo3
	;AVX: icmp slt <4 x i32> %wide.load, <i32 100, i32 100,			;AVX1: icmp slt <4 x i32> %wide.load, <i32 100, i32 100,
	;AVX: call <4 x double> @llvm.masked.load.v4f64.p0v4f64			;AVX1: call <4 x double> @llvm.masked.load.v4f64.p0v4f64
	;AVX: sitofp <4 x i32> %wide.load to <4 x double>			;AVX1: sitofp <4 x i32> %wide.load to <4 x double>
	;AVX: fadd <4 x double>			;AVX1: fadd <4 x double>
	;AVX: call void @llvm.masked.store.v4f64.p0v4f64			;AVX1: call void @llvm.masked.store.v4f64.p0v4f64
	;AVX: ret void			;AVX1: ret void

				;AVX2-LABEL: @foo3
				;AVX2: icmp slt <8 x i32> %wide.load, <i32 100, i32 100,
				;AVX2: call <8 x double> @llvm.masked.load.v8f64.p0v8f64
				;AVX2: sitofp <8 x i32> %wide.load to <8 x double>
				;AVX2: fadd <8 x double>
				;AVX2: call void @llvm.masked.store.v8f64.p0v8f64
				;AVX2: ret void

	;AVX512-LABEL: @foo3			;AVX512-LABEL: @foo3
	;AVX512: icmp slt <8 x i32> %wide.load, <i32 100, i32 100,			;AVX512: icmp slt <16 x i32> %wide.load, <i32 100, i32 100,
	;AVX512: call <8 x double> @llvm.masked.load.v8f64.p0v8f64			;AVX512: call <16 x double> @llvm.masked.load.v16f64.p0v16f64
	;AVX512: sitofp <8 x i32> %wide.load to <8 x double>			;AVX512: sitofp <16 x i32> %wide.load to <16 x double>
	;AVX512: fadd <8 x double>			;AVX512: fadd <16 x double>
	;AVX512: call void @llvm.masked.store.v8f64.p0v8f64			;AVX512: call void @llvm.masked.store.v16f64.p0v16f64
	;AVX512: ret void			;AVX512: ret void


	; Function Attrs: nounwind uwtable			; Function Attrs: nounwind uwtable
	define void @foo3(double* %A, double* %B, i32* %trigger) #0 {			define void @foo3(double* %A, double* %B, i32* %trigger) #0 {
	entry:			entry:
	%A.addr = alloca double*, align 8			%A.addr = alloca double*, align 8
	%B.addr = alloca double*, align 8			%B.addr = alloca double*, align 8
	▲ Show 20 Lines • Show All 212 Lines • ▼ Show 20 Lines
	;			;
	; for (int i=SIZE-1; i>=0; i--) {			; for (int i=SIZE-1; i>=0; i--) {
	; if (trigger[i] > 0) {			; if (trigger[i] > 0) {
	; out[i] = in[i] + (double) 0.5;			; out[i] = in[i] + (double) 0.5;
	; }			; }
	; }			; }
	;}			;}
	;AVX2-LABEL: @foo6			;AVX2-LABEL: @foo6
	;AVX2: icmp sgt <4 x i32> %reverse, zeroinitializer			;AVX2: icmp sgt <8 x i32> %reverse, zeroinitializer
	;AVX2: shufflevector <4 x i1>{{.*}}<4 x i32> <i32 3, i32 2, i32 1, i32 0>			;AVX2: shufflevector <8 x i1>{{.*}}<8 x i32> <i32 7, i32 6, i32 5, i32 4
	;AVX2: call <4 x double> @llvm.masked.load.v4f64.p0v4f64			;AVX2: call <8 x double> @llvm.masked.load.v8f64.p0v8f64
	;AVX2: fadd <4 x double>			;AVX2: fadd <8 x double>
	;AVX2: call void @llvm.masked.store.v4f64.p0v4f64			;AVX2: call void @llvm.masked.store.v8f64.p0v8f64
	;AVX2: ret void			;AVX2: ret void

	;AVX512-LABEL: @foo6			;AVX512-LABEL: @foo6
	;AVX512: icmp sgt <8 x i32> %reverse, zeroinitializer			;AVX512: icmp sgt <16 x i32> %reverse, zeroinitializer
	;AVX512: shufflevector <8 x i1>{{.*}}<8 x i32> <i32 7, i32 6, i32 5, i32 4			;AVX512: shufflevector <16 x i1>{{.*}}<16 x i32> <i32 15, i32 14, i32 13, i32 12
	;AVX512: call <8 x double> @llvm.masked.load.v8f64.p0v8f64			;AVX512: call <16 x double> @llvm.masked.load.v16f64.p0v16f64
	;AVX512: fadd <8 x double>			;AVX512: fadd <16 x double>
	;AVX512: call void @llvm.masked.store.v8f64.p0v8f64			;AVX512: call void @llvm.masked.store.v16f64.p0v16f64
	;AVX512: ret void			;AVX512: ret void


	define void @foo6(double* %in, double* %out, i32 %size, i32* %trigger) {			define void @foo6(double* %in, double* %out, i32 %size, i32* %trigger) {
	entry:			entry:
	%in.addr = alloca double*, align 8			%in.addr = alloca double*, align 8
	%out.addr = alloca double*, align 8			%out.addr = alloca double*, align 8
	%size.addr = alloca i32, align 4			%size.addr = alloca i32, align 4
	▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	; bool * __restrict__ trigger, unsigned size) {			; bool * __restrict__ trigger, unsigned size) {
	;			;
	; for (unsigned i=0; i<size; i++)			; for (unsigned i=0; i<size; i++)
	; if (trigger[i] && (in[i] != 0))			; if (trigger[i] && (in[i] != 0))
	; out[i] = (double) 0.5;			; out[i] = (double) 0.5;
	; }			; }

	;AVX512-LABEL: @foo7			;AVX512-LABEL: @foo7
	;AVX512: call <8 x double> @llvm.masked.load.v8p0f64.p0v8p0f64(<8 x double>*			;AVX512: call <64 x double> @llvm.masked.load.v64p0f64.p0v64p0f64(<64 x double>*
	;AVX512: call void @llvm.masked.store.v8f64.p0v8f64			;AVX512: call void @llvm.masked.store.v64f64.p0v64f64
	;AVX512: ret void			;AVX512: ret void

	define void @foo7(double* noalias %out, double** noalias %in, i8* noalias %trigger, i32 %size) #0 {			define void @foo7(double* noalias %out, double** noalias %in, i8* noalias %trigger, i32 %size) #0 {
	entry:			entry:
	%out.addr = alloca double*, align 8			%out.addr = alloca double*, align 8
	%in.addr = alloca double**, align 8			%in.addr = alloca double**, align 8
	%trigger.addr = alloca i8*, align 8			%trigger.addr = alloca i8*, align 8
	%size.addr = alloca i32, align 4			%size.addr = alloca i32, align 4
	▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	;void foo8 (double* __restrict__ out, fp* __restrict__ in, bool * __restrict__ trigger, unsigned size) {			;void foo8 (double* __restrict__ out, fp* __restrict__ in, bool * __restrict__ trigger, unsigned size) {
	;			;
	; for (unsigned i=0; i<size; i++)			; for (unsigned i=0; i<size; i++)
	; if (trigger[i] && (in[i] != 0))			; if (trigger[i] && (in[i] != 0))
	; out[i] = (double) 0.5;			; out[i] = (double) 0.5;
	;}			;}

	;AVX512-LABEL: @foo8			;AVX512-LABEL: @foo8
	;AVX512: call <8 x i32 ()> @llvm.masked.load.v8p0f_i32f.p0v8p0f_i32f(<8 x i32 ()>* %			;AVX512: call <64 x i32 ()> @llvm.masked.load.v64p0f_i32f.p0v64p0f_i32f(<64 x i32 ()>* %
	;AVX512: call void @llvm.masked.store.v8f64.p0v8f64			;AVX512: call void @llvm.masked.store.v64f64.p0v64f64
	;AVX512: ret void			;AVX512: ret void

	define void @foo8(double* noalias %out, i32 ()** noalias %in, i8* noalias %trigger, i32 %size) #0 {			define void @foo8(double* noalias %out, i32 ()** noalias %in, i8* noalias %trigger, i32 %size) #0 {
	entry:			entry:
	%out.addr = alloca double*, align 8			%out.addr = alloca double*, align 8
	%in.addr = alloca i32 ()**, align 8			%in.addr = alloca i32 ()**, align 8
	%trigger.addr = alloca i8*, align 8			%trigger.addr = alloca i8*, align 8
	%size.addr = alloca i32, align 4			%size.addr = alloca i32, align 4
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/no_fpmath.ll

	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' -pass-remarks-missed='loop-vectorize' -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' -pass-remarks-missed='loop-vectorize' -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck %s

	; CHECK: remark: no_fpmath.c:6:11: loop not vectorized: cannot prove it is safe to reorder floating-point operations			; CHECK: remark: no_fpmath.c:6:11: loop not vectorized: cannot prove it is safe to reorder floating-point operations
	; CHECK: remark: no_fpmath.c:6:14: loop not vectorized			; CHECK: remark: no_fpmath.c:6:14: loop not vectorized
	; CHECK: remark: no_fpmath.c:17:14: vectorized loop (vectorization width: 2, interleaved count: 2)			; CHECK: remark: no_fpmath.c:17:14: vectorized loop (vectorization width: 4, interleaved count: 2)

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.10.0"			target triple = "x86_64-apple-macosx10.10.0"

	; Function Attrs: nounwind readonly ssp uwtable			; Function Attrs: nounwind readonly ssp uwtable
	define double @cond_sum(i32* nocapture readonly %v, i32 %n) #0 !dbg !4 {			define double @cond_sum(i32* nocapture readonly %v, i32 %n) #0 !dbg !4 {
	entry:			entry:
	%cmp.7 = icmp sgt i32 %n, 0, !dbg !3			%cmp.7 = icmp sgt i32 %n, 0, !dbg !3
	▲ Show 20 Lines • Show All 96 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll

	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks=loop-vectorize -pass-remarks-missed=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-with-hotness 2>&1 \| FileCheck %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks=loop-vectorize -pass-remarks-missed=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-with-hotness 2>&1 \| FileCheck %s
	; RUN: opt < %s -passes=loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks=loop-vectorize -pass-remarks-missed=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-with-hotness 2>&1 \| FileCheck %s			; RUN: opt < %s -passes=loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks=loop-vectorize -pass-remarks-missed=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-with-hotness 2>&1 \| FileCheck %s

	; CHECK: remark: no_fpmath.c:6:11: loop not vectorized: cannot prove it is safe to reorder floating-point operations (hotness: 300)			; CHECK: remark: no_fpmath.c:6:11: loop not vectorized: cannot prove it is safe to reorder floating-point operations (hotness: 300)
	; CHECK: remark: no_fpmath.c:6:14: loop not vectorized			; CHECK: remark: no_fpmath.c:6:14: loop not vectorized
	; CHECK: remark: no_fpmath.c:17:14: vectorized loop (vectorization width: 2, interleaved count: 2) (hotness: 300)			; CHECK: remark: no_fpmath.c:17:14: vectorized loop (vectorization width: 4, interleaved count: 2) (hotness: 300)

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.10.0"			target triple = "x86_64-apple-macosx10.10.0"

	; Function Attrs: nounwind readonly ssp uwtable			; Function Attrs: nounwind readonly ssp uwtable
	define double @cond_sum(i32* nocapture readonly %v, i32 %n) #0 !dbg !4 !prof !29 {			define double @cond_sum(i32* nocapture readonly %v, i32 %n) #0 !dbg !4 !prof !29 {
	entry:			entry:
	%cmp.7 = icmp sgt i32 %n, 0, !dbg !3			%cmp.7 = icmp sgt i32 %n, 0, !dbg !3
	▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/reduction-crash.ll

	; RUN: opt -S -loop-vectorize -mcpu=prescott -disable-basicaa < %s \| FileCheck %s			; RUN: opt -S -loop-vectorize -mcpu=prescott -disable-basicaa < %s \| FileCheck %s

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:32:32-n8:16:32-S128"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:32:32-n8:16:32-S128"
	target triple = "i386-apple-darwin"			target triple = "i386-apple-darwin"

	; PR15344			; PR15344
	define void @test1(float* nocapture %arg, i32 %arg1) nounwind {			define void @test1(float* nocapture %arg, i32 %arg1) nounwind {
	; CHECK-LABEL: @test1(			; CHECK-LABEL: @test1(
	; CHECK: preheader			; CHECK: preheader
	; CHECK: insertelement <2 x double> zeroinitializer, double %tmp, i32 0			; CHECK: insertelement <4 x double> zeroinitializer, double %tmp, i32 0
	; CHECK: vector.memcheck			; CHECK: vector.memcheck

	bb:			bb:
	br label %bb2			br label %bb2

	bb2: ; preds = %bb			bb2: ; preds = %bb
	%tmp = load double, double* null, align 8			%tmp = load double, double* null, align 8
	br i1 undef, label %bb3, label %bb12			br i1 undef, label %bb3, label %bb12
	Show All 17 Lines

test/Transforms/LoopVectorize/X86/vectorization-remarks-loopid-dbg.ll

	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=VECTORIZED %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=VECTORIZED %s
	; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=UNROLLED %s			; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=UNROLLED %s
	; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=1 -mtriple=x86_64-unknown-linux -S -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck -check-prefix=NONE %s			; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=1 -mtriple=x86_64-unknown-linux -S -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck -check-prefix=NONE %s

	; RUN: llc < %s -mtriple x86_64-pc-linux-gnu -o - \| FileCheck -check-prefix=DEBUG-OUTPUT %s			; RUN: llc < %s -mtriple x86_64-pc-linux-gnu -o - \| FileCheck -check-prefix=DEBUG-OUTPUT %s
	; DEBUG-OUTPUT-NOT: .loc			; DEBUG-OUTPUT-NOT: .loc
	; DEBUG-OUTPUT-NOT: {{.*}}.debug_info			; DEBUG-OUTPUT-NOT: {{.*}}.debug_info

	; VECTORIZED: remark: vectorization-remarks.c:17:8: vectorized loop (vectorization width: 4, interleaved count: 1)			; VECTORIZED: remark: vectorization-remarks.c:17:8: vectorized loop (vectorization width: 16, interleaved count: 1)
	; UNROLLED: remark: vectorization-remarks.c:17:8: interleaved loop (interleaved count: 4)			; UNROLLED: remark: vectorization-remarks.c:17:8: interleaved loop (interleaved count: 4)
	; NONE: remark: vectorization-remarks.c:17:8: loop not vectorized: vectorization and interleaving are explicitly disabled, or vectorize width and interleave count are both set to 1			; NONE: remark: vectorization-remarks.c:17:8: loop not vectorized: vectorization and interleaving are explicitly disabled, or vectorize width and interleave count are both set to 1

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define i32 @foo(i32 %n) #0 !dbg !4 {			define i32 @foo(i32 %n) #0 !dbg !4 {
	entry:			entry:
	%diff = alloca i32, align 4			%diff = alloca i32, align 4
	▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/vectorization-remarks.ll

	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=VECTORIZED %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=VECTORIZED %s
	; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=UNROLLED %s			; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=4 -mtriple=x86_64-unknown-linux -S -pass-remarks='loop-vectorize' 2>&1 \| FileCheck -check-prefix=UNROLLED %s
	; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=1 -mtriple=x86_64-unknown-linux -S -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck -check-prefix=NONE %s			; RUN: opt < %s -loop-vectorize -force-vector-width=1 -force-vector-interleave=1 -mtriple=x86_64-unknown-linux -S -pass-remarks-analysis='loop-vectorize' 2>&1 \| FileCheck -check-prefix=NONE %s

	; RUN: llc < %s -mtriple x86_64-pc-linux-gnu -o - \| FileCheck -check-prefix=DEBUG-OUTPUT %s			; RUN: llc < %s -mtriple x86_64-pc-linux-gnu -o - \| FileCheck -check-prefix=DEBUG-OUTPUT %s
	; DEBUG-OUTPUT-NOT: .loc			; DEBUG-OUTPUT-NOT: .loc
	; DEBUG-OUTPUT-NOT: {{.*}}.debug_info			; DEBUG-OUTPUT-NOT: {{.*}}.debug_info

	; VECTORIZED: remark: vectorization-remarks.c:17:8: vectorized loop (vectorization width: 4, interleaved count: 1)			; VECTORIZED: remark: vectorization-remarks.c:17:8: vectorized loop (vectorization width: 16, interleaved count: 1)
	; UNROLLED: remark: vectorization-remarks.c:17:8: interleaved loop (interleaved count: 4)			; UNROLLED: remark: vectorization-remarks.c:17:8: interleaved loop (interleaved count: 4)
	; NONE: remark: vectorization-remarks.c:17:8: loop not vectorized: vectorization and interleaving are explicitly disabled, or vectorize width and interleave count are both set to 1			; NONE: remark: vectorization-remarks.c:17:8: loop not vectorized: vectorization and interleaving are explicitly disabled, or vectorize width and interleave count are both set to 1

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define i32 @foo(i32 %n) #0 !dbg !4 {			define i32 @foo(i32 %n) #0 !dbg !4 {
	entry:			entry:
	%diff = alloca i32, align 4			%diff = alloca i32, align 4
	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Enable vectorizer-maximize-bandwidth by default.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 102597

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll

test/Transforms/LoopVectorize/AArch64/reduction-small-size.ll

test/Transforms/LoopVectorize/ARM/gcc-examples.ll

test/Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll

test/Transforms/LoopVectorize/X86/gcc-examples.ll

test/Transforms/LoopVectorize/X86/masked_load_store.ll

test/Transforms/LoopVectorize/X86/no_fpmath.ll

test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll

test/Transforms/LoopVectorize/X86/reduction-crash.ll

test/Transforms/LoopVectorize/X86/vectorization-remarks-loopid-dbg.ll

test/Transforms/LoopVectorize/X86/vectorization-remarks.ll

Enable vectorizer-maximize-bandwidth by default.
ClosedPublic