This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable interleaved memory accesses by default
ClosedPublic

Authored by mkuper on Oct 6 2016, 4:10 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
delena
DavidKreitzer

Commits

rGb2443ed62bcf: [X86] Enable interleaved memory access by default
rL284779: [X86] Enable interleaved memory access by default

Summary

Following r283480, we should, hopefully, have no regressions from this.
If you want to give this a spin with your workloads before I commit, let me know.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 73865.Oct 6 2016, 4:10 PM

mkuper retitled this revision from to [X86] Enable interleaved memory accesses by default.

mkuper updated this object.

mkuper added reviewers: delena, RKSimon, spatel, DavidKreitzer.

mkuper added subscribers: llvm-commits, zvi, Farhana and 2 others.

mkuper added inline comments.Oct 6 2016, 5:29 PM

test/Transforms/LoopVectorize/X86/interleaving.ll
1 ↗	(On Diff #73865)	Note that this test's only purpose is to verify that interleaved memory access is turned on for x86, it's not a test for interleaved memory access functionality. The general functional test is Transforms/LoopVectorize/interleaved-accesses.ll

Any performance test gains/regressions?

In D25350#566175, @RKSimon wrote:

Any performance test gains/regressions?

We've had performance gains on internal benchmarks. (And, anecdotally, a performance regression, which turned out to be a gain - we've started vectorizing a loop we were not vectorizing before, and really should have better performance when vectorized - but the dynamic loop count on that loop tends to be 1 or 2...)
As far as I know Intel also had performance gains on the benchmarks they run, but I'll let Ayal/Elena/Dorit speak for themselves.

As to public benchmarks - I don't have an up-to-date SPEC run with this unfortunately. I can run one if you want - unless Intel already happen to have the results handy?

Anyway, if you have internal benchmarks you want to run this on pre-commit, please do - codegen isn't necessarily happy with the shuffle sequences this generates. We've had some really bad regressions initially, which led to r283480. I don't expect any more big surprises, since the x86 cost model is still *really* conservative w.r.t interleaved memory accesses (teaching it to be more precise is a separate issue, and probably ties in with Farhana's work in D24681), but who knows.

mkuper added a subscriber: dorit.Oct 10 2016, 8:25 AM

SPEC2006 looks flat.

We're seeing some minor but consistent regressions (< 1%) on some internal tests.

In D25350#569327, @RKSimon wrote:

We're seeing some minor but consistent regressions (< 1%) on some internal tests.

That's really interesting - that means it's firing. :-)

Can you provide a reproducer? I can see basically three possible sources of regressions:

We're still doing very bad lowering for some shuffle sequences.
The cost model is too optimistic regarding the cost of the sequence, even with optimal lowering.
"Normal" vectorizer flakiness - the cost model is doing the right thing about the interleave sequence, but we shouldn't be vectorizing regardless. So without interleaving enabled we were just getting lucky.

I really hope what you're seeing is a case of (1).

In D25350#569385, @mkuper wrote:

In D25350#569327, @RKSimon wrote:

We're seeing some minor but consistent regressions (< 1%) on some internal tests.

That's really interesting - that means it's firing. :-)

Can you provide a reproducer? I can see basically three possible sources of regressions:

We're trying to drill down to the codegen diffs - its a large codebase and is putting up a fight.... The regression reduced when we enabled LTO and/or PGO.

Hi Michael,

I thought the plan was to avoid interleaved vectorization when targeting Atom. Did that plan change or get handled in some other way?

In D25350#569425, @DavidKreitzer wrote:

Hi Michael,

I thought the plan was to avoid interleaved vectorization when targeting Atom. Did that plan change or get handled in some other way?

No, this is my mistake, sorry about that.
I'll update the patch, thanks.

Updated to disable this on Atom.
Intel are still looking into why they're getting regressions on Atom - for the same code that performs much better on Intel big cores.

Hi Michael,

We see a few eembc (mp2decode on SLM) and coremark regressions of the order of around 3-6%, and a couple of 3-6% geekbench (HSW) regression which we can follow up on.

In addition to this, we see significant (60%+) gains in denbench/rgb tests on HSW, which are nice.

Performance-wise, I think the change looks good to us.

I can give you (or anyone else) more specific on the gains/regressions if interested.

Thanks,
Zia.

Thanks Zia!

As I've discussed with Ayal, the SLM-specific regressions should probably be looked at separately - but I'd appreciate more details on the HSW regressions.

Michael

Simon, any news on your end?

In D25350#574974, @mkuper wrote:

Simon, any news on your end?

So looking through the before + after code we're seeing 2 types of diff:

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.). This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind) and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

Thanks for investigating this, Simon!

In D25350#575481, @RKSimon wrote:

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

That seems fairly bad.
Do you have a reproducer? This didsn't seem to break our existing horizontal reduction lit tests.

2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.).

Is this AVX or AVX2? I mean, do we get this just because of having to perform integer ops on xmms, or is this just part of the resulting shuffle sequence?

This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind)

It seems like PR21281 was mostly resolved. I'll need to look at PR21138.

and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

This is a bit weird - I'm not sure I'd expect this to fire in this kind of situation.

In any case, how do you think we can move forward with this? I'd really like to get this in (because of cases like the 60% improvement in denbench), but, obviously, with a minimum amount of regressions. :-)
If you can provide reproducers for the CG issues, I'll look into fixing them before enabling this. Otherwise, are you ok with this going on as is? If not, what's the alternative?

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

That seems fairly bad.
Do you have a reproducer? This didsn't seem to break our existing horizontal reduction lit tests.

We should be able to create one and address this as a follow up issue.

2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.).

Is this AVX or AVX2? I mean, do we get this just because of having to perform integer ops on xmms, or is this just part of the resulting shuffle sequence?

This is AVX1 on a Jaguar CPU - so internally its a 128-bit ALU that double pumps ymm instructions. It can be sensitive to large amounts of dependent ymm code like this.

This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind)

It seems like PR21281 was mostly resolved. I'll need to look at PR21138.

I have a possible shuffle patch that cover both of these, but haven't had time to finish it - its a rewrite of lowerVectorShuffleByMerging128BitLanes that acts a bit like lowerShuffleAsRepeatedMaskAndLanePermute but in reverse (multiple input lane permute followed by repeated mask).

and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

This is a bit weird - I'm not sure I'd expect this to fire in this kind of situation.

Sorry, I meant such a change could fix the regression - and possibly allow a great deal more shuffle folding. It'll be a fine balance as to when to let it fire though.

In any case, how do you think we can move forward with this? I'd really like to get this in (because of cases like the 60% improvement in denbench), but, obviously, with a minimum amount of regressions. :-)
If you can provide reproducers for the CG issues, I'll look into fixing them before enabling this. Otherwise, are you ok with this going on as is? If not, what's the alternative?

Yes I think I'm happy for this to go ahead, the regression areas we can work on afterward, most can be solved during lowering and are existing issues - its just interleaving makes them a little more obvious!

In D25350#575681, @RKSimon wrote:

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

That seems fairly bad.
Do you have a reproducer? This didsn't seem to break our existing horizontal reduction lit tests.

We should be able to create one and address this as a follow up issue.

Great, please CC me on the PR when you have one, I'll look into it.

This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind)

It seems like PR21281 was mostly resolved. I'll need to look at PR21138.

I have a possible shuffle patch that cover both of these, but haven't had time to finish it - its a rewrite of lowerVectorShuffleByMerging128BitLanes that acts a bit like lowerShuffleAsRepeatedMaskAndLanePermute but in reverse (multiple input lane permute followed by repeated mask).

Sounds good, feel free to add me to the review.

and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

This is a bit weird - I'm not sure I'd expect this to fire in this kind of situation.

Sorry, I meant such a change could fix the regression - and possibly allow a great deal more shuffle folding. It'll be a fine balance as to when to let it fire though.

Sorry, I wasn't clear- I understood what you meant. I'm just confused about why we're producing this pattern.

In any case, how do you think we can move forward with this? I'd really like to get this in (because of cases like the 60% improvement in denbench), but, obviously, with a minimum amount of regressions. :-)
If you can provide reproducers for the CG issues, I'll look into fixing them before enabling this. Otherwise, are you ok with this going on as is? If not, what's the alternative?

Yes I think I'm happy for this to go ahead, the regression areas we can work on afterward, most can be solved during lowering and are existing issues - its just interleaving makes them a little more obvious!

Of course this doesn't introduce CG issues, it was just a question of which of the ones it exposes we "prefetch" vs. which ones we handle later. :)

Anyway, this sounds good to me, thanks a lot.
I'm going to push this in. If we see significant regressions coming from other people, I'm perfectly ok with reverting and reapplying after they're fixed.

LGTM

This revision is now accepted and ready to land.Oct 20 2016, 1:51 PM

Closed by commit rL284779: [X86] Enable interleaved memory access by default (authored by mkuper). · Explain WhyOct 20 2016, 2:13 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86TargetTransformInfo.h

2 lines

X86TargetTransformInfo.cpp

7 lines

test/

Transforms/

LoopVectorize/

X86/

2 lines

16 lines

35 lines

4 lines

Diff 75353

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	public:
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
bool isLegalMaskedLoad(Type *DataType);		bool isLegalMaskedLoad(Type *DataType);
bool isLegalMaskedStore(Type *DataType);		bool isLegalMaskedStore(Type *DataType);
bool isLegalMaskedGather(Type *DataType);		bool isLegalMaskedGather(Type *DataType);
bool isLegalMaskedScatter(Type *DataType);		bool isLegalMaskedScatter(Type *DataType);
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

		bool enableInterleavedAccessVectorization();
private:		private:
int getGSScalarCost(unsigned Opcode, Type *DataTy, bool VariableMask,		int getGSScalarCost(unsigned Opcode, Type *DataTy, bool VariableMask,
unsigned Alignment, unsigned AddressSpace);		unsigned Alignment, unsigned AddressSpace);
int getGSVectorCost(unsigned Opcode, Type DataTy, Value Ptr,		int getGSVectorCost(unsigned Opcode, Type DataTy, Value Ptr,
unsigned Alignment, unsigned AddressSpace);		unsigned Alignment, unsigned AddressSpace);

/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,761 Lines • ▼ Show 20 Lines	bool X86TTIImpl::areInlineCompatible(const Function *Caller,
const FeatureBitset &CalleeBits =		const FeatureBitset &CalleeBits =
TM.getSubtargetImpl(*Callee)->getFeatureBits();		TM.getSubtargetImpl(*Callee)->getFeatureBits();

// FIXME: This is likely too limiting as it will include subtarget features		// FIXME: This is likely too limiting as it will include subtarget features
// that we might not care about for inlining, but it is conservatively		// that we might not care about for inlining, but it is conservatively
// correct.		// correct.
return (CallerBits & CalleeBits) == CalleeBits;		return (CallerBits & CalleeBits) == CalleeBits;
}		}

		bool X86TTIImpl::enableInterleavedAccessVectorization() {
		// TODO: We expect this to be beneficial regardless of arch,
		// but there are currently some unexplained performance artifacts on Atom.
		// As a temporary solution, disable on Atom.
		return !(ST->isAtom() \|\| ST->isSLM());
		}

llvm/trunk/test/Transforms/LoopVectorize/X86/cost-model.ll

Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	for:
%indvars.iv = phi i64 [ 0, %preheader ], [ %indvars.iv.next, %for ]		%indvars.iv = phi i64 [ 0, %preheader ], [ %indvars.iv.next, %for ]
%s.02 = phi float [ 0.0, %preheader ], [ %add4, %for ]		%s.02 = phi float [ 0.0, %preheader ], [ %add4, %for ]
%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv		%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
%t1 = load float, float* %arrayidx, align 4		%t1 = load float, float* %arrayidx, align 4
%arrayidx3 = getelementptr inbounds float, float* %b, i64 %indvars.iv		%arrayidx3 = getelementptr inbounds float, float* %b, i64 %indvars.iv
%t2 = load float, float* %arrayidx3, align 4		%t2 = load float, float* %arrayidx3, align 4
%add = fadd fast float %t1, %s.02		%add = fadd fast float %t1, %s.02
%add4 = fadd fast float %add, %t2		%add4 = fadd fast float %add, %t2
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 8		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 32
%cmp1 = icmp slt i64 %indvars.iv.next, %t0		%cmp1 = icmp slt i64 %indvars.iv.next, %t0
br i1 %cmp1, label %for, label %loopexit		br i1 %cmp1, label %for, label %loopexit

loopexit:		loopexit:
%add4.lcssa = phi float [ %add4, %for ]		%add4.lcssa = phi float [ %add4, %for ]
br label %for.end		br label %for.end

for.end:		for.end:
%s.0.lcssa = phi float [ 0.0, %entry ], [ %add4.lcssa, %loopexit ]		%s.0.lcssa = phi float [ 0.0, %entry ], [ %add4.lcssa, %loopexit ]
ret float %s.0.lcssa		ret float %s.0.lcssa
}		}

llvm/trunk/test/Transforms/LoopVectorize/X86/gather_scatter.ll

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines

for.end: ; preds = %for.cond		for.end: ; preds = %for.cond
ret void		ret void
}		}

; The source code		; The source code
;void foo2 (In * __restrict__ in, float * __restrict__ out, int * __restrict__ trigger) {		;void foo2 (In * __restrict__ in, float * __restrict__ out, int * __restrict__ trigger) {
;		;
; for (int i=0; i<SIZE; ++i) {		; for (int i=0; i<SIZE; i += 16) {
; if (trigger[i] > 0) {		; if (trigger[i] > 0) {
; out[i] = in[i].b + (float) 0.5;		; out[i] = in[i].b + (float) 0.5;
; }		; }
; }		; }
;}		;}

%struct.In = type { float, float }		%struct.In = type { float, float }

;AVX512-LABEL: @foo2		;AVX512-LABEL: @foo2
;AVX512: getelementptr inbounds %struct.In, %struct.In* %in, <16 x i64> %{{.*}}, i32 1		;AVX512: getelementptr inbounds %struct.In, %struct.In* %in, <16 x i64> {{.*}}, i32 1
;AVX512: llvm.masked.gather.v16f32		;AVX512: llvm.masked.gather.v16f32
;AVX512: llvm.masked.store.v16f32		;AVX512: llvm.masked.scatter.v16f32
;AVX512: ret void		;AVX512: ret void
define void @foo2(%struct.In* noalias %in, float* noalias %out, i32* noalias %trigger, i32* noalias %index) #0 {		define void @foo2(%struct.In* noalias %in, float* noalias %out, i32* noalias %trigger, i32* noalias %index) #0 {
entry:		entry:
%in.addr = alloca %struct.In*, align 8		%in.addr = alloca %struct.In*, align 8
%out.addr = alloca float*, align 8		%out.addr = alloca float*, align 8
%trigger.addr = alloca i32*, align 8		%trigger.addr = alloca i32*, align 8
%index.addr = alloca i32*, align 8		%index.addr = alloca i32*, align 8
%i = alloca i32, align 4		%i = alloca i32, align 4
Show All 33 Lines	if.then: ; preds = %for.body
store float %add, float* %arrayidx5, align 4		store float %add, float* %arrayidx5, align 4
br label %if.end		br label %if.end

if.end: ; preds = %if.then, %for.body		if.end: ; preds = %if.then, %for.body
br label %for.inc		br label %for.inc

for.inc: ; preds = %if.end		for.inc: ; preds = %if.end
%9 = load i32, i32* %i, align 4		%9 = load i32, i32* %i, align 4
%inc = add nsw i32 %9, 1		%inc = add nsw i32 %9, 16
store i32 %inc, i32* %i, align 4		store i32 %inc, i32* %i, align 4
br label %for.cond		br label %for.cond

for.end: ; preds = %for.cond		for.end: ; preds = %for.cond
ret void		ret void
}		}

; The source code		; The source code
;struct Out {		;struct Out {
; float a;		; float a;
; float b;		; float b;
;};		;};
;void foo3 (In * __restrict__ in, Out * __restrict__ out, int * __restrict__ trigger) {		;void foo3 (In * __restrict__ in, Out * __restrict__ out, int * __restrict__ trigger) {
;		;
; for (int i=0; i<SIZE; ++i) {		; for (int i=0; i<SIZE; i += 16) {
; if (trigger[i] > 0) {		; if (trigger[i] > 0) {
; out[i].b = in[i].b + (float) 0.5;		; out[i].b = in[i].b + (float) 0.5;
; }		; }
; }		; }
;}		;}

;AVX512-LABEL: @foo3		;AVX512-LABEL: @foo3
;AVX512: getelementptr inbounds %struct.In, %struct.In* %in, <16 x i64> %{{.*}}, i32 1		;AVX512: getelementptr inbounds %struct.In, %struct.In* %in, <16 x i64> {{.*}}, i32 1
;AVX512: llvm.masked.gather.v16f32		;AVX512: llvm.masked.gather.v16f32
;AVX512: fadd <16 x float>		;AVX512: fadd <16 x float>
;AVX512: getelementptr inbounds %struct.Out, %struct.Out* %out, <16 x i64> %{{.*}}, i32 1		;AVX512: getelementptr inbounds %struct.Out, %struct.Out* %out, <16 x i64> {{.*}}, i32 1
;AVX512: llvm.masked.scatter.v16f32		;AVX512: llvm.masked.scatter.v16f32
;AVX512: ret void		;AVX512: ret void

%struct.Out = type { float, float }		%struct.Out = type { float, float }

define void @foo3(%struct.In* noalias %in, %struct.Out* noalias %out, i32* noalias %trigger) {		define void @foo3(%struct.In* noalias %in, %struct.Out* noalias %out, i32* noalias %trigger) {
entry:		entry:
%in.addr = alloca %struct.In*, align 8		%in.addr = alloca %struct.In*, align 8
Show All 36 Lines	if.then: ; preds = %for.body
store float %add, float* %b6, align 4		store float %add, float* %b6, align 4
br label %if.end		br label %if.end

if.end: ; preds = %if.then, %for.body		if.end: ; preds = %if.then, %for.body
br label %for.inc		br label %for.inc

for.inc: ; preds = %if.end		for.inc: ; preds = %if.end
%9 = load i32, i32* %i, align 4		%9 = load i32, i32* %i, align 4
%inc = add nsw i32 %9, 1		%inc = add nsw i32 %9, 16
store i32 %inc, i32* %i, align 4		store i32 %inc, i32* %i, align 4
br label %for.cond		br label %for.cond

for.end: ; preds = %for.cond		for.end: ; preds = %for.cond
ret void		ret void
}		}
declare void @llvm.masked.scatter.v16f32(<16 x float>, <16 x float*>, i32, <16 x i1>)		declare void @llvm.masked.scatter.v16f32(<16 x float>, <16 x float*>, i32, <16 x i1>)

llvm/trunk/test/Transforms/LoopVectorize/X86/interleaving.ll

				; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine < %s \| FileCheck %s --check-prefix=NORMAL
				; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=atom < %s \| FileCheck %s --check-prefix=ATOM

				; NORMAL-LABEL: foo
				; NORMAL: %[[WIDE:.]] = load <8 x i32>, <8 x i32> %{{.*}}, align 4
				; NORMAL: %[[STRIDED1:.*]] = shufflevector <8 x i32> %[[WIDE]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; NORMAL: %[[STRIDED2:.*]] = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; NORMAL: add nsw <4 x i32> %[[STRIDED2]], %[[STRIDED1]]

				; ATOM-LABEL: foo
				; ATOM: load i32
				; ATOM: load i32
				; ATOM: store i32
				define void @foo(i32* noalias nocapture %a, i32* noalias nocapture readonly %b) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%0 = shl nsw i64 %indvars.iv, 1
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %0
				%1 = load i32, i32* %arrayidx, align 4
				%2 = or i64 %0, 1
				%arrayidx3 = getelementptr inbounds i32, i32* %b, i64 %2
				%3 = load i32, i32* %arrayidx3, align 4
				%add4 = add nsw i32 %3, %1
				%arrayidx6 = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				store i32 %add4, i32* %arrayidx6, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

llvm/trunk/test/Transforms/LoopVectorize/X86/masked_load_store.ll

Show First 20 Lines • Show All 335 Lines • ▼ Show 20 Lines
for.end: ; preds = %for.cond		for.end: ; preds = %for.cond
ret void		ret void
}		}

; The source code:		; The source code:
;		;
;void foo4(double A, double B, int *trigger) {		;void foo4(double A, double B, int *trigger) {
;		;
; for (int i=0; i<10000; i++) {		; for (int i=0; i<10000; i += 16) {
; if (trigger[i] < 100) {		; if (trigger[i] < 100) {
; A[i] = B[i*2] + trigger[i]; << non-cosecutive access		; A[i] = B[i*2] + trigger[i]; << non-cosecutive access
; }		; }
; }		; }
;}		;}

;AVX-LABEL: @foo4		;AVX-LABEL: @foo4
;AVX-NOT: llvm.masked		;AVX-NOT: llvm.masked
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	if.then: ; preds = %for.body
store double %add, double* %arrayidx7, align 8		store double %add, double* %arrayidx7, align 8
br label %if.end		br label %if.end

if.end: ; preds = %if.then, %for.body		if.end: ; preds = %if.then, %for.body
br label %for.inc		br label %for.inc

for.inc: ; preds = %if.end		for.inc: ; preds = %if.end
%12 = load i32, i32* %i, align 4		%12 = load i32, i32* %i, align 4
%inc = add nsw i32 %12, 1		%inc = add nsw i32 %12, 16
store i32 %inc, i32* %i, align 4		store i32 %inc, i32* %i, align 4
br label %for.cond		br label %for.cond

for.end: ; preds = %for.cond		for.end: ; preds = %for.cond
ret void		ret void
}		}

@a = common global [1 x i32*] zeroinitializer, align 8		@a = common global [1 x i32*] zeroinitializer, align 8
▲ Show 20 Lines • Show All 297 Lines • Show Last 20 Lines