This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
4/7
LoopVectorizationLegality.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
X86/
-
load-deref-pred.ll
-
hoist-loads.ll

Differential D66688

[LoopVectorize] Leverage speculation safety to avoid masked.loads
ClosedPublic

Authored by reames on Aug 23 2019, 3:03 PM.

Download Raw Diff

Details

Reviewers

hsaito
Ayal

Commits

rG7403569be751: [LoopVectorize] Leverage speculation safety to avoid masked.loads
rL371452: [LoopVectorize] Leverage speculation safety to avoid masked.loads

Summary

This is my first real patch to the loop vectorizer. There's some obvious room for API cleanup in the patch, but before I iterate on that, I want to make sure I'm approaching the problem the "right way" and that my attempt at clarifying comments are actually correct. :)

The intention of the patch is to avoid generating masked loads for cases where we can simply speculate the load, and then ignore the results. This is a building block on the path to a longer term goal, which is to eventually support early exits in the vectorizer. (But that's out of scope for the moment.)

Diff Detail

Repository: rL LLVM

Event Timeline

reames created this revision.Aug 23 2019, 3:03 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2019, 3:03 PM

Herald added subscribers: bollu, mcrosier. · View Herald Transcript

lebedev.ri added a subscriber: lebedev.ri.Aug 23 2019, 11:09 PM

lebedev.ri added inline comments.

test/Transforms/LoopVectorize/load-deref-pred.ll
1 ↗	(On Diff #216963)	Precommit?

vivekvpandya added inline comments.Aug 24 2019, 3:39 AM

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
918 ↗	(On Diff #216963)	Minor: loan -> load
975 ↗	(On Diff #216963)	Minor: loop loop -> loop or you mean loop, loop header

xbolva00 added a subscriber: xbolva00.Aug 24 2019, 4:13 AM

xbolva00 added inline comments.

test/Transforms/LoopVectorize/load-deref-pred.ll
17 ↗	(On Diff #216963)	br i1 false Vectorizer should cleanup it a bit..

rscottmanley added a subscriber: rscottmanley.Aug 25 2019, 8:23 AM

There's some obvious room for API cleanup in the patch, but before I iterate on that, I want to make sure I'm approaching the problem the "right way" and that my attempt at clarifying comments are actually correct. :)

Yes, this approach of refining SafePointe[r]s, originally designed to express derefereceability, LGTM.

It may be better to place isSafeToLoadUnconditionallyInLoop() in Load.{h,cpp} rather than in LoopVectorizationLegality.cpp; this needs to also work with FoldTail / IsAnnotatedParallel; and not sure about extending the load reasoning to stores etc., but all those can be part of "API cleanup".

Added some minor comments.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
20 ↗	(On Diff #216963)	nit: lex ordering (the above admittedly sets an unordered example..)
950 ↗	(On Diff #216963)	I'm a bit confused by Loads.cpp's comment, wrt `Size`: /// This uses the pointee type to determine how many bytes need to be safe to /// load from the pointer.
976 ↗	(On Diff #216963)	Definitely better explanation, committable irrespective of this patch as NFC.
981 ↗	(On Diff #216963)	The original if blockNeedsPredication(BB) condition was an early-continue; now more suitable that its complementary condition be an early-continue instead. OTOH, can fuse the two "for (Instruction &I : BB) if (auto Ptr = ...)" loops together, with an early-continue if (!blockNeedsPredication(BB)) inside - or w/o it - in case isSafeToSpeculativelyExecute() will return true just as fast(?). (Can check if Ptr is already in SafePointe[r]s to retain an early-continue ;-).
982 ↗	(On Diff #216963)	typo: a[n] address
990 ↗	(On Diff #216963)	Sure, we're not trying to hoist/LICM the load out of the loop.
993 ↗	(On Diff #216963)	Does this "generic version" add anything, given that it returns false for stores, and does it account for the addresses accessed along all iterations of the loop?
1008 ↗	(On Diff #216963)	Fold these conditions (of the generic version) into isSafeToLoadUnconditionallyInLoop()?
test/Transforms/LoopVectorize/load-deref-pred.ll
17 ↗	(On Diff #216963)	Vectorizer leaves it and others like it for subsequent passes to clean up.
89 ↗	(On Diff #216963)	This "for (i=0; i<4096; ++i) {val = 0; if (i < len) val = a[i]; acc += val}" loop is indeed probably a minimal example of a speculative load. (It does deserve a better optimization though - that of trimming the loop bound ;-)

Thanks for the patch, looks good with the outstanding comments addressed. Additionally, it would be great to have a few more negative test cases (ptr operand not safe to speculatively execute, dynamic bounds, access in predicated block outside of loop).

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
997 ↗	(On Diff #216963)	We only use the generic version (isSafeToSpeculativelyExecute) for pointer operands, right?
1020 ↗	(On Diff #216963)	This is not needed, right?

This is a building block on the path to a longer term goal, which is to eventually support early exits in the vectorizer. (But that's out of scope for the moment.)

Is there anything more you can share on this in terms of design plans?

reames mentioned this in rG2de97888155c: Preland test cases for D66688 to make diffs clear..Aug 26 2019, 1:41 PM

Rebase over landed tests (including a bunch of negative ones).

Note that this reveals a bug in the patch. Will fix, and iterate from there.

Expect several rounds of incremental improvements before ready for real review.

reames planned changes to this revision.Aug 26 2019, 1:42 PM

reames mentioned this in rL369959: Preland test cases for D66688 to make diffs clear..Aug 26 2019, 1:44 PM

reames marked 4 inline comments as done.Aug 26 2019, 1:48 PM

reames mentioned this in rL369962: Add a clarify comment for meaning of SafePointes [NFC].Aug 26 2019, 1:52 PM

reames mentioned this in rGcf3b55597395: Add a clarify comment for meaning of SafePointes [NFC].

xbolva00 added inline comments.Aug 26 2019, 1:58 PM

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1006 ↗	(On Diff #217233)	Maybe also SanitizeMemory?

In D66688#1645843, @reames wrote:

Rebase over landed tests (including a bunch of negative ones).

Note that this reveals a bug in the patch. Will fix, and iterate from there.

Expect several rounds of incremental improvements before ready for real review.

Sure. Additional points of thought, possibly as a separate follow-up patches:

Interleave groups may also be effected, e.g., two speculative interleaved loads from distinct basic blocks could potentially be combined; useMaskedInterleavedAccesses() deserves attention, as raised by test_non_unit_stride().
Perhaps caching SafePointers(InLoop) could be offloaded altogether from LV to Load.cpp.

Rebase over landed changes, functional issue identified in last iteration was fixed in rL370102.

Starting style cleanup, now passing all vectorize tests.

Still more work planned.

Actually ready for review now. I think I've covered all the cornercases and the structure of the code should be more reasonable.

I left in two todos about moving code. As a matter of practicality, I'd prefer to land as is, then do NFC moves in a follow up. Keeps the rebuild times much more manageable.

p.s. There was one comment in previous review about needing to handle the interleave case. I'm interpreting that as a request for further optimization (i.e. broadened scope) and have left it for future work. If it was actually a correctness concern, then please explain more.

ping

Herald added a subscriber: ychen. · View Herald TranscriptSep 6 2019, 9:28 AM

xbolva00 added inline comments.Sep 6 2019, 9:32 AM

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
976 ↗	(On Diff #218190)	Also SanitizeMemory?

reames marked an inline comment as done.Sep 6 2019, 3:49 PM

reames added inline comments.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
976 ↗	(On Diff #218190)	If so, as a separate change. This exactly matches the list in ValueTracking.

This LGTM, just some minor nits.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
918 ↗	(On Diff #218190)	/// Return true if we can [prove] that the given
928 ↗	(On Diff #218190)	What about loads from uniform addresses? If desired their unmasking can be delayed to a later patch, worth leaving a TODO.
935 ↗	(On Diff #218190)	indent "LoadSize"? This refers to the memory size of a single element.
936 ↗	(On Diff #218190)	don't touch every byte >> have gaps
942 ↗	(On Diff #218190)	It would be great to handle symbolic trip counts known to be sufficiently small.
945 ↗	(On Diff #218190)	"Size"? This refers to the entire memory interval accessed by all iterations of the loop. Set Size = LoadSize * TC? Handling gaps would imply (only?) setting Size accordingly, right? That may be useful for unmasking gathers / interleave-groups.
948 ↗	(On Diff #218190)	StartS must be LoopInvariant, iiuc. ("All operands of an AddRec are required to be loop invariant.") This method is currently employed only by the innermost loop vectorizer; if/when considered by outer loop vectorization, there will be more than a single loop to look after; i.e., StartS may also be an AddRec to check.
957 ↗	(On Diff #218190)	For the moment - as in TODO? the access size - of a single element Do we (need to) check that "the base is aligned"?
968 ↗	(On Diff #218190)	Can first refactor ValueTracking.h/cpp's isSafeToSpeculativelyExecute().
969 ↗	(On Diff #218190)	mustSuppressSpecula[ta]tion
1007 ↗	(On Diff #218190)	a[n] address
1009 ↗	(On Diff #218190)	For the moment - as in TODO?
1020 ↗	(On Diff #218190)	If DL is left to be computed by callee rather than caller, can simplify into: if (LI && !mustSuppressSpeculation(LI) && isDereferenceableAndAlignedInLoop(LI, TheLoop, SE, DT)) SafePointes.insert(LI->getPointerOperand());
20 ↗	(On Diff #216963)	ValueTracking.h < VectorUtils.h

This revision was not accepted when it landed; it landed in state Needs Review.Sep 9 2019, 1:56 PM

Closed by commit rL371452: [LoopVectorize] Leverage speculation safety to avoid masked.loads (authored by reames). · Explain Why

This revision was automatically updated to reflect the committed changes.

reames marked an inline comment as done.

In D66688#1653401, @reames wrote:

...
p.s. There was one comment in previous review about needing to handle the interleave case. I'm interpreting that as a request for further optimization (i.e. broadened scope) and have left it for future work. If it was actually a correctness concern, then please explain more.

Sorry for late response. This was indeed referring to potential for further optimizations, such as:

Current logic for handling an interleave group (of loads) with gaps at the end, is to ensure there's at-least one scalar iteration following all vector iterations. This may result in running VF*UF scalar iterations instead of a single last vector/unrolled iteration. With the ability to "close this gap" - prove that all missing elements of the last vector iteration can be loaded speculatively, this can be improved.
Current logic requires all members of an interleave group to belong to same basic-block. Unmasking loads should help facilitate forming interleave groups across distinct basic-blocks.

xbolva00 added inline comments.Sep 15 2019, 3:13 AM

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1017	Do you plan to work on stores?

MaskRay added a subscriber: MaskRay.Sep 14 2020, 4:36 PM

MaskRay added inline comments.

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977	Is this isUnordered or isSimple? IIUC isSimple is a subset of isUnordered.

Herald added a subscriber: dantrushin. · View Herald TranscriptSep 14 2020, 4:36 PM

MaskRay mentioned this in D87538: [VectorCombine] Don't vectorize scalar load under asan/hwasan/memtag/tsan.Sep 14 2020, 4:36 PM

reames added inline comments.Sep 15 2020, 11:36 AM

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
957 ↗	(On Diff #218190)	"the base is aligned" is checked within the called function a line below.
llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977	Well, it's written as intended, but that might not be what you're trying to ask? Do you have an example you're wondering about?
1017	I am not going to get back to this any time soon.

MaskRay added inline comments.Sep 15 2020, 11:42 AM

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977	In D87538, `VectorCombine::vectorizeLoadInsert` uses the utility to suppress some cases. It uses `Load->isSimple()` but I don't know whether it should be isUnordered or mustSuppressSpeculation should use isSimple. Appreciate if you can take a look at that piece of code (you can ignore the most complexity there. The main thing is that it is a load widening)

reames added inline comments.Sep 15 2020, 11:53 AM

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977	It is safe to speculate an unordered atomic load. We're conservative in a bunch of places in the optimizer by using isSimple where isUnordered is legal, please don't add new ones if you can avoid it.

reames added inline comments.Sep 15 2020, 11:55 AM

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977	In general, it is not safe to combine arbitrary atomic loads as we may not be able to lower a wider atomic. The code in InstCombine you reference also looks correct, but I don't really see the connection between the two reviews. They're reasoning about different aspects of legality.

dcaballe mentioned this in D111846: [LV] Drop integer poison-generating flags from instructions that need predication.Nov 7 2021, 2:30 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

89 lines

test/

Transforms/

LoopVectorize/

X86/

load-deref-pred.ll

32 lines

hoist-loads.ll

4 lines

Diff 219431

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show All 9 Lines
// resided in LoopVectorize.cpp for a long time.		// resided in LoopVectorize.cpp for a long time.
//		//
// At this point, it is implemented as a utility class, not as an analysis		// At this point, it is implemented as a utility class, not as an analysis
// pass. It should be easy to create an analysis pass around it if there		// pass. It should be easy to create an analysis pass around it if there
// is a need (but D45420 needs to happen first).		// is a need (but D45420 needs to happen first).
//		//
#include "llvm/Transforms/Vectorize/LoopVectorize.h"		#include "llvm/Transforms/Vectorize/LoopVectorize.h"
#include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"		#include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
		#include "llvm/Analysis/Loads.h"
		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"

using namespace llvm;		using namespace llvm;

#define LV_NAME "loop-vectorize"		#define LV_NAME "loop-vectorize"
#define DEBUG_TYPE LV_NAME		#define DEBUG_TYPE LV_NAME

▲ Show 20 Lines • Show All 885 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
}		}
if (I.mayThrow())		if (I.mayThrow())
return false;		return false;
}		}

return true;		return true;
}		}

		/// Return true if we can prove that the given load would access only
		/// dereferenceable memory, and be properly aligned on every iteration.
		/// (i.e. does not require predication beyond that required by the the header
		/// itself) TODO: Move to Loads.h/cpp in a separate change
		static bool isDereferenceableAndAlignedInLoop(LoadInst LI, Loop L,
		ScalarEvolution &SE,
		DominatorTree &DT) {
		auto &DL = LI->getModule()->getDataLayout();
		Value *Ptr = LI->getPointerOperand();
		auto *AddRec = dyn_cast<SCEVAddRecExpr>(SE.getSCEV(Ptr));
		if (!AddRec \|\| AddRec->getLoop() != L \|\| !AddRec->isAffine())
		return false;
		auto* Step = dyn_cast<SCEVConstant>(AddRec->getStepRecurrence(SE));
		if (!Step)
		return false;
		APInt StepC = Step->getAPInt();
		APInt EltSize(DL.getIndexTypeSizeInBits(Ptr->getType()),
		DL.getTypeStoreSize(LI->getType()));
		// TODO: generalize to access patterns which have gaps
		// TODO: handle uniform addresses (if not already handled by LICM)
		if (StepC != EltSize)
		return false;

		// TODO: If the symbolic trip count has a small bound (max count), we might
		// be able to prove safety.
		auto TC = SE.getSmallConstantTripCount(L);
		if (!TC)
		return false;

		const APInt AccessSize = TC * EltSize;

		auto *StartS = dyn_cast<SCEVUnknown>(AddRec->getStart());
		if (!StartS)
		return false;
		assert(SE.isLoopInvariant(StartS, L) && "implied by addrec definition");
		Value *Base = StartS->getValue();

		Instruction *HeaderFirstNonPHI = L->getHeader()->getFirstNonPHI();

		unsigned Align = LI->getAlignment();
		if (Align == 0)
		Align = DL.getABITypeAlignment(LI->getType());
		// For the moment, restrict ourselves to the case where the access size is a
		// multiple of the requested alignment and the base is aligned.
		// TODO: generalize if a case found which warrants
		if (EltSize.urem(Align) != 0)
		return false;
		return isDereferenceableAndAlignedPointer(Base, Align, AccessSize,
		DL, HeaderFirstNonPHI, &DT);
		}

		/// Return true if speculation of the given load must be suppressed for
		/// correctness reasons. If not suppressed, dereferenceability and alignment
		/// must be proven.
		/// TODO: Move to ValueTracking.h/cpp in a separate change
		static bool mustSuppressSpeculation(const LoadInst &LI) {
		if (!LI.isUnordered())
		MaskRayUnsubmitted Not Done Reply Inline Actions Is this isUnordered or isSimple? IIUC isSimple is a subset of isUnordered. MaskRay: Is this isUnordered or isSimple? IIUC isSimple is a subset of isUnordered.
		reamesAuthorUnsubmitted Done Reply Inline Actions Well, it's written as intended, but that might not be what you're trying to ask? Do you have an example you're wondering about? reames: Well, it's written as intended, but that might not be what you're trying to ask? Do you have…
		MaskRayUnsubmitted Not Done Reply Inline Actions In D87538, `VectorCombine::vectorizeLoadInsert` uses the utility to suppress some cases. It uses `Load->isSimple()` but I don't know whether it should be isUnordered or mustSuppressSpeculation should use isSimple. Appreciate if you can take a look at that piece of code (you can ignore the most complexity there. The main thing is that it is a load widening) MaskRay: In D87538, `VectorCombine::vectorizeLoadInsert` uses the utility to suppress some cases. It…
		reamesAuthorUnsubmitted Done Reply Inline Actions It is safe to speculate an unordered atomic load. We're conservative in a bunch of places in the optimizer by using isSimple where isUnordered is legal, please don't add new ones if you can avoid it. reames: It is safe to speculate an unordered atomic load. We're conservative in a bunch of places in…
		reamesAuthorUnsubmitted Done Reply Inline Actions In general, it is not safe to combine arbitrary atomic loads as we may not be able to lower a wider atomic. The code in InstCombine you reference also looks correct, but I don't really see the connection between the two reviews. They're reasoning about different aspects of legality. reames: In general, it is not safe to combine arbitrary atomic loads as we may not be able to lower a…
		return true;
		const Function &F = *LI.getFunction();
		// Speculative load may create a race that did not exist in the source.
		return F.hasFnAttribute(Attribute::SanitizeThread) \|\|
		// Speculative load may load data from dirty regions.
		F.hasFnAttribute(Attribute::SanitizeAddress) \|\|
		F.hasFnAttribute(Attribute::SanitizeHWAddress);
		}

bool LoopVectorizationLegality::canVectorizeWithIfConvert() {		bool LoopVectorizationLegality::canVectorizeWithIfConvert() {
if (!EnableIfConversion) {		if (!EnableIfConversion) {
reportVectorizationFailure("If-conversion is disabled",		reportVectorizationFailure("If-conversion is disabled",
"if-conversion is disabled",		"if-conversion is disabled",
"IfConversionDisabled",		"IfConversionDisabled",
ORE, TheLoop);		ORE, TheLoop);
return false;		return false;
}		}

assert(TheLoop->getNumBlocks() > 1 && "Single block loops are vectorizable");		assert(TheLoop->getNumBlocks() > 1 && "Single block loops are vectorizable");

// A list of pointers which are known to be dereferenceable within scope of		// A list of pointers which are known to be dereferenceable within scope of
// the loop body for each iteration of the loop which executes. That is,		// the loop body for each iteration of the loop which executes. That is,
// the memory pointed to can be dereferenced (with the access size implied by		// the memory pointed to can be dereferenced (with the access size implied by
// the value's type) unconditionally within the loop header without		// the value's type) unconditionally within the loop header without
// introducing a new fault.		// introducing a new fault.
SmallPtrSet<Value *, 8> SafePointes;		SmallPtrSet<Value *, 8> SafePointes;

// Collect safe addresses.		// Collect safe addresses.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
if (blockNeedsPredication(BB))		if (!blockNeedsPredication(BB)) {
continue;

for (Instruction &I : *BB)		for (Instruction &I : *BB)
if (auto *Ptr = getLoadStorePointerOperand(&I))		if (auto *Ptr = getLoadStorePointerOperand(&I))
SafePointes.insert(Ptr);		SafePointes.insert(Ptr);
		continue;
		}

		// For a block which requires predication, a address may be safe to access
		// in the loop w/o predication if we can prove dereferenceability facts
		// sufficient to ensure it'll never fault within the loop. For the moment,
		// we restrict this to loads; stores are more complicated due to
		xbolva00Unsubmitted Not Done Reply Inline Actions Do you plan to work on stores? xbolva00: Do you plan to work on stores?
		reamesAuthorUnsubmitted Done Reply Inline Actions I am not going to get back to this any time soon. reames: I am not going to get back to this any time soon.
		// concurrency restrictions.
		ScalarEvolution &SE = *PSE.getSE();
		for (Instruction &I : *BB) {
		LoadInst *LI = dyn_cast<LoadInst>(&I);
		if (LI && !mustSuppressSpeculation(*LI) &&
		isDereferenceableAndAlignedInLoop(LI, TheLoop, SE, *DT))
		SafePointes.insert(LI->getPointerOperand());
		}
}		}

// Collect the blocks that need predication.		// Collect the blocks that need predication.
BasicBlock *Header = TheLoop->getHeader();		BasicBlock *Header = TheLoop->getHeader();
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
// We don't support switch statements inside loops.		// We don't support switch statements inside loops.
if (!isa<BranchInst>(BB->getTerminator())) {		if (!isa<BranchInst>(BB->getTerminator())) {
reportVectorizationFailure("Loop contains a switch statement",		reportVectorizationFailure("Loop contains a switch statement",
▲ Show 20 Lines • Show All 273 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/load-deref-pred.ll

	Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP18:%.*]] = icmp slt <4 x i64> [[STEP_ADD1]], [[BROADCAST_SPLAT10]]			; CHECK-NEXT: [[TMP18:%.*]] = icmp slt <4 x i64> [[STEP_ADD1]], [[BROADCAST_SPLAT10]]
	; CHECK-NEXT: [[TMP19:%.*]] = icmp slt <4 x i64> [[STEP_ADD2]], [[BROADCAST_SPLAT12]]			; CHECK-NEXT: [[TMP19:%.*]] = icmp slt <4 x i64> [[STEP_ADD2]], [[BROADCAST_SPLAT12]]
	; CHECK-NEXT: [[TMP20:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP20:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP4]]			; CHECK-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP8]]			; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP8]]
	; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP12]]			; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP12]]
	; CHECK-NEXT: [[TMP24:%.]] = getelementptr inbounds i32, i32 [[TMP20]], i32 0			; CHECK-NEXT: [[TMP24:%.]] = getelementptr inbounds i32, i32 [[TMP20]], i32 0
	; CHECK-NEXT: [[TMP25:%.]] = bitcast i32 [[TMP24]] to <4 x i32>*			; CHECK-NEXT: [[TMP25:%.]] = bitcast i32 [[TMP24]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP25]], i32 4, <4 x i1> [[TMP16]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP25]], align 4
	; CHECK-NEXT: [[TMP26:%.]] = getelementptr inbounds i32, i32 [[TMP20]], i32 4			; CHECK-NEXT: [[TMP26:%.]] = getelementptr inbounds i32, i32 [[TMP20]], i32 4
	; CHECK-NEXT: [[TMP27:%.]] = bitcast i32 [[TMP26]] to <4 x i32>*			; CHECK-NEXT: [[TMP27:%.]] = bitcast i32 [[TMP26]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD13:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP27]], i32 4, <4 x i1> [[TMP17]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_LOAD13:%.]] = load <4 x i32>, <4 x i32> [[TMP27]], align 4
	; CHECK-NEXT: [[TMP28:%.]] = getelementptr inbounds i32, i32 [[TMP20]], i32 8			; CHECK-NEXT: [[TMP28:%.]] = getelementptr inbounds i32, i32 [[TMP20]], i32 8
	; CHECK-NEXT: [[TMP29:%.]] = bitcast i32 [[TMP28]] to <4 x i32>*			; CHECK-NEXT: [[TMP29:%.]] = bitcast i32 [[TMP28]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD14:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP29]], i32 4, <4 x i1> [[TMP18]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_LOAD14:%.]] = load <4 x i32>, <4 x i32> [[TMP29]], align 4
	; CHECK-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 [[TMP20]], i32 12			; CHECK-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 [[TMP20]], i32 12
	; CHECK-NEXT: [[TMP31:%.]] = bitcast i32 [[TMP30]] to <4 x i32>*			; CHECK-NEXT: [[TMP31:%.]] = bitcast i32 [[TMP30]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD15:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP31]], i32 4, <4 x i1> [[TMP19]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_LOAD15:%.]] = load <4 x i32>, <4 x i32> [[TMP31]], align 4
	; CHECK-NEXT: [[TMP32:%.*]] = xor <4 x i1> [[TMP16]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP32:%.*]] = xor <4 x i1> [[TMP16]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[TMP33:%.*]] = xor <4 x i1> [[TMP17]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP33:%.*]] = xor <4 x i1> [[TMP17]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[TMP34:%.*]] = xor <4 x i1> [[TMP18]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP34:%.*]] = xor <4 x i1> [[TMP18]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[TMP35:%.*]] = xor <4 x i1> [[TMP19]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP35:%.*]] = xor <4 x i1> [[TMP19]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[PREDPHI:%.*]] = select <4 x i1> [[TMP16]], <4 x i32> [[WIDE_MASKED_LOAD]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[PREDPHI:%.*]] = select <4 x i1> [[TMP16]], <4 x i32> [[WIDE_LOAD]], <4 x i32> zeroinitializer
	; CHECK-NEXT: [[PREDPHI16:%.*]] = select <4 x i1> [[TMP17]], <4 x i32> [[WIDE_MASKED_LOAD13]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[PREDPHI16:%.*]] = select <4 x i1> [[TMP17]], <4 x i32> [[WIDE_LOAD13]], <4 x i32> zeroinitializer
	; CHECK-NEXT: [[PREDPHI17:%.*]] = select <4 x i1> [[TMP18]], <4 x i32> [[WIDE_MASKED_LOAD14]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[PREDPHI17:%.*]] = select <4 x i1> [[TMP18]], <4 x i32> [[WIDE_LOAD14]], <4 x i32> zeroinitializer
	; CHECK-NEXT: [[PREDPHI18:%.*]] = select <4 x i1> [[TMP19]], <4 x i32> [[WIDE_MASKED_LOAD15]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[PREDPHI18:%.*]] = select <4 x i1> [[TMP19]], <4 x i32> [[WIDE_LOAD15]], <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP36]] = add <4 x i32> [[VEC_PHI]], [[PREDPHI]]			; CHECK-NEXT: [[TMP36]] = add <4 x i32> [[VEC_PHI]], [[PREDPHI]]
	; CHECK-NEXT: [[TMP37]] = add <4 x i32> [[VEC_PHI4]], [[PREDPHI16]]			; CHECK-NEXT: [[TMP37]] = add <4 x i32> [[VEC_PHI4]], [[PREDPHI16]]
	; CHECK-NEXT: [[TMP38]] = add <4 x i32> [[VEC_PHI5]], [[PREDPHI17]]			; CHECK-NEXT: [[TMP38]] = add <4 x i32> [[VEC_PHI5]], [[PREDPHI17]]
	; CHECK-NEXT: [[TMP39]] = add <4 x i32> [[VEC_PHI6]], [[PREDPHI18]]			; CHECK-NEXT: [[TMP39]] = add <4 x i32> [[VEC_PHI6]], [[PREDPHI18]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i64> [[STEP_ADD2]], <i64 4, i64 4, i64 4, i64 4>			; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i64> [[STEP_ADD2]], <i64 4, i64 4, i64 4, i64 4>
	; CHECK-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; CHECK-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; CHECK-NEXT: br i1 [[TMP40]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0			; CHECK-NEXT: br i1 [[TMP40]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
	▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP62:%.*]] = insertelement <4 x i1> [[TMP61]], i1 [[TMP58]], i32 2			; CHECK-NEXT: [[TMP62:%.*]] = insertelement <4 x i1> [[TMP61]], i1 [[TMP58]], i32 2
	; CHECK-NEXT: [[TMP63:%.*]] = insertelement <4 x i1> [[TMP62]], i1 [[TMP59]], i32 3			; CHECK-NEXT: [[TMP63:%.*]] = insertelement <4 x i1> [[TMP62]], i1 [[TMP59]], i32 3
	; CHECK-NEXT: [[TMP64:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP64:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP65:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP4]]			; CHECK-NEXT: [[TMP65:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP66:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP8]]			; CHECK-NEXT: [[TMP66:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP8]]
	; CHECK-NEXT: [[TMP67:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP12]]			; CHECK-NEXT: [[TMP67:%.]] = getelementptr inbounds i32, i32 [[BASE]], i64 [[TMP12]]
	; CHECK-NEXT: [[TMP68:%.]] = getelementptr inbounds i32, i32 [[TMP64]], i32 0			; CHECK-NEXT: [[TMP68:%.]] = getelementptr inbounds i32, i32 [[TMP64]], i32 0
	; CHECK-NEXT: [[TMP69:%.]] = bitcast i32 [[TMP68]] to <4 x i32>*			; CHECK-NEXT: [[TMP69:%.]] = bitcast i32 [[TMP68]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP69]], i32 4, <4 x i1> [[TMP39]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP69]], align 4
	; CHECK-NEXT: [[TMP70:%.]] = getelementptr inbounds i32, i32 [[TMP64]], i32 4			; CHECK-NEXT: [[TMP70:%.]] = getelementptr inbounds i32, i32 [[TMP64]], i32 4
	; CHECK-NEXT: [[TMP71:%.]] = bitcast i32 [[TMP70]] to <4 x i32>*			; CHECK-NEXT: [[TMP71:%.]] = bitcast i32 [[TMP70]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD7:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP71]], i32 4, <4 x i1> [[TMP47]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_LOAD7:%.]] = load <4 x i32>, <4 x i32> [[TMP71]], align 4
	; CHECK-NEXT: [[TMP72:%.]] = getelementptr inbounds i32, i32 [[TMP64]], i32 8			; CHECK-NEXT: [[TMP72:%.]] = getelementptr inbounds i32, i32 [[TMP64]], i32 8
	; CHECK-NEXT: [[TMP73:%.]] = bitcast i32 [[TMP72]] to <4 x i32>*			; CHECK-NEXT: [[TMP73:%.]] = bitcast i32 [[TMP72]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD8:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP73]], i32 4, <4 x i1> [[TMP55]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_LOAD8:%.]] = load <4 x i32>, <4 x i32> [[TMP73]], align 4
	; CHECK-NEXT: [[TMP74:%.]] = getelementptr inbounds i32, i32 [[TMP64]], i32 12			; CHECK-NEXT: [[TMP74:%.]] = getelementptr inbounds i32, i32 [[TMP64]], i32 12
	; CHECK-NEXT: [[TMP75:%.]] = bitcast i32 [[TMP74]] to <4 x i32>*			; CHECK-NEXT: [[TMP75:%.]] = bitcast i32 [[TMP74]] to <4 x i32>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD9:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP75]], i32 4, <4 x i1> [[TMP63]], <4 x i32> undef)			; CHECK-NEXT: [[WIDE_LOAD9:%.]] = load <4 x i32>, <4 x i32> [[TMP75]], align 4
	; CHECK-NEXT: [[TMP76:%.*]] = xor <4 x i1> [[TMP39]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP76:%.*]] = xor <4 x i1> [[TMP39]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[TMP77:%.*]] = xor <4 x i1> [[TMP47]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP77:%.*]] = xor <4 x i1> [[TMP47]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[TMP78:%.*]] = xor <4 x i1> [[TMP55]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP78:%.*]] = xor <4 x i1> [[TMP55]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[TMP79:%.*]] = xor <4 x i1> [[TMP63]], <i1 true, i1 true, i1 true, i1 true>			; CHECK-NEXT: [[TMP79:%.*]] = xor <4 x i1> [[TMP63]], <i1 true, i1 true, i1 true, i1 true>
	; CHECK-NEXT: [[PREDPHI:%.*]] = select <4 x i1> [[TMP39]], <4 x i32> [[WIDE_MASKED_LOAD]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[PREDPHI:%.*]] = select <4 x i1> [[TMP39]], <4 x i32> [[WIDE_LOAD]], <4 x i32> zeroinitializer
	; CHECK-NEXT: [[PREDPHI10:%.*]] = select <4 x i1> [[TMP47]], <4 x i32> [[WIDE_MASKED_LOAD7]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[PREDPHI10:%.*]] = select <4 x i1> [[TMP47]], <4 x i32> [[WIDE_LOAD7]], <4 x i32> zeroinitializer
	; CHECK-NEXT: [[PREDPHI11:%.*]] = select <4 x i1> [[TMP55]], <4 x i32> [[WIDE_MASKED_LOAD8]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[PREDPHI11:%.*]] = select <4 x i1> [[TMP55]], <4 x i32> [[WIDE_LOAD8]], <4 x i32> zeroinitializer
	; CHECK-NEXT: [[PREDPHI12:%.*]] = select <4 x i1> [[TMP63]], <4 x i32> [[WIDE_MASKED_LOAD9]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[PREDPHI12:%.*]] = select <4 x i1> [[TMP63]], <4 x i32> [[WIDE_LOAD9]], <4 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP80]] = add <4 x i32> [[VEC_PHI]], [[PREDPHI]]			; CHECK-NEXT: [[TMP80]] = add <4 x i32> [[VEC_PHI]], [[PREDPHI]]
	; CHECK-NEXT: [[TMP81]] = add <4 x i32> [[VEC_PHI4]], [[PREDPHI10]]			; CHECK-NEXT: [[TMP81]] = add <4 x i32> [[VEC_PHI4]], [[PREDPHI10]]
	; CHECK-NEXT: [[TMP82]] = add <4 x i32> [[VEC_PHI5]], [[PREDPHI11]]			; CHECK-NEXT: [[TMP82]] = add <4 x i32> [[VEC_PHI5]], [[PREDPHI11]]
	; CHECK-NEXT: [[TMP83]] = add <4 x i32> [[VEC_PHI6]], [[PREDPHI12]]			; CHECK-NEXT: [[TMP83]] = add <4 x i32> [[VEC_PHI6]], [[PREDPHI12]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
	; CHECK-NEXT: [[TMP84:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096			; CHECK-NEXT: [[TMP84:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
	; CHECK-NEXT: br i1 [[TMP84]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4			; CHECK-NEXT: br i1 [[TMP84]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
	; CHECK: middle.block:			; CHECK: middle.block:
	▲ Show 20 Lines • Show All 1,118 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/hoist-loads.ll

	Show All 36 Lines
	}			}

	; However, we can't hoist loads whose address we have not seen unconditionally			; However, we can't hoist loads whose address we have not seen unconditionally
	; accessed. One wide load is fine, but not the second.			; accessed. One wide load is fine, but not the second.
	; CHECK-LABEL: @dont_hoist_cond_load(			; CHECK-LABEL: @dont_hoist_cond_load(
	; CHECK: load <2 x float>			; CHECK: load <2 x float>
	; CHECK-NOT: load <2 x float>			; CHECK-NOT: load <2 x float>

	define void @dont_hoist_cond_load() {			define void @dont_hoist_cond_load([1024 x float]* %a) {
	entry:			entry:
	br label %for.body			br label %for.body
	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end9 ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end9 ]
	%arrayidx = getelementptr inbounds [1024 x float], [1024 x float]* @A, i64 0, i64 %indvars.iv			%arrayidx = getelementptr inbounds [1024 x float], [1024 x float]* %a, i64 0, i64 %indvars.iv
	%arrayidx2 = getelementptr inbounds [1024 x float], [1024 x float]* @B, i64 0, i64 %indvars.iv			%arrayidx2 = getelementptr inbounds [1024 x float], [1024 x float]* @B, i64 0, i64 %indvars.iv
	%0 = load float, float* %arrayidx2, align 4			%0 = load float, float* %arrayidx2, align 4
	%cmp3 = fcmp oeq float %0, 0.000000e+00			%cmp3 = fcmp oeq float %0, 0.000000e+00
	br i1 %cmp3, label %if.end9, label %if.else			br i1 %cmp3, label %if.end9, label %if.else

	if.else:			if.else:
	%1 = load float, float* %arrayidx, align 4			%1 = load float, float* %arrayidx, align 4
	br label %if.end9			br label %if.end9
	Show All 12 Lines