This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
3/8
LoopVectorize.cpp
1
VPlan.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1/3
sve-inv-store.ll
-
RISCV/
-
uniform-load-store.ll

Differential D131118

[LV] Add generic scalarization support for unpredicated scalable vectors
AbandonedPublic

Authored by reames on Aug 3 2022, 2:54 PM.

Download Raw Diff

Details

Reviewers

fhahn
david-arm
paulwalker-arm

Summary

This change adds generic support for scalarizing scalable vector operations. Unlike fixed length vectors, we can't simply unroll a scalable vector as we don't know how long it is at compile time. However, there's nothing that prevents us from emitting the loop directly as we do know the dynamic number of elements.

For testing purposes, I have hocked this up to the uniform memory op path. This is not an optimal lowering for uniform mem ops, but it nicely demonstrates the value of having a fallback scalarization strategy available when smarter and more optimal things haven't yet been implemented.

From here, I plan on doing the following:

Add the support on the predicated path. This is quite a bit more involved and requires setting up VPBlocks for the CFG.
Generalize the definition of uniform memory op to allow internal predication. (This fundamentally requires the fully general predicated scalarization fallback, so it makes a good test to make sure we haven't missed anything.)
Write generic cost modeling for scalable scalarization, and start enabling other paths that we current unconditionally bail out from.
Implement a dedicated recipe for the uniform memory op case in the current predication due to tail folding only form. The loop form will probably be removed via LICM, but we should really stop relying on pass ordering here.

Diff Detail

Event Timeline

reames created this revision.Aug 3 2022, 2:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2022, 2:54 PM

Herald added subscribers: frasercrmck, luismarques, apazos and 21 others. · View Herald Transcript

reames requested review of this revision.Aug 3 2022, 2:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2022, 2:54 PM

Herald added subscribers: alextsao1999, • pcwang-thead, vkmr, MaskRay. · View Herald Transcript

reames added parent revisions: D131093: [LV] Restructure isPredicatedInst and isScalarWithPredication (w/a fix for uniform mem ops), D131015: [LV] Track all IR blocks corresponding to VPBasicBlock.Aug 3 2022, 2:54 PM

Harbormaster completed remote builds in B179135: Diff 449793.Aug 3 2022, 3:59 PM

david-arm added inline comments.Aug 11 2022, 1:02 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll
28	Hi @reames, we definitely don't want to be doing this for SVE as it will likely hurt performance - scatters are still likely to be better. We made a conscious effort to avoid scalaring this way for SVE because we managed to find other ways of solving the same problems using existing instructions. Also, if we ever encountered a situation where we'd have to scalarise a scalable vector then the performance will likely to be terrible and so we choose to return an Invalid cost from the cost model and skip that VF. We may as well just use NEON or not vectorise, because this is almost certainly better. What use cases are you trying to solve here? This patch doesn't seem to fix any actual bugs, so I'm assuming this is for performance reasons. It looks like this change only affects one test (`@uniform_store_of_loop_varying`) and I guess the IR in this test is not a common idiom. Have you tested performance before and after to see if this is worthwhile?

Matt added a subscriber: Matt.Aug 12 2022, 12:36 PM

reames added inline comments.Aug 15 2022, 9:12 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll
28	The motivation here is primarily robustness - both correctness wise and performance. On the correctness front, the existing code pretty fundamentally assumes that scalarization is an available strategy. The invalid bailouts for scalable has been added in, but I've already hit (and fixed) several cases where this was incomplete. From a code complexity, and likelihood of bugs perspective, having fixed and scalable have the same capabilities seems highly useful. On the performance side, there's a two level argument here. For many cases, you're entirely right that there are other strategies available. You'll note that I have a patch on review right now which implements a more optimal strategy for divides. However, we have not implemented all of them. As a result, the current situation is that these cases become strictly non-vectorizeable. This means that even otherwise hugely profitable to vectorize loops are prevented from vectorization. This is basically a "order of work" argument to minimize the pain while that work is happening. However, there are also cases which we can vectorize, but fundamentally don't have a better strategy for. One that we could vectorize - but don't today - is a call to a readonly function in an otherwise vectorizable loop. We can't assume the existence of a vector variant of the call (since it's not a known function), but we could scalarize in general. Now, put that call down a rarely taken predicated block and you start seeing why having the capability of scalarization is useful. Uniform memops are only a testing vehicle here. Nothing more. Please don't fixate on the generality of a uniform store test as that is not what this patch is really about. For aarch64, if scatters are profitable and legal, then the code should pick that already. This is an example where the cost model believes scalarization is profitable. It didn't make it into the final patch, but I played with some pretty serious cost penalties, and this case didn't change behavior. If you have particular suggestions on how to adjust cost to penalize scalarizing "enough", I'm completely open to them.

david-arm added inline comments.Aug 15 2022, 9:19 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870	I think you can avoid the scalarisation for SVE here simply by asking for the scalarisation cost of the instruction, similarly to how it's done elsewhere. For SVE this should return Invalid. Alternatively you could add a TTI hook to ask if target should scalarise or not, i.e. if (!foldTailByMasking()) return isLegalToScalarize(); We have always considered it 'illegal' to scalarise for SVE.

reames added inline comments.Aug 15 2022, 9:23 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870	You seem to be missing the point of the code. The costing here is done at two levels: TTI, and LV. TTI returns invalid costs for a bunch of cases which LV then turns around and decides to scalarize by reasoning about the scalarization cost and scalar op cost directly. Doing as you suggest here completely prevents the use of the added lowering. Adding a TTI hook is feasible, but I don't have a good name/semantic for it. Any suggestions?

reames added inline comments.Aug 15 2022, 9:25 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870	Another possibility here - would you be okay with the scalarization being behind a flag for now, and the cost model discussion (i.e. profitability) being done in a separate review? I'm wanting to avoid too much coupling here, and when I tried various different approaches to costing, a ton of unrelated issues started falling out.

david-arm added inline comments.Aug 15 2022, 10:21 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870	That might be an option too. I'll think about this more in the morning!
6893	It sounds like getUniformMemOpCost is not returning Invalid here for SVE. I'm not sure why, but by returning true from isLegalToScalarize it relies on the cost being sensible. When I'm back at work tomorrow morning I'll take a better look.

reames added inline comments.Aug 15 2022, 11:54 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870	A further point for your consideration. The scatter based lowering here has the following cost: Found an estimated cost of 81 for VF vscale x 4 For instruction: store i16 %ld, i16* %dst, align 2 Assuming this is a correct cost for the scatter, this is blatantly unprofitable. Essentially, we've switched from one unprofitable vectorization to another. As such, I don't think this test is particularly interesting.

david-arm added a reviewer: paulwalker-arm.Aug 16 2022, 1:11 AM

fhahn added inline comments.Aug 16 2022, 1:15 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll
28	In general it seems like having this extra lowering strategy would be desirable with respect to guaranteeing that we will always be able to scalarize any construct, rather than crash. This is an assumption made in various places. It should be up to the cost model to avoid scalarizing in cases where it won't be profitable, but I can't comment on this particular test case or if there are scenarios where this will profitable in general in practice.

Hi @reames, by the way I completely understand the intent of the patch and it seems like a useful addition to be able to fall back on scalarisation this way for scalable vectors.

However, just for some context on why I'm worried about this patch as it stands currently - I tried out this patch on a really simple (and obviously contrived!) loop like this:

void inv_store_i16(short* __restrict__ dst, short* __restrict__ src, long n) {
  for (long i = 0; i < n; i++) {
    *dst = src[i];
  }
}

corresponding to the test inv_store_i16 changed in this patch. On a neoverse-v1 (SVE) machine I created a micro-benchmark that iterated over this a number of times. Here are some rough results:

scalar loop: 4.9s
vector loop with SVE scatters: 4.5s
vector loop with scalarised op (using this patch): 13.7s

That strongly suggests that the cost model is broken now because we're now choosing the worst cost-based widening decision. At the moment the patch regresses performance for SVE. In another comment I've suggested two possible ways forward and I'd be happy with either. Of course, I'm happy to listen to other ideas too!

I'll try to review the technical changes in the patch properly soon ...

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870	Given the comment // If not predicated, we can now scalarize generically with a loop then this looks wrong here. I think we should be doing: if (!blockNeedsPredicationForAnyReason()) instead of if (!foldTailByMasking()) so that we also don't scalarise loops with conditional uniform memory operations. Or if the scalarisation code does support conditional ops, then perhaps the comment needs updating?
6893	I'm now convinced that in the current form the patch is wrong because `getUniformMemOpCost` assumes that for stores we will perform a normal vector store, i.e.: StoreInst *SI = cast<StoreInst>(I); bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand()); return TTI.getAddressComputationCost(ValTy) + TTI.getMemoryOpCost(Instruction::Store, ValTy, Alignment, AS, CostKind) + (isLoopInvariantStoreValue ? 0 : TTI.getVectorInstrCost(Instruction::ExtractElement, VectorTy, VF.getKnownMinValue() - 1)); whereas we're actually scalarising in some cases through generation of an inner vector loop. I can think of two ways forward here: Introduce this feature under control of a flag, which is off by default. Then in a later patch fix up the costs to properly calculate the scalarisation cost, which includes the loop overhead (pre-header, phi nodes, IV cmp+branch, etc.). Fix the cost as part of this patch. As you say, provided the scalarisation cost is fairly sensible the vectoriser should make the right decision and choose the best widening decision. But right now we're making decisions based on what look like very incorrect costs.
llvm/lib/Transforms/Vectorize/VPlan.cpp
800	nit: Stray comment.

Marking as Plan Changes as @david-arm has convinced me that the cost model on this needs significantly reworked. Going to think about this a bit, and probably split this into a couple pieces.

fhahn mentioned this in D131015: [LV] Track all IR blocks corresponding to VPBasicBlock.Sep 6 2022, 10:40 AM

Abandoning this. I still think having the LV be able to scalarize is a worthwhile fallback, but given this has run into significant resistance in review, and my main motivating cases have now all been handled in scalable vectorization without fallback, I'm going to discontinue working on this.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

140 lines

VPlan.cpp

10 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-inv-store.ll

25 lines

RISCV/

uniform-load-store.ll

58 lines

Diff 449793

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	public:
/// Generates a sequence of scalar instances for each lane between \p MinLane		/// Generates a sequence of scalar instances for each lane between \p MinLane
/// and \p MaxLane, times each part between \p MinPart and \p MaxPart,		/// and \p MaxLane, times each part between \p MinPart and \p MaxPart,
/// inclusive. Uses the VPValue operands from \p RepRecipe instead of \p		/// inclusive. Uses the VPValue operands from \p RepRecipe instead of \p
/// Instr's operands.		/// Instr's operands.
void scalarizeInstruction(Instruction Instr, VPReplicateRecipe RepRecipe,		void scalarizeInstruction(Instruction Instr, VPReplicateRecipe RepRecipe,
const VPIteration &Instance, bool IfPredicateInstr,		const VPIteration &Instance, bool IfPredicateInstr,
VPTransformState &State);		VPTransformState &State);

		/// Same as above, except that the lane comes from a runtime value, and the
		/// cloned instruction is returned instead of being directly stored into
		/// the transform state.
		Instruction *
		scalarizeInstruction(Instruction Instr, VPReplicateRecipe RepRecipe,
		unsigned Part, Value *Lane, bool IfPredicateInstr,
		VPTransformState &State);

/// Construct the vector value of a scalarized value \p V one lane at a time.		/// Construct the vector value of a scalarized value \p V one lane at a time.
void packScalarIntoVectorValue(VPValue *Def, const VPIteration &Instance,		void packScalarIntoVectorValue(VPValue *Def, const VPIteration &Instance,
VPTransformState &State);		VPTransformState &State);

/// Try to vectorize interleaved access group \p Group with the base address		/// Try to vectorize interleaved access group \p Group with the base address
/// given in \p Addr, optionally masking the vector operations if \p		/// given in \p Addr, optionally masking the vector operations if \p
/// BlockInMask is non-null. Use \p State to translate given VPValues to IR		/// BlockInMask is non-null. Use \p State to translate given VPValues to IR
/// values in the vectorized loop.		/// values in the vectorized loop.
▲ Show 20 Lines • Show All 2,266 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,
if (auto *II = dyn_cast<AssumeInst>(Cloned))		if (auto *II = dyn_cast<AssumeInst>(Cloned))
AC->registerAssumption(II);		AC->registerAssumption(II);

// End if-block.		// End if-block.
if (IfPredicateInstr)		if (IfPredicateInstr)
PredicatedInstructions.push_back(Cloned);		PredicatedInstructions.push_back(Cloned);
}		}


		Instruction *
		InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,
		VPReplicateRecipe *RepRecipe,
		unsigned Part, Value *Lane,
		bool IfPredicateInstr,
		VPTransformState &State) {
		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");

		// llvm.experimental.noalias.scope.decl intrinsics can just be dropped
		// TODO: add special first lane handling
		if (isa<NoAliasScopeDeclInst>(Instr))
		return nullptr;

		// Does this instruction return a value ?
		bool IsVoidRetTy = Instr->getType()->isVoidTy();

		Instruction *Cloned = Instr->clone();
		if (!IsVoidRetTy)
		Cloned->setName(Instr->getName() + ".cloned");

		// If the scalarized instruction contributes to the address computation of a
		// widen masked load/store which was in a basic block that needed predication
		// and is not predicated after vectorization, we can't propagate
		// poison-generating flags (nuw/nsw, exact, inbounds, etc.). The scalarized
		// instruction could feed a poison value to the base address of the widen
		// load/store.
		if (State.MayGeneratePoisonRecipes.contains(RepRecipe))
		Cloned->dropPoisonGeneratingFlags();

		if (Instr->getDebugLoc())
		State.setDebugLocFromInst(Instr);

		// Replace the operands of the cloned instructions with their scalar
		// equivalents in the new loop.
		for (auto &I : enumerate(RepRecipe->operands())) {
		VPValue *Operand = I.value();
		if (VPReplicateRecipe *OperandR = dyn_cast<VPReplicateRecipe>(Operand))
		if (OperandR->isUniform()) {
		VPIteration First = {Part, VPLane::getFirstLane()};
		Cloned->setOperand(I.index(), State.get(Operand, First));
		continue;
		}
		auto *VecPart = State.get(Operand, Part);
		auto *Extract = Builder.CreateExtractElement(VecPart, Lane);
		Cloned->setOperand(I.index(), Extract);
		}
		State.addNewMetadata(Cloned, Instr);

		// Place the cloned scalar in the new loop.
		State.Builder.Insert(Cloned);

		// If we just cloned a new assumption, add it the assumption cache.
		if (auto *II = dyn_cast<AssumeInst>(Cloned))
		AC->registerAssumption(II);

		// End if-block.
		if (IfPredicateInstr)
		PredicatedInstructions.push_back(Cloned);

		return Cloned;
		}


Value InnerLoopVectorizer::getOrCreateTripCount(BasicBlock InsertBlock) {		Value InnerLoopVectorizer::getOrCreateTripCount(BasicBlock InsertBlock) {
if (TripCount)		if (TripCount)
return TripCount;		return TripCount;

assert(InsertBlock);		assert(InsertBlock);
IRBuilder<> Builder(InsertBlock->getTerminator());		IRBuilder<> Builder(InsertBlock->getTerminator());
// Find the loop boundaries.		// Find the loop boundaries.
ScalarEvolution *SE = PSE.getSE();		ScalarEvolution *SE = PSE.getSE();
▲ Show 20 Lines • Show All 3,996 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
NumPredStores++;		NumPredStores++;

if (Legal->isUniformMemOp(I)) {		if (Legal->isUniformMemOp(I)) {
auto isLegalToScalarize = [&]() {		auto isLegalToScalarize = [&]() {
if (!VF.isScalable())		if (!VF.isScalable())
// Scalarization of fixed length vectors "just works".		// Scalarization of fixed length vectors "just works".
return true;		return true;

		// If not predicated, we can now scalarize generically with a loop
		// if needed. The remainder of the code below is about checking
		// for cases we can scalarize with predication without hitting
		// the generic replicate path which isn't yet implemented.
		if (!foldTailByMasking())
		david-armUnsubmitted Not Done Reply Inline Actions I think you can avoid the scalarisation for SVE here simply by asking for the scalarisation cost of the instruction, similarly to how it's done elsewhere. For SVE this should return Invalid. Alternatively you could add a TTI hook to ask if target should scalarise or not, i.e. if (!foldTailByMasking()) return isLegalToScalarize(); We have always considered it 'illegal' to scalarise for SVE. david-arm: I think you can avoid the scalarisation for SVE here simply by asking for the scalarisation…
		reamesAuthorUnsubmitted Done Reply Inline Actions You seem to be missing the point of the code. The costing here is done at two levels: TTI, and LV. TTI returns invalid costs for a bunch of cases which LV then turns around and decides to scalarize by reasoning about the scalarization cost and scalar op cost directly. Doing as you suggest here completely prevents the use of the added lowering. Adding a TTI hook is feasible, but I don't have a good name/semantic for it. Any suggestions? reames: You seem to be missing the point of the code. The costing here is done at two levels: TTI, and…
		reamesAuthorUnsubmitted Done Reply Inline Actions Another possibility here - would you be okay with the scalarization being behind a flag for now, and the cost model discussion (i.e. profitability) being done in a separate review? I'm wanting to avoid too much coupling here, and when I tried various different approaches to costing, a ton of unrelated issues started falling out. reames: Another possibility here - would you be okay with the scalarization being behind a flag for now…
		david-armUnsubmitted Not Done Reply Inline Actions That might be an option too. I'll think about this more in the morning! david-arm: That might be an option too. I'll think about this more in the morning!
		reamesAuthorUnsubmitted Done Reply Inline Actions A further point for your consideration. The scatter based lowering here has the following cost: Found an estimated cost of 81 for VF vscale x 4 For instruction: store i16 %ld, i16* %dst, align 2 Assuming this is a correct cost for the scatter, this is blatantly unprofitable. Essentially, we've switched from one unprofitable vectorization to another. As such, I don't think this test is particularly interesting. reames: A further point for your consideration. The scatter based lowering here has the following cost…
		david-armUnsubmitted Not Done Reply Inline Actions Given the comment // If not predicated, we can now scalarize generically with a loop then this looks wrong here. I think we should be doing: if (!blockNeedsPredicationForAnyReason()) instead of if (!foldTailByMasking()) so that we also don't scalarise loops with conditional uniform memory operations. Or if the scalarisation code does support conditional ops, then perhaps the comment needs updating? david-arm: Given the comment // If not predicated, we can now scalarize generically with a loop then…
		return true;

// For scalable vectors, a uniform memop load is always		// For scalable vectors, a uniform memop load is always
// uniform-by-parts and we know how to scalarize that.		// uniform-by-parts and we know how to scalarize that.
if (isa<LoadInst>(I))		if (isa<LoadInst>(I))
return true;		return true;

// A uniform store isn't neccessarily uniform-by-part		// A uniform store isn't neccessarily uniform-by-part
// and we can't assume scalarization.		// and we can't assume scalarization.
auto &SI = cast<StoreInst>(I);		auto &SI = cast<StoreInst>(I);
return TheLoop->isLoopInvariant(SI.getValueOperand());		return TheLoop->isLoopInvariant(SI.getValueOperand());
};		};

const InstructionCost GatherScatterCost =		const InstructionCost GatherScatterCost =
isLegalGatherOrScatter(&I, VF) ?		isLegalGatherOrScatter(&I, VF) ?
getGatherScatterCost(&I, VF) : InstructionCost::getInvalid();		getGatherScatterCost(&I, VF) : InstructionCost::getInvalid();

// Load: Scalar load + broadcast		// Load: Scalar load + broadcast
// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract		// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract
// TODO: Avoid replicating loads and stores instead of relying on		// TODO: Avoid replicating loads and stores instead of relying on
// instcombine to remove them.		// instcombine to remove them.
const InstructionCost ScalarizationCost = isLegalToScalarize() ?		const InstructionCost ScalarizationCost = isLegalToScalarize() ?
getUniformMemOpCost(&I, VF) : InstructionCost::getInvalid();		getUniformMemOpCost(&I, VF) : InstructionCost::getInvalid();
		david-armUnsubmitted Not Done Reply Inline Actions It sounds like getUniformMemOpCost is not returning Invalid here for SVE. I'm not sure why, but by returning true from isLegalToScalarize it relies on the cost being sensible. When I'm back at work tomorrow morning I'll take a better look. david-arm: It sounds like getUniformMemOpCost is not returning Invalid here for SVE. I'm not sure why, but…
		david-armUnsubmitted Not Done Reply Inline Actions I'm now convinced that in the current form the patch is wrong because `getUniformMemOpCost` assumes that for stores we will perform a normal vector store, i.e.: StoreInst SI = cast<StoreInst>(I); bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand()); return TTI.getAddressComputationCost(ValTy) + TTI.getMemoryOpCost(Instruction::Store, ValTy, Alignment, AS, CostKind) + (isLoopInvariantStoreValue ? 0 : TTI.getVectorInstrCost(Instruction::ExtractElement, VectorTy, VF.getKnownMinValue() - 1)); whereas we're actually scalarising in some cases through generation of an inner vector loop. I can think of two ways forward here: Introduce this feature under control of a flag, which is off by default. Then in a later patch fix up the costs to properly calculate the scalarisation cost, which includes the loop overhead (pre-header, phi nodes, IV cmp+branch, etc.). Fix the cost as part of this patch. As you say, provided the scalarisation cost is fairly sensible the vectoriser should make the right decision and choose the best widening decision. But right now we're making decisions based on what look like very incorrect costs. david-arm:* I'm now convinced that in the current form the patch is wrong because `getUniformMemOpCost`…


// Choose better solution for the current VF, Note that Invalid		// Choose better solution for the current VF, Note that Invalid
// costs compare as maximumal large. If both are invalid, we get		// costs compare as maximumal large. If both are invalid, we get
// scalable invalid which signals a failure and a vectorization abort.		// scalable invalid which signals a failure and a vectorization abort.
if (GatherScatterCost < ScalarizationCost)		if (GatherScatterCost < ScalarizationCost)
setWideningDecision(&I, VF, CM_GatherScatter, GatherScatterCost);		setWideningDecision(&I, VF, CM_GatherScatter, GatherScatterCost);
else		else
▲ Show 20 Lines • Show All 2,769 Lines • ▼ Show 20 Lines	if (IsUniform) {
// unrolled copy.		// unrolled copy.
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
State.ILV->scalarizeInstruction(getUnderlyingInstr(), this,		State.ILV->scalarizeInstruction(getUnderlyingInstr(), this,
VPIteration(Part, 0), IsPredicated,		VPIteration(Part, 0), IsPredicated,
State);		State);
return;		return;
}		}

		if (State.VF.isScalable()) {
		// For scalable vectors, we can scalarize by using an inner loop to
		// execute the statically unknown number of iterations required.
		// TODO: This strategy could be used for long fixed length vectors if
		// profitable.
		// TODO: Instead of one sub-loop per part, we could use one loop
		// processing lanes of each unrolled copy at once.
		auto *Instr = getUnderlyingInstr();

		auto &Builder = State.Builder;
		Value *RunTimeVF = getRuntimeVF(Builder, Builder.getInt32Ty(), State.VF);

		for (unsigned Part = 0; Part < State.UF; ++Part) {

		auto InsertPt = &Builder.GetInsertPoint();
		BasicBlock *PreheaderBB = InsertPt->getParent();
		BasicBlock *HeaderBB = SplitBlock(InsertPt->getParent(), InsertPt);
		BasicBlock *ExitBB = SplitBlock(InsertPt->getParent(), InsertPt);

		HeaderBB->getTerminator()->eraseFromParent();
		Builder.SetInsertPoint(HeaderBB);
		auto *IV = Builder.CreatePHI(RunTimeVF->getType(), 2);
		IV->addIncoming(ConstantInt::get(RunTimeVF->getType(), 0), PreheaderBB);
		PHINode *ResultIV = nullptr;
		if (!Instr->getType()->isVoidTy()) {
		auto *ResultTy = VectorType::get(Instr->getType(), State.VF);
		ResultIV = Builder.CreatePHI(ResultTy, 2);
		ResultIV->addIncoming(PoisonValue::get(ResultTy), PreheaderBB);
		}

		Instruction *Cloned =
		State.ILV->scalarizeInstruction(Instr, this, Part, IV, IsPredicated,
		State);

		if (ResultIV) {
		auto *Insert = Builder.CreateInsertElement(ResultIV, Cloned, IV);
		ResultIV->addIncoming(Insert, HeaderBB);
		State.set(this, Insert, Part);
		}

		auto *Inc = Builder.CreateAdd(IV, ConstantInt::get(RunTimeVF->getType(), 1));
		IV->addIncoming(Inc, HeaderBB);
		auto *Cmp = Builder.CreateICmpNE(IV, RunTimeVF);
		Builder.CreateCondBr(Cmp, HeaderBB, ExitBB);

		// Update the state so that we can continue at the right point for the next
		// recipe, and have valid analysis results once done transforming.
		assert(State.CFG.VPBB2IRBB[getParent()].back() == PreheaderBB);
		State.CFG.VPBB2IRBB[getParent()].push_back(HeaderBB);
		State.CFG.VPBB2IRBB[getParent()].push_back(ExitBB);
		State.CFG.PrevBB = ExitBB;

		assert(State.CurrentVectorLoop);
		State.CurrentVectorLoop->addBasicBlockToLoop(HeaderBB, *State.LI);
		State.CurrentVectorLoop->addBasicBlockToLoop(ExitBB, *State.LI);

		Builder.SetInsertPoint(ExitBB, ExitBB->begin());
		}
		return;
		}

// Generate scalar instances for all VF lanes of all UF parts.		// Generate scalar instances for all VF lanes of all UF parts.
assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");		assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");
const unsigned EndLane = State.VF.getKnownMinValue();		const unsigned EndLane = State.VF.getKnownMinValue();
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
for (unsigned Lane = 0; Lane < EndLane; ++Lane)		for (unsigned Lane = 0; Lane < EndLane; ++Lane)
State.ILV->scalarizeInstruction(getUnderlyingInstr(), this,		State.ILV->scalarizeInstruction(getUnderlyingInstr(), this,
VPIteration(Part, Lane), IsPredicated,		VPIteration(Part, Lane), IsPredicated,
State);		State);
▲ Show 20 Lines • Show All 982 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.cpp

	Show First 20 Lines • Show All 791 Lines • ▼ Show 20 Lines
	void VPlan::addLiveOut(PHINode PN, VPValue V) {			void VPlan::addLiveOut(PHINode PN, VPValue V) {
	assert(LiveOuts.count(PN) == 0 && "an exit value for PN already exists");			assert(LiveOuts.count(PN) == 0 && "an exit value for PN already exists");
	LiveOuts.insert({PN, new VPLiveOut(PN, V)});			LiveOuts.insert({PN, new VPLiveOut(PN, V)});
	}			}

	void VPlan::updateDominatorTree(DominatorTree DT, BasicBlock LoopHeaderBB,			void VPlan::updateDominatorTree(DominatorTree DT, BasicBlock LoopHeaderBB,
	BasicBlock *LoopLatchBB,			BasicBlock *LoopLatchBB,
	BasicBlock *LoopExitBB) {			BasicBlock *LoopExitBB) {
				//LoopHeaderBB->getParent()->dump();
				david-armUnsubmitted Not Done Reply Inline Actions nit: Stray comment. david-arm: nit: Stray comment.
	// The vector body may be more than a single basic-block by this point.			// The vector body may be more than a single basic-block by this point.
	// Update the dominator tree information inside the vector body by propagating			// Update the dominator tree information inside the vector body by propagating
	// it from header to latch, expecting only triangular control-flow, if any.			// it from header to latch, expecting only triangular control-flow, if any.
	BasicBlock *PostDomSucc = nullptr;			BasicBlock *PostDomSucc = nullptr;
	for (auto *BB = LoopHeaderBB; BB != LoopLatchBB; BB = PostDomSucc) {			for (auto *BB = LoopHeaderBB; BB != LoopLatchBB; BB = PostDomSucc) {
	// Get the list of successors of this block.			// Get the list of successors of this block.
	std::vector<BasicBlock *> Succs(succ_begin(BB), succ_end(BB));			std::vector<BasicBlock *> Succs(succ_begin(BB), succ_end(BB));
	assert(Succs.size() <= 2 &&			assert(Succs.size() <= 2 &&
	"Basic block in vector loop has more than 2 successors.");			"Basic block in vector loop has more than 2 successors.");
	PostDomSucc = Succs[0];			PostDomSucc = Succs[0];
	if (Succs.size() == 1) {			if (Succs.size() == 1) {
	assert(PostDomSucc->getSinglePredecessor() &&			DT->addNewBlock(PostDomSucc, BB);
	"PostDom successor has more than one predecessor.");			continue;
				}
				if (Succs[0] == BB \|\| Succs[1] == BB) {
				// simple one block loop - i.e. scalable scalarization
				if (Succs[0] == BB)
				PostDomSucc = Succs[1];
	DT->addNewBlock(PostDomSucc, BB);			DT->addNewBlock(PostDomSucc, BB);
	continue;			continue;
	}			}
	BasicBlock *InterimSucc = Succs[1];			BasicBlock *InterimSucc = Succs[1];
	if (PostDomSucc->getSingleSuccessor() == InterimSucc) {			if (PostDomSucc->getSingleSuccessor() == InterimSucc) {
	PostDomSucc = Succs[1];			PostDomSucc = Succs[1];
	InterimSucc = Succs[0];			InterimSucc = Succs[0];
	}			}
	▲ Show 20 Lines • Show All 289 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll

	Show All 13 Lines
	; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4			; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP3]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP3]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i16> poison, i16* [[DST:%.*]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i16> poison, i16* [[DST:%.*]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <vscale x 4 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i16*> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <vscale x 4 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i16*> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY_SPLIT_SPLIT:%.*]] ]
	; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i16, i16 [[SRC:%.*]], i64 [[TMP4]]			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i16, i16 [[SRC:%.*]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, i16 [[TMP5]], i32 0			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, i16 [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP7:%.]] = bitcast i16 [[TMP6]] to <vscale x 4 x i16>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast i16 [[TMP6]] to <vscale x 4 x i16>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 4 x i16>, <vscale x 4 x i16> [[TMP7]], align 2			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 4 x i16>, <vscale x 4 x i16> [[TMP7]], align 2
	; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i16.nxv4p0i16(<vscale x 4 x i16> [[WIDE_LOAD]], <vscale x 4 x i16*> [[BROADCAST_SPLAT]], i32 2, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))			; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vscale.i32()
	david-armUnsubmitted Not Done Reply Inline Actions Hi @reames, we definitely don't want to be doing this for SVE as it will likely hurt performance - scatters are still likely to be better. We made a conscious effort to avoid scalaring this way for SVE because we managed to find other ways of solving the same problems using existing instructions. Also, if we ever encountered a situation where we'd have to scalarise a scalable vector then the performance will likely to be terrible and so we choose to return an Invalid cost from the cost model and skip that VF. We may as well just use NEON or not vectorise, because this is almost certainly better. What use cases are you trying to solve here? This patch doesn't seem to fix any actual bugs, so I'm assuming this is for performance reasons. It looks like this change only affects one test (`@uniform_store_of_loop_varying`) and I guess the IR in this test is not a common idiom. Have you tested performance before and after to see if this is worthwhile? david-arm: Hi @reames, we definitely don't want to be doing this for SVE as it will likely hurt…
	reamesAuthorUnsubmitted Done Reply Inline Actions The motivation here is primarily robustness - both correctness wise and performance. On the correctness front, the existing code pretty fundamentally assumes that scalarization is an available strategy. The invalid bailouts for scalable has been added in, but I've already hit (and fixed) several cases where this was incomplete. From a code complexity, and likelihood of bugs perspective, having fixed and scalable have the same capabilities seems highly useful. On the performance side, there's a two level argument here. For many cases, you're entirely right that there are other strategies available. You'll note that I have a patch on review right now which implements a more optimal strategy for divides. However, we have not implemented all of them. As a result, the current situation is that these cases become strictly non-vectorizeable. This means that even otherwise hugely profitable to vectorize loops are prevented from vectorization. This is basically a "order of work" argument to minimize the pain while that work is happening. However, there are also cases which we can vectorize, but fundamentally don't have a better strategy for. One that we could vectorize - but don't today - is a call to a readonly function in an otherwise vectorizable loop. We can't assume the existence of a vector variant of the call (since it's not a known function), but we could scalarize in general. Now, put that call down a rarely taken predicated block and you start seeing why having the capability of scalarization is useful. Uniform memops are only a testing vehicle here. Nothing more. Please don't fixate on the generality of a uniform store test as that is not what this patch is really about. For aarch64, if scatters are profitable and legal, then the code should pick that already. This is an example where the cost model believes scalarization is profitable. It didn't make it into the final patch, but I played with some pretty serious cost penalties, and this case didn't change behavior. If you have particular suggestions on how to adjust cost to penalize scalarizing "enough", I'm completely open to them. reames: The motivation here is primarily robustness - both correctness wise and performance. On the…
	fhahnUnsubmitted Not Done Reply Inline Actions In general it seems like having this extra lowering strategy would be desirable with respect to guaranteeing that we will always be able to scalarize any construct, rather than crash. This is an assumption made in various places. It should be up to the cost model to avoid scalarizing in cases where it won't be profitable, but I can't comment on this particular test case or if there are scenarios where this will profitable in general in practice. fhahn: In general it seems like having this extra lowering strategy would be desirable with respect to…
	; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP9:%.*]] = mul i32 [[TMP8]], 4
	; CHECK-NEXT: [[TMP9:%.*]] = mul i64 [[TMP8]], 4			; CHECK-NEXT: br label [[VECTOR_BODY_SPLIT:%.*]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP9]]			; CHECK: vector.body.split:
	; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP10:%.]] = phi i32 [ 0, [[VECTOR_BODY]] ], [ [[TMP13:%.]], [[VECTOR_BODY_SPLIT]] ]
	; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <vscale x 4 x i16> [[WIDE_LOAD]], i32 [[TMP10]]
				; CHECK-NEXT: [[TMP12:%.]] = extractelement <vscale x 4 x i16> [[BROADCAST_SPLAT]], i32 [[TMP10]]
				; CHECK-NEXT: store i16 [[TMP11]], i16* [[TMP12]], align 2
				; CHECK-NEXT: [[TMP13]] = add i32 [[TMP10]], 1
				; CHECK-NEXT: [[TMP14:%.*]] = icmp ne i32 [[TMP10]], [[TMP9]]
				; CHECK-NEXT: br i1 [[TMP14]], label [[VECTOR_BODY_SPLIT]], label [[VECTOR_BODY_SPLIT_SPLIT]]
				; CHECK: vector.body.split.split:
				; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP16]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_INC24:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_INC24:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY14:%.*]]			; CHECK-NEXT: br label [[FOR_BODY14:%.*]]
	; CHECK: for.body14:			; CHECK: for.body14:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY14]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY14]] ]
	▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll

	Show First 20 Lines • Show All 847 Lines • ▼ Show 20 Lines

	for.end:			for.end:
	ret void			ret void
	}			}

	define void @uniform_store_of_loop_varying(ptr noalias nocapture %a, ptr noalias nocapture %b, i64 %v, i64 %n) {			define void @uniform_store_of_loop_varying(ptr noalias nocapture %a, ptr noalias nocapture %b, i64 %v, i64 %n) {
	; SCALABLE-LABEL: @uniform_store_of_loop_varying(			; SCALABLE-LABEL: @uniform_store_of_loop_varying(
	; SCALABLE-NEXT: entry:			; SCALABLE-NEXT: entry:
				; SCALABLE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; SCALABLE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP0]]
				; SCALABLE-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; SCALABLE: vector.ph:
				; SCALABLE-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
				; SCALABLE-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP1]]
				; SCALABLE-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
				; SCALABLE-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 1 x ptr> poison, ptr [[B:%.]], i32 0
				; SCALABLE-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 1 x ptr> [[BROADCAST_SPLATINSERT]], <vscale x 1 x ptr> poison, <vscale x 1 x i32> zeroinitializer
				; SCALABLE-NEXT: [[BROADCAST_SPLATINSERT1:%.]] = insertelement <vscale x 1 x i64> poison, i64 [[V:%.]], i32 0
				; SCALABLE-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 1 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 1 x i64> poison, <vscale x 1 x i32> zeroinitializer
				; SCALABLE-NEXT: br label [[VECTOR_BODY:%.*]]
				; SCALABLE: vector.body:
				; SCALABLE-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY_SPLIT_SPLIT:%.*]] ]
				; SCALABLE-NEXT: [[TMP2:%.*]] = call <vscale x 1 x i64> @llvm.experimental.stepvector.nxv1i64()
				; SCALABLE-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 1 x i64> poison, i64 [[INDEX]], i32 0
				; SCALABLE-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 1 x i64> [[DOTSPLATINSERT]], <vscale x 1 x i64> poison, <vscale x 1 x i32> zeroinitializer
				; SCALABLE-NEXT: [[TMP3:%.*]] = add <vscale x 1 x i64> zeroinitializer, [[TMP2]]
				; SCALABLE-NEXT: [[TMP4:%.*]] = mul <vscale x 1 x i64> [[TMP3]], shufflevector (<vscale x 1 x i64> insertelement (<vscale x 1 x i64> poison, i64 1, i32 0), <vscale x 1 x i64> poison, <vscale x 1 x i32> zeroinitializer)
				; SCALABLE-NEXT: [[TMP5:%.*]] = add <vscale x 1 x i64> [[DOTSPLAT]], [[TMP4]]
				; SCALABLE-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 0
				; SCALABLE-NEXT: [[TMP7:%.*]] = call i32 @llvm.vscale.i32()
				; SCALABLE-NEXT: br label [[VECTOR_BODY_SPLIT:%.*]]
				; SCALABLE: vector.body.split:
				; SCALABLE-NEXT: [[TMP8:%.]] = phi i32 [ 0, [[VECTOR_BODY]] ], [ [[TMP11:%.]], [[VECTOR_BODY_SPLIT]] ]
				; SCALABLE-NEXT: [[TMP9:%.*]] = extractelement <vscale x 1 x i64> [[TMP5]], i32 [[TMP8]]
				; SCALABLE-NEXT: [[TMP10:%.*]] = extractelement <vscale x 1 x ptr> [[BROADCAST_SPLAT]], i32 [[TMP8]]
				; SCALABLE-NEXT: store i64 [[TMP9]], ptr [[TMP10]], align 8
				; SCALABLE-NEXT: [[TMP11]] = add i32 [[TMP8]], 1
				; SCALABLE-NEXT: [[TMP12:%.*]] = icmp ne i32 [[TMP8]], [[TMP7]]
				; SCALABLE-NEXT: br i1 [[TMP12]], label [[VECTOR_BODY_SPLIT]], label [[VECTOR_BODY_SPLIT_SPLIT]]
				; SCALABLE: vector.body.split.split:
				; SCALABLE-NEXT: [[TMP13:%.]] = getelementptr inbounds i64, ptr [[A:%.]], i64 [[TMP6]]
				; SCALABLE-NEXT: [[TMP14:%.*]] = getelementptr inbounds i64, ptr [[TMP13]], i32 0
				; SCALABLE-NEXT: store <vscale x 1 x i64> [[BROADCAST_SPLAT2]], ptr [[TMP14]], align 8
				; SCALABLE-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
				; SCALABLE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP15]]
				; SCALABLE-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; SCALABLE-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
				; SCALABLE: middle.block:
				; SCALABLE-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
				; SCALABLE-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; SCALABLE: scalar.ph:
				; SCALABLE-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; SCALABLE-NEXT: br label [[FOR_BODY:%.*]]			; SCALABLE-NEXT: br label [[FOR_BODY:%.*]]
	; SCALABLE: for.body:			; SCALABLE: for.body:
	; SCALABLE-NEXT: [[IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[IV_NEXT:%.*]], [[FOR_BODY]] ]			; SCALABLE-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
	; SCALABLE-NEXT: store i64 [[IV]], ptr [[B:%.*]], align 8			; SCALABLE-NEXT: store i64 [[IV]], ptr [[B]], align 8
	; SCALABLE-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i64, ptr [[A:%.]], i64 [[IV]]			; SCALABLE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV]]
	; SCALABLE-NEXT: store i64 [[V:%.*]], ptr [[ARRAYIDX]], align 8			; SCALABLE-NEXT: store i64 [[V]], ptr [[ARRAYIDX]], align 8
	; SCALABLE-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1			; SCALABLE-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
	; SCALABLE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024			; SCALABLE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024
	; SCALABLE-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END:%.*]], label [[FOR_BODY]]			; SCALABLE-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
	; SCALABLE: for.end:			; SCALABLE: for.end:
	; SCALABLE-NEXT: ret void			; SCALABLE-NEXT: ret void
	;			;
	; FIXEDLEN-LABEL: @uniform_store_of_loop_varying(			; FIXEDLEN-LABEL: @uniform_store_of_loop_varying(
	; FIXEDLEN-NEXT: entry:			; FIXEDLEN-NEXT: entry:
	; FIXEDLEN-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; FIXEDLEN-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; FIXEDLEN: vector.ph:			; FIXEDLEN: vector.ph:
	; FIXEDLEN-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <2 x i64> poison, i64 [[V:%.]], i32 0			; FIXEDLEN-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <2 x i64> poison, i64 [[V:%.]], i32 0
	▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines
	; SCALABLE-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 0			; SCALABLE-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 0
	; SCALABLE-NEXT: store i64 [[V]], ptr [[B:%.*]], align 1			; SCALABLE-NEXT: store i64 [[V]], ptr [[B:%.*]], align 1
	; SCALABLE-NEXT: [[TMP3:%.]] = getelementptr inbounds i64, ptr [[A:%.]], i64 [[TMP2]]			; SCALABLE-NEXT: [[TMP3:%.]] = getelementptr inbounds i64, ptr [[A:%.]], i64 [[TMP2]]
	; SCALABLE-NEXT: [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[TMP3]], i32 0			; SCALABLE-NEXT: [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[TMP3]], i32 0
	; SCALABLE-NEXT: store <vscale x 1 x i64> [[BROADCAST_SPLAT]], ptr [[TMP4]], align 8			; SCALABLE-NEXT: store <vscale x 1 x i64> [[BROADCAST_SPLAT]], ptr [[TMP4]], align 8
	; SCALABLE-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()			; SCALABLE-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
	; SCALABLE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]			; SCALABLE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]
	; SCALABLE-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; SCALABLE-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; SCALABLE-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]			; SCALABLE-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
	; SCALABLE: middle.block:			; SCALABLE: middle.block:
	; SCALABLE-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]			; SCALABLE-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
	; SCALABLE-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; SCALABLE-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; SCALABLE: scalar.ph:			; SCALABLE: scalar.ph:
	; SCALABLE-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; SCALABLE-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; SCALABLE-NEXT: br label [[FOR_BODY:%.*]]			; SCALABLE-NEXT: br label [[FOR_BODY:%.*]]
	; SCALABLE: for.body:			; SCALABLE: for.body:
	; SCALABLE-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]			; SCALABLE-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
	; SCALABLE-NEXT: store i64 [[V]], ptr [[B]], align 1			; SCALABLE-NEXT: store i64 [[V]], ptr [[B]], align 1
	; SCALABLE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV]]			; SCALABLE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV]]
	; SCALABLE-NEXT: store i64 [[V]], ptr [[ARRAYIDX]], align 8			; SCALABLE-NEXT: store i64 [[V]], ptr [[ARRAYIDX]], align 8
	; SCALABLE-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1			; SCALABLE-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
	; SCALABLE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024			; SCALABLE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024
	; SCALABLE-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]			; SCALABLE-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP13:![0-9]+]]
	; SCALABLE: for.end:			; SCALABLE: for.end:
	; SCALABLE-NEXT: ret void			; SCALABLE-NEXT: ret void
	;			;
	; FIXEDLEN-LABEL: @uniform_store_unaligned(			; FIXEDLEN-LABEL: @uniform_store_unaligned(
	; FIXEDLEN-NEXT: entry:			; FIXEDLEN-NEXT: entry:
	; FIXEDLEN-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; FIXEDLEN-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; FIXEDLEN: vector.ph:			; FIXEDLEN: vector.ph:
	; FIXEDLEN-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <2 x i64> poison, i64 [[V:%.]], i32 0			; FIXEDLEN-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <2 x i64> poison, i64 [[V:%.]], i32 0
	▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines