Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
fhahn
efriedma
Ayal

Commits

rG0d748b4d32cb: [LoopVectorize] Extract the last lane from a uniform store

Summary

Changes VPReplicateRecipe to extract the last lane from an unconditional,
uniform store instruction. collectLoopUniforms will also add stores to
the list of uniform instructions where Legal->isUniformMemOp is true.

setCostBasedWideningDecision now sets the widening decision for
all uniform memory ops to Scalarize, where previously GatherScatter
may have been chosen for scalable stores.

This fixes an assert ("Cannot yet scalarize uniform stores") in
setCostBasedWideningDecision when we have a loop containing a
uniform i1 store and a scalable VF, which we cannot create a scatter for.

Diff Detail

Event Timeline

kmclaughlin created this revision.Oct 28 2021, 8:27 AM

Herald added a reviewer: efriedma. · View Herald TranscriptOct 28 2021, 8:27 AM

Herald added subscribers: psnobl, hiraditya, tschuett. · View Herald Transcript

kmclaughlin requested review of this revision.Oct 28 2021, 8:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 28 2021, 8:27 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B131210: Diff 383039.Oct 28 2021, 9:27 AM

sdesmalen added inline comments.Oct 29 2021, 12:53 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7470–7471	This condition is always true, because it is enclosed by `if (Legal->isUniformMemOp(I))`
9817	This code needs a comment with rationale for extracting the last lane.
9818	While it is a concrete problem for scalable vectors, I don't think this is necessarily specific to scalable vectors and so we may want to do the same thing for fixed-width vectors. I'd expect other passes to remove the redundant scalar stores that are currently created, but it would be nice if those would not be generated in the first place.
9818	I would expect `isScalarAfterVectorization` to be set to `true` when the memory address is uniform and that to be used instead of '`!IsUniform && isUniformMemOp(*I)`. Can you check whether this can be used instead?

Removed redundant Legal->isUniformMemOp(I) check from setCostBasedWideningDecision
Added a comment to VPReplicateRecipe::execute
Removed the State.VF.isScalable() check from VPReplicateRecipe::execute & updated the tests affected by this change. Also added a test of uniform stores for fixed-width.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9818	Hi @sdesmalen, I've removed `State.VF.isScalable()` and added a new test for fixed-width to LoopVectorize/uniform-store.ll. There were also a few existing tests with uniform stores affected by this change which have been updated.
9818	I tried removing `!IsUniform && isUniformMemOp(*I)` and replacing it with `isScalarAfterVectorization`, though this returns false for the instruction here and we continue on to hit the "Can't scalarize a scalable vector" assert below.

Harbormaster completed remote builds in B131446: Diff 383385.Oct 29 2021, 9:59 AM

Matt added a subscriber: Matt.Nov 1 2021, 2:32 PM

kmclaughlin mentioned this in D113034: [LoopVectorize] Mark store instructions as uniform in collectLoopUniforms.Nov 2 2021, 10:40 AM

Removed the uniform-store.ll test added in the previous revision.

kmclaughlin added a child revision: D113034: [LoopVectorize] Mark store instructions as uniform in collectLoopUniforms.Nov 2 2021, 10:44 AM

Harbormaster completed remote builds in B132012: Diff 384161.Nov 2 2021, 11:23 AM

kmclaughlin marked an inline comment as not done.Nov 2 2021, 11:25 AM

kmclaughlin added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9818	@sdesmalen I think you're right that `isScalarAfterVectorization` should be true for the store instruction and I've created D113034 (which changes `collectLoopUniforms` to also consider uniform stores) to address this. I think we still need to check `isUniformMemOp` here though, since Scalars collects more than just uniform instructions and we only want to generate the last lane for uniform stores.

I think the title needs updating after the latest update (remove SVE).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9818	If D113034 would be landed first, shouldn't `IsUniform` be set correctly? Ideally the uniform information should be explicit in the recipe and Legal should not be accessed during codegen.

Merged with D113034, which makes changes to collectLoopUniforms to collect uniform store instructions.

kmclaughlin added inline comments.Nov 3 2021, 9:59 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9818	Hi @fhahn, I don't think D113034 could be landed first since it requires changes from this patch to generate the last lane for stores? Since these patches are closely related I've merged them here so that I don't need to access Legal from VPReplicateRecipe.

Harbormaster completed remote builds in B132261: Diff 384494.Nov 3 2021, 10:38 AM

The changes look good to me @kmclaughlin! I'll look through the remaining non-aarch64 tests later today. I had a couple of minor comments so far ...

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9820	This change looks sensible to me, since I think we mark unrolled scalar stores as uniform.
llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll
29 ↗	(On Diff #384494)	Perhaps it's worth deleting the CHECK lines from `middle.block` onwards as they don't add much value?
llvm/test/Transforms/LoopVectorize/AArch64/sve-uniform-store.ll
6	I wonder if it's perhaps worth moving these tests into sve-inv-store.ll, since they're testing the same thing?

david-arm added inline comments.Nov 4 2021, 4:40 AM

llvm/test/Transforms/LoopVectorize/X86/illegal-parallel-loop-uniform-write.ll
85 ↗	(On Diff #384494)	Hi @kmclaughlin, I don't think this looks right sadly. I wonder if now we're marking the store as uniform that you've exposed an existing bug in `collectLoopUniforms` or somewhere else like `handleReplication`? It looks like we're also now treating the the load as uniform, which is wrong in this case because we still want to do the vector load and store out the last lane. I'd expected something like: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP22]], align 4 [[TMP23:%.]] = add nsw <4 x i32> [[WIDE_LOAD]], <i32 1, i32 1, i32 1, i32 1> [[TMP27:%.]] = extractelement <4 x i32> [[TMP23]], i32 3 store i32 [[TMP27]], i32* [[ARRAYIDX7_US]], align 4

fhahn requested changes to this revision.Nov 4 2021, 5:17 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5465	Unfortunately this is incorrect now. The worklist currently relies on instructions only demanding the first lane, and this is used to propagate this property later on, using: if all users of an operand demand the first lane only, the operand itself also only needs to compute the first lane. I added a clarifying comment in b4992dbb21ff9159285ae0aec73f3d760344b0e5 Adding stores violates that. You should probably be able to work around that by adding them to `Uniforms` without adding them to the worklist. It might be worth calling out that entries in Uniforms may demand the first or last lane.

This revision now requires changes to proceed.Nov 4 2021, 5:17 AM

fhahn added a reviewer: Ayal.Nov 4 2021, 5:18 AM

Add store instructions to the Uniforms list in collectLoopUniforms, instead of the worklist. Added more comments to clarify that instructions in Uniforms may demand the first or last lane.
Moved the new tests in sve-uniform-store.ll into sve-inv-store.ll.
Removed the CHECK lines from middle.block from @inv_store_i16

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5465	Thank you for adding a comment to addToWorklistIfAllowed, @fhahn. I've changed this so that store instructions are only added to `Uniforms` as suggested and added some comments which I hope makes this clear.
llvm/test/Transforms/LoopVectorize/X86/illegal-parallel-loop-uniform-write.ll
85 ↗	(On Diff #384494)	Thanks @david-arm, the load instruction was incorrectly being marked as uniform here. I think now that I've changed collectLoopUniforms so that stores are not added to the worklist, the output of this test looks as expected?

Harbormaster completed remote builds in B132490: Diff 384795.Nov 4 2021, 12:17 PM

LGTM! It looks like you've addressed all the review comments here. Thanks @kmclaughlin!

LGTM, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5467	nit: perhaps throw in an `only` (e.g. we instead add these to Uniforms only). Otherwise it may sound like the other instructions won't get added to `Uniforms`, which they will eventually.

This revision is now accepted and ready to land.Nov 7 2021, 1:43 AM

Nice improvement @kmclaughlin, thanks for addressing my comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5465–5466	nit: Should the comment on line 5443 be updated?

This revision was landed with ongoing or failed builds.Nov 9 2021, 6:44 AM

Closed by commit rG0d748b4d32cb: [LoopVectorize] Extract the last lane from a uniform store (authored by kmclaughlin). · Explain Why

This revision was automatically updated to reflect the committed changes.

kmclaughlin marked 2 inline comments as done.

kmclaughlin added a commit: rG0d748b4d32cb: [LoopVectorize] Extract the last lane from a uniform store.

Herald added a subscriber: zzheng. · View Herald TranscriptNov 9 2021, 6:44 AM

kmclaughlin added a reverting change: rG6f16ee5e14a0: Revert "[LoopVectorize] Extract the last lane from a uniform store".Nov 10 2021, 3:23 AM

Diff 383039

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 560 Lines • ▼ Show 20 Lines	public:

/// Create a broadcast instruction. This method generates a broadcast		/// Create a broadcast instruction. This method generates a broadcast
/// instruction (shuffle) for loop invariant values and for the induction		/// instruction (shuffle) for loop invariant values and for the induction
/// value. If this is the induction variable then we extend it to N, N+1, ...		/// value. If this is the induction variable then we extend it to N, N+1, ...
/// this is needed because each iteration in the loop corresponds to a SIMD		/// this is needed because each iteration in the loop corresponds to a SIMD
/// element.		/// element.
virtual Value getBroadcastInstrs(Value V);		virtual Value getBroadcastInstrs(Value V);

		LoopVectorizationLegality *getLegal() const { return Legal; }

protected:		protected:
friend class LoopVectorizationPlanner;		friend class LoopVectorizationPlanner;

/// A small list of PHINodes.		/// A small list of PHINodes.
using PhiVector = SmallVector<PHINode *, 4>;		using PhiVector = SmallVector<PHINode *, 4>;

/// A type for scalarized values in the new loop. Each value from the		/// A type for scalarized values in the new loop. Each value from the
/// original loop, when scalarized, is represented by UF x VF scalar values		/// original loop, when scalarized, is represented by UF x VF scalar values
▲ Show 20 Lines • Show All 4,878 Lines • ▼ Show 20 Lines	for (auto &I : *BB) {
// If there's no pointer operand, there's nothing to do.		// If there's no pointer operand, there's nothing to do.
auto *Ptr = getLoadStorePointerOperand(&I);		auto *Ptr = getLoadStorePointerOperand(&I);
if (!Ptr)		if (!Ptr)
continue;		continue;

// A uniform memory op is itself uniform. We exclude uniform stores		// A uniform memory op is itself uniform. We exclude uniform stores
// here as they demand the last lane, not the first one.		// here as they demand the last lane, not the first one.
if (isa<LoadInst>(I) && Legal->isUniformMemOp(I))		if (isa<LoadInst>(I) && Legal->isUniformMemOp(I))
addToWorklistIfAllowed(&I);		addToWorklistIfAllowed(&I);
		fhahnUnsubmitted Not Done Reply Inline Actions Unfortunately this is incorrect now. The worklist currently relies on instructions only demanding the first lane, and this is used to propagate this property later on, using: if all users of an operand demand the first lane only, the operand itself also only needs to compute the first lane. I added a clarifying comment in b4992dbb21ff9159285ae0aec73f3d760344b0e5 Adding stores violates that. You should probably be able to work around that by adding them to `Uniforms` without adding them to the worklist. It might be worth calling out that entries in Uniforms may demand the first or last lane. fhahn: Unfortunately this is incorrect now. The worklist currently relies on instructions only…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Thank you for adding a comment to addToWorklistIfAllowed, @fhahn. I've changed this so that store instructions are only added to `Uniforms` as suggested and added some comments which I hope makes this clear. kmclaughlin: Thank you for adding a comment to addToWorklistIfAllowed, @fhahn. I've changed this so that…

		sdesmalenUnsubmitted Done Reply Inline Actions nit: Should the comment on line 5443 be updated? sdesmalen: nit: Should the comment on line 5443 be updated?
if (isUniformDecision(&I, VF)) {		if (isUniformDecision(&I, VF)) {
		fhahnUnsubmitted Done Reply Inline Actions nit: perhaps throw in an `only` (e.g. we instead add these to Uniforms only). Otherwise it may sound like the other instructions won't get added to `Uniforms`, which they will eventually. fhahn: nit: perhaps throw in an `only` (e.g. we instead add these to Uniforms only). Otherwise it may…
assert(isVectorizedMemAccessUse(&I, Ptr) && "consistency check");		assert(isVectorizedMemAccessUse(&I, Ptr) && "consistency check");
HasUniformUse.insert(Ptr);		HasUniformUse.insert(Ptr);
}		}
}		}

// Add to the worklist any operands which have only uniform (e.g. lane 0		// Add to the worklist any operands which have only uniform (e.g. lane 0
// demanding) users. Since loops are assumed to be in LCSSA form, this		// demanding) users. Since loops are assumed to be in LCSSA form, this
// disallows uses outside the loop as well.		// disallows uses outside the loop as well.
▲ Show 20 Lines • Show All 1,986 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
// Load: Scalar load + broadcast		// Load: Scalar load + broadcast
// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract		// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract
InstructionCost Cost;		InstructionCost Cost;
if (isa<StoreInst>(&I) && VF.isScalable() &&		if (isa<StoreInst>(&I) && VF.isScalable() &&
isLegalGatherOrScatter(&I)) {		isLegalGatherOrScatter(&I)) {
Cost = getGatherScatterCost(&I, VF);		Cost = getGatherScatterCost(&I, VF);
setWideningDecision(&I, VF, CM_GatherScatter, Cost);		setWideningDecision(&I, VF, CM_GatherScatter, Cost);
} else {		} else {
assert((isa<LoadInst>(&I) \|\| !VF.isScalable()) &&		assert((isa<LoadInst>(&I) \|\| !VF.isScalable() \|\|
		Legal->isUniformMemOp(I)) &&
		sdesmalenUnsubmitted Done Reply Inline Actions This condition is always true, because it is enclosed by `if (Legal->isUniformMemOp(I))` sdesmalen: This condition is always true, because it is enclosed by `if (Legal->isUniformMemOp(I))`
"Cannot yet scalarize uniform stores");		"Cannot yet scalarize uniform stores");
Cost = getUniformMemOpCost(&I, VF);		Cost = getUniformMemOpCost(&I, VF);
setWideningDecision(&I, VF, CM_Scalarize, Cost);		setWideningDecision(&I, VF, CM_Scalarize, Cost);
}		}
continue;		continue;
}		}

// We assume that widening is the best solution when possible.		// We assume that widening is the best solution when possible.
▲ Show 20 Lines • Show All 2,329 Lines • ▼ Show 20 Lines	if (AlsoPack && State.VF.isVector()) {
VectorType::get(getUnderlyingValue()->getType(), State.VF));		VectorType::get(getUnderlyingValue()->getType(), State.VF));
State.set(this, Poison, State.Instance->Part);		State.set(this, Poison, State.Instance->Part);
}		}
State.ILV->packScalarIntoVectorValue(this, *State.Instance, State);		State.ILV->packScalarIntoVectorValue(this, *State.Instance, State);
}		}
return;		return;
}		}

		Instruction *I = getUnderlyingInstr();
		sdesmalenUnsubmitted Done Reply Inline Actions This code needs a comment with rationale for extracting the last lane. sdesmalen: This code needs a comment with rationale for extracting the last lane.
		if (!IsUniform && State.VF.isScalable() && isa<StoreInst>(I) &&
		sdesmalenUnsubmitted Done Reply Inline Actions While it is a concrete problem for scalable vectors, I don't think this is necessarily specific to scalable vectors and so we may want to do the same thing for fixed-width vectors. I'd expect other passes to remove the redundant scalar stores that are currently created, but it would be nice if those would not be generated in the first place. sdesmalen: While it is a concrete problem for scalable vectors, I don't think this is necessarily specific…
		kmclaughlinAuthorUnsubmitted Done Reply Inline Actions Hi @sdesmalen, I've removed `State.VF.isScalable()` and added a new test for fixed-width to LoopVectorize/uniform-store.ll. There were also a few existing tests with uniform stores affected by this change which have been updated. kmclaughlin: Hi @sdesmalen, I've removed `State.VF.isScalable()` and added a new test for fixed-width to…
		sdesmalenUnsubmitted Not Done Reply Inline Actions I would expect `isScalarAfterVectorization` to be set to `true` when the memory address is uniform and that to be used instead of '`!IsUniform && isUniformMemOp(I)`. Can you check whether this can be used instead? sdesmalen:* I would expect `isScalarAfterVectorization` to be set to `true` when the memory address is…
		kmclaughlinAuthorUnsubmitted Not Done Reply Inline Actions I tried removing `!IsUniform && isUniformMemOp(I)` and replacing it with `isScalarAfterVectorization`, though this returns false for the instruction here and we continue on to hit the "Can't scalarize a scalable vector" assert below. kmclaughlin:* I tried removing `!IsUniform && isUniformMemOp(*I)` and replacing it with…
		kmclaughlinAuthorUnsubmitted Not Done Reply Inline Actions @sdesmalen I think you're right that `isScalarAfterVectorization` should be true for the store instruction and I've created D113034 (which changes `collectLoopUniforms` to also consider uniform stores) to address this. I think we still need to check `isUniformMemOp` here though, since Scalars collects more than just uniform instructions and we only want to generate the last lane for uniform stores. kmclaughlin: @sdesmalen I think you're right that `isScalarAfterVectorization` should be true for the store…
		fhahnUnsubmitted Not Done Reply Inline Actions If D113034 would be landed first, shouldn't `IsUniform` be set correctly? Ideally the uniform information should be explicit in the recipe and Legal should not be accessed during codegen. fhahn: If D113034 would be landed first, shouldn't `IsUniform` be set correctly? Ideally the uniform…
		kmclaughlinAuthorUnsubmitted Not Done Reply Inline Actions Hi @fhahn, I don't think D113034 could be landed first since it requires changes from this patch to generate the last lane for stores? Since these patches are closely related I've merged them here so that I don't need to access Legal from VPReplicateRecipe. kmclaughlin: Hi @fhahn, I don't think D113034 could be landed first since it requires changes from this…
		State.ILV->getLegal()->isUniformMemOp(*I)) {
		VPLane Lane = VPLane::getLastLaneForVF(State.VF);
		david-armUnsubmitted Not Done Reply Inline Actions This change looks sensible to me, since I think we mark unrolled scalar stores as uniform. david-arm: This change looks sensible to me, since I think we mark unrolled scalar stores as uniform.
		State.ILV->scalarizeInstruction(
		I, this, *this, VPIteration(State.UF - 1, Lane), IsPredicated, State);
		return;
		}

// Generate scalar instances for all VF lanes of all UF parts, unless the		// Generate scalar instances for all VF lanes of all UF parts, unless the
// instruction is uniform inwhich case generate only the first lane for each		// instruction is uniform inwhich case generate only the first lane for each
// of the UF parts.		// of the UF parts.
unsigned EndLane = IsUniform ? 1 : State.VF.getKnownMinValue();		unsigned EndLane = IsUniform ? 1 : State.VF.getKnownMinValue();
assert((!State.VF.isScalable() \|\| IsUniform) &&		assert((!State.VF.isScalable() \|\| IsUniform) &&
"Can't scalarize a scalable vector");		"Can't scalarize a scalable vector");
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
for (unsigned Lane = 0; Lane < EndLane; ++Lane)		for (unsigned Lane = 0; Lane < EndLane; ++Lane)
State.ILV->scalarizeInstruction(getUnderlyingInstr(), this, *this,		State.ILV->scalarizeInstruction(I, this, *this, VPIteration(Part, Lane),
VPIteration(Part, Lane), IsPredicated,		IsPredicated, State);
State);
}		}

void VPBranchOnMaskRecipe::execute(VPTransformState &State) {		void VPBranchOnMaskRecipe::execute(VPTransformState &State) {
assert(State.Instance && "Branch on Mask works only on single instance.");		assert(State.Instance && "Branch on Mask works only on single instance.");

unsigned Part = State.Instance->Part;		unsigned Part = State.Instance->Part;
unsigned Lane = State.Instance->Lane.getKnownLane();		unsigned Lane = State.Instance->Lane.getKnownLane();

▲ Show 20 Lines • Show All 779 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-uniform-store.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -loop-vectorize -scalable-vectorization=preferred -mtriple aarch64-linux-gnu -mattr=+sve -S < %s \| FileCheck %s
				target triple = "aarch64-unknown-linux-gnu"

				define void @uniform_store_i1(i1* noalias %dst, i64* noalias %start, i64 %N) {
				; CHECK-LABEL: @uniform_store_i1(
				david-armUnsubmitted Done Reply Inline Actions I wonder if it's perhaps worth moving these tests into sve-inv-store.ll, since they're testing the same thing? david-arm: I wonder if it's perhaps worth moving these tests into sve-inv-store.ll, since they're testing…
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = add i64 [[N:%.]], 1
				; CHECK-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 4
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], [[TMP4]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.]] = getelementptr i64, i64 [[START:%.*]], i64 [[N_VEC]]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 2 x i64> poison, i64* [[START]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64*> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT10:%.]] = insertelement <vscale x 2 x i64> poison, i64* [[START]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT11:%.]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT10]], <vscale x 2 x i64*> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[INDEX]], i32 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 0, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), [[TMP5]]
				; CHECK-NEXT: [[TMP7:%.*]] = add <vscale x 2 x i64> [[DOTSPLAT]], [[TMP6]]
				; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i64, i64 [[START]], <vscale x 2 x i64> [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[NEXT_GEP2:%.]] = getelementptr i64, i64 [[START]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 1
				; CHECK-NEXT: [[NEXT_GEP3:%.]] = getelementptr i64, i64 [[START]], i64 [[TMP9]]
				; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 2
				; CHECK-NEXT: [[DOTSPLATINSERT4:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[TMP11]], i32 0
				; CHECK-NEXT: [[DOTSPLAT5:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT4]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 2 x i64> [[DOTSPLAT5]], [[TMP5]]
				; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 2 x i64> [[DOTSPLAT]], [[TMP12]]
				; CHECK-NEXT: [[NEXT_GEP6:%.]] = getelementptr i64, i64 [[START]], <vscale x 2 x i64> [[TMP13]]
				; CHECK-NEXT: [[TMP14:%.*]] = add i64 [[TMP11]], 0
				; CHECK-NEXT: [[TMP15:%.*]] = add i64 [[INDEX]], [[TMP14]]
				; CHECK-NEXT: [[NEXT_GEP7:%.]] = getelementptr i64, i64 [[START]], i64 [[TMP15]]
				; CHECK-NEXT: [[TMP16:%.*]] = add i64 [[TMP11]], 1
				; CHECK-NEXT: [[TMP17:%.*]] = add i64 [[INDEX]], [[TMP16]]
				; CHECK-NEXT: [[NEXT_GEP8:%.]] = getelementptr i64, i64 [[START]], i64 [[TMP17]]
				; CHECK-NEXT: [[TMP18:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP20:%.*]] = mul i64 [[TMP19]], 2
				; CHECK-NEXT: [[TMP21:%.*]] = add i64 [[TMP20]], 0
				; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 1
				; CHECK-NEXT: [[TMP23:%.*]] = add i64 [[INDEX]], [[TMP22]]
				; CHECK-NEXT: [[TMP24:%.]] = getelementptr i64, i64 [[NEXT_GEP2]], i32 0
				; CHECK-NEXT: [[TMP25:%.]] = bitcast i64 [[TMP24]] to <vscale x 2 x i64>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 2 x i64>, <vscale x 2 x i64> [[TMP25]], align 4
				; CHECK-NEXT: [[TMP26:%.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: [[TMP27:%.*]] = mul i32 [[TMP26]], 2
				; CHECK-NEXT: [[TMP28:%.]] = getelementptr i64, i64 [[NEXT_GEP2]], i32 [[TMP27]]
				; CHECK-NEXT: [[TMP29:%.]] = bitcast i64 [[TMP28]] to <vscale x 2 x i64>*
				; CHECK-NEXT: [[WIDE_LOAD9:%.]] = load <vscale x 2 x i64>, <vscale x 2 x i64> [[TMP29]], align 4
				; CHECK-NEXT: [[TMP30:%.]] = getelementptr inbounds i64, <vscale x 2 x i64> [[NEXT_GEP]], i64 1
				; CHECK-NEXT: [[TMP31:%.]] = getelementptr inbounds i64, <vscale x 2 x i64> [[NEXT_GEP6]], i64 1
				; CHECK-NEXT: [[TMP32:%.]] = icmp eq <vscale x 2 x i64> [[TMP30]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP33:%.]] = icmp eq <vscale x 2 x i64> [[TMP31]], [[BROADCAST_SPLAT11]]
				; CHECK-NEXT: [[TMP34:%.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: [[TMP35:%.*]] = mul i32 [[TMP34]], 2
				; CHECK-NEXT: [[TMP36:%.*]] = sub i32 [[TMP35]], 1
				; CHECK-NEXT: [[TMP37:%.*]] = extractelement <vscale x 2 x i1> [[TMP33]], i32 [[TMP36]]
				; CHECK-NEXT: store i1 [[TMP37]], i1* [[DST:%.*]], align 1
				; CHECK-NEXT: [[TMP38:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP39:%.*]] = mul i64 [[TMP38]], 4
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP39]]
				; CHECK-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP40]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				;
				entry:
				br label %for.body

				for.body:
				%first.sroa = phi i64* [ %incdec.ptr, %for.body ], [ %start, %entry ]
				%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]
				%iv.next = add i64 %iv, 1
				%0 = load i64, i64* %first.sroa
				%incdec.ptr = getelementptr inbounds i64, i64* %first.sroa, i64 1
				%cmp.not = icmp eq i64* %incdec.ptr, %start
				store i1 %cmp.not, i1* %dst
				%cmp = icmp ult i64 %iv, %N
				br i1 %cmp, label %for.body, label %end

				end:
				ret void
				}

				; Ensure conditional i1 stores do not vectorize
				define void @cond_store_i1(i1* noalias %dst, i8* noalias %start, i32 %cond, i64 %N) {
				; CHECK-LABEL: @cond_store_i1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[FIRST_SROA:%.]] = phi i8 [ [[INCDEC_PTR:%.]], [[IF_END:%.]] ], [ null, [[ENTRY:%.*]] ]
				; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, i8* [[FIRST_SROA]], i64 1
				; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[INCDEC_PTR]], align 1
				; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i8 [[TMP0]], 10
				; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END]], label [[IF_THEN:%.*]]
				; CHECK: if.then:
				; CHECK-NEXT: [[CMP_STORE:%.]] = icmp eq i8 [[START:%.*]], [[INCDEC_PTR]]
				; CHECK-NEXT: store i1 [[CMP_STORE]], i1* [[DST:%.*]], align 1
				; CHECK-NEXT: br label [[IF_END]]
				; CHECK: if.end:
				; CHECK-NEXT: [[CMP_NOT:%.]] = icmp eq i8 [[INCDEC_PTR]], [[START]]
				; CHECK-NEXT: br i1 [[CMP_NOT]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body:
				%first.sroa = phi i8* [ %incdec.ptr, %if.end ], [ null, %entry ]
				%incdec.ptr = getelementptr inbounds i8, i8* %first.sroa, i64 1
				%0 = load i8, i8* %incdec.ptr
				%tobool.not = icmp eq i8 %0, 10
				br i1 %tobool.not, label %if.end, label %if.then

				if.then:
				%cmp.store = icmp eq i8* %start, %incdec.ptr
				store i1 %cmp.store, i1* %dst
				br label %if.end

				if.end:
				%cmp.not = icmp eq i8* %incdec.ptr, %start
				br i1 %cmp.not, label %for.end, label %for.body

				for.end:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Extract the last lane from a uniform store
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 383039

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/sve-uniform-store.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Extract the last lane from a uniform storeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 383039

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/sve-uniform-store.ll

[LoopVectorize] Extract the last lane from a uniform store
ClosedPublic